Random Hangs and Flash Drive Corruption

January 10, 20251 yr

I recently switched my motherboard, cpu, and ram from an x99 system to a 5800x on a X570 AORUS ELITE. The new cpu, board, and ram was previously in my gaming pc and worked fine with no crashes running windows 11.

At first everything seemed fine but then I noticed it was offline. I looked at the display out from the server and it was just hanging with no disk usage that I could see/hear. After power cycling it would hang at a hash check for one of the bz files on the flashdrive. Restoring from a flash backup temporarily fixed the issue but after a few hours the same thing would repeat.

I found a few other threads discussing crashes with Ryzen so I made the following changes with no success:

- Disabled XMP in bios

- Disabled global c-states in bios

- Set power supply idle current to typical in bios

- Added "/usr/local/sbin/zenstates --c6-disable" to /boot/config/go

- Enabled syslog writing to flash but observed no errors prior to the system hanging

I'm at my wits end, I get no errors, this hardware worked fine before, and I don't know what to do next.

unraid-diagnostics-20250109-2047.zip

Quote

January 10, 20251 yr

Author

Earlier today I also brought my motherboard bios current from an older bios from 2022 and made sure all my settings from before persisted for the c-states and such.

Experienced another hang and forced reboot that required me to replace the bzfirmware file on the flashdrive to boot properly again.

Just updated to unraid 7. I'll see if that resolves anything but I don't have my hopes up.

Quote

January 10, 20251 yr

Enable the syslog server and post that after a crash.

Quote

January 10, 20251 yr

Author

I had the syslog server enabled, but as of right now the latest syslog previous was a clean reboot after installing unraid 7. I did not realize that the syslogs containing my crashes would not persist. Is there a way to avoid removing old syslogs from my flash?

Quote

January 10, 20251 yr

If saved to the flash drive, only the previous log will be kept, but if you save to a share, it will keep a continuous one.

Quote

January 10, 20251 yr

Author

Thank you for the help, I have that enabled now. I have not had a crash since updating to Unraid 7. Hopefully the newer kernel fix my issue. If I have another crash now I'll post the syslog.

Quote

1

January 11, 20251 yr

Author

I had another crash. This one had the server lasting much longer (~24 hrs) , but still was left hanging with no display out and not logging anything to syslog.

The system becomes unresponsive to keyboard inputs, has a blank display out, and the webgui doesn't load.

syslog

Quote

January 11, 20251 yr

Unfortunately, there's nothing relevant logged, this can also be a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the individual docker containers.

Quote

January 11, 20251 yr

Author

I unfortunately don't have the luxury of running it with containers disabled as a number of people rely on the services and I also will lose control of all the lighting in my house. I'll try swapping out the motherboard next. I have another set of memory to try as well. Waiting on the board to come in the mail now.

Is there a way to rule out a cpu issue? At this point hardware wise I ruled out the power supply since the system stays up and hangs when there is an issue. I ruled out any of the drives and their connections since that was untouched. The only pieces left are the memory, motherboard, and cpu.

Quote

January 11, 20251 yr

8 minutes ago, bebis said:

Is there a way to rule out a cpu issue?

Basically you would need to swap it and retest, difficult to completely rule it out otherwise

Quote

January 28, 20251 yr

Author

Alright, I have now swapped out the motherboard and the PSU. I also ran a memtest for 24 hours with no errors.

I am still seeing random reboots happening though

I have attached diagnostics, I have been noticing this error:

Jan 28 15:49:21 unraid root: mcelog: ERROR: AMD Processor family 25: mcelog does not support this processor.  Please use the edac_mce_amd module instead.

I tried looking around on the forums and the general consensus seemed to be to ignore it... but I feel like this may be a clue. Not sure though.

Also attached the latest logs. The crash happened between syslog-previous and syslog. Unfortunately I don't think anything there will be too helpful this is starting to drive me crazy, is it just a ryzen issue with unraid?

unraid-diagnostics-20250128-1806.zip

Edited January 28, 20251 yr by bebis
add diagnostics

Quote

January 29, 20251 yr

8 hours ago, bebis said:

is it just a ryzen issue with unraid?

There are some known issues with Ryzen and Linux, but those typically make the server hang, not restart by itself.

Quote

January 29, 20251 yr

Author

The issue was previously hanging. After swapping the motherboard and power supply I now have the restarting issue. I can't seem to tie the restarts to a time or process. They seem to happen during times of low load which is usually the middle of the night so I'm not sure if it hangs and eventually reboots or just immediately reboots

Quote

January 29, 20251 yr

11 hours ago, JorgeB said:

There are some known issues with Ryzen and Linux, but those typically make the server hang, not restart by itself.

@bebis

Im running a 5800X w/ a B550M board and I have not done any of the things listed in the link JorgeB cited. The only change in my BIOS I made was to enable WOL and set the CPU to a 45W Eco Mode.

I have not ran into any of these kinds of issues with the exception of a single MCE. I started on 6.12.10 and have moved through to Unraid 7.0.0 without any further issues.

Quote

January 29, 20251 yr

Author

4 hours ago, MowMdown said:

@bebis

Im running a 5800X w/ a B550M board and I have not done any of the things listed in the link JorgeB cited. The only change in my BIOS I made was to enable WOL and set the CPU to a 45W Eco Mode.

I have not ran into any of these kinds of issues with the exception of a single MCE. I started on 6.12.10 and have moved through to Unraid 7.0.0 without any further issues.

When you say a single MCE, you mean you received that mcelog error once after a crash/hang and then never saw it again? Or do you mean you received the mcelog error once with no crashes? Trying to figure out if the MCE log error is due to the crash or not since I swear I don't see it otherwise.

Quote

January 30, 20251 yr

Author

Just had another crash with an mce error logged. how do I actually debug this?

Jan 30 11:16:41 unraid kernel: mce: [Hardware Error]: Machine check events logged
Jan 30 11:16:41 unraid kernel: mce: [Hardware Error]: CPU 15: Machine Check: 0 Bank 5: bea0000000000108
Jan 30 11:16:41 unraid kernel: mce: [Hardware Error]: TSC 0 ADDR 1479c98cf6ce MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Jan 30 11:16:41 unraid kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1738253773 SOCKET 0 APIC f microcode a20102d

Quote

January 30, 20251 yr

Author

I think it is a memory error? But I don't know how to determine which stick is throwing the error. If I can figure that out then I could remove the bad stick and see if I am still crashing

I ran the error through mcelog and got this. not sure how to interpret it though

root@unraid:~# /usr/sbin/mcelog --ascii < error_text 
Hardware event. This is not a software error.
CPU 15 BANK 5 
MISC d012000100000000 ADDR 1479c98cf6ce 
STATUS bea0000000000108 MCGSTATUS 0
SYND 4d000000 IPID 500b000000000 
mcelog: Unknown CPU type vendor 2 family 25 model 1

unraid-diagnostics-20250130-1322.zip

Edited January 30, 20251 yr by bebis
removed the mcelog output for the microcode update; added full diagnostics

Quote

January 30, 20251 yr

Author

I found this thread on the archwiki. I'll take a look and see if this solves my issue

https://wiki.archlinux.org/title/Ryzen#Random_reboots

Edit: I applied a +5 curve offset to pump more voltage into the cpu, +1 above what the arch wiki recommended. Hopefully that does the trick. Will report back if I continue to crash.

Also based on other mcelog outputs that I have seen either the one I am getting has nothing to do with RAM or it just is not outputting properly since my cpu is not properly recognized. I have seen some outputs from other peoples' memory errors and they contain much more detail. Mine almost exactly matches the one in the arch wiki.

Quote

With Ryzen 5000 series, particularly the higher-end models of 5950X and 5900X there seem to be some slight instability issues under Linux, related possibly to the 5.11+ kernel, as shown by this kernel bug. After investigating and reading reports on the Internet, It seems that out of the box, Windows seems to run the CPUs at higher voltage and lower peak frequencies, compared to the stock linux kernel, which depending on your draw from the silicon lottery could cause a host of random application crashes or hardware errors that lead to reboots. You will recognise those by dmesg logs that look like:

kernel: mce: [Hardware Error]: Machine check events logged kernel: mce: [Hardware Error]: CPU 22: Machine Check: 0 Bank 1: bc800800060c0859 kernel: mce: [Hardware Error]: TSC 0 ADDR 7ea8f5b00 MISC d012000000000000 IPID 100b000000000 kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1636645367 SOCKET 0 APIC d microcode a201016

The CPU ID and the Processor number may vary. To solve this problem you need to supply higher voltage to your CPU so that it is stable when running at peak frequencies. The easiest way to achieve this is to use the AMD curve optimiser which is accessible via your motherboard's UEFI. Access it and put a positive offset of 4 points, which will increase the voltage your CPU is getting at higher loads. It will limit overclocking potential due to higher heat dissipation requirements, but it will run stable. For more details check this forum post. When I did this for my 5950X, my processor stabilised and the frequency and voltage ranges were more similar to those observed under windows.

Edited January 31, 20251 yr by bebis
Update with new attempt

Quote

January 31, 20251 yr

On 1/29/2025 at 6:20 PM, bebis said:

When you say a single MCE, you mean you received that mcelog error once after a crash/hang and then never saw it again? Or do you mean you received the mcelog error once with no crashes? Trying to figure out if the MCE log error is due to the crash or not since I swear I don't see it otherwise.

I received one single MCE, the event even told me to ignore itself as it was handled by the system.

The system did not crash or lockup.

Quote

Random Hangs and Flash Drive Corruption

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)