May 12, 20251 yr My server ran without any problems 24/7 for the last 1,5 years, running as a NAS and providing about 30 docker containers throughout. There are VMs, but none is running. No hardware issues since its first built, and no hardware changes during this time. Starting with OS 7, crashes and hang-ups emerged. What happens exactly? Suddenly, one of the CPU-cores' load increases to 100% and stays there. After a short period of time, sometimes 10 seconds, sometimes half a minute, a second core reaches 100% and so on, until the system becomes completely unresponsive, the webserver stops responding and then even SSH is not usable anymore. Sometimes, a ping from another machine still works, but not always. The only option then is a hard reset, i.e., to pull the plug. After reboot, the problem happens again. Because of that, I at first assumed an error due to the update, but now I am not so sure anymore. I tested several docker containers one after another that I suspected to be the culprit, but no hints to any clear cause showed up. The SMART-reports show no error for any of the disks, and for me the logs are at least inconclusive. I started the syslog server with localhost as the target and the unraid flash drive as a mirror, and I am attaching the diagnostics, the SMART reports and the syslog of the moment before the last crash (today at around 12). I do not know if the parity check itself is the culprit, but it is the best way to reproduce the behavior: Just start the parity check and after a minute, the server is frozen. As I wrote above, the freeze is happening after reboot, which, as I know now, is due to the parity check which starts because of the unclean shutdown after I restarted the array again. So, when I immediately stop the check again after starting the array, everything is running as smooth as always, I did not find any container (or container log) that might indicate a problem, and I did not encounter any data errors. Copying files, writing big files, over the network or locally work without hickups, nothing seems to be wrong, and therefore, I really really do not know how to proceed… There are hardware errors in the logs, e.g., quit fast after a reboot: May 12 10:04:05 Server kernel: mce: [Hardware Error]: Machine check events logged May 12 10:04:05 Server kernel: [Hardware Error]: Corrected error, no action required. May 12 10:04:05 Server kernel: [Hardware Error]: CPU:2 (17:60:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000010859 May 12 10:04:05 Server kernel: [Hardware Error]: Error Addr: 0x000000027cedcd80 May 12 10:04:05 Server kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000005a020300 May 12 10:04:05 Server kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 1 May 12 10:04:05 Server kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout) The last log entry in syslog before the freeze was this here: May 12 11:51:34 Server Parity Check Tuning: Manual Correcting Parity-Check detected May 12 11:51:34 Server Parity Check Tuning: Manual Correcting Parity-Check: Manually resumed May 12 11:53:21 Server kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: May 12 11:53:21 Server kernel: rcu: 0-...0: (11 ticks this GP) idle=14dc/1/0x4000000000000000 softirq=418403/418405 fqs=117831 May 12 11:53:21 Server kernel: rcu: (detected by 1, t=240008 jiffies, g=840105, q=928877 ncpus=16) May 12 11:53:21 Server kernel: Sending NMI from CPU 1 to CPUs 0: May 12 11:53:21 Server kernel: NMI backtrace for cpu 0 May 12 11:53:21 Server kernel: CPU: 0 UID: 0 PID: 14837 Comm: unraidd0 Tainted: P D O 6.12.24-Unraid #1 May 12 11:53:21 Server kernel: Tainted: [P]=PROPRIETARY_MODULE, [D]=DIE, [O]=OOT_MODULE May 12 11:53:21 Server kernel: Hardware name: To Be Filled By O.E.M. B550M Steel Legend/B550M Steel Legend, BIOS P2.50 10/19/2022 May 12 11:53:21 Server kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x85/0x1d0 May 12 11:53:21 Server kernel: Code: c2 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00 00 00 76 0c 0f ba e0 08 72 1e c6 43 01 00 eb 18 85 c0 74 0a 8a 03 84 c0 74 04 f3 90 <eb> f6 66 c7 03 01 00 e9 2e 01 00 00 e8 ca 72 ff ff 49 c7 c4 40 0e May 12 11:53:21 Server kernel: RSP: 0018:ffffc90002b83d58 EFLAGS: 00000002 May 12 11:53:21 Server kernel: RAX: 0000000000000001 RBX: ffff888105c1c570 RCX: 0000000000000000 May 12 11:53:21 Server kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888105c1c570 May 12 11:53:21 Server kernel: RBP: ffff888105c1c570 R08: 0000000000000000 R09: ffff88824ea30b10 May 12 11:53:21 Server kernel: R10: ffff88824ea30b08 R11: 00000000000000b1 R12: ffff888105c1c000 May 12 11:53:21 Server kernel: R13: ffff88824ea30f78 R14: ffff88824ea310a0 R15: ffff88822a0d2800 May 12 11:53:21 Server kernel: FS: 0000000000000000(0000) GS:ffff889bcdc00000(0000) knlGS:0000000000000000 May 12 11:53:21 Server kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 12 11:53:21 Server kernel: CR2: 0000000000454c78 CR3: 0000000006418000 CR4: 0000000000350ef0 May 12 11:53:21 Server kernel: Call Trace: May 12 11:53:21 Server kernel: <TASK> May 12 11:53:21 Server kernel: do_raw_spin_lock+0x14/0x20 May 12 11:53:21 Server kernel: release_stripe+0x20/0x40 [md_mod] May 12 11:53:21 Server kernel: unraidd+0x1206/0x1280 [md_mod] May 12 11:53:21 Server kernel: ? srso_return_thunk+0x5/0x5f May 12 11:53:21 Server kernel: ? preempt_latency_start+0x2b/0x50 May 12 11:53:21 Server kernel: ? srso_return_thunk+0x5/0x5f May 12 11:53:21 Server kernel: ? _raw_spin_lock_irqsave+0x1f/0x30 May 12 11:53:21 Server kernel: md_thread+0xf9/0x130 [md_mod] May 12 11:53:21 Server kernel: ? __pfx_autoremove_wake_function+0x10/0x10 May 12 11:53:21 Server kernel: ? __pfx_md_thread+0x10/0x10 [md_mod] May 12 11:53:21 Server kernel: kthread+0xef/0x100 May 12 11:53:21 Server kernel: ? __pfx_kthread+0x10/0x10 May 12 11:53:21 Server kernel: ret_from_fork+0x24/0x40 May 12 11:53:21 Server kernel: ? __pfx_kthread+0x10/0x10 May 12 11:53:21 Server kernel: ret_from_fork_asm+0x1a/0x30 May 12 11:53:21 Server kernel: </TASK> Any help or idea how to identify the root cause will be much appreciated...! Thanks! server-diagnostics-20250512-1236.zip syslog-127.0.0.1.log SMART-Reports_20250509.zip
May 12, 20251 yr Community Expert Unraid driver is crashing, that is almost always a hardware problem, start by running memtest, or since memtest is only definitive if it finds errors, and you have multiple RAM sticks, try again with just one pair, if the same try with the other one, that will basically rule out bad RAM.
May 12, 20251 yr Author 15 minutes ago, JorgeB said: Unraid driver is crashing, that is almost always a hardware problem, start by running memtest, or since memtest is only definitive if it finds errors, and you have multiple RAM sticks, try again with just one pair, if the same try with the other one, that will basically rule out bad RAM. Thank you for the fast reply! But, to clarify one thing: Since I am running Unraid in UEFI-Mode, not legacy, I cannot use the memtest on the flash drive, am I right? Only in legacy mode, the boot with memtest option is available, when I recall correctly? So, can you give me a small hint how to run memtest the most easy way? Edited May 12, 20251 yr by Woosah
May 12, 20251 yr 19 minutes ago, Woosah said: Since I am running Unraid in UEFI-Mode, not legacy, I cannot use the memtest on the flash drive, am I right? No. UEFI/BIOS boots the USB, and from that moment is no longer in control, the linux loader on the USB is, and that can run memtest just fine no matter if it started from BIOS or UEFI.
August 26, 2025Aug 26 @Woosah following up on this. I'm running into the same issue. Did you find a fix? Thanks
October 15, 2025Oct 15 Author Solution Just documenting the solution here in case someone else needs it:The base problem here was the same as with hammersandwhich findings. I have an AMD Ryzen 7 Pro 4750G CPU, which ran completely flawless until the kernel update which was introduced in the unraid update to 7.1. It was a bit obscured by a not-so-great SATA expansion card which I switched to a LSI 9300i one recently. Since then, the error logs and messages changed from something like the above to something like this:Oct 13 13:54:37 Server kernel: mce: [Hardware Error]: Machine check events logged Oct 13 13:54:37 Server kernel: [Hardware Error]: Corrected error, no action required. Oct 13 13:54:37 Server kernel: [Hardware Error]: CPU:2 (17:60:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000010859 Oct 13 13:54:37 Server kernel: [Hardware Error]: Error Addr: 0x000000020f818c40 Oct 13 13:54:37 Server kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000005a020300 Oct 13 13:54:37 Server kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 1 Oct 13 13:54:37 Server kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout)This gave me the same hint as hammersandwhich, that the kernel update did something, and it seemed to have something to do with the L1 errors, which can be triggered by microcode problems. Such errors (machine check events, MCEs) in MC1, i.e., instruction fetch unit (IFU) errors, can happen, when something goes bad in the CPU logic. And my CPU, the Renoir line, was bitten by IFUs while running older AGESA releases (aka, microcode software). Therefore, I updated my BIOS, which also upgraded the AGESA release from 1.2.0.D to 1.2.0.F. Then, I did reset the BIOS to its default values and set the following settings with regard to CPU power management and needed tweaks for my unraid setup:SVM Mode = EnabledIOMMU = EnabledAbove 4G Decoding = EnabledTypical Current Idle = EnabledGlobal C-State Control = DisabledPerformance Boost Overdrive (PBO) = DisabledThe following settings would be the optimum, but were not available in my BIOS: CPPC Ctrl → Enabled CPPC Preferred Cores → AutoThen, I tweaked the unraid boot settings to this:append amd_iommu=on iommu=pt amd_pstate=active pcie_acs_override=downstream,multifunction vfio_iommu_type1.allow_unsafe_interrupts=1 video=efifb:off initrd=/bzroot Testing after reboot with dmesg | grep -E "amd_pstate|cpufreq" showed this: amd_pstate: The CPPC feature is supported but currently disabled by the BIOS. Therefore, the amd_pstate did not work, because my BIOS doesn't make this available to the kernel and falls back to acpi_cpufreq, but that is more a nuisance than a problem. Another try with amd_pstate.shared_mem=1 showed the same. Due to this, I ran unraid with this here:append amd_iommu=on iommu=pt amd_pstate=disable pcie_acs_override=downstream,multifunction vfio_iommu_type1.allow_unsafe_interrupts=1 video=efifb:off initrd=/bzrootWith this setup, I ran 4 passes of memtest for the RAM without any errors, and I can now happily say that the server is up and running without any glitches again...Hopefully this helps someone else!
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.