Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

7.1.2 - Hard crashes when starting the parity check

Featured Replies

My server ran without any problems 24/7 for the last 1,5 years, running as a NAS and providing about 30 docker containers throughout. There are VMs, but none is running. No hardware issues since its first built, and no hardware changes during this time. Starting with OS 7, crashes and hang-ups emerged.
What happens exactly? Suddenly, one of the CPU-cores' load increases to 100% and stays there. After a short period of time, sometimes 10 seconds, sometimes half a minute, a second core reaches 100% and so on, until the system becomes completely unresponsive, the webserver stops responding and then even SSH is not usable anymore. Sometimes, a ping from another machine still works, but not always. The only option then is a hard reset, i.e., to pull the plug. After reboot, the problem happens again.
Because of that, I at first assumed an error due to the update, but now I am not so sure anymore. I tested several docker containers one after another that I suspected to be the culprit, but no hints to any clear cause showed up. The SMART-reports show no error for any of the disks, and for me the logs are at least inconclusive. I started the syslog server with localhost as the target and the unraid flash drive as a mirror, and I am attaching the diagnostics, the SMART reports and the syslog of the moment before the last crash (today at around 12).
I do not know if the parity check itself is the culprit, but it is the best way to reproduce the behavior: Just start the parity check and after a minute, the server is frozen. As I wrote above, the freeze is happening after reboot, which, as I know now, is due to the parity check which starts because of the unclean shutdown after I restarted the array again. So, when I immediately stop the check again after starting the array, everything is running as smooth as always, I did not find any container (or container log) that might indicate a problem, and I did not encounter any data errors. Copying files, writing big files, over the network or locally work without hickups, nothing seems to be wrong, and therefore, I really really do not know how to proceed…

There are hardware errors in the logs, e.g., quit fast after a reboot:

May 12 10:04:05 Server kernel: mce: [Hardware Error]: Machine check events logged
May 12 10:04:05 Server kernel: [Hardware Error]: Corrected error, no action required.
May 12 10:04:05 Server kernel: [Hardware Error]: CPU:2 (17:60:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000010859
May 12 10:04:05 Server kernel: [Hardware Error]: Error Addr: 0x000000027cedcd80
May 12 10:04:05 Server kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000005a020300
May 12 10:04:05 Server kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 1
May 12 10:04:05 Server kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout)

 

The last log entry in syslog before the freeze was this here:

May 12 11:51:34 Server Parity Check Tuning: Manual Correcting Parity-Check detected
May 12 11:51:34 Server Parity Check Tuning: Manual Correcting Parity-Check: Manually resumed
May 12 11:53:21 Server kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
May 12 11:53:21 Server kernel: rcu: 	0-...0: (11 ticks this GP) idle=14dc/1/0x4000000000000000 softirq=418403/418405 fqs=117831
May 12 11:53:21 Server kernel: rcu: 	(detected by 1, t=240008 jiffies, g=840105, q=928877 ncpus=16)
May 12 11:53:21 Server kernel: Sending NMI from CPU 1 to CPUs 0:
May 12 11:53:21 Server kernel: NMI backtrace for cpu 0
May 12 11:53:21 Server kernel: CPU: 0 UID: 0 PID: 14837 Comm: unraidd0 Tainted: P      D    O       6.12.24-Unraid #1
May 12 11:53:21 Server kernel: Tainted: [P]=PROPRIETARY_MODULE, [D]=DIE, [O]=OOT_MODULE
May 12 11:53:21 Server kernel: Hardware name: To Be Filled By O.E.M. B550M Steel Legend/B550M Steel Legend, BIOS P2.50 10/19/2022
May 12 11:53:21 Server kernel: RIP: 0010:native_queued_spin_lock_slowpath+0x85/0x1d0
May 12 11:53:21 Server kernel: Code: c2 0f b6 d2 c1 e2 08 30 e4 09 d0 3d ff 00 00 00 76 0c 0f ba e0 08 72 1e c6 43 01 00 eb 18 85 c0 74 0a 8a 03 84 c0 74 04 f3 90 <eb> f6 66 c7 03 01 00 e9 2e 01 00 00 e8 ca 72 ff ff 49 c7 c4 40 0e
May 12 11:53:21 Server kernel: RSP: 0018:ffffc90002b83d58 EFLAGS: 00000002
May 12 11:53:21 Server kernel: RAX: 0000000000000001 RBX: ffff888105c1c570 RCX: 0000000000000000
May 12 11:53:21 Server kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888105c1c570
May 12 11:53:21 Server kernel: RBP: ffff888105c1c570 R08: 0000000000000000 R09: ffff88824ea30b10
May 12 11:53:21 Server kernel: R10: ffff88824ea30b08 R11: 00000000000000b1 R12: ffff888105c1c000
May 12 11:53:21 Server kernel: R13: ffff88824ea30f78 R14: ffff88824ea310a0 R15: ffff88822a0d2800
May 12 11:53:21 Server kernel: FS:  0000000000000000(0000) GS:ffff889bcdc00000(0000) knlGS:0000000000000000
May 12 11:53:21 Server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 12 11:53:21 Server kernel: CR2: 0000000000454c78 CR3: 0000000006418000 CR4: 0000000000350ef0
May 12 11:53:21 Server kernel: Call Trace:
May 12 11:53:21 Server kernel: <TASK>
May 12 11:53:21 Server kernel: do_raw_spin_lock+0x14/0x20
May 12 11:53:21 Server kernel: release_stripe+0x20/0x40 [md_mod]
May 12 11:53:21 Server kernel: unraidd+0x1206/0x1280 [md_mod]
May 12 11:53:21 Server kernel: ? srso_return_thunk+0x5/0x5f
May 12 11:53:21 Server kernel: ? preempt_latency_start+0x2b/0x50
May 12 11:53:21 Server kernel: ? srso_return_thunk+0x5/0x5f
May 12 11:53:21 Server kernel: ? _raw_spin_lock_irqsave+0x1f/0x30
May 12 11:53:21 Server kernel: md_thread+0xf9/0x130 [md_mod]
May 12 11:53:21 Server kernel: ? __pfx_autoremove_wake_function+0x10/0x10
May 12 11:53:21 Server kernel: ? __pfx_md_thread+0x10/0x10 [md_mod]
May 12 11:53:21 Server kernel: kthread+0xef/0x100
May 12 11:53:21 Server kernel: ? __pfx_kthread+0x10/0x10
May 12 11:53:21 Server kernel: ret_from_fork+0x24/0x40
May 12 11:53:21 Server kernel: ? __pfx_kthread+0x10/0x10
May 12 11:53:21 Server kernel: ret_from_fork_asm+0x1a/0x30
May 12 11:53:21 Server kernel: </TASK>

 

Any help or idea how to identify the root cause will be much appreciated...!

 

Thanks!

 

server-diagnostics-20250512-1236.zip syslog-127.0.0.1.log SMART-Reports_20250509.zip

Solved by Woosah

  • Community Expert

Unraid driver is crashing, that is almost always a hardware problem, start by running memtest, or since memtest is only definitive if it finds errors, and you have multiple RAM sticks, try again with just one pair, if the same try with the other one, that will basically rule out bad RAM.

  • Author
15 minutes ago, JorgeB said:

Unraid driver is crashing, that is almost always a hardware problem, start by running memtest, or since memtest is only definitive if it finds errors, and you have multiple RAM sticks, try again with just one pair, if the same try with the other one, that will basically rule out bad RAM.

 

Thank you for the fast reply! But, to clarify one thing:

Since I am running Unraid in UEFI-Mode, not legacy, I cannot use the memtest on the flash drive, am I right? Only in legacy mode, the boot with memtest option is available, when I recall correctly? So, can you give me a small hint how to run memtest the most easy way?

Edited by Woosah

19 minutes ago, Woosah said:

Since I am running Unraid in UEFI-Mode, not legacy, I cannot use the memtest on the flash drive, am I right?

No. UEFI/BIOS boots the USB, and from that moment is no longer in control, the linux loader on the USB is, and that can run memtest just fine no matter if it started from BIOS or UEFI.

@Woosah did you manage to solve the issue? I have the same problem.

  • 3 months later...

@Woosah following up on this. I'm running into the same issue. Did you find a fix? Thanks

  • 1 month later...
  • Author
  • Solution

Just documenting the solution here in case someone else needs it:

The base problem here was the same as with hammersandwhich findings. I have an AMD Ryzen 7 Pro 4750G CPU, which ran completely flawless until the kernel update which was introduced in the unraid update to 7.1. It was a bit obscured by a not-so-great SATA expansion card which I switched to a LSI 9300i one recently. Since then, the error logs and messages changed from something like the above to something like this:

Oct 13 13:54:37 Server kernel: mce: [Hardware Error]: Machine check events logged Oct 13 13:54:37 Server kernel: [Hardware Error]: Corrected error, no action required. 
Oct 13 13:54:37 Server kernel: [Hardware Error]: CPU:2 (17:60:1) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000010859 
Oct 13 13:54:37 Server kernel: [Hardware Error]: Error Addr: 0x000000020f818c40 Oct 13 13:54:37 Server kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000005a020300 
Oct 13 13:54:37 Server kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 1 Oct 13 13:54:37 Server kernel: [Hardware Error]: cache level: L1, mem/io: IO, mem-tx: IRD, part-proc: SRC (no timeout)

This gave me the same hint as hammersandwhich, that the kernel update did something, and it seemed to have something to do with the L1 errors, which can be triggered by microcode problems. Such errors (machine check events, MCEs) in MC1, i.e., instruction fetch unit (IFU) errors, can happen, when something goes bad in the CPU logic. And my CPU, the Renoir line, was bitten by IFUs while running older AGESA releases (aka, microcode software). Therefore, I updated my BIOS, which also upgraded the AGESA release from 1.2.0.D to 1.2.0.F. Then, I did reset the BIOS to its default values and set the following settings with regard to CPU power management and needed tweaks for my unraid setup:

  1. SVM Mode = Enabled

  2. IOMMU = Enabled

  3. Above 4G Decoding = Enabled

  4. Typical Current Idle = Enabled

  5. Global C-State Control = Disabled

  6. Performance Boost Overdrive (PBO) = Disabled

The following settings would be the optimum, but were not available in my BIOS:

  •  CPPC Ctrl → Enabled

  •  CPPC Preferred Cores → Auto

Then, I tweaked the unraid boot settings to this:

append amd_iommu=on iommu=pt amd_pstate=active pcie_acs_override=downstream,multifunction vfio_iommu_type1.allow_unsafe_interrupts=1 video=efifb:off initrd=/bzroot 

Testing after reboot with dmesg | grep -E "amd_pstate|cpufreq" showed this: amd_pstate: The CPPC feature is supported but currently disabled by the BIOS. Therefore, the amd_pstate did not work, because my BIOS doesn't make this available to the kernel and falls back to acpi_cpufreq, but that is more a nuisance than a problem. Another try with amd_pstate.shared_mem=1 showed the same. Due to this, I ran unraid with this here:

append amd_iommu=on iommu=pt amd_pstate=disable pcie_acs_override=downstream,multifunction vfio_iommu_type1.allow_unsafe_interrupts=1 video=efifb:off initrd=/bzroot

With this setup, I ran 4 passes of memtest for the RAM without any errors, and I can now happily say that the server is up and running without any glitches again...

Hopefully this helps someone else!

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.