September 8, 2025Sep 8

Issue started when I upgraded the USB Key to a DoM and when building upgraded the server from 7.1.2 to 7.1.4

Issue

two crashes in last 24 hours, three in the since upgrade 2 weeks ago

Steps so far

memory test for 10 hours clean

Put both screen grabs of the crash into ChatGPT

Thanks for sharing the second screenshot — this gives more detail.

Here’s what’s happening:

What the Trace Shows

The panic here is in the RCU subsystem (Read-Copy-Update), which is a core Linux synchronization mechanism.
Specifically:

  rcu: rcu_preempt kthread starved for 664912 jiffies! g=7744983 fqs=6

This means the RCU preempt kernel thread was stuck/starved for too long — the kernel watchdog considers this a deadlock.

Backtrace shows:

  native_queued_spin_lock_slowpath
  raw_spin_lock
  resched_curr
  rcu_gp_fqs_loop
  rcu_gp_kthread

→ This indicates the kernel got stuck trying to grab a spinlock during RCU grace period cleanup.

Why This Happens

Kernel/Driver Bug
- Most common cause. Since you’re on Unraid 7.1.4 (kernel 6.12.24), this could be a regression in RCU + ZFS or RCU + NVIDIA drivers.
- Both ZFS and GPU drivers are heavy users of locking.
Resource Starvation
- If one CPU core is pegged at 100% (tight loop), the RCU threads can get starved and trigger this warning.
- Check if you had a VM, Plex, Tdarr, or GPU job maxing out CPU/GPU before the crash.
Hardware Edge Case
- Bad RAM or unstable overclocking (if enabled) can cause locks to never resolve.

How to Safely Restart

Since your system is already throwing RCU stalls, it may not shut down cleanly. Do this:

Try a graceful stop first (recommended):

   /usr/local/sbin/powerdown

* This will attempt to stop the array, unmount disks, and kill processes.

If that hangs for >2–3 minutes:

   sync
   reboot -f

* `sync` flushes writes to disk.
* `reboot -f` forces a reboot without waiting for hung threads.

Next Steps After Restart

Check syslog (Tools → System Log) for warnings before the panic.
Update:
- Make sure you’re on the latest BIOS for the ASRock Z690 Extreme.
- See if Unraid 7.1.5 (or newer kernel) is available — 6.12.24 has had reports of RCU-related hangs.
Isolate:
- Temporarily disable ZFS or NVIDIA drivers to test stability.
- Run memtest from the Unraid boot menu.

⚠️ Bottom line: This wasn’t you doing anything wrong — this is most likely a kernel/module bug aggravated by ZFS or NVIDIA drivers under 6.12.24.

Do you want me to build you a step-by-step recovery & isolation plan (BIOS update → memtest → driver isolation → kernel update) so you can nail down whether this is software or hardware?

Fix common issues

Share appdata set to cache-only, but files / folders exist on the array [no idea how to fix]
Share plex_cache set to cache-only, but files / folders exist on the array [no idea how to fix]
The plugin folder.view.plg is not known to Community Applications and is possibly incompatible with your server [action plan to remove, which sucks it is GOLD]

other actions

going to disable nVIDIA as well

Quote

September 8, 2025Sep 8

Community Expert

Enable the syslog server and post that after a crash, together with fresh diagnostics.

Quote

September 8, 2025Sep 8

Author

Server log from Syslog server (SPLUNK) and Diag, though I did remove the plugins before this was run.

dockyard-diagnostics-20250908-0530.zip

there are a lot of these

Sep 7 20:51:46 10.10.9.1 Sep 7 20:51:46 DockYard kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

Sep 7 20:51:46 10.10.9.1 Sep 7 20:51:46 DockYard kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x51:884)

also

Sep 7 20:54:18 10.10.9.1 Sep 7 20:54:18 DockYard kernel: #PF: supervisor instruction fetch in kernel mode

Sep 7 20:54:18 10.10.9.1 Sep 7 20:54:18 DockYard kernel: BUG: unable to handle page fault for address: ffffffff824fc5c0

1757334860_88.csv.zip

Quote

September 8, 2025Sep 8

Community Expert

Retest without the GPU, also woth runing memtest.

Quote

September 8, 2025Sep 8

Author

59 minutes ago, JorgeB said:
Retest without the GPU, also woth runing memtest.

As stated, did the memory test for 10 hours clean

Retest, is just wait until some unknown time to feel like the issue was the GPU, and then try it again...........

The crashes were not "I did this, than that occurred" it is the server is down, oh crap, let me fix.

In an effort for it to not crash I removed the nVidia drivers (as indicated), in a hope that I have stability until I return home this weekend.

My assumptions are

nvidia driver issue
removed the drivers, so all the apps that use nvidia are not? I assume they will try CPU transcoding? or OLLAMA will just fail
wait 4 days in this state
if no crashes, then put them back in?

Down grading the OS is my GUT move as it is the new OS that started this, but that does seem to have risk.

Quote

September 23, 2025Sep 23

Author

Update.

As suspected the server has become extremely stable with the Nvidia drivers and video card removed.

Update.

As suspected the server has become extremely stable with the Nvidia drivers and video card removed.

With the removal of the Nvidia GPU and drivers the server has not failed with the removal of the

With the removal of the Nvidia graphics card and associated drivers there hasn’t been a crash.

Quote

September 23, 2025Sep 23

Community Expert

That suggests the GPU or its driver is the problem, recommend posting in the Nvidia plugin support thread; there may be some known issues with some models/drivers.

Quote

September 27, 2025Sep 27

Author

4 hours ago, alison2033 said:
But what was the reason you wanted to update? Was there an important feature you'd like to use in the new version, or did you just want to update because it was a new version?

When building the key, did not ‘think’ and used the standard / latest version (Server down for the upgrade of the USB DOM, working through the backup of the old key, and missed selecting the 6.x), and did an unintended upgrade. This was a 6.x to 7, and after the new DOM Key install, did not think the effort to go back right away (back to the server rack, unmount the server, open up the case, and remove the DOM from the mother board, back to the workstation, and build again) was worth it, it is the new stuff. Then the crashes started. If it this was immediate, and constant, I probably would have gone back. As you know the networking changed, and had a history of issues, moving down to a lower 7.x looked like a network issue risk, and after research the move back to 6.x looked risky after all the changes over the weeks of being at 7.x

4 hours ago, alison2033 said:
Below are two options if you want to try:
1. If the reason for the failure was the new update, just go back to the version that was working.

Was not an easy option, see above. Moving back in the 7.x introduces more known bugs, at the time of writing, there were no stated issues with nVIDIA and like most posted, assumed a memory issue, and worked it.

4 hours ago, alison2033 said:
2. If the new update was done through the UnRAID panel, I recommend you try a clean update. Remove the USB flash drive and wipe it, then reinstall UnRAID and the GPU driver. This may solve your problem. After the operating system is clean, everything will work again, or go back to option 1.

Thank You - the stability of a DOM, also introduces the need for me to dive into the rack server, and pull the motherboard based USB drive. The upgrade was done by building a new key via the Mac Tool, for a new DoM after too many USB Stick failure. (Skip rant on the need to use a USB to boot a server).

With 20+ Containers and this server being a key daily/hourly used services, going through a process like this was not an option. It would be if it crashed, after an hour, but it is days between them. So have a server (that is used hourly) down for a week, to feel comfortable it is ‘stable’ and layer the next fix on was not an option.

I am extremely, extremely disappointed that there is so little diagnostic tools with this product that the guess work is the procedure to fix issues. I have a Syslog Server (SPLUNK), I have console Crash captures, and the best troubleshooting we have is, try this, or some ‘canned’ response of it is your memory. Working with two AI models, diagnosing 4 crashed console and SysLogs, everything pointed to the NVIDIA issue.

Two weeks stability with the GPU and drivers gone. Moved the nVIDIA workloads to another server on a lower release, and have stability. (btw, why not move everything? The server with instability is Intel, with integrated GPU, that I use, the other servers are XEON and AMD and can not support the workloads).

I do like the unRAID platform with three servers. The value it brings for both hosting and experimenting is great. Support on USB Failure is fast, I could not express how incredible getting support with in an hour on the weekend due to a USB Key failure is. Unfortunately I have learned. how fast the support is too many times. Not enterprise priced, not enterprise quality.

PS. this post is over a month after the failures / crashes / ‘upgrade’ started, thus the frustration. Not sure who has the need for unRAID, and can support 5+ weeks of outages..

4 hours ago, alison2033 said:
There are some words that pros always recommend: don't mess with a good thing, because if you do, it can get bad.

The good thing is not having a USB Drive, (holding back on the rant), and after too many server failures because of a $10 USB Stick, I built a DOM Based server that seems to be the community hive’s recommendation for stability. I was on a VERY VERY old version for the great advice you just mentioned.

This Product is made great by great people like you willing to help. Thank You

Edited September 27, 2025Sep 27 by kcossabo

Quote

September 27, 2025Sep 27

Author

Assumed Root Cause

After 2 weeks of stability the assumed root cause is one of the following

In compatibility with nVidia drivers and 7.1.4 - low probability as I assume there are a lot of installed and not having the issue
In compatibility with nVidia drivers and 7.1.4 and other drivers and combinations on my server - higher probability as this get to more of a snowflake of some combination is causing the issue
BAD nVidia card - high probability, as the removal did result in a stability. Though not crashing on earlier versions of unRAID.

Each of these could be refined, but, not my jam to figure this out, just provide the family with stability.

Edited September 27, 2025Sep 27 by kcossabo

Quote

7.1.4 Crashing

Featured Replies

Issue

Steps so far

Put both screen grabs of the crash into ChatGPT

What the Trace Shows

Why This Happens

How to Safely Restart

Next Steps After Restart

Fix common issues

other actions

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)