September 8, 2025Sep 8 Crash 9-7.pdfcrash 9-8.pdfIssue started when I upgraded the USB Key to a DoM and when building upgraded the server from 7.1.2 to 7.1.4Issuetwo crashes in last 24 hours, three in the since upgrade 2 weeks agoSteps so farmemory test for 10 hours cleanPut both screen grabs of the crash into ChatGPTThanks for sharing the second screenshot — this gives more detail.Here’s what’s happening:What the Trace ShowsThe panic here is in the RCU subsystem (Read-Copy-Update), which is a core Linux synchronization mechanism.Specifically: rcu: rcu_preempt kthread starved for 664912 jiffies! g=7744983 fqs=6 This means the RCU preempt kernel thread was stuck/starved for too long — the kernel watchdog considers this a deadlock.Backtrace shows: native_queued_spin_lock_slowpath raw_spin_lock resched_curr rcu_gp_fqs_loop rcu_gp_kthread → This indicates the kernel got stuck trying to grab a spinlock during RCU grace period cleanup.Why This HappensKernel/Driver BugMost common cause. Since you’re on Unraid 7.1.4 (kernel 6.12.24), this could be a regression in RCU + ZFS or RCU + NVIDIA drivers.Both ZFS and GPU drivers are heavy users of locking.Resource StarvationIf one CPU core is pegged at 100% (tight loop), the RCU threads can get starved and trigger this warning.Check if you had a VM, Plex, Tdarr, or GPU job maxing out CPU/GPU before the crash.Hardware Edge CaseBad RAM or unstable overclocking (if enabled) can cause locks to never resolve.How to Safely RestartSince your system is already throwing RCU stalls, it may not shut down cleanly. Do this:Try a graceful stop first (recommended): /usr/local/sbin/powerdown * This will attempt to stop the array, unmount disks, and kill processes. If that hangs for >2–3 minutes: sync reboot -f * `sync` flushes writes to disk. * `reboot -f` forces a reboot without waiting for hung threads. Next Steps After RestartCheck syslog (Tools → System Log) for warnings before the panic.Update:Make sure you’re on the latest BIOS for the ASRock Z690 Extreme.See if Unraid 7.1.5 (or newer kernel) is available — 6.12.24 has had reports of RCU-related hangs.Isolate:Temporarily disable ZFS or NVIDIA drivers to test stability.Run memtest from the Unraid boot menu.⚠️ Bottom line: This wasn’t you doing anything wrong — this is most likely a kernel/module bug aggravated by ZFS or NVIDIA drivers under 6.12.24.Do you want me to build you a step-by-step recovery & isolation plan (BIOS update → memtest → driver isolation → kernel update) so you can nail down whether this is software or hardware?Fix common issuesShare appdata set to cache-only, but files / folders exist on the array [no idea how to fix]Share plex_cache set to cache-only, but files / folders exist on the array [no idea how to fix]The plugin folder.view.plg is not known to Community Applications and is possibly incompatible with your server [action plan to remove, which sucks it is GOLD]other actionsgoing to disable nVIDIA as well
September 8, 2025Sep 8 Community Expert Enable the syslog server and post that after a crash, together with fresh diagnostics.
September 8, 2025Sep 8 Author Server log from Syslog server (SPLUNK) and Diag, though I did remove the plugins before this was run.dockyard-diagnostics-20250908-0530.zipthere are a lot of theseSep 7 20:51:46 10.10.9.1 Sep 7 20:51:46 DockYard kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0Sep 7 20:51:46 10.10.9.1 Sep 7 20:51:46 DockYard kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x22:0x51:884)alsoSep 7 20:54:18 10.10.9.1 Sep 7 20:54:18 DockYard kernel: #PF: supervisor instruction fetch in kernel modeSep 7 20:54:18 10.10.9.1 Sep 7 20:54:18 DockYard kernel: BUG: unable to handle page fault for address: ffffffff824fc5c0 1757334860_88.csv.zip
September 8, 2025Sep 8 Author 59 minutes ago, JorgeB said:Retest without the GPU, also woth runing memtest.As stated, did the memory test for 10 hours cleanRetest, is just wait until some unknown time to feel like the issue was the GPU, and then try it again...........The crashes were not "I did this, than that occurred" it is the server is down, oh crap, let me fix.In an effort for it to not crash I removed the nVidia drivers (as indicated), in a hope that I have stability until I return home this weekend.My assumptions arenvidia driver issueremoved the drivers, so all the apps that use nvidia are not? I assume they will try CPU transcoding? or OLLAMA will just failwait 4 days in this stateif no crashes, then put them back in?Down grading the OS is my GUT move as it is the new OS that started this, but that does seem to have risk.
September 23, 2025Sep 23 Author Update.As suspected the server has become extremely stable with the Nvidia drivers and video card removed.Update.As suspected the server has become extremely stable with the Nvidia drivers and video card removed.With the removal of the Nvidia GPU and drivers the server has not failed with the removal of theWith the removal of the Nvidia graphics card and associated drivers there hasn’t been a crash.
September 23, 2025Sep 23 Community Expert That suggests the GPU or its driver is the problem, recommend posting in the Nvidia plugin support thread; there may be some known issues with some models/drivers.
September 27, 2025Sep 27 Author 4 hours ago, alison2033 said:But what was the reason you wanted to update? Was there an important feature you'd like to use in the new version, or did you just want to update because it was a new version?When building the key, did not ‘think’ and used the standard / latest version (Server down for the upgrade of the USB DOM, working through the backup of the old key, and missed selecting the 6.x), and did an unintended upgrade. This was a 6.x to 7, and after the new DOM Key install, did not think the effort to go back right away (back to the server rack, unmount the server, open up the case, and remove the DOM from the mother board, back to the workstation, and build again) was worth it, it is the new stuff. Then the crashes started. If it this was immediate, and constant, I probably would have gone back. As you know the networking changed, and had a history of issues, moving down to a lower 7.x looked like a network issue risk, and after research the move back to 6.x looked risky after all the changes over the weeks of being at 7.x4 hours ago, alison2033 said:Below are two options if you want to try:1. If the reason for the failure was the new update, just go back to the version that was working.Was not an easy option, see above. Moving back in the 7.x introduces more known bugs, at the time of writing, there were no stated issues with nVIDIA and like most posted, assumed a memory issue, and worked it.4 hours ago, alison2033 said:2. If the new update was done through the UnRAID panel, I recommend you try a clean update. Remove the USB flash drive and wipe it, then reinstall UnRAID and the GPU driver. This may solve your problem. After the operating system is clean, everything will work again, or go back to option 1.Thank You - the stability of a DOM, also introduces the need for me to dive into the rack server, and pull the motherboard based USB drive. The upgrade was done by building a new key via the Mac Tool, for a new DoM after too many USB Stick failure. (Skip rant on the need to use a USB to boot a server).With 20+ Containers and this server being a key daily/hourly used services, going through a process like this was not an option. It would be if it crashed, after an hour, but it is days between them. So have a server (that is used hourly) down for a week, to feel comfortable it is ‘stable’ and layer the next fix on was not an option.I am extremely, extremely disappointed that there is so little diagnostic tools with this product that the guess work is the procedure to fix issues. I have a Syslog Server (SPLUNK), I have console Crash captures, and the best troubleshooting we have is, try this, or some ‘canned’ response of it is your memory. Working with two AI models, diagnosing 4 crashed console and SysLogs, everything pointed to the NVIDIA issue.Two weeks stability with the GPU and drivers gone. Moved the nVIDIA workloads to another server on a lower release, and have stability. (btw, why not move everything? The server with instability is Intel, with integrated GPU, that I use, the other servers are XEON and AMD and can not support the workloads).I do like the unRAID platform with three servers. The value it brings for both hosting and experimenting is great. Support on USB Failure is fast, I could not express how incredible getting support with in an hour on the weekend due to a USB Key failure is. Unfortunately I have learned. how fast the support is too many times. Not enterprise priced, not enterprise quality.PS. this post is over a month after the failures / crashes / ‘upgrade’ started, thus the frustration. Not sure who has the need for unRAID, and can support 5+ weeks of outages..4 hours ago, alison2033 said:There are some words that pros always recommend: don't mess with a good thing, because if you do, it can get bad.The good thing is not having a USB Drive, (holding back on the rant), and after too many server failures because of a $10 USB Stick, I built a DOM Based server that seems to be the community hive’s recommendation for stability. I was on a VERY VERY old version for the great advice you just mentioned.This Product is made great by great people like you willing to help. Thank You Edited September 27, 2025Sep 27 by kcossabo
September 27, 2025Sep 27 Author Assumed Root CauseAfter 2 weeks of stability the assumed root cause is one of the followingIn compatibility with nVidia drivers and 7.1.4 - low probability as I assume there are a lot of installed and not having the issueIn compatibility with nVidia drivers and 7.1.4 and other drivers and combinations on my server - higher probability as this get to more of a snowflake of some combination is causing the issueBAD nVidia card - high probability, as the removal did result in a stability. Though not crashing on earlier versions of unRAID.Each of these could be refined, but, not my jam to figure this out, just provide the family with stability. Edited September 27, 2025Sep 27 by kcossabo
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.