August 5, 2025Aug 5 For months now my server goes unresponsive about every 2 days and I've had to hard power down to get it back. When it's back it runs like a dreamUnresponsive means:I can't access GUI dashboardI can ssh. 'Reboot' with force or otherwise gives me a beep but not a shutdown. Docker Stop commands fail. HTOP suggests 3 CPU threads are stuck on 100%Docker stats suggest all my docker containers are running, but I can't access access them either locally or through reverse proxyThere is nothing of any importance (error/warnings) that I see in the syslogs or via dmesg.Failure times are random (middle of the night, middle of the day, idle times, busy times) - I can't pin-point to anyone container or scriptIn troubleshootingI've changed/upgraded all of PCI cards, memory and processorUninstalled many pluginsTried switching off the dockers (but not all yet)Rebuilt docker vdisk (twice)Switched off all VMs completelyPrevented drives from spinning downEnsured BIOS Idle settings are as everyone suggests (C-states disabled, idle control typical, cool'n'quiet disabled)Removed all USB devicesChanged UPSUse Docker ipvlan networksTemperatures are all okHammered the box with 90% cpu processing for 7 hours - no problemAll disks have plenty of capacity / SMARTs are goodThe server cannot survive for more than 48 hours....Can anyone spot anything in my diags?Do I need to reinstall the UNRAID OS perhaps?It all suggests to me that something is 'building up' or hamming the system - but what is it?ANY suggestion is welcome. I'm almost ready to give up.ding-a-ling-diagnostics-20250805-1130.zip Edited August 5, 2025Aug 5 by late4473 Add brief HTOP findings.
August 5, 2025Aug 5 this sounds like the exact issue i have! havent looked at your diag but are you on 7.1.4?
August 5, 2025Aug 5 Author 16 minutes ago, mister_thew said:this sounds like the exact issue i have! havent looked at your diag but are you on 7.1.4?Yes 7.1.4I guess you haven't solved it? Did you try a rollback?
August 5, 2025Aug 5 Author 21 minutes ago, JorgeB said:Enable the syslog server and post that after a crash.Enabled, will post. But should add I've inspected before during the 'hangup' via ssh. Last entries always seem to be just the normal 'spin down' of the hard drives. Edited August 5, 2025Aug 5 by late4473
August 7, 2025Aug 7 Author @JorgeB Here is the syslog and some other screenshots. Died 6-Aug @ 22:10ish i.e. All dockers stopped responding. Hard reboot and all is well again....cycle continuesI could access unraid GUI - note a CPU thread locked at 100%HTOP doesn't report the same locked CPU but is reporting the load average endlessly climbingThere is an recurring error about a missing css file. I have tried both 'light' and 'dark' settings but this file is always missingI hope you can spot somethingThanks syslog-192.168.50.12.log
August 7, 2025Aug 7 Community Expert I'm afraid that there's nothing relevant logged, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it doesn't crash, start turning on the other services one by one, including the individual docker containers.
August 9, 2025Aug 9 I've had the same problem with 7.1.4 - started with 7.0.1 but was less frequent back then.The only useful output I've seen appears in the log window if you have it open (it is not written to the log file):Server kernel: CIFS: VFS: \\Server has not responded in 180 seconds. Reconnecting...Most times it dies around 1:15am, hours after the last drive has spun down (noted in the log)I need to rollback to 6.12 - same hardware was stable back then.
August 9, 2025Aug 9 Author Thank you for the feedback. I'm not convinced by the version being the root cause because I was actually reporting the same issue on this forum about a year ago! My server dies at RANDOM times also.I've actually today just gone over 2 days uptime (!) - unprecedented for months. I'd noticed that a 'flash backup' seemed to be running at random times during the day, so my last change was to completely remove the CONNECT plug-in. Let me monitor over the next few days. It's so good not having to hard boot every other day! #fingerscrossed@JDGJr Do you have Connect installed and do you have auto backups switched on?
August 10, 2025Aug 10 Author My server survived for 4 days or so - and the unresponsive behaviour at the death was slightly different. I could access GUI (all looked normal) - no reported stuck CPUs but all docker containers unresponsive. Any attempt to stop them resulting in 'server error'. I did notice my Docker PID limit was on the default (2048) and a quick docker stats showed I was in the high 1000s so have now bumped up substantially and restarted.Watch this space.
August 10, 2025Aug 10 On 8/8/2025 at 11:55 PM, late4473 said:@JDGJr Do you have Connect installed and do you have auto backups switched on?I do not have Connect installed. Auto backups run once a week, but the system had been stopping every other day.I'm still prepping for rolling back to 6.12 and system has been more stable since I posted here. But I've been doing a lot of backups to other systems overnight when the stoppage had been common.My system's UI looked ok - i could change pages, but I got the 'system error' message when interacting with Dockers and the spin up/down actions on the main drive list did nothing with the drives. Edited August 10, 2025Aug 10 by JDGJr more detail
August 15, 2025Aug 15 Author I'm going to mark this as solved - my uptime is now 5 days (!) and server seems rock solid (40 docker containers, 4 VMs all happily running). Last key actions:Uninstalled the Connect Plug-in (had some impact )Increased Settings/Docker/PID limit to 8000 (Even though my Docker Stats PID column only sums to c1300) from the default setting. Could they have been spikes perhaps? pids.sh
September 6, 2025Sep 6 Author Solution One more follow up - previous solution was not permanent - server continued to die. I'm putting this down to the unRaid FUSE bug (shfs). Resolution which has a 7 day up-time so far...1) Upgrade to 7.2.0-beta.2 (Can't be any worse that what I've got!)2) Change all drive references (docker container persistent drives, docker.img, libvirt.img and all VM storage paths) from /mnt/user/... to /mnt/cache/... These are all cache only drives anyway and this takes FUSE out of the equation.I hope this helps someone!Will I make 2 weeks of up-time?? Edited September 6, 2025Sep 6 by late4473
September 6, 2025Sep 6 Thanks for posting this. I'm still trying to prove my 7.1.4 system isn't crash prone. (I am wary of moving to the next beta)Interestingly, I made the /mnt/user -> /mnt/cache changes late this week - mostly to become consistent. I hope that gives me the positive result you're looking for.Question: do you see a line similar to 'CIFS: VFS: \\Server has not responded in 180 seconds. Reconnecting...' at the end of the syslog when your machine crashes?(need to have a console window displaying syslog when it dies, as it would not be able to write to the disk log file) Edited September 6, 2025Sep 6 by JDGJr () addition
September 7, 2025Sep 7 Author No - never seen the CIFS line. I only see drives spinning up and down in my syslog.Good luck!
September 7, 2025Sep 7 I recently been dealing with the same issues. Nothing relating to upgrading the OS versions since I did my update a few weeks ago and the problem started 4 days ago.Basically, the server freezes about every 24h to 36h. Did a complete server check and no hardware faults found. I did turn on syslog and will edit this post when I get the file. Help would certainly be greatly appreciated for now. tower-diagnostics-20250907-0950.zip
September 9, 2025Sep 9 Syslog server is enabled and the following files were recorded after multiple crashes. Today, I moved my appdata and docker.img files to my cache drive. Server still crashed. syslog syslog-previous Edited September 9, 2025Sep 9 by leprechaun17
September 9, 2025Sep 9 Community Expert You have a serious LAN loop configured, this will kill the box sooner or laterSep 8 19:02:47 Tower kernel: br0: received packet on eth0 with own address as source address (addr:a4:ba:db:19:b2:c2, vlan:0)Sep 8 19:02:47 Tower kernel: br0: received packet on eth2 with own address as source address (addr:a4:ba:db:19:b2:c2, vlan:0)Sep 8 19:02:47 Tower kernel: br0: received packet on eth0 with own address as source address (addr:a4:ba:db:19:b2:be, vlan:0)Sep 8 19:02:47 Tower kernel: br0: received packet on eth0 with own address as source address (addr:a4:ba:db:19:b2:c2, vlan:0)Sep 8 19:02:47 Tower kernel: br0: received packet on eth2 with own address as source address (addr:a4:ba:db:19:b2:be, vlan:0)Sep 8 19:02:47 Tower kernel: br0: received packet on eth3 with own address as source address (addr:a4:ba:db:19:b2:be, vlan:0)Sep 8 19:02:47 Tower kernel: br0: received packet on eth2 with own address as source address (addr:a4:ba:db:19:b2:c2, vlan:0)Sep 8 19:02:47 Tower kernel: br0: received packet on eth0 with own address as source address (addr:a4:ba:db:19:b2:c2, vlan:0)Sep 8 19:02:47 Tower kernel: br0: received packet on eth0 with own address as source address (addr:a4:ba:db:19:b2:be, vlan:0)Sep 8 19:02:47 Tower kernel: br0: received packet on eth3 with own address as source address (addr:a4:ba:db:19:b2:c2, vlan:0)Looks like you have attached all ports to the same switch (or to different switches but these are interconnected too). I assume, you want to do some kind of bonding, but either you have not told your switch about it, or it does not support this kind of setup.The only valid option would be to pull out 3 of those cables or to set bond mode to "active backup".All will result in only one card beeing used anymore.
September 18, 2025Sep 18 So the server have up for the past 6 days without interruption. Thank you for your help.
September 23, 2025Sep 23 On 9/6/2025 at 2:44 PM, late4473 said:Change all drive references (docker container persistent drives, docker.img, libvirt.img and all VM storage paths) from /mnt/user/... to /mnt/cache/... These are all cache only drives anyway and this takes FUSE out of the equation.Will I make 2 weeks of up-time??Did you make it to 2 weeks?I'm happy to report that making those changes to my 7.1.4 system has given me 16+ days of uptime! Haven't had this in months.
September 23, 2025Sep 23 Author Yep - solid as a rock - like you I haven't seen that for months! I'm pretty sure it was due to taking FUSE out of the equation rather than the upgrade (my conclusion being it couldn't handle the throughput)....glad you're in the same boat. I was ready to chuck unRAID into the bin!
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.