Skip to content
View in the app

A better way to browse. Learn more.

Unraid

A full-screen app on your home screen with push notifications, badges and more.

To install this app on iOS and iPadOS
  1. Tap the Share icon in Safari
  2. Scroll the menu and tap Add to Home Screen.
  3. Tap Add in the top-right corner.
To install this app on Android
  1. Tap the 3-dot menu (⋮) in the top-right corner of the browser.
  2. Tap Add to Home screen or Install app.
  3. Confirm by tapping Install.

Help diagnosing system hanging intermittently

Featured Replies

Hi All,

I'm tearing my hair out trying to diagnose a problem where, after several days (10 in the last instance), my whole system just hangs and becomes completely unresponsive. No response to ping, no display output, numlock not even toggling. I have to hard power cycle via IPMI. No logs in the SuperMicro IPMI health events either.

The failure also does not seem to follow any particular schedule, so I can't align it with any scheduled jobs that might be causing the issue.

I've mirrored syslog to flash, but I can't see any obvious syslog events at the tail end of the syslog-previous.txt either. As such, I'm left scratching my head about what the issue is.
I've had a browse of the syslog and nothing is jumping out at me. A few FFMPEG errors, which I'm guessing are just Frigate dealing with a few blips in my CCTV rtsp streams.

The only one that I thought might be worth looking at was this one where it's starting up the custom network that lets the PiHole container get it's own IP on the network:
Jul 26 00:08:16 tower rc.docker: Processing... br0

Jul 26 00:08:16 tower root: Error response from daemon: network with name br0 already exists

Jul 26 00:08:16 tower rc.docker: connecting pihole to network br0

Jul 26 00:08:16 tower root: Error response from daemon: endpoint with name pihole already exists in network br0

Jul 26 00:08:16 tower rc.docker: ip link add link br0 name shim-br0 type ipvlan mode l2 bridge

Jul 26 00:08:16 tower rc.docker: ip link set shim-br0 up

Jul 26 00:08:16 tower rc.docker: ip -6 addr flush dev shim-br0

Jul 26 00:08:16 tower rc.docker: ip -4 addr add 192.168.1.11/24 dev shim-br0 metric 0

Jul 26 00:08:16 tower rc.docker: ip -4 route add default via 192.168.1.1 dev shim-br0 metric 0

Jul 26 00:08:16 tower rc.docker: created network shim-br0 for host access

Jul 26 00:08:16 tower rc.docker: Network started.

I think I've probably screwed up the config for pihole's network in some way, but it seems unlikely that would cause the whole system to hang days later.

Does anyone have any ideas or suggestions on how to diagnose further? I feel like I've hit a wall. Diagnostics attached.

Cheers

tower-diagnostics-20250805-0801.zip

Solved by kabadisha

Without anything relevant logged theres' not much to go on, this can also be a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the individual docker containers.

  • Author
2 hours ago, JorgeB said:

Without anything relevant logged theres' not much to go on, this can also be a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the individual docker containers.

Yes, indeed. I was worried this might be the answer. Debugging that way is going to take an age because of how infrequently the issue occurs.
Maybe I'll try to run a memtest - that's easy at least.

I figure it's unlikely to be a hardware fault given that I haven't changed the hardware configuration at all for months. It all started happening when I made the jump to the v7 release. I was suffering from the widely reported networking issues. This seemed to be significantly improved by the 7.1.4 patch and so I thought that was the root cause.

That's why I'm suspecting maybe the PiHole network configuration and ipvlan. I think I'll try disabling PiHole as well as the custom network and see if that resolves it.
I hate issues like this with not much to go on.

Cheers,
Charlie

6 minutes ago, kabadisha said:

Maybe I'll try to run a memtest - that's easy at least.

Worth a try, or since memtest is only definitive if it finds errors, if you have multiple sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM.

  • Author
5 hours ago, JorgeB said:

Worth a try, or since memtest is only definitive if it finds errors, if you have multiple sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM.

It died again at some point this morning. Interestingly, this time there does seem to be something in the IPMI health event log, but the error codes don't mean anything to me, so not sure it's very helpful. Screenshot attached.
Successfully did a memtest pass with no issues. Going to try disabling PiHole now and see if that improves things...

Screenshot 2025-08-05 at 17.27.09.png

24 minutes ago, kabadisha said:

but the error codes don't mean anything to me

Me neither, but Supermicro support may be able to help.

  • Author

Ok, some progress:

I've managed to resolve Error response from daemon: network with name br0 already exists.

It appears that somewhere along the line, the custom br0 network was duplicated (or maybe I created it manually and simply can't remember). Since I have "Preserve user defined networks" enabled in my Docker settings, it wasn't being cleaned up, and so was conflicting with the custom network that Unraid was trying to create for me when I enabled IPv4 custom network on interface br0 under Docker Settings.

I suspect that dodgy instance of br0 may have been causing the issues. To resolve it, I did the following:

  1. Disable autostart on the pi-hole container

  2. Stop Docker

  3. Disable IPv4 custom network on interface br0 under Docker Settings

  4. Start Docker

  5. Edit the pi-hole container config and set the network type to host. It will fail to start and complain about port 80 being taken, but that's ok.

  6. Stop Docker

  7. Enable IPv4 custom network on interface br0 under Docker Settings

  8. Start Docker

  9. Edit the pi-hole container config and set network type to Custom-br0 and set the static IP I use for pi-hole on my network (192.168.1.12 in my case).

  10. Re-enable autostart on the pi-hole container

Not sure if that's going to resolve the system crashes, but I do now get a clean network startup when starting Docker, and I'm counting that as progress:

Aug 5 19:39:32 tower rc.docker: Processing... br0

Aug 5 19:39:32 tower rc.docker: created network ipvlan br0 with subnets: 192.168.1.0/24;

Aug 5 19:39:32 tower rc.docker: connecting pihole to network br0

Aug 5 19:39:32 tower rc.docker: ip link add link br0 name shim-br0 type ipvlan mode l2 bridge

Aug 5 19:39:32 tower rc.docker: ip link set shim-br0 up

Aug 5 19:39:32 tower rc.docker: ip -6 addr flush dev shim-br0

Aug 5 19:39:32 tower rc.docker: ip -4 addr add 192.168.1.11/24 dev shim-br0 metric 0

Aug 5 19:39:32 tower rc.docker: ip -4 route add default via 192.168.1.1 dev shim-br0 metric 0

Aug 5 19:39:32 tower rc.docker: created network shim-br0 for host access

Aug 5 19:39:32 tower rc.docker: Network started.

Time will tell if that resolves the issue.

  • 2 weeks later...
  • Author

Still no dice. I was getting optimistic, but it died again last night.

Pumped the syslog into ChatGPT on the off-chance it might spot something I didn't. It didn't give me much - it seemed to hallucinate a kernel panic in the logs.
It did cause me to pay more attention to:

kernel: mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 8.

This doesn't look like a fatal error to me, but it did cause me to go and check for BIOS and IPMI firmware updates. Both had updates available, so I decided to try upgrading those.

Upgrading both hasn't resolved those mce firmware bugs, but maybe it'll help system stability.

  • Author

Still having the issue, so BIOS update didn't do the trick.
I'm also still not getting anything useful in the logs. This is going to be a bastard to resolve. There's so little to go on :-(

  • Author

It just died for the first time while I was actively using it.
I tried switching from ipvlan to macvlan to give pihole its own IP to see if that had any impact. Apparently not.

I also noticed that the error Error response from daemon: network with name br0 already exists had reappeared. It seems that if you have "preserve custom networks" enabled, this error will be seen in the log. I've disabled preserving networks for now (as I no longer need that feature) and that seems to have resolved that one.

I've now pulled out two of the four RAM sticks. If it fails again, I'll swap the pair. If I still get the failure on both pairs then I know it's not a ram issue.

  • Author

Update: Swapping RAM had no impact, however, I have just managed to trigger the issue several times in short succession.

I started a large download in sabnzbd and it seemed to trigger the failure. I was tailing syslog at the time of failure and there was literally nothing logged.
I'm going to disable sabnzbd and see if that leads to stability.

  • 3 weeks later...
  • Author

Another update in case anyone finds this thread in the future. Shutting down the Sabnzbd container resolved the stability issues.
I now have it running again, and so far it seems to be stable. I have changed two things:

  1. Previously, I had /downloads/incomplete and /downloads/complete mapped as two different mounts on the container. I have now switched to one mount of the parent directory /downloads instead.

    My theory is that my previous method was preventing simple file move operations from behaving properly since the move (as far as the container was concerned) was from one mount to another.

  2. I have switched from /mnt/user/downloads to /mnt/cache/downloads as the host end of the mount. This avoids the overhead of the FUSE filesystem layer: https://www.reddit.com/r/unRAID/comments/uwxmg6/comment/i9uj9z1/

I hope this helps someone else in future :-)

Edited by kabadisha

  • Author

A further update: The mystery continues.

My server was stable for several weeks with Sabnzbd disabled. I brought it back online after the changes above and it was stable for a while, successfully downloading a large number of files.
This morning, though: Dead again.

This time I booted back up, and started downloads whilst tailing /var/log/* via ssh as well as having htop open and the logs for both Sabnzbd and gluetun.
I was able to get the system to hang multiple times over the course of an hour. Each time, there were no log entries at all in any of the logs at all and htop showed no unusual system load.

While I'm glad I've narrowed it down to Sabnzbd (or maybe Gluetun), this is proving very difficult indeed to diagnose fully.

  • Author

Another update:
I just had another system freeze while sabnzbd was shut down. That means it isn't the culprit.

Interestingly, the issue occurred while I was conducting an internet speedtest from the server using a different containerised service. The theme here is that this involves a reasonably large download.

This is smelling increasingly like a hardware issue. Since I have already ruled out RAM and I have high confidence in my power supply, I'm wondering if this might be one of the SSDs in my cache pool struggling to keep up when data is being written.
I've got two 2TB SATA SSDs from different manufacturers, both about a year old and both reporting good health and 96+% remaining health:

  • Samsung 870 EVO 2TB

  • Crucial MX500 (CT2000MX500SSD1)

Feel like I'm clutching at straws though.

  • 2 weeks later...
  • Author

I have changed a few things, and so far the system seems stable (fingers crossed).

My Supermicro motherboard has an NVME m.2 slot, so I bought a 2TB WD Black one and then changed a number of things:

  1. I made the new NVME drive the primary cache and migrated appdata, docker image etc to it.

  2. I took the Samsung SATA SSD and made it into a separate cache pool called data-cache. This cache is now used for downloads, frigate cctv and other potentially heavy write IO.

  3. I completely overhauled my directory structure to adopt the TRaSH guides suggested structure. My previous setup involved completed downloads being copied to the array as soon as they are downloaded. Now the move is instantaneous and relies on the mover to transfer files to the array.

  4. I configured my paths to bypass FUSE for appdata, docker & downloads and instead write directly to the relevant cache pool.

My latest theory is that my sub-optimal setup was causing some kind of disk IO bottleneck and since everything was relying on the same cache pool, the system crapped out. It's an unsatisfactory answer to be honest, but I can't seem to narrow it down any further.

The final issue for me now is that the cache pools are both using a single disk right now, which I'm not a fan of, so I've ordered a PCIE NVME adaptor so I can mount two NVME drives. I'll also try adding the second disk back onto the data-cache pool once I have seen a bit of long-term stability.

  • Author

FFS. System freeze again today while I was at work.
I'm going to try swapping out the Crucial SATA SSD for the Samsung one and see if that resolves it. Still clutching at fog though :-(

  • 3 weeks later...
  • Author
  • Solution

Finally resolved!
I refactored all my media and download directories & mounts according to the TRaSH guides, which was worth doing anyway, however this also didn't reolve the issue.


Eventually I found a deal on a replacement motherboard on Ebay. I replaced the motherboard nearly two weeks ago and so far it has been rock solid, even whilst downloading.
The thermal paste on the PCH of both my original board, as well as the replacement was very crisp. It's possible that was the issue, but hard to tell.
I repasted the new one before swapping it in and maybe when I get some free time I'll test the original once repasted.

It's frustrating not to know exactly what the root cause was, but I'm glad it appears to be resolved. :-)

Edited by kabadisha
Additional information added.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

Account

Navigation

Search

Search

Configure browser push notifications

Chrome (Android)
  1. Tap the lock icon next to the address bar.
  2. Tap Permissions → Notifications.
  3. Adjust your preference.
Chrome (Desktop)
  1. Click the padlock icon in the address bar.
  2. Select Site settings.
  3. Find Notifications and adjust your preference.