Help diagnosing system hanging intermittently

August 5, 2025Aug 5

Hi All,

I'm tearing my hair out trying to diagnose a problem where, after several days (10 in the last instance), my whole system just hangs and becomes completely unresponsive. No response to ping, no display output, numlock not even toggling. I have to hard power cycle via IPMI. No logs in the SuperMicro IPMI health events either.

The failure also does not seem to follow any particular schedule, so I can't align it with any scheduled jobs that might be causing the issue.

I've mirrored syslog to flash, but I can't see any obvious syslog events at the tail end of the syslog-previous.txt either. As such, I'm left scratching my head about what the issue is.
I've had a browse of the syslog and nothing is jumping out at me. A few FFMPEG errors, which I'm guessing are just Frigate dealing with a few blips in my CCTV rtsp streams.

The only one that I thought might be worth looking at was this one where it's starting up the custom network that lets the PiHole container get it's own IP on the network:
Jul 26 00:08:16 tower rc.docker: Processing... br0

Jul 26 00:08:16 tower root: Error response from daemon: network with name br0 already exists

Jul 26 00:08:16 tower rc.docker: connecting pihole to network br0

Jul 26 00:08:16 tower root: Error response from daemon: endpoint with name pihole already exists in network br0

Jul 26 00:08:16 tower rc.docker: ip link add link br0 name shim-br0 type ipvlan mode l2 bridge

Jul 26 00:08:16 tower rc.docker: ip link set shim-br0 up

Jul 26 00:08:16 tower rc.docker: ip -6 addr flush dev shim-br0

Jul 26 00:08:16 tower rc.docker: ip -4 addr add 192.168.1.11/24 dev shim-br0 metric 0

Jul 26 00:08:16 tower rc.docker: ip -4 route add default via 192.168.1.1 dev shim-br0 metric 0

Jul 26 00:08:16 tower rc.docker: created network shim-br0 for host access

Jul 26 00:08:16 tower rc.docker: Network started.

I think I've probably screwed up the config for pihole's network in some way, but it seems unlikely that would cause the whole system to hang days later.

Does anyone have any ideas or suggestions on how to diagnose further? I feel like I've hit a wall. Diagnostics attached.

Cheers

tower-diagnostics-20250805-0801.zip

Quote

August 5, 2025Aug 5

Community Expert

Without anything relevant logged theres' not much to go on, this can also be a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the individual docker containers.

Quote

August 5, 2025Aug 5

Author

2 hours ago, JorgeB said:
Without anything relevant logged theres' not much to go on, this can also be a hardware issue, one thing you can try is to boot the server in safe mode with all docker containers/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one, including the individual docker containers.

Yes, indeed. I was worried this might be the answer. Debugging that way is going to take an age because of how infrequently the issue occurs.
Maybe I'll try to run a memtest - that's easy at least.

I figure it's unlikely to be a hardware fault given that I haven't changed the hardware configuration at all for months. It all started happening when I made the jump to the v7 release. I was suffering from the widely reported networking issues. This seemed to be significantly improved by the 7.1.4 patch and so I thought that was the root cause.

That's why I'm suspecting maybe the PiHole network configuration and ipvlan. I think I'll try disabling PiHole as well as the custom network and see if that resolves it.
I hate issues like this with not much to go on.

Cheers,
Charlie

Quote

August 5, 2025Aug 5

Community Expert

6 minutes ago, kabadisha said:
Maybe I'll try to run a memtest - that's easy at least.

Worth a try, or since memtest is only definitive if it finds errors, if you have multiple sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM.

Quote

August 5, 2025Aug 5

Author

5 hours ago, JorgeB said:
Worth a try, or since memtest is only definitive if it finds errors, if you have multiple sticks try using the server with just one, if the same try with a different one, that will basically rule out bad RAM.

It died again at some point this morning. Interestingly, this time there does seem to be something in the IPMI health event log, but the error codes don't mean anything to me, so not sure it's very helpful. Screenshot attached.
Successfully did a memtest pass with no issues. Going to try disabling PiHole now and see if that improves things...

Quote

August 5, 2025Aug 5

Community Expert

24 minutes ago, kabadisha said:
but the error codes don't mean anything to me

Me neither, but Supermicro support may be able to help.

Quote

August 5, 2025Aug 5

Author

Ok, some progress:

I've managed to resolve Error response from daemon: network with name br0 already exists.

It appears that somewhere along the line, the custom br0 network was duplicated (or maybe I created it manually and simply can't remember). Since I have "Preserve user defined networks" enabled in my Docker settings, it wasn't being cleaned up, and so was conflicting with the custom network that Unraid was trying to create for me when I enabled IPv4 custom network on interface br0 under Docker Settings.

I suspect that dodgy instance of br0 may have been causing the issues. To resolve it, I did the following:

Disable autostart on the pi-hole container
Stop Docker
Disable IPv4 custom network on interface br0 under Docker Settings
Start Docker
Edit the pi-hole container config and set the network type to host. It will fail to start and complain about port 80 being taken, but that's ok.
Stop Docker
Enable IPv4 custom network on interface br0 under Docker Settings
Start Docker
Edit the pi-hole container config and set network type to Custom-br0 and set the static IP I use for pi-hole on my network (192.168.1.12 in my case).
Re-enable autostart on the pi-hole container

Not sure if that's going to resolve the system crashes, but I do now get a clean network startup when starting Docker, and I'm counting that as progress:

Aug 5 19:39:32 tower rc.docker: Processing... br0

Aug 5 19:39:32 tower rc.docker: created network ipvlan br0 with subnets: 192.168.1.0/24;

Aug 5 19:39:32 tower rc.docker: connecting pihole to network br0

Aug 5 19:39:32 tower rc.docker: ip link add link br0 name shim-br0 type ipvlan mode l2 bridge

Aug 5 19:39:32 tower rc.docker: ip link set shim-br0 up

Aug 5 19:39:32 tower rc.docker: ip -6 addr flush dev shim-br0

Aug 5 19:39:32 tower rc.docker: ip -4 addr add 192.168.1.11/24 dev shim-br0 metric 0

Aug 5 19:39:32 tower rc.docker: ip -4 route add default via 192.168.1.1 dev shim-br0 metric 0

Aug 5 19:39:32 tower rc.docker: created network shim-br0 for host access

Aug 5 19:39:32 tower rc.docker: Network started.

Time will tell if that resolves the issue.

Quote

1

August 14, 2025Aug 14

Author

Still no dice. I was getting optimistic, but it died again last night.

Pumped the syslog into ChatGPT on the off-chance it might spot something I didn't. It didn't give me much - it seemed to hallucinate a kernel panic in the logs.
It did cause me to pay more attention to:

kernel: mce: [Firmware Bug]: Ignoring request to disable invalid MCA bank 8.

This doesn't look like a fatal error to me, but it did cause me to go and check for BIOS and IPMI firmware updates. Both had updates available, so I decided to try upgrading those.

Upgrading both hasn't resolved those mce firmware bugs, but maybe it'll help system stability.

Quote

August 18, 2025Aug 18

Author

Still having the issue, so BIOS update didn't do the trick.
I'm also still not getting anything useful in the logs. This is going to be a bastard to resolve. There's so little to go on :-(

Quote

August 23, 2025Aug 23

Author

It just died for the first time while I was actively using it.
I tried switching from ipvlan to macvlan to give pihole its own IP to see if that had any impact. Apparently not.

I also noticed that the error Error response from daemon: network with name br0 already exists had reappeared. It seems that if you have "preserve custom networks" enabled, this error will be seen in the log. I've disabled preserving networks for now (as I no longer need that feature) and that seems to have resolved that one.

I've now pulled out two of the four RAM sticks. If it fails again, I'll swap the pair. If I still get the failure on both pairs then I know it's not a ram issue.

Quote

August 23, 2025Aug 23

Author

Update: Swapping RAM had no impact, however, I have just managed to trigger the issue several times in short succession.

I started a large download in sabnzbd and it seemed to trigger the failure. I was tailing syslog at the time of failure and there was literally nothing logged.
I'm going to disable sabnzbd and see if that leads to stability.

Quote

September 8, 2025Sep 8

Author

Another update in case anyone finds this thread in the future. Shutting down the Sabnzbd container resolved the stability issues.
I now have it running again, and so far it seems to be stable. I have changed two things:

Previously, I had /downloads/incomplete and /downloads/complete mapped as two different mounts on the container. I have now switched to one mount of the parent directory /downloads instead.
My theory is that my previous method was preventing simple file move operations from behaving properly since the move (as far as the container was concerned) was from one mount to another.
I have switched from /mnt/user/downloads to /mnt/cache/downloads as the host end of the mount. This avoids the overhead of the FUSE filesystem layer: https://www.reddit.com/r/unRAID/comments/uwxmg6/comment/i9uj9z1/

I hope this helps someone else in future :-)

Edited September 8, 2025Sep 8 by kabadisha

Quote

September 12, 2025Sep 12

Author

A further update: The mystery continues.

My server was stable for several weeks with Sabnzbd disabled. I brought it back online after the changes above and it was stable for a while, successfully downloading a large number of files.
This morning, though: Dead again.

This time I booted back up, and started downloads whilst tailing /var/log/* via ssh as well as having htop open and the logs for both Sabnzbd and gluetun.
I was able to get the system to hang multiple times over the course of an hour. Each time, there were no log entries at all in any of the logs at all and htop showed no unusual system load.

While I'm glad I've narrowed it down to Sabnzbd (or maybe Gluetun), this is proving very difficult indeed to diagnose fully.

Quote

September 16, 2025Sep 16

Author

Another update:
I just had another system freeze while sabnzbd was shut down. That means it isn't the culprit.

Interestingly, the issue occurred while I was conducting an internet speedtest from the server using a different containerised service. The theme here is that this involves a reasonably large download.

This is smelling increasingly like a hardware issue. Since I have already ruled out RAM and I have high confidence in my power supply, I'm wondering if this might be one of the SSDs in my cache pool struggling to keep up when data is being written.
I've got two 2TB SATA SSDs from different manufacturers, both about a year old and both reporting good health and 96+% remaining health:

Samsung 870 EVO 2TB
Crucial MX500 (CT2000MX500SSD1)

Feel like I'm clutching at straws though.

Quote

September 28, 2025Sep 28

Author

I have changed a few things, and so far the system seems stable (fingers crossed).

My Supermicro motherboard has an NVME m.2 slot, so I bought a 2TB WD Black one and then changed a number of things:

I made the new NVME drive the primary cache and migrated appdata, docker image etc to it.
I took the Samsung SATA SSD and made it into a separate cache pool called data-cache. This cache is now used for downloads, frigate cctv and other potentially heavy write IO.
I completely overhauled my directory structure to adopt the TRaSH guides suggested structure. My previous setup involved completed downloads being copied to the array as soon as they are downloaded. Now the move is instantaneous and relies on the mover to transfer files to the array.
I configured my paths to bypass FUSE for appdata, docker & downloads and instead write directly to the relevant cache pool.

My latest theory is that my sub-optimal setup was causing some kind of disk IO bottleneck and since everything was relying on the same cache pool, the system crapped out. It's an unsatisfactory answer to be honest, but I can't seem to narrow it down any further.

The final issue for me now is that the cache pools are both using a single disk right now, which I'm not a fan of, so I've ordered a PCIE NVME adaptor so I can mount two NVME drives. I'll also try adding the second disk back onto the data-cache pool once I have seen a bit of long-term stability.

Quote

October 2, 2025Oct 2

Author

FFS. System freeze again today while I was at work.
I'm going to try swapping out the Crucial SATA SSD for the Samsung one and see if that resolves it. Still clutching at fog though :-(

Quote

October 18, 2025Oct 18

Author
Solution

Finally resolved!
I refactored all my media and download directories & mounts according to the TRaSH guides, which was worth doing anyway, however this also didn't reolve the issue.

Eventually I found a deal on a replacement motherboard on Ebay. I replaced the motherboard nearly two weeks ago and so far it has been rock solid, even whilst downloading.
The thermal paste on the PCH of both my original board, as well as the replacement was very crisp. It's possible that was the issue, but hard to tell.
I repasted the new one before swapping it in and maybe when I get some free time I'll test the original once repasted.

It's frustrating not to know exactly what the root cause was, but I'm glad it appears to be resolved. :-)

Edited October 18, 2025Oct 18 by kabadisha
Additional information added.

Quote

1

Help diagnosing system hanging intermittently

Featured Replies

Solved by kabadisha

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)