I have been having an inexplicable issue where the server hangs and requires a hard reboot to return to normal. This has occurred since upgrading to 6.12.3 (and now 6.12.4).
When I say it's random, I mean it:
- Sometimes the server has lasted 4+ days after a reboot, sometimes it's become unresponsive within minutes of booting.
- Sometimes it has crashed while disks are active, sometimes when they are idle.
-
It's crashed with all of these combinations:
- Docker disabled, VMs disabled
- Docker enabled, VMs disabled
- Docker disabled, VMs enabled
- Docker enabled, VMs enabled
- It's crashed when it's connected to the usual network, and when it's on its own dedicated subnet.
- Crashed for each ethernet port on the motherboard being used as the sole network connect (and that's a 1G and a 2.5G port, so they aren't even the same hardware or drivers!)
- It's crashed with the configuration that I've built up over the last eight years of running Unraid, and it's crashed with a fresh configuration on a fresh flash drive.
There is not a single condition that is actually correlated with crashes.
I believe this is related to these issues, but I'm creating a new post because one was marked as closed, and I also went to some pretty significant lengths to try to debug this:
Debugging Process
This has been going on since August, and I've done absolutely everything possible to eliminate defective hardware as a possibility. That includes:
- swapping out all PCIe cards with spares
- a run of the system with each indivdual drive disconnected, one at a time (i.e. I remove one disk, see if it still crashes, if it does, put the disk back in and pull the next one).
- every single non-destructive stress test I can think of
- fsck each disk and pool individually
- run every maintenance operation I can think of
- tested various configurations of power and sleep settings on the motherboard BIOS
Logging Process
Here's the wildly frustrating part: I creating a syslog configuration that would log basically every single message it could (including marks) to a log file on a ext4-formatted flash drive mounted as an unassigned device. I had at least a dozen log files that didn't contain a single error and before those log files end, there is an unbroken sequences of --MARK-- lines that goes back for hours before the system locked up.
I've also tried using various notification methods to try to receive messages that the system is dying, and I've also tried setting up remote logging. None of them ever surface an issue anywhere near the time of the crash, so the crash is definitely also killing outbound networking.
There is one tiny hint of what might be going on, and that is for a period of time, when I rebooted after a crash, I would get a "udma crc error count returned to normal value" for a drive (but it never seemed to be consistent). However, all the components have since been removed and added back to the server and I haven't seen an issue like that in a while.
I'll also add that rebooting requires holding down the power button on the computer until it shuts off. If I just do a quick press once, nothing happens: the server keeps running, the monitor doesn't wake up, nothing happens to indicate that anything was actually able to capture that ACPI signal.
Fresh Install
Last night, as my last step, I used the USB creator tool to create a brand new boot disk with 6.12.4 (on a factory-sealed flash drive), copied over only the bare minimum configuration files (like the array config).
Again, it crashed.
Unraid 6.12.4 is Fundamentally Broken on Some Systems?
I'm just rolling back to 6.12.2 at this point, because there is nothing abnormal in the diagnostics or logs that would indicate an actual problem.
I've attached the diagnostics file from the fresh install from before the array was even started, because this locking up problem happens even when nothing is mounted and the system is just idling. But it also happens when a parity check is running, so it's not just a high-load or low-load issue.
tl;dr: There is some issue occurring with Unraid ≥6.12.3 that cannot be detected through any normal logging methods, and has made my local installation totally unusable since August—and I'm apparently not the only one.
Recommended Comments
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.