Jump to content

[6.12.x] Completely inexplicable, random crashes for 2+ months

  • Urgent

I have been having an inexplicable issue where the server hangs and requires a hard reboot to return to normal. This has occurred since upgrading to 6.12.3 (and now 6.12.4).

 

When I say it's random, I mean it:

  • Sometimes the server has lasted 4+ days after a reboot, sometimes it's become unresponsive within minutes of booting.
  • Sometimes it has crashed while disks are active, sometimes when they are idle.
  • It's crashed with all of these combinations:
    • Docker disabled, VMs disabled
    • Docker enabled, VMs disabled
    • Docker disabled, VMs enabled
    • Docker enabled, VMs enabled
  • It's crashed when it's connected to the usual network, and when it's on its own dedicated subnet.
  • Crashed for each ethernet port on the motherboard being used as the sole network connect (and that's a 1G and a 2.5G port, so they aren't even the same hardware or drivers!)
  • It's crashed with the configuration that I've built up over the last eight years of running Unraid, and it's crashed with a fresh configuration on a fresh flash drive.

 

There is not a single condition that is actually correlated with crashes.

 

I believe this is related to these issues, but I'm creating a new post because one was marked as closed, and I also went to some pretty significant lengths to try to debug this:

 

Debugging Process

This has been going on since August, and I've done absolutely everything possible to eliminate defective hardware as a possibility. That includes:

  • swapping out all PCIe cards with spares
  • a run of the system with each indivdual drive disconnected, one at a time (i.e. I remove one disk, see if it still crashes, if it does, put the disk back in and pull the next one).
  • every single non-destructive stress test I can think of
  • fsck each disk and pool individually
  • run every maintenance operation I can think of
  • tested various configurations of power and sleep settings on the motherboard BIOS

Logging Process

Here's the wildly frustrating part: I creating a syslog configuration that would log basically every single message it could (including marks) to a log file on a ext4-formatted flash drive mounted as an unassigned device. I had at least a dozen log files that didn't contain a single error and before those log files end, there is an unbroken sequences of --MARK-- lines that goes back for hours before the system locked up.

 

I've also tried using various notification methods to try to receive messages that the system is dying, and I've also tried setting up remote logging. None of them ever surface an issue anywhere near the time of the crash, so the crash is definitely also killing outbound networking.

 

There is one tiny hint of what might be going on, and that is for a period of time, when I rebooted after a crash, I would get a "udma crc error count returned to normal value" for a drive (but it never seemed to be consistent). However, all the components have since been removed and added back to the server and I haven't seen an issue like that in a while.

 

I'll also add that rebooting requires holding down the power button on the computer until it shuts off. If I just do a quick press once, nothing happens: the server keeps running, the monitor doesn't wake up, nothing happens to indicate that anything was actually able to capture that ACPI signal.

Fresh Install

Last night, as my last step, I used the USB creator tool to create a brand new boot disk with 6.12.4 (on a factory-sealed flash drive), copied over only the bare minimum configuration files (like the array config).

 

Again, it crashed.

Unraid 6.12.4 is Fundamentally Broken on Some Systems?

I'm just rolling back to 6.12.2 at this point, because there is nothing abnormal in the diagnostics or logs that would indicate an actual problem.

 

I've attached the diagnostics file from the fresh install from before the array was even started, because this locking up problem happens even when nothing is mounted and the system is just idling. But it also happens when a parity check is running, so it's not just a high-load or low-load issue.

 

tl;dr: There is some issue occurring with Unraid ≥6.12.3 that cannot be detected through any normal logging methods, and has made my local installation totally unusable since August—and I'm apparently not the only one.

 

mediatower-diagnostics-20231027-1338.zip

User Feedback

Recommended Comments

wug

Members

I rolled back to 6.12.2, and it's already crashed. If I recall correctly, I actually upgraded to 6.12.3 from a lower version (6.12.0 or 6.12.1), so just to be thorough, I've now rolled back all the way to 6.11.5. I have a parity check in progress now, so let's see if I can successfully complete one of these for the first time in almost three months!

bastl

Members

I'am kinda in the same boat as you. Random crashes on all 6.12 releases I tested. I also tried all sorts of combination with Dockers started or stopped, same with VMs. 14h Memtest with 0 errors. Smart values from the disks show no errors. There is no clear indication what causes the crash for me. Sometimes during the night when idle, sometimes during the day on low load or even when transcoding a video with tdar. 30min after a fresh reboot it crashes/freezes on the next the run the server is stable for 3-4 days and as you experienced nothing in the logs. It's kinda frustrating. I'am back on 6.11.5 and it was stable for 11 days. I had a power outage 3 days ago and since than also no crash. 

EofChris

Members

I just want to leave a possible solution here, that worked for me after days of struggling. I don't know if this is an OS or hardware problem on my side, but turning off the cpu graphic unit in the BIOS of my Supermicro X11SSH-LN4F completly solved all problems.

This however means that you can no longer pass the gpu to your virtualization and in my case I also encountered crashes on 6.11.5, so this might not be the same problem.

schale01

Members
(edited)
6 hours ago, EofChris said:

I just want to leave a possible solution here, that worked for me after days of struggling. I don't know if this is an OS or hardware problem on my side, but turning off the cpu graphic unit in the BIOS of my Supermicro X11SSH-LN4F completly solved all problems.

This however means that you can no longer pass the gpu to your virtualization and in my case I also encountered crashes on 6.11.5, so this might not be the same problem.

 

Interesting I also have a similar super micro board.  The last stable system version that worked for me was 6.10.4. That version was rock solid and I never had any hangs.  I've reported this hanging issue myself and seen it reported multiple times elsewhere.  At this point I've just moved off of unRaid as it seems that I will be unable to upgrade to a stable version for my system configuration anytime soon.  

Edited by schale01

wug

Members

It's crashed three more times since yesterday. Each time, it crashed not too long after starting a parity check.

 

As part of my next debugging step, I reset the motherboard to default settings, and I now have an interesting clue. The system booted up just fine and the parity disk was missing. It remained missing after another reboot. All other disks are present and available:

 

468148161_Screenshot2023-10-28T183907-Safari(MediaTowerMain)@2x.thumb.png.35031dd3e322589c28d1978cdee32b22.png

 

It's not super likely to be a bad cable, because it's a SAS to SATA cable and the other three connected drives are available. I pulled the parity disk out, stuck it in the external USB drive caddy, and it shows up just fine. But now, another disk is missing:

 

421955021_Screenshot2023-10-28T185950-Safari(MediaTowerMain)@2x.thumb.png.7892b9c8958a8f24c38ab9612bb095e3.png

 

Baffling. When I switched the Disk 7 to the SATA connector that parity was connected to, it was still missing.

 

I wonder if some BIOS setting was causing the system to miss that certain disks are having issues... somehow? But that barely makes sense. I'm gonna do some more fiddling with it to see if I can get it behaving.

wug

Members

OKAY. I got all disks to be available again.

 

1351904065_Screenshot2023-10-28T193623-Safari(MediaTowerMain)@2x.thumb.png.df3d35e0819c0cbc9375c2e01162e0c6.png

 

I'm now with a mobo that was reset to default settings (and it hasn't been updated away from the version that was stable for a year) and a nearly-fresh install of Unraid 6.11.5.

 

I'll start a new parity check and report back if things have been resolved or if further debugging is needed.

wug

Members

It crashed again with the mobo reset to default settings.

 

I think one interesting clue is that after I rolled back to 6.11.5, when the server crashes, it actually seems to do a full system reset. I keep opening it up to find that it's crashed but is now waiting ready to go once again. This is in contrast to how it would require a long press on the power button to forcibly shut it down because it was TOTALLY unresponsive.

 

591588096_Screenshot2023-10-29T154944-Safari(MediaTowerMain)@2x.thumb.png.6f102f0770d49065bd0c1db077e042f5.png

 

I'm going to try to graphics settings someone else described above, and if it crashes again, I'll try upgrading the BIOS.

wug

Members

It crashes again after upgrading the mobo firmware. I'm officially out of ideas.

 

Interestingly, when it has crashed during these parity checks, according to the hourly array health notifications, it seems to crash in 4-5 hours, but then I'll log in the next day to find that the system has restarted and has had uptime for 12+ hours (with the array offline).

 

I'll note that the last few parity checks were in maintenance mode, so it's unlikely to be a filesystem-related issue.

 

I think this evening, I'll run a live Debian environment on this system instead of Unraid, so that I can make sure all of my disks are backed up to tape. I'll report back on system stability when it's just running Debian, because if it's still crashing, then presumably, it's hardware. If it's not crashing, then we can assume that it's Unraid being incompatible with this particular hardware for some reason.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Add a comment...