• [6.12.x] Completely inexplicable, random crashes for 2+ months


    wug
    • Urgent

    I have been having an inexplicable issue where the server hangs and requires a hard reboot to return to normal. This has occurred since upgrading to 6.12.3 (and now 6.12.4).

     

    When I say it's random, I mean it:

    • Sometimes the server has lasted 4+ days after a reboot, sometimes it's become unresponsive within minutes of booting.
    • Sometimes it has crashed while disks are active, sometimes when they are idle.
    • It's crashed with all of these combinations:
      • Docker disabled, VMs disabled
      • Docker enabled, VMs disabled
      • Docker disabled, VMs enabled
      • Docker enabled, VMs enabled
    • It's crashed when it's connected to the usual network, and when it's on its own dedicated subnet.
    • Crashed for each ethernet port on the motherboard being used as the sole network connect (and that's a 1G and a 2.5G port, so they aren't even the same hardware or drivers!)
    • It's crashed with the configuration that I've built up over the last eight years of running Unraid, and it's crashed with a fresh configuration on a fresh flash drive.

     

    There is not a single condition that is actually correlated with crashes.

     

    I believe this is related to these issues, but I'm creating a new post because one was marked as closed, and I also went to some pretty significant lengths to try to debug this:

     

    Debugging Process

    This has been going on since August, and I've done absolutely everything possible to eliminate defective hardware as a possibility. That includes:

    • swapping out all PCIe cards with spares
    • a run of the system with each indivdual drive disconnected, one at a time (i.e. I remove one disk, see if it still crashes, if it does, put the disk back in and pull the next one).
    • every single non-destructive stress test I can think of
    • fsck each disk and pool individually
    • run every maintenance operation I can think of
    • tested various configurations of power and sleep settings on the motherboard BIOS

    Logging Process

    Here's the wildly frustrating part: I creating a syslog configuration that would log basically every single message it could (including marks) to a log file on a ext4-formatted flash drive mounted as an unassigned device. I had at least a dozen log files that didn't contain a single error and before those log files end, there is an unbroken sequences of --MARK-- lines that goes back for hours before the system locked up.

     

    I've also tried using various notification methods to try to receive messages that the system is dying, and I've also tried setting up remote logging. None of them ever surface an issue anywhere near the time of the crash, so the crash is definitely also killing outbound networking.

     

    There is one tiny hint of what might be going on, and that is for a period of time, when I rebooted after a crash, I would get a "udma crc error count returned to normal value" for a drive (but it never seemed to be consistent). However, all the components have since been removed and added back to the server and I haven't seen an issue like that in a while.

     

    I'll also add that rebooting requires holding down the power button on the computer until it shuts off. If I just do a quick press once, nothing happens: the server keeps running, the monitor doesn't wake up, nothing happens to indicate that anything was actually able to capture that ACPI signal.

    Fresh Install

    Last night, as my last step, I used the USB creator tool to create a brand new boot disk with 6.12.4 (on a factory-sealed flash drive), copied over only the bare minimum configuration files (like the array config).

     

    Again, it crashed.

    Unraid 6.12.4 is Fundamentally Broken on Some Systems?

    I'm just rolling back to 6.12.2 at this point, because there is nothing abnormal in the diagnostics or logs that would indicate an actual problem.

     

    I've attached the diagnostics file from the fresh install from before the array was even started, because this locking up problem happens even when nothing is mounted and the system is just idling. But it also happens when a parity check is running, so it's not just a high-load or low-load issue.

     

    tl;dr: There is some issue occurring with Unraid ≥6.12.3 that cannot be detected through any normal logging methods, and has made my local installation totally unusable since August—and I'm apparently not the only one.

     

    mediatower-diagnostics-20231027-1338.zip




    User Feedback

    Recommended Comments

    I rolled back to 6.12.2, and it's already crashed. If I recall correctly, I actually upgraded to 6.12.3 from a lower version (6.12.0 or 6.12.1), so just to be thorough, I've now rolled back all the way to 6.11.5. I have a parity check in progress now, so let's see if I can successfully complete one of these for the first time in almost three months!

    Link to comment

    I'am kinda in the same boat as you. Random crashes on all 6.12 releases I tested. I also tried all sorts of combination with Dockers started or stopped, same with VMs. 14h Memtest with 0 errors. Smart values from the disks show no errors. There is no clear indication what causes the crash for me. Sometimes during the night when idle, sometimes during the day on low load or even when transcoding a video with tdar. 30min after a fresh reboot it crashes/freezes on the next the run the server is stable for 3-4 days and as you experienced nothing in the logs. It's kinda frustrating. I'am back on 6.11.5 and it was stable for 11 days. I had a power outage 3 days ago and since than also no crash. 

    Link to comment

    I just want to leave a possible solution here, that worked for me after days of struggling. I don't know if this is an OS or hardware problem on my side, but turning off the cpu graphic unit in the BIOS of my Supermicro X11SSH-LN4F completly solved all problems.

    This however means that you can no longer pass the gpu to your virtualization and in my case I also encountered crashes on 6.11.5, so this might not be the same problem.

    Link to comment
    6 hours ago, EofChris said:

    I just want to leave a possible solution here, that worked for me after days of struggling. I don't know if this is an OS or hardware problem on my side, but turning off the cpu graphic unit in the BIOS of my Supermicro X11SSH-LN4F completly solved all problems.

    This however means that you can no longer pass the gpu to your virtualization and in my case I also encountered crashes on 6.11.5, so this might not be the same problem.

     

    Interesting I also have a similar super micro board.  The last stable system version that worked for me was 6.10.4. That version was rock solid and I never had any hangs.  I've reported this hanging issue myself and seen it reported multiple times elsewhere.  At this point I've just moved off of unRaid as it seems that I will be unable to upgrade to a stable version for my system configuration anytime soon.  

    Edited by schale01
    Link to comment

    It's crashed three more times since yesterday. Each time, it crashed not too long after starting a parity check.

     

    As part of my next debugging step, I reset the motherboard to default settings, and I now have an interesting clue. The system booted up just fine and the parity disk was missing. It remained missing after another reboot. All other disks are present and available:

     

    468148161_Screenshot2023-10-28T183907-Safari(MediaTowerMain)@2x.thumb.png.35031dd3e322589c28d1978cdee32b22.png

     

    It's not super likely to be a bad cable, because it's a SAS to SATA cable and the other three connected drives are available. I pulled the parity disk out, stuck it in the external USB drive caddy, and it shows up just fine. But now, another disk is missing:

     

    421955021_Screenshot2023-10-28T185950-Safari(MediaTowerMain)@2x.thumb.png.7892b9c8958a8f24c38ab9612bb095e3.png

     

    Baffling. When I switched the Disk 7 to the SATA connector that parity was connected to, it was still missing.

     

    I wonder if some BIOS setting was causing the system to miss that certain disks are having issues... somehow? But that barely makes sense. I'm gonna do some more fiddling with it to see if I can get it behaving.

    Link to comment

    OKAY. I got all disks to be available again.

     

    1351904065_Screenshot2023-10-28T193623-Safari(MediaTowerMain)@2x.thumb.png.df3d35e0819c0cbc9375c2e01162e0c6.png

     

    I'm now with a mobo that was reset to default settings (and it hasn't been updated away from the version that was stable for a year) and a nearly-fresh install of Unraid 6.11.5.

     

    I'll start a new parity check and report back if things have been resolved or if further debugging is needed.

    Link to comment

    It crashed again with the mobo reset to default settings.

     

    I think one interesting clue is that after I rolled back to 6.11.5, when the server crashes, it actually seems to do a full system reset. I keep opening it up to find that it's crashed but is now waiting ready to go once again. This is in contrast to how it would require a long press on the power button to forcibly shut it down because it was TOTALLY unresponsive.

     

    591588096_Screenshot2023-10-29T154944-Safari(MediaTowerMain)@2x.thumb.png.6f102f0770d49065bd0c1db077e042f5.png

     

    I'm going to try to graphics settings someone else described above, and if it crashes again, I'll try upgrading the BIOS.

    Link to comment

    It crashes again after upgrading the mobo firmware. I'm officially out of ideas.

     

    Interestingly, when it has crashed during these parity checks, according to the hourly array health notifications, it seems to crash in 4-5 hours, but then I'll log in the next day to find that the system has restarted and has had uptime for 12+ hours (with the array offline).

     

    I'll note that the last few parity checks were in maintenance mode, so it's unlikely to be a filesystem-related issue.

     

    I think this evening, I'll run a live Debian environment on this system instead of Unraid, so that I can make sure all of my disks are backed up to tape. I'll report back on system stability when it's just running Debian, because if it's still crashing, then presumably, it's hardware. If it's not crashing, then we can assume that it's Unraid being incompatible with this particular hardware for some reason.

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.