Jump to content
  • Server freezes in 6.12.6 - not macvlan or realtek related


    thither
    • Annoyance

     I'm seeing server crashes roughly once a day after going from 6.12.4 to 6.12.6. My symptoms are exactly the same as in this bug report:

    I can ping the machine and the console is responsive, but when I try to log in it just freezes after I enter the username. I get "504 Gateway Time-out" nginx errors on the web interface, and none of the dockers are responsive. Diagnostics attached.

     

    As far as I can tell I don't have realtek or adaptec hardware. My docker is running ipvlan, not macvlan (it was set to that in 6.12.4 already, and was working fine). (Edit to add: I don't have any VMs running, unlike in the above bug report, and "Fix common problems" doesn't show anything except a warning about syslog being mirrored to flash.)

     

    I set syslog to mirror to a cache directory, but I don't see any logs there. I've just checked the box to get it to mirror to flash, so if it happens again hopefully I'll have something useful (but I haven't see anything relevant in the logs I've looked at in the past, it just looks as if the server is working normally).

     

    I have some other very odd and vexing behavior which started happening at the time I upgraded, too, but it's all at the BIOS level, so it kind of seems impossible that the upgrade would have caused it. The first time the server went down, after I restarted the BIOS wouldn't recognize my flash drive as a bootable device (it just didn't appear in the menus at all). The drive was attached to an internal USB header. Eventually I reinstalled from an online backup onto a new thumb drive, re-registered it (blacklisting the old drive), and was able to boot from it.

     

    Things were working fine, but when the server went down again and I had to cold-restart it, the new thumb drive wasn't available as a bootable device. Through trial and error, and an eventual CMOS reset, I discovered that after rebooting the server I need to physically unplug the USB drive and plug it back in in order for it to be recognized and bootable. I have no idea what that's about. It's happening before the OS loads, so it doesn't seem possible that Unraid could be affecting it. I did verify that I'm running the latest firmware for my motherboard (an ASRock z170 Extreme 7+). I haven't seen this behavior before but I also have rarely needed to cold-boot my server in the past.

     

    At this point I'd like to roll back to Unraid 6.12.4, since it didn't have these issues, but since I'm on a new USB stick, I don't have the option to just roll back to it from Tools / Update OS. Are there files I can copy from the old USB stick to the new one that will let me do that? Or is there some other way to downgrade?

    eurydice-diagnostics-20240101-0940.zip




    User Feedback

    Recommended Comments

    First things you should always do in any situation like this is run Memtest from the boot menu for at least a couple of passes.

     

    To roll back though, all you have to do would be to download the appropriate .zip file from https://unraid.net/download

     

    Shutdown the server, pull out the flash and replace all of the bz* files in the root of the flash drive with those in the archive.  Power back up.

     

    Alternatively you can also create a folder on the flash drive named "previous" and put all of the files in the root of the archive into that folder and then Update OS will offer you a rollback option

    Link to comment

    Thanks! I ran memtest off the boot drive and it seemed to freeze after printing "Loading memtest... ok". I'll poke around to see what that's about; I don't see my exact motherboard listed here but there are some very similar ones that seem to have issues. Copying the 6.12.4 files into a "previously" folder did give me an option to rollback through the web GUI again, but I'm going to hold off to see if I can get useful syslog info out of the current distribution before I do so.

     

    If I do roll back, I assume it will still use my old configs, right? (So I won't need to set up my dockers and array again?)

    Edited by thither
    Link to comment

    Well, the system froze again and I restarted it. Here's my syslog and syslog-previous from the flash drive. I restarted at around 4:00am, and then again this morning at 10:30. I personally can't see much suspicious here, I have community apps set to auto-update nightly and that runs. It does seem a  little weird that the system boots at 4:08 and the "Unraid API started" message doesn't appear until 7:30, but I don't know how normal that is.

     

    Anyways, I'm going to roll back to 6.12.4 and will report back on whether that seems more stable.

    syslog.txt syslog-previous.txt

    Link to comment

    Do you use NFS a lot? I've had a lot of issues with Unraid freezing up with the symptoms you describe (server is still up, but unresponsive and login attempts freeze), which seemed to be related to NFS in my case. The ultimate cause was shfs deadlocking itself for whatever reason, causing most disk I/O to get stuck until it was killed (or the server was power cycled, since it couldn't perform a proper shutdown).

     

    I had the issue on 6.11.5 but it became much worse when I upgraded to 6.12.6 to see if that would fix my problems.

     

    It looks like my issue is related to a long standing issue.

    Link to comment

    I'm not running NFS, no.

     

    Actually I just downgraded back to 6.12.4 and while the system was stable for longer, it's now frozen up again. I'm guessing this points to some kind of hardware fault in my server, which just happened to rear its head after the upgrade.

    Link to comment

    Thinking about this a little more, the weird thing is that it seems like something is just eating up all the CPU, whereas I would expect a hardware fault to result in a kernel panic or something.

     

    Is there something I can run that will give me a graph or log of historical CPU load, or maybe load per docker container or something?

    Link to comment

    As a historical note, I downgraded back to 6.11.5 and have had 2 months, 14 days of uninterrupted uptime since then.

    Edited by thither
    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.

×
×
  • Create New...