Jump to content
  • [6.6.5] Unraid freezes completely


    Mex
    • Retest Urgent

    So over the last couple of days I have had a couple of total freezes of Unraid. The server becomes completely unresponsive. Shares, webgui or SSH are not accessible. I haven't found a way to reproduce it but my guess is that it must be connected to some background activities (like parity check or mover) but I cannot confirm this as it does not happen all the time. And of course I cannot exclude hardware errors.

     

    - The first time it was booted in default mode (no GUI) and the cursor kept blinking, but the keyboard was non-responsive. I had just replaced a HDD and it had begun the rebuild when it froze. When I got it up again it started over and completed successfully.

    - The other time I was booted in GUI mode and the entire screen was just frozen (mouse+keyboard non-responsive). This must have happened during the night and I didn't use it for anything.

     

    Both times I had to cut power to get it up again.

     

    I have updated all plugins and dockers, I've also run a successful memtest. I have no idea what to look for in logs etc, but I have included the diagnostics as directed.

     

    Hardware:

    CPUs: 2x Intel Xeon E5645 6 core/12 thread 80W TDP
    RAM: 6x Hynix 4GB DDR3 ECC 1333 GHz

    GPU: Nvidia Quadro K2000 2GB Kepler

    Storage controller: LSI SAS 9201-8i HBA

    Adapter: NVMe PCI-E M.2 SSD to PCI Express 3.0 X4

    Storage: 2x 4TB (1 as parity), 2x 2TB, 2x 750GB
    Cache: Intel 600p 256GB NVMe SSD

     

     

    nova-diagnostics-20181201-1129.zip



    User Feedback

    Recommended Comments

    Likely a consequence of the lockups there appears to be a problem with the docker image, you should delete and recreate, as for the lockups diags are just after rebooting, so not much to see, anything outputted to a monitor when it lockups?

    Share this comment


    Link to comment
    Share on other sites

    Thanks for the reply! 

    Which docker do you see issues with? or do you mean all of them?

     

    Edit: No messages appeared on the display.

    Edited by Mex

    Share this comment


    Link to comment
    Share on other sites

    Okey, so this morning I started up the array. Within half an hour or so after starting the dockers (Plex, Handbrake and MakeMKV) the server froze as before. I did not think to have the log open and the cursor stays blinking on login on the monitor.

    Share this comment


    Link to comment
    Share on other sites

    After deleting the docker image again and setting up my dockers and rebooting I have now started a tail on the log on the connected monitor, and so far the only warnings I see are these:

     

    Dec 2 10:44:13 Nova root: error: /webGui/include/ProcessStatus.php: wrong csrf_token

    Dec 2 10:44:14 Nova root: error: /plugins/preclear.disk/Preclear.php: wrong csrf_token

     

    There were also something odd about this;

    Dec 2 10:49:38 Nova ntpd[2026]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized

     

    I do not know what this means.

     

    Most recent diagnostics included.

    nova-diagnostics-20181202-1053.zip

    Share this comment


    Link to comment
    Share on other sites

    It just froze again, and the last log entries were these:

     

    image.png.e9b668c01b574ecb1eae87b01764ecac.png

    Share this comment


    Link to comment
    Share on other sites

    Yes, the latest BIOS and Micro Code tool from Supermicro has been applied.

    Share this comment


    Link to comment
    Share on other sites
    32 minutes ago, Mex said:

    And upon reboot the log displayed this:

    Those call traces are because the docker image is again corrupt, but like mentioned this is likely the result of the crashes, not the reason for them, still you'll need to recreate the docker image again.

     

    You could try running the server in safe mode for a couple of days and see if the crashes persist.

    Share this comment


    Link to comment
    Share on other sites
    6 minutes ago, johnnie.black said:

    You could try running the server in safe mode for a couple of days and see if the crashes persist.

    Will dockers be available in safe mode?

     

    Share this comment


    Link to comment
    Share on other sites
    9 minutes ago, Mex said:

    Will dockers be available in safe mode?

    Yes, docker is still available.

    You can control docker autostart after reboot by enabling or disabling the service under Docker Settings.

    Share this comment


    Link to comment
    Share on other sites

    Ok!

    I have rebuilt the docker image and rebooted, it seems like it was a clean boot based on the log.

     

    The w83795 warning I got I think is a temperature sensor that maybe System Temp was repeatedly trying to read (and then something locked up?). I have (as noted above) uninstalled my plugins and I will monitor the situation. If it freezes again I will attempt Safe Mode.

    Share this comment


    Link to comment
    Share on other sites

    The server has been up since my last post (about 30 hours) without any freezes. This leads me to believe that it was a plugin at fault. System Temp, Preclear, Unassigned devices or Nerdpack (for perl). These were all uninstalled.

     

    These is one thing in the log that puzzles me though. There are a lot of error messages related to the Preclear plugin that is no longer installed.

     

     

    image.png

    nova-diagnostics-20181203-1900.zip

    Share this comment


    Link to comment
    Share on other sites

    I finally upgraded from 6.5.3 to 6.6.5 around 20 to 30hrs ago.

    Tonight I started playing a show on Plex and 1-2 mins into the episode the stream stops.

    I try to check the unraid webui and it is unresponsive.

    Eventually I login to the IPMI, and took these screenshots before I rebooted through the interface.

    My specs:

    Asrock rack E3C236D2I

    Intel i3 6100

    latest bios (2.60)

     

    Never experienced this before on previous builds, it has been rock solid for over 18 months.

    unraiderroripimi.JPG

    unraiderroripimi2.JPG

    Share this comment


    Link to comment
    Share on other sites
    5 hours ago, nas_nerd said:

    I finally upgraded from 6.5.3 to 6.6.5 around 20 to 30hrs ago.

    What plugins do you have installed?

    Share this comment


    Link to comment
    Share on other sites

    preclear, unassigned devices, nerd tools, system temp, s3 sleep, active streams, sad trim, clean up app data, controller, system statistics, ipmi support, speed test, tips and tweaks, unbalance, user scripts, CA applications and fix common problems.

     

    Share this comment


    Link to comment
    Share on other sites
    On 12/5/2018 at 1:26 PM, nas_nerd said:

    preclear, unassigned devices, nerd tools, system temp, s3 sleep, active streams, sad trim, clean up app data, controller, system statistics, ipmi support, speed test, tips and tweaks, unbalance, user scripts, CA applications and fix common problems.

    So we have quite a bit of plugins in common.

     

    I have had no issues since I removed mine. It was up for several days and then i upgraded to 6.6.6. And it has been up for 2 days now. The only logical conclusion is that one (or more) of the plugins had some issue. Maybe limited to certain types of hardware.

    Share this comment


    Link to comment
    Share on other sites
    5 hours ago, Mex said:

    So we have quite a bit of plugins in common.

     

    I have had no issues since I removed mine. It was up for several days and then i upgraded to 6.6.6. And it has been up for 2 days now. The only logical conclusion is that one (or more) of the plugins had some issue. Maybe limited to certain types of hardware.

    Ok I have uninstalled preclear, nerd pack, unassigned devices and a couple of dynamix plugins I wasn't using anyway. No issues for >24hours on 6.6.6

     

    It would be great if someone from @limetech could have a look at the errors we both posted as it means little to me.

     

    As I said before, everything was rock solid on 6.5.3

    Share this comment


    Link to comment
    Share on other sites
    6 hours ago, nas_nerd said:

    As I said before, everything was rock solid on 6.5.3

    Some incompatabilities between version 6.0 and plugins did (do) exist.

    Make sure all your plugins are up-to-date and in case of doubt test while your system runs in safe mode.

    Share this comment


    Link to comment
    Share on other sites

    All my plugins were up to date before I uninstalled them (on 6.6.5). I have not tried reinstalling any of them in 6.6.6 yet.

    Share this comment


    Link to comment
    Share on other sites

    The NerdPack and DevPack plugins install additional (selectable) packages, which may interfere with Unraid itself.

     

    I don't know if all packages are compatible with the latest version of Unraid, hence the advice to test in safe mode to rule out such issues.

     

    Share this comment


    Link to comment
    Share on other sites

    I get that, and I do believe that is the case in my instance. Unraid appear to be rock solid once I removed the plugins (which would do much the same job as safe mode I assume)

     

    Nerdpack may well be to blame, but I only installed it because system temp or fan control (don't remember which) spesified that it needed Perl. 

     

    I do believe the issue might lie somewhere between nerdpack/Perl/system temp. When I had the plugins installed I got a lot more stuff when I ran the "sensors" command than I do now. Maybe the issue was a driver fault for reading the sensors?

     

    Edit: or my motherboard has some bad sensors which only manifests itself when the plugins were installed

    Edited by Mex

    Share this comment


    Link to comment
    Share on other sites


    Create an account or sign in to comment

    You need to be a member in order to leave a comment

    Create an account

    Sign up for a new account in our community. It's easy!

    Register a new account

    Sign in

    Already have an account? Sign in here.

    Sign In Now

  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.