• unRAID 6.6.5 Total system lockup unresponsive from console, ssh, network, parity check stops, no disk activity at all, no VM working, nothing functioning what so ever.


    Rudder2
    • Solved Urgent

    Total system lockup unresponsive from console, ssh, network, parity check stops, no disk activity at all, no VM working, nothing functioning what so ever.  I suffer this lock up when ever I install 6.6.0+.  Though it was because of the RealTek NIC problem so downgraded to 6.5.3 and then updated to 6.6.3 and had worse network issues so down graded to 6.5.3.  Then Upgraded to 6.6.5 and the RealTek NIC error has been rectified but the Lockup problem remains. 

     

    The system will run for about 5 or 6 days lock up then run 6 hours lock up and run 3 hours lock up then 30 minutes and lock up.  I can't get system logs because they are not stored on flash.  I already downgraded to the stable 6.5.3 image.  My system log on first boot after the 4 lock up in one day is attached.  The first was after the first lock up and the last after the 4th.  Don't know if they will help because they were taken after a reset button press to get the system back up.

     

    BIOS was updated to the latest BIOS for my motherboard after the second lockup because I saw a hardware error about microcode in the log file 1923.  That same error is in log file 2357.

    rudder2-server-diagnostics-20181114-1923.zip

    rudder2-server-diagnostics-20181114-2357.zip




    User Feedback

    Recommended Comments



    I've had this happen a couple times, the cursor on the console will even stop blinking. Seemed to happen whenever I made changes to the hardware such as adding or removing a PCI-E card to test hardware passthrough to a VM. Required a hard reset but never happened again after in my case (until the next mucking about the PCI-E cards).

     

    Ryzen Threadripper, current BIOS.

     

    This is not going to be an easy thing to track down with no means to capture logs or specifying the syslog to be written to the flash drive instead of RAM.

    Edited by jbartlett
    Link to comment
    5 minutes ago, jbartlett said:

    This is not going to be an easy thing to track down with no means to capture logs or specifying the syslog to be written to the flash drive instead of RAM. 

    I've been wanting my logs to write my my Cache Drive.  I'm not sure about having continuous write to the USB Thumb Drive.  Is there away to change unRAID to write logs to Cache Drive and have it rotate it say every week?

    Link to comment

    Hardware from Info Button.

    M/B: ASRock - Z97 Extreme6

    CPU: Intel® Core™ i7-4790K CPU @ 4.00GHz

    HVM: Enabled

    IOMMU: Enabled

    Cache: 256 kB, 1024 kB, 8192 kB

    Memory: 32 GB (max. installable capacity 32 GB)

    Network: eth0: 1000 Mb/s, full duplex, mtu 1500 (Intel Onboard NIC)
     eth1: not connected (RealTek Onboard NIC usually used for NIC Bonding disconnected for diagnostic reasons)

    Kernel: Linux 4.14.26-unRAID x86_64

    OpenSSL: 1.0.2n

    Link to comment

    You can click on 'Log' button and leave the window open to see if any messages get output.

    Alternately, via console or terminal window you can 'tail' the log:

    tail -f /var/log/syslog

    The machine check messages are suspicious - they should not be there and normally indicate h/w problems.  Have you run a memtest lately?

     

    Your system log shows this on the top line:

    Nov 14 23:48:13 Rudder2-Server kernel: microcode: microcode updated early to revision 0x25, date = 2018-04-02

    I think that should not happen using latest bios.

    Link to comment
    3 minutes ago, limetech said:

    You can click on 'Log' button and leave the window open to see if any messages get output.

    Alternately, via console or terminal window you can 'tail' the log:

    
    tail -f /var/log/syslog

    The machine check messages are suspicious - they should not be there and normally indicate h/w problems.  Have you run a memtest lately?

     

    Your system log shows this on the top line:

    
    Nov 14 23:48:13 Rudder2-Server kernel: microcode: microcode updated early to revision 0x25, date = 2018-04-02

    I think that should not happen using latest bios.

    The latest BIOS is installed.

     

    I ran a MEM check back when I updated to 6.6.0 with the first lockup and it passed fine.  The weird thing is I'm 100% stable on 6.5.0.  I kept saying 6.5.3 but it's actually 6.5.0 because I missed the .3 update.  I hate to be lock to this version.  I was going to wait for the 6.6.6 update and try again.

     

    That Machine Check Message is what made me look at the logs and see that there was a message about the Microcode and update my BIOS after searching the logs and reading that was recommended if 6.6.x locked up.  It only shows up when running 6.6.x Never on 6.5.0.

    Link to comment

    Any idea what the difference between 6.5.0 and 6.6.x is that can cause the hardware error?  How about giving us an option to easily store logs to cache instead of the ramdisk?  Anything I should try?  Really want to upgrade.

    Link to comment

    In the log of Unraid 6.6.5 is says your motherboard is running BIOS P2.70 05/17/2016

    In the log of Unraid 6.5.0 is says your motherboard is running BIOS P2.80 03/06/2018

     

    Did you upgrade your BIOS afterwards? - ok, I see your 2nd diagnostics has the updated BIOS

    The microcode update which gets loaded in Unraid 6.6.5 cripples your CPU (hardware errors).

    This microcode update is not loaded in Unraid 6.5.0 due to the newer BIOS version (3/6/2018).

    Edited by bonienl
    • Upvote 1
    Link to comment
    5 hours ago, bonienl said:

    In the log of Unraid 6.6.5 is says your motherboard is running BIOS P2.70 05/17/2016

    In the log of Unraid 6.5.0 is says your motherboard is running BIOS P2.80 03/06/2018

     

    Did you upgrade your BIOS afterwards? - ok, I see your 2nd diagnostics has the updated BIOS

    The microcode update which gets loaded in Unraid 6.6.5 cripples your CPU (hardware errors).

    This microcode update is not loaded in Unraid 6.5.0 due to the newer BIOS version (3/6/2018).

    Which l log?  I ran 6.5.0 for a long time on BIOS 2.70 with out knowing there was an update.  The 6.6.5 time stamp 1923 log file was before the BIOS update and the 2357 log was after the  BIOS update.  I uploaded the log before and after the BIOS update post crash.

     

    I just checked the 1923 log and it said I had the old BIOS and the 2357 log has the new BIOS.

    Edited by Rudder2
    Link to comment

    The problem looks like that 6.6.5 loads the microcode even though my BIOS is up to date in the second LOG file (2357).  Someone already said above

    22 hours ago, limetech said:

    Your system log shows this on the top line:

    
    Nov 14 23:48:13 Rudder2-Server kernel: microcode: microcode updated early to revision 0x25, date = 2018-04-02

    I think that should not happen using latest bios.

    I am starting to think this is the smoking gun after reading you comment.  Why is my system loading the Microcode even though I have the latest BIOS that has the microcode in it?

    Link to comment

    The 6.5.0 diagnostics you just posted shows you are running BIOS 2.80 and NO microcode update is loaded.

    I am not sure at which point this particular microcode update was added to the Linux kernel.

    Unraid 6.5 runs kernel version 4.14, while Unraid 6.6 runs kernel version 4.18.

     

    The microcode update is dated April 2, 2018 which is newer than your BIOS of March 6, 2018

    Edited by bonienl
    Link to comment
    7 minutes ago, bonienl said:

    The 6.5.0 diagnostics you just posted shows you are running BIOS 2.80 and NO microcode update is loaded.

    I am not sure at which point this particular microcode update was added to the Linux kernel.

    Unraid 6.5 runs kernel version 4.14, while Unraid 6.6 runs kernel version 4.18.

    This is the LOG after the BIOS update on 6.6.5.  It still loads the Microcode.  Look at the Motherboard.txt and the first entry in the log.

     

    6.6.5 is loading the Microcode in error than...Why would it do that? 

    rudder2-server-diagnostics-20181114-2357.zip

     

    Sorry is posting Before and After Diagnostics in the same post confused the issue.  I uploaded both thinking a comparison might help...

    Edited by Rudder2
    Link to comment
    2 minutes ago, Rudder2 said:

    6.6.5 is loading the Microcode in error than...Why would it do that? 

    See statement I added later to my previous post

    Link to comment
    2 minutes ago, bonienl said:

    See statement I added later to my previous post

    I'm confused what this has to do with anything.  Are you saying that once the Microcode is loaded that even updating the BIOS the kernel will never stop loading it so I need to reinstall the OS to fix it or are you saying that there is a kernel error and I need to wait for the next kernel update to fix this problem?

    Link to comment

    I am saying the microcode update is loaded because it is seen as newer than your BIOS. In other words, it overwrites the microcode that comes with your BIOS.

    Link to comment

     

    1 minute ago, bonienl said:

    I am saying the microcode update is loaded because it is seen as newer than your BIOS. In other words, it overwrites the microcode that comes with your BIOS. 

    OK, so is there away to force the Kernel not to load it?  Will I be stuck on 6.5.0 forever because of this?

    Link to comment
    4 minutes ago, Rudder2 said:

     

    OK, so is there away to force the Kernel not to load it?  Will I be stuck on 6.5.0 forever because of this?

    Limetech can include or exclude the microcode update, but this requires a recompile.

    According to Intel's information the latest microcode update is from August 2018.

    It would be weird that Intel's own microcode breaks your particular processor ... but it looks like that

    Link to comment
    1 minute ago, bonienl said:

    Limetech can include or exclude the microcode update, but this requires a recompile.

    According to Intel's information the latest microcode update is from August 2018.

    It would be weird that Intel's own microcode breaks your particular processor ... but it looks like that

     

    From LOG:

    Nov 14 23:48:13 Rudder2-Server kernel: microcode: microcode updated early to revision 0x25, date = 2018-04-02

     

    From Motherboard Manufacture (ASRock):

    BIOS P2.80: Update Haswell CPU Microcode to revision 24 and Broadwell CPU Microcode to revision 1D.

     

    So yes, I think we are on to something...The microcode update in the Kernel is breaking the CPU.  I wander if this is why ASRock didn't update the BIOS to it.  Where do we go from here?  Is this something that must be fixed by Limetech or is there a code I can insert to fix my system till a permanent fix is implemented?  Really don't want to be stuck with no updates to unRAID. 

     

    Thank you for your help, you shed some light on the situation...

    Link to comment
    4 hours ago, Rudder2 said:

    I wander if this is why ASRock didn't update the BIOS to it.

    This would be a good question for you to pose to ASRock.

     

    Because of the Meltdown/Spectre debacle we added "early microcode update" into the linux kernel (other distros do this do).  This is the correct thing to do and we don't want to change that.  It's very possible the microcode update will make no difference at all, but here's what we can do.

     

    We'll publish a 6.6.6 release in the normal manner (with updated linux kernel), and then ask you to test and see if the issue persists.  If so, we can generate a version of 'bzimage' that does not include early microcode update for you to test.  If issue goes away then we have some evidence to report to Intel.  But I'd be surprised if this were the issue because there would be a lot more people complaining about it if that were the case.

    Link to comment
    17 minutes ago, limetech said:

    This would be a good question for you to pose to ASRock.

     

    Because of the Meltdown/Spectre debacle we added "early microcode update" into the linux kernel (other distros do this do).  This is the correct thing to do and we don't want to change that.  It's very possible the microcode update will make no difference at all, but here's what we can do.

     

    We'll publish a 6.6.6 release in the normal manner (with updated linux kernel), and then ask you to test and see if the issue persists.  If so, we can generate a version of 'bzimage' that does not include early microcode update for you to test.  If issue goes away then we have some evidence to report to Intel.  But I'd be surprised if this were the issue because there would be a lot more people complaining about it if that were the case.

    I'm totally open to this as long as I retain my downgrade ability to the stable release for me, 6.5.0.  I've been testing every release and downgrading since 6.6.0, because that's when this problem started for me (only missed 6.6.4 because I never got a notification when it came out, this notification thing has been remedied).  I though it would go away when you fixed the RealTek NIC problem but it didn't. 

     

    Just tell me what you want my to try and I will upgrade my system and try it.  When I first install an upgrade it takes 5-6 days to lockup the first time then only 6 hours, then only 3 hours, thank every 30 minutes so it mite take a week or so to get you the results.  Also, if you could tell my away to have the logs stored on the cache drive I will so this so we have logs right up to the lockup, been wanting this as an option for a long time with maybe deleting the logs after a week or 2 old. 

     

    In the mean time I'm putting in an email to ASRock support.

     

    Thank you for your help.  I really appreciate it.

    Link to comment

    That lockup pattern is quite amazing - how do you restart following lockup?  Meaning, via Reset button or do you power down/power up?

    Link to comment
    51 minutes ago, limetech said:

    That lockup pattern is quite amazing - how do you restart following lockup?  Meaning, via Reset button or do you power down/power up

    I know, it's so odd. Since it's hard locked I can't power down properly so I press the reset button. Honestly, I haven't had a computer lockup in so long I forgot powering off and waiting a minute and powering back on was different than reset button.

    Edited by Rudder2
    Link to comment

    Just for my understanding so correct me if I'm worng, if the first line doesn't report that the microcode was updated, then unraid left things as they were and the following lines in the syslog are just reported for informational purposes.

     

    Nov 10 22:04:44 NAS kernel: microcode: CPU0: patch_level=0x08001137 (repeated for each hyperthreaded core)

    Nov 10 22:04:44 NAS kernel: microcode: Microcode Update Driver: v2.2.

    Link to comment

    Correct. These lines report which microcode version is in use. When not updated by the kernel, it is the version supported by the BIOS.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.