• [6.7.0-rc1] System Hard Lock


    TechnoBabble28
    • Urgent

    I just upgraded from 5.6.3 to 6.7 rc1 yesterday afternoon. It initially locked after about 4 hours, no ssh or browser access, and required a hard reset. I then ran TS mode from the FCP plugin, left the dashboard window open on my screen, and went to bed. When i woke up this morning it had locked up again around 12.5 hours in. I am not running any VM's, this is purely a media server with minimal dockers. The logs are attached below.

    Unraid.thumb.PNG.89e543f8e0299621ee334448646ccc9e.PNG

    FCPsyslog_tail.txt

    mediaserver-diagnostics-20190122-0245.zip




    User Feedback

    Recommended Comments



    Im using the same motherboard as the OP and i haven't had a ryzen crash in a very long time. First this bug is pretty much on the first gen ryzen chip and setting the typical current option in the bios should solve it. I had a 1600x and was booted for months without a problem. I have a 2700x now still using the same motherboard and still no crashes. Im also on the newest F25 bios that i flashed friday, I was on F24 before that with no issues. 

    Link to comment
    43 minutes ago, limetech said:

    This is a tough one for us because Ryzen on Linux is just plain broken and AMD will not fix it.

     

    512 posts later (as of writing this) on the main kernel Bug Report, I don't see a clear solution:

    https://bugzilla.kernel.org/show_bug.cgi?id=196683

    At least they are still actively troubleshooting it after 17 months. [he says almost convincingly... clinging to hope]

    Link to comment

    I had been running on 6.5.3 for many months without issue and without any sort of zenstates or c-states modifications. I never had any issues with my Ryzen build other than some bad RAM sticks early on.

     

    I updated to the F23 gigabyte BIOS a few months back but never looked through it until the other day. After disabling global c-states and adjusting the power supply power to "typical", 6.7 rc1 was stable for several days. It only started crashing again once I updated to rc2. I am using FCP to gather logs so when it inevitably crashes in the next 16 hours or so I can hopefully see what's going on. I also see that gigabyte released a F25 BIOS a little while ago but I am hesitant to change to the new BIOS until I figure out why the system is crashing. From what I can see, the biggest change is some tweaks for athalon based systems and an AGESA update from 1.0.0.4 to 1.0.0.6. It also mentions a requirememt to update the chipset driver which I cannot do.

     

    I know one of you mentioned a mellanox card and having some problems. I am also running a mellanox card but haven't encountered any issues. However, I do not have any VMs running and I am not using the mellanox card for internet, only direct file transfers between my desktop and server. 

    Link to comment

    @limetech I think my issue on latest RC is some Mellanox change in the latest kernel. Going back to 6.6.6 with no changes to c-states/power on idle, I have been stable for about 4 days now.

    Link to comment

    I've been having issues too - on an Asus X370-Pro / Prime.  However, also had crashes on 6.6.6 - updated to the RC to see if it was any better.  It seems In all cases the system crash is happening to me when there is network / disk io happening at a larger degree than normal.  I had an intel D33682 Dual Lan card installed which I just took out today out of suspicion.  It may well be something around the bonding mentioned earlier in the thread, so will see how it goes I guess.  I didn't have these problems when running proxmox though.  Also didn't have them running a FreeNAS.  

     

    I have a ton of screenshots of the crash screen if anyone wants to see them - nothing seems to turn up in the logs.

    Link to comment
    On 1/28/2019 at 9:28 PM, david279 said:

    First this bug is pretty much on the first gen ryzen chip

    That's my understanding of the matter too. The only chip I have that has the problem is an early 1700 (which, incidentally fails the kill-ryzen test too). I've never had the issue with a 2000-series chip. So it would be very useful if people who are experiencing this problem could specify not just their motherboard but their processor.

    Link to comment
    4 hours ago, Pauven said:

    What the current status with 6.7.0 RC4?  Are there still stability issues, or has the situation improved?

    I think its solved, only problem is non spinning down drives, which is also probably fixed by next RC5. 

     

    Edited by nuhll
    Link to comment

    I kept trying each RC until I found a stable one. RC4 had been the most stable with an uptime of about 2 weeks which usually means I am in the clear. I just upgraded to RC5 a couple days ago and that has been solid as well so far.

    • Like 1
    Link to comment

    I am having this exact issue on a Dell R710. I have a Mellonex card but that is not the one throwing the errors. its coming from the onboard NIC. did anyone ever find a fix ?

    Link to comment

    Joining the unraid 'me_too' on this one.

    My system is Supermicro board, xeon 3460, stable for around three years across all upgrades. Since 6.7.2 upgrade i've experienced loss of webgui needing hard resets. Last couple of days, server has crashed while idle (??) and i'm unsure of how to catch the syslog in this event (tail??). I now have a screen and keyboard hooked up but what are the commands needed to allow me to monitor the syslog? I'm sure this will happen again. Parity check running. Next up is to double up on parity, maybe go back a release?

    Edited by superloopy1
    Link to comment
    7 hours ago, superloopy1 said:

    I now have a screen and keyboard hooked up but what are the commands needed to allow me to monitor the syslog?

     

    You can click the Log icon on the webGUI menu bar to start a tail of syslog in your browser.  You can also type this command at console:

     

    tail -f /var/log/syslog

     

    Link to comment

    I am been running on 6.7.2 for weeks no freeze issue.

    My hardware is X470D4U, pro 1700 and ecc ddr4 ram, only turned the power supply to "typical" in BIOS without other operation.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.