• 6.8.0-rc1 server randomly restarts


    Darqfallen
    • Closed



    User Feedback

    Recommended Comments



    Fans are working great, all fans reporting >5000rpm, ambient temp in the room is 25C, System Ambient is 30C, CPU1 36C, CPU2 40C. Didn't have this issue in 6.7.0, had a similar issue in 3.7.2 6.7.2. But didn't have the time to work on it so I had downgraded to 6.7.0 again

    Edited by Darqfallen
    Link to comment
    38 minutes ago, Darqfallen said:

    3.7.2

    Presuming you mean 6.7.2

     

    Does nothing at all spit out on console?

     

    This is tell-tale of memory issue.  You can try re-seating your RAM and maybe swap them around while at it.  Shake the tree and see if failure different.

    Link to comment

    Yes 6.7.2, nope, the only reason I know it restarts is I get the email a parity check has started,

    I opened her up and pulled every stick of ram and swapped with its opposite bank. Went into the BIOS and then noticed the DDR3-10600 was running at DDR3-12800 (1333mhz vs 1600mhz), so I've dropped it back down to the proper speed and will monitor for stability. I will let you know in 12 hours or so if its still stable.

    Link to comment

    Server ran for 2 hours then spat out these errors and rebooted.

    Oct 13 14:04:31 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 14:04:42 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 14:10:39 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 14:10:52 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 14:34:08 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 15:04:12 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 15:04:14 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 15:09:48 Dirge kernel: mce_notify_irq: 1 callbacks suppressed
    Oct 13 15:09:48 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 15:13:16 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 15:15:25 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 15:19:14 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 15:19:15 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 15:21:30 Dirge kernel: mce_notify_irq: 1 callbacks suppressed
    Oct 13 15:21:30 Dirge kernel: mce: [Hardware Error]: Machine check events logged
    Oct 13 15:35:23 Dirge kernel: mce: [Hardware Error]: Machine check events logged

    So Im running a memtest to see if any memory modules have failed. I am not sure what that means.

    Link to comment

    I'm sorry but I didn't realize that the diagnostics didn't use the whole syslog. I posted above my diag of when it was crashing. Attached is  my whole syslog including the crashing.syslog.zip

    I'm not sure what else to do, if I leave the server running on anything > 6.7.0 then it crashes and reboots every 1-4 hours.

    Link to comment

    There's likely some hardware issue, also supported by the MCE errors above, if it was me I would  start swapping/replacng things around, like RAM, board, PSU, etc.

    Link to comment

    Might well be that newer Unraid version addresses different parts of your RAM and makes the problem visible.

     

    If you have mutliple RAM sticks, try a reduced number or change RAM completely.

     

    Link to comment

    Only if errors are detected is memtest definite, no errors isn't, also I didn't check your hardware, if you're using ECC RAM it won't detect any errors when they are correct, but the server will halt when there's an uncorrectable error.

    Link to comment

    So something in the kernel patch in 6.7.2 and 6.8.0-RC1 is causing my hardware to start throwing memory errors. And a memtest will not show bad memory due to it being ecc memory. How do you fully test ECC ram?

    Link to comment

    Most server boards have a system event log where those errors are logged, some have more details than others, e.g., identify or not the slot with problems.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.