• MCELOG ERROR on AMD sytems


    bmartino1
    • Annoyance

    I'm not sure why mcelog is error-ing out. Nor why a machine check event logs to dmesg on an AMD system... As you can see here, the SGC Unriad box eventual hits a machine check event. mcelogs error about a module to use even thought this module is available and in use seen via tools system drivers. ?is there a different mcelog command to see the system error???

    this system was up longer than a week and then error-ed with a machine check event...

     

    I find that any amd system with a processor from zen 1 - zen 3 system when attempting to use mcelog received error unsupported processor. use module xyz...

     

    This is Continue off a few forum post. From system error to FCP post to nerd tools post for mcelog to general support and now this bug report as something doesn't seem right here.

     

    A reboot seems to fix all issues, except that mcelog still errors out with unsupported processor...

    ... Can't seem to find old FCP suport forum post that needs to be updated ...
     

     

      

    22 hours ago, bmartino1 said:

    So my friend's system today started to randomly do this error. FCP says machine event. So I decided to run diagnostic.

     

    Rebooted the server and had a server fault kernel panic. Boot to safe mode success, fix mod changes and rebooted system to normal boot and its up no errors...

    Not sure what setting changed or issues with a system staying up on an AMD platform that is causing this issue.

     

    My friend also has 3 unraid licenses and Mutiple Computer type systems, but done't have a forum account...

    This machine is called SGC names from star trek.

     

    There seems to be a problem with this kernal module driver...

     

    image.png.1d7a0c256bde65a743cbb968b458cb54.png

    Unirad 6.12.8 > system drivers:

    image.thumb.png.633f71d35e307c0d9fdc725d025f746c.png

     

    but terminal shows error:
    image.thumb.png.a0221404f40ab89e92c3cd6a07994112.png

     

    Thoughts ideas?

     

    after_reboot_all_systems_go_-_sgc-diagnostics-20240326-1658.zip 191.46 kB · 0 downloads sgc-diagnostics-20240326-1157.zip 181.73 kB · 0 downloads

     

     




    User Feedback

    Recommended Comments

    to be clear this is the bug report issue:

     

    root@BMM-Unraid:~# mcelog
    mcelog: ERROR: AMD Processor family 25: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
    CPU is unsupported
    root@BMM-Unraid:~# 

     

    image.thumb.png.4f90c48c5a9d439e7015161cc3d3805c.png

    Link to comment

    ? may need to look into rasdaemon...

     

    https://forums.gentoo.org/viewtopic-p-8805913.html?sid=0f503115472249e88a2a689e2be25864

     

    jeffbee on Jan 3, 2021 | parent | context | favorite | on: ECC matters

     

    Newer linux have replaced mcelog with edac-util. I think most shops operating systems at that scale are getting their ECC errors out of band with IPMI SEL, though.


     

    gsvelto on Jan 3, 2021 [–]

     

    It's rasdaemon these days: https://www.setphaserstostun.org/posts/monitoring-ecc-memory-on-linux-with-rasdaemon/

     

    Edited by bmartino1
    Link to comment
    5 hours ago, bmartino1 said:

    ? may need to look into rasdaemon...

    LT is already aware of this, I think they were looking to use rasdaemon for v6.13

    • Like 1
    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.