TechGeek01 Posted December 20, 2020 Share Posted December 20, 2020 In the last couple months, I moved Unraid over from a Dell R510 to a Supermicro build, and since then, I see occasional warnings about machine check errors. Dec 19 16:49:16 helium kernel: mce: [Hardware Error]: Machine check events logged Dec 19 16:49:16 helium kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR Dec 19 16:49:16 helium kernel: EDAC sbridge MC0: CPU 6: Machine Check Event: 0 Bank 10: 8c000046000800c1 Dec 19 16:49:16 helium kernel: EDAC sbridge MC0: TSC 51ce458bc87a8 Dec 19 16:49:16 helium kernel: EDAC sbridge MC0: ADDR c5c6ea000 Dec 19 16:49:16 helium kernel: EDAC sbridge MC0: MISC 900100010000c8c Dec 19 16:49:16 helium kernel: EDAC sbridge MC0: PROCESSOR 0:306f2 TIME 1608418156 SOCKET 1 APIC 10 Dec 19 16:49:16 helium kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xc5c6ea offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:0) Current system is a Supermicro X10DRi with dual E5-2620 v3 processors, and 64GB of RAM, configured as 2x16GB per socket. This obviously is a memory issue, but what exactly causes this, and how do I go about fixing it? I don't know a lot about these sort of logs. Presumably this isn't logging of things like ECC corrections, and this indicates a memory issue where I may have to replace the stick, correct? helium-diagnostics-20201219-2044.zip Quote Link to comment
JorgeB Posted December 20, 2020 Share Posted December 20, 2020 This looks like an ECC memory error, IPMI/system event log might have more info. Quote Link to comment
TechGeek01 Posted December 20, 2020 Author Share Posted December 20, 2020 IPMI/system log shows nothing unusual, unfortunately. Memtest completely freaked on the bit fade test. Like, millions of errors in the first several hundred MB, so I'm currently in the process of finding what I hope is a bad stick, and not a slot or something. Quote Link to comment
gguglielmi Posted January 18, 2021 Share Posted January 18, 2021 I got the same error like 6 months ago on an Asus x99 WS IPMI motherboard with a xeon E5-2698 v3 and 64Gb of ram. Memtest wasn't showing any error at all, but after some trial and error (aka several crashes) i found out that 2 sticks of ram were kind of buggy. After removing them, the system went back to normal. No more error during the last months. To identify the faulty DIMM i used a terminal command, to pair the BANK indicated in the MCE error and the serial number of the physical DIMM. If i remember correctly the command was "dmidecode -t 6", without quotations. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.