[6.8.3] Machine check events logged

Followers

December 20, 20205 yr

In the last couple months, I moved Unraid over from a Dell R510 to a Supermicro build, and since then, I see occasional warnings about machine check errors.

Dec 19 16:49:16 helium kernel: mce: [Hardware Error]: Machine check events logged
Dec 19 16:49:16 helium kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Dec 19 16:49:16 helium kernel: EDAC sbridge MC0: CPU 6: Machine Check Event: 0 Bank 10: 8c000046000800c1
Dec 19 16:49:16 helium kernel: EDAC sbridge MC0: TSC 51ce458bc87a8
Dec 19 16:49:16 helium kernel: EDAC sbridge MC0: ADDR c5c6ea000
Dec 19 16:49:16 helium kernel: EDAC sbridge MC0: MISC 900100010000c8c
Dec 19 16:49:16 helium kernel: EDAC sbridge MC0: PROCESSOR 0:306f2 TIME 1608418156 SOCKET 1 APIC 10
Dec 19 16:49:16 helium kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xc5c6ea offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:0)

Current system is a Supermicro X10DRi with dual E5-2620 v3 processors, and 64GB of RAM, configured as 2x16GB per socket.

This obviously is a memory issue, but what exactly causes this, and how do I go about fixing it? I don't know a lot about these sort of logs. Presumably this isn't logging of things like ECC corrections, and this indicates a memory issue where I may have to replace the stick, correct?

helium-diagnostics-20201219-2044.zip

Quote

December 20, 20205 yr

Community Expert

This looks like an ECC memory error, IPMI/system event log might have more info.

Quote

December 20, 20205 yr

Author

IPMI/system log shows nothing unusual, unfortunately. Memtest completely freaked on the bit fade test. Like, millions of errors in the first several hundred MB, so I'm currently in the process of finding what I hope is a bad stick, and not a slot or something.

Quote

4 weeks later...

January 18, 20215 yr

I got the same error like 6 months ago on an Asus x99 WS IPMI motherboard with a xeon E5-2698 v3 and 64Gb of ram.

Memtest wasn't showing any error at all, but after some trial and error (aka several crashes) i found out that 2 sticks of ram were kind of buggy.

After removing them, the system went back to normal. No more error during the last months.

To identify the faulty DIMM i used a terminal command, to pair the BANK indicated in the MCE error and the serial number of the physical DIMM.

If i remember correctly the command was "dmidecode -t 6", without quotations.

Quote

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Followers

Go to topic listing

[6.8.3] Machine check events logged

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)