Machine Check Error, but memtest says no errors


Recommended Posts

I'm getting a machine check error that looks like a memory error. So I took the array down and had it run memtest86 over the weekend and there were no problems found. But... it keeps happening.

 

The machine is a Supermicro X9DAi, with dual Intel Xeon CPU E5-2695 v2, 512 GB of DDR3 ECC RAM. One obvious question is whether or not the memory controller is hiding the error from memtest86 so that it does happen but I can't find it.

 

Jun  1 04:40:07 Tower root: Fix Common Problems: Error: Machine Check Events detected on your server
Jun  1 04:40:07 Tower root: Hardware event. This is not a software error.
Jun  1 04:40:07 Tower root: MCE 0
Jun  1 04:40:07 Tower root: CPU 0 BANK 13 TSC 159a9f0821204 
Jun  1 04:40:07 Tower root: MISC 900000400040c8c ADDR 42c306000 
Jun  1 04:40:07 Tower root: TIME 1622506605 Mon May 31 20:16:45 2021
Jun  1 04:40:07 Tower root: MCG status:
Jun  1 04:40:07 Tower root: MCi status:
Jun  1 04:40:07 Tower root: Corrected error
Jun  1 04:40:07 Tower root: MCi_MISC register valid
Jun  1 04:40:07 Tower root: MCi_ADDR register valid
Jun  1 04:40:07 Tower root: MCA: MEMORY CONTROLLER MS_CHANNEL0_ERR
Jun  1 04:40:07 Tower root: Transaction: Memory scrubbing error
Jun  1 04:40:07 Tower root: MemCtrl: Corrected patrol scrub error
Jun  1 04:40:07 Tower root: STATUS 8c000046000800c0 MCGSTATUS 0
Jun  1 04:40:07 Tower root: MCGCAP 1000c1d APICID 0 SOCKETID 0 
Jun  1 04:40:07 Tower root: MICROCODE 42e
Jun  1 04:40:07 Tower root: CPUID Vendor Intel Family 6 Model 62
Jun  1 04:40:07 Tower root: mcelog: warning: 8 bytes ignored in each record
Jun  1 04:40:07 Tower root: mcelog: consider an update
Jun  1 05:06:58 Tower kernel: mce: [Hardware Error]: Machine check events logged
Jun  1 05:06:58 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jun  1 05:06:58 Tower kernel: EDAC sbridge MC1: CPU 12: Machine Check Event: 0 Bank 10: 8c000042000800c1
Jun  1 05:06:58 Tower kernel: EDAC sbridge MC1: TSC 19f1ad61ad6a1 
Jun  1 05:06:58 Tower kernel: EDAC sbridge MC1: ADDR 45dcace000 
Jun  1 05:06:58 Tower kernel: EDAC sbridge MC1: MISC 90840200020048c 
Jun  1 05:06:58 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1622538418 SOCKET 1 APIC 20
Jun  1 05:06:58 Tower kernel: EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 page:0x45dcace offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:255)

 

Link to comment

The last line of your syslog snippet seems to be indicating which DIMM is at fault. What does the BIOS event log say?

 

Are you using the very old version of MemTest86 that's included with Unraid? If you go to the MemTest86 website and download the latest free version you can make a stand-alone bootable USB stick (obviously, a different one from your Unraid USB!) which is able to see through error correction. It's a shame the bundled version can't be updated but the licensing doesn't allow it.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.