rbroberts Posted June 1, 2021 Share Posted June 1, 2021 I'm getting a machine check error that looks like a memory error. So I took the array down and had it run memtest86 over the weekend and there were no problems found. But... it keeps happening. The machine is a Supermicro X9DAi, with dual Intel Xeon CPU E5-2695 v2, 512 GB of DDR3 ECC RAM. One obvious question is whether or not the memory controller is hiding the error from memtest86 so that it does happen but I can't find it. Jun 1 04:40:07 Tower root: Fix Common Problems: Error: Machine Check Events detected on your server Jun 1 04:40:07 Tower root: Hardware event. This is not a software error. Jun 1 04:40:07 Tower root: MCE 0 Jun 1 04:40:07 Tower root: CPU 0 BANK 13 TSC 159a9f0821204 Jun 1 04:40:07 Tower root: MISC 900000400040c8c ADDR 42c306000 Jun 1 04:40:07 Tower root: TIME 1622506605 Mon May 31 20:16:45 2021 Jun 1 04:40:07 Tower root: MCG status: Jun 1 04:40:07 Tower root: MCi status: Jun 1 04:40:07 Tower root: Corrected error Jun 1 04:40:07 Tower root: MCi_MISC register valid Jun 1 04:40:07 Tower root: MCi_ADDR register valid Jun 1 04:40:07 Tower root: MCA: MEMORY CONTROLLER MS_CHANNEL0_ERR Jun 1 04:40:07 Tower root: Transaction: Memory scrubbing error Jun 1 04:40:07 Tower root: MemCtrl: Corrected patrol scrub error Jun 1 04:40:07 Tower root: STATUS 8c000046000800c0 MCGSTATUS 0 Jun 1 04:40:07 Tower root: MCGCAP 1000c1d APICID 0 SOCKETID 0 Jun 1 04:40:07 Tower root: MICROCODE 42e Jun 1 04:40:07 Tower root: CPUID Vendor Intel Family 6 Model 62 Jun 1 04:40:07 Tower root: mcelog: warning: 8 bytes ignored in each record Jun 1 04:40:07 Tower root: mcelog: consider an update Jun 1 05:06:58 Tower kernel: mce: [Hardware Error]: Machine check events logged Jun 1 05:06:58 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Jun 1 05:06:58 Tower kernel: EDAC sbridge MC1: CPU 12: Machine Check Event: 0 Bank 10: 8c000042000800c1 Jun 1 05:06:58 Tower kernel: EDAC sbridge MC1: TSC 19f1ad61ad6a1 Jun 1 05:06:58 Tower kernel: EDAC sbridge MC1: ADDR 45dcace000 Jun 1 05:06:58 Tower kernel: EDAC sbridge MC1: MISC 90840200020048c Jun 1 05:06:58 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1622538418 SOCKET 1 APIC 20 Jun 1 05:06:58 Tower kernel: EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 page:0x45dcace offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 ha:0 channel_mask:2 rank:255) Quote Link to comment
John_M Posted June 2, 2021 Share Posted June 2, 2021 The last line of your syslog snippet seems to be indicating which DIMM is at fault. What does the BIOS event log say? Are you using the very old version of MemTest86 that's included with Unraid? If you go to the MemTest86 website and download the latest free version you can make a stand-alone bootable USB stick (obviously, a different one from your Unraid USB!) which is able to see through error correction. It's a shame the bundled version can't be updated but the licensing doesn't allow it. Quote Link to comment
ChatNoir Posted June 2, 2021 Share Posted June 2, 2021 6 hours ago, rbroberts said: So I took the array down and had it run memtest86 over the weekend and there were no problems found. As stated by John_M, the memtest version included with Unraid is not able to detect errors with ECC. Quote Link to comment
rbroberts Posted June 2, 2021 Author Share Posted June 2, 2021 18 hours ago, John_M said: Are you using the very old version of MemTest86 that's included with Unraid? Yes, that's what I did. I'll pull down the standalone version and give it a try. I'm not sure where to find the BIOS event log. Quote Link to comment
John_M Posted June 2, 2021 Share Posted June 2, 2021 8 minutes ago, rbroberts said: I'm not sure where to find the BIOS event log. You can download a motherboard manual from here. Page 4-26. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.