June 3, 20206 yr Can anyone please help me decipher this mce error This is a brand new build so is already acting funny 241831.858564] mce: [Hardware Error]: Machine check events logged [241831.858570] mce: [Hardware Error]: Machine check events logged root@NAS-UNRAID:~# mcelog Hardware event. This is not a software error. MCE 0 CPU 0 BANK 5 ADDR 22f43abc0 TIME 1591113315 Tue Jun 2 10:55:15 2020 MCG status: MCi status: Corrected error Error enabled MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error STATUS 9400004000910091 MCGSTATUS 0 MCGCAP 806 APICID 0 SOCKETID 0 MICROCODE 12d CPUID Vendor Intel Family 6 Model 77 Hardware event. This is not a software error. MCE 1 CPU 1 BANK 5 ADDR 22f43abc0 TIME 1591113315 Tue Jun 2 10:55:15 2020 MCG status: MCi status: Corrected error Error enabled MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error STATUS 9400004000910091 MCGSTATUS 0 MCGCAP 806 APICID 2 SOCKETID 0 MICROCODE 12d CPUID Vendor Intel Family 6 Model 77 root@NAS-UNRAID:~# nas-unraid-diagnostics-20200602-2313.zip Edited June 14, 20206 yr by johnwhicker
June 3, 20206 yr That looks like an ECC RAM corrected error, there might be more information on the board's SEL (system event log)
June 3, 20206 yr Author Nope that's pretty much 2 lines in the syslog so I run the mcelog to get more info. This was buried between some USB/UPS issues I am having. 240021.315170] usb 1-1.1: USB disconnect, device number 10 [240021.487384] usb 1-1.1: new full-speed USB device number 11 using ehci-pci [240021.572741] hid-generic 0003:0764:0501.0009: hiddev96,hidraw0: USB HID v1.10 Device [CPS CST135XLU] on usb-0000:00:16.0-1.1/input0 [241831.858564] mce: [Hardware Error]: Machine check events logged [241831.858570] mce: [Hardware Error]: Machine check events logged [243974.212849] usb 1-1.1: USB disconnect, device number 11 [243974.386787] usb 1-1.1: new full-speed USB device number 12 using ehci-pci [243974.471684] hid-generic 0003:0764:0501.000A: hiddev96,hidraw0: USB HID v1.10 Device [CPS CST135XLU] on usb-0000:00:16.0-1.1/input0
June 7, 20206 yr Author On 6/3/2020 at 8:57 AM, johnnie.black said: I said the board's system event log, usually accessible in the BIOS or over IPMI. Thanks much Sir. I didn't see anything in the BIOS or IMPI logs. I even send the IPMI logs to syslog and nothing on this MCE error. That being said I did run an extensive memtest86 Pro test for 24 hours straight and no errors on memory so I guess it was just ECC doing its job during this heavy data set copy? perhaps ECC corrected some corrupted data?
June 8, 20206 yr Memtest won't show errors with ECC RAM, if you can't find more info on the affected DIMM, just remove one at a time and test for a few days, or disable ECC in the BIOS (if that's an option) and run memtest again.
June 12, 20206 yr Author On 6/8/2020 at 3:24 AM, johnnie.black said: Memtest won't show errors with ECC RAM, if you can't find more info on the affected DIMM, just remove one at a time and test for a few days, or disable ECC in the BIOS (if that's an option) and run memtest again. Thanks partner. I run memtest pro version for 2 days and nothing. I think is ok as I haven't seen that error anymore. It was just during a heavy copy and mdsum check from drive to drive, about 8TG of data.
Archived
This topic is now archived and is closed to further replies.