MCE Error

June 26, 20206 yr

Should I worry? The only thing going on in the background is a CrashPlan backup and that's it.

[220931.116512] mdcmd (103): spindown 0

[220932.220100] mdcmd (104): spindown 2

[220932.990079] mdcmd (105): spindown 29

[223282.188060] mdcmd (106): spindown 0

[223283.596123] mdcmd (107): spindown 29

[223481.377715] mce: [Hardware Error]: Machine check events logged

[236431.060981] mdcmd (108): spindown 0

[236432.452252] mdcmd (109): spindown 2

[236433.214208] mdcmd (110): spindown 29

[243263.979691] mdcmd (111): spindown 0

[243265.394517] mdcmd (112): spindown 2

[243266.154389] mdcmd (113): spindown 29

[245152.913453] mdcmd (114): spindown 0

[245154.004587] mdcmd (115): spindown 29

[246527.600475] mdcmd (116): spindown 0

[246528.681609] mdcmd (117): spindown 2

[246529.441559] mdcmd (118): spindown 29

[250698.413302] mdcmd (119): spindown 0

[250699.817398] mdcmd (120): spindown 2

[250700.581450] mdcmd (121): spindown 29

root@NAS-UNRAID:~#

root@NAS-UNRAID:~# mcelog

Hardware event. This is not a software error.

MCE 0

CPU 3 BANK 5

ADDR 22f43abd8

TIME 1593140144 Thu Jun 25 21:55:44 2020

MCG status:

MCi status:

Error overflow

Corrected error

Error enabled

MCi_ADDR register valid

MCA: MEMORY CONTROLLER RD_CHANNEL1_ERR

Transaction: Memory read error

STATUS d400008000910091 MCGSTATUS 0

MCGCAP 806 APICID 6 SOCKETID 0

MICROCODE 12d

CPUID Vendor Intel Family 6 Model 77

root@NAS-UNRAID:~#

Quote

June 26, 20206 yr

9 minutes ago, johnwhicker said:

Transaction: Memory read error

Bad ECC memory. May be more information in your BIOS' event log.

Quote

June 26, 20206 yr

Author

13 minutes ago, Squid said:

Bad ECC memory. May be more information in your BIOS' event log.

I would need to look at the BIOS logs. I did have 2 of these MCE errors few weeks back during a heavy transfer between drives and after that I did like a 2 days extensive memtest with 2 different mem applications and no errors came outta of these 2 mem test. I would think the mem test will catch something in a 48 hour test?

Quote

June 26, 20206 yr

Memtest (depending upon the version) will not catch corrected memory errors, because they are being corrected at the hardware level.

Quote

June 26, 20206 yr

Community Expert

Also, regular Memtest can still be used if there's an option on the BIOS to disable ECC.

Quote

June 26, 20206 yr

Author

That I didn't know about ECC errors not being identified during the memtest. Thanks

I like your suggestion Johnnie in disabling ECC in the BIOS granted my board allows it. let me check into that and I guess run another test.

I would hate to have bad memory Is very rare so should I even worry?

Here is my Syslog record of mce errors since I built this box:

root@splunk:/var/log/NAS-UNRAID# cat * |grep mce

Jun 2 10:55:15 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun 3 09:35:07 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun 4 03:58:18 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun 4 10:42:27 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun 25 21:55:44 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun 2 22:48:27 NAS-UNRAID nerdpack: Downloading mcelog-161-x86_64-1.txz package...

Jun 2 22:48:28 NAS-UNRAID nerdpack: mcelog-161-x86_64-1.txz package download sucessful!

Jun 2 22:48:28 NAS-UNRAID nerdpack: Installing mcelog-161 package...

root@splunk:/var/log/NAS-UNRAID#

Quote

June 26, 20206 yr

14 minutes ago, johnwhicker said:

I would hate to have bad memory Is very rare so should I even worry?

You bought ECC memory (and the associated CPU / motherboard) for one of two reasons:

So that when an error is detected (and subsequently corrected) you would go and replace the bad DIMM before the errors become uncorrectable.
So that when an error is detected (and subsequently corrected) it would buy you some time before the errors become uncorrectable and you then have to buy a replacement DIMM

They may *seem* rare, but only when an access to the affected memory happens is the MCE going to be issued. But, judging by the quantity of MCE's being issued over those 3 days, this is not a "cosmic-ray" thing, but rather the DIMM is actually bad.

Quote

June 27, 20206 yr

Author

This is an interesting article on testing ECC ram. I am in middle of a drive array swap / rebuild and as soon as is done I will look for the BIOS options to disable ECC and test the ram again. I will get back with the results for sure.

https://www.pugetsystems.com/labs/articles/How-to-Check-ECC-RAM-Functionality-462/

Edited June 27, 20206 yr by johnwhicker

Quote

MCE Error

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)