Jump to content

MCE Error


Recommended Posts

Should I worry?  The only thing going on in the background is a CrashPlan backup and that's it.

 

[220931.116512] mdcmd (103): spindown 0

[220932.220100] mdcmd (104): spindown 2

[220932.990079] mdcmd (105): spindown 29

[223282.188060] mdcmd (106): spindown 0

[223283.596123] mdcmd (107): spindown 29

[223481.377715] mce: [Hardware Error]: Machine check events logged

[236431.060981] mdcmd (108): spindown 0

[236432.452252] mdcmd (109): spindown 2

[236433.214208] mdcmd (110): spindown 29

[243263.979691] mdcmd (111): spindown 0

[243265.394517] mdcmd (112): spindown 2

[243266.154389] mdcmd (113): spindown 29

[245152.913453] mdcmd (114): spindown 0

[245154.004587] mdcmd (115): spindown 29

[246527.600475] mdcmd (116): spindown 0

[246528.681609] mdcmd (117): spindown 2

[246529.441559] mdcmd (118): spindown 29

[250698.413302] mdcmd (119): spindown 0

[250699.817398] mdcmd (120): spindown 2

[250700.581450] mdcmd (121): spindown 29

root@NAS-UNRAID:~#

root@NAS-UNRAID:~#

root@NAS-UNRAID:~#

root@NAS-UNRAID:~# mcelog

Hardware event. This is not a software error.

MCE 0

CPU 3 BANK 5

ADDR 22f43abd8

TIME 1593140144 Thu Jun 25 21:55:44 2020

MCG status:

MCi status:

Error overflow

Corrected error

Error enabled

MCi_ADDR register valid

MCA: MEMORY CONTROLLER RD_CHANNEL1_ERR

Transaction: Memory read error

STATUS d400008000910091 MCGSTATUS 0

MCGCAP 806 APICID 6 SOCKETID 0

MICROCODE 12d

CPUID Vendor Intel Family 6 Model 77

root@NAS-UNRAID:~#

root@NAS-UNRAID:~#

Link to comment
13 minutes ago, Squid said:

Bad ECC memory.  May be more information in your BIOS' event log.

 

I would need to look at the BIOS logs.  I did have 2 of these MCE errors few weeks back during a heavy transfer between drives and after that I did like a 2 days extensive memtest with 2 different mem applications and no errors came outta of these 2 mem test.   I would think the mem test will catch something in a 48 hour test?

 

 

Link to comment

That I didn't know about ECC errors not being identified during the memtest.  Thanks

 

I like your suggestion Johnnie in disabling ECC in the BIOS granted my board allows it.  let me check into that and I guess run another test.

 

I would hate to have bad memory :(  Is very rare so should I even worry?

 

Here is my Syslog record of mce errors since I built this box:

 

root@splunk:/var/log/NAS-UNRAID# cat * |grep mce

Jun  2 10:55:15 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun  2 10:55:15 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun  3 09:35:07 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun  4 03:58:18 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun  4 03:58:18 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun  4 10:42:27 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun  4 10:42:27 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun 25 21:55:44 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged

Jun  2 22:48:27 NAS-UNRAID nerdpack: Downloading mcelog-161-x86_64-1.txz package...

Jun  2 22:48:28 NAS-UNRAID nerdpack: mcelog-161-x86_64-1.txz package download sucessful!

Jun  2 22:48:28 NAS-UNRAID nerdpack: Installing mcelog-161 package...

root@splunk:/var/log/NAS-UNRAID#

 

 

 

Link to comment
14 minutes ago, johnwhicker said:

I would hate to have bad memory :(  Is very rare so should I even worry?

You bought ECC memory (and the associated CPU / motherboard) for one of two reasons:

 

  1. So that when an error is detected (and subsequently corrected) you would go and replace the bad DIMM before the errors become uncorrectable.
  2. So that when an error is detected (and subsequently corrected) it would buy you some time before the errors become uncorrectable and you then have to buy a replacement DIMM

 

They may *seem* rare, but only when an access to the affected memory happens is the MCE going to be issued.  But, judging by the quantity of MCE's being issued over those 3 days, this is not a "cosmic-ray" thing, but rather the DIMM is actually bad.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...