DivideBy0 Posted June 26, 2020 Posted June 26, 2020 Should I worry? The only thing going on in the background is a CrashPlan backup and that's it. [220931.116512] mdcmd (103): spindown 0 [220932.220100] mdcmd (104): spindown 2 [220932.990079] mdcmd (105): spindown 29 [223282.188060] mdcmd (106): spindown 0 [223283.596123] mdcmd (107): spindown 29 [223481.377715] mce: [Hardware Error]: Machine check events logged [236431.060981] mdcmd (108): spindown 0 [236432.452252] mdcmd (109): spindown 2 [236433.214208] mdcmd (110): spindown 29 [243263.979691] mdcmd (111): spindown 0 [243265.394517] mdcmd (112): spindown 2 [243266.154389] mdcmd (113): spindown 29 [245152.913453] mdcmd (114): spindown 0 [245154.004587] mdcmd (115): spindown 29 [246527.600475] mdcmd (116): spindown 0 [246528.681609] mdcmd (117): spindown 2 [246529.441559] mdcmd (118): spindown 29 [250698.413302] mdcmd (119): spindown 0 [250699.817398] mdcmd (120): spindown 2 [250700.581450] mdcmd (121): spindown 29 root@NAS-UNRAID:~# root@NAS-UNRAID:~# root@NAS-UNRAID:~# root@NAS-UNRAID:~# mcelog Hardware event. This is not a software error. MCE 0 CPU 3 BANK 5 ADDR 22f43abd8 TIME 1593140144 Thu Jun 25 21:55:44 2020 MCG status: MCi status: Error overflow Corrected error Error enabled MCi_ADDR register valid MCA: MEMORY CONTROLLER RD_CHANNEL1_ERR Transaction: Memory read error STATUS d400008000910091 MCGSTATUS 0 MCGCAP 806 APICID 6 SOCKETID 0 MICROCODE 12d CPUID Vendor Intel Family 6 Model 77 root@NAS-UNRAID:~# root@NAS-UNRAID:~# Quote
Squid Posted June 26, 2020 Posted June 26, 2020 9 minutes ago, johnwhicker said: Transaction: Memory read error Bad ECC memory. May be more information in your BIOS' event log. Quote
DivideBy0 Posted June 26, 2020 Author Posted June 26, 2020 13 minutes ago, Squid said: Bad ECC memory. May be more information in your BIOS' event log. I would need to look at the BIOS logs. I did have 2 of these MCE errors few weeks back during a heavy transfer between drives and after that I did like a 2 days extensive memtest with 2 different mem applications and no errors came outta of these 2 mem test. I would think the mem test will catch something in a 48 hour test? Quote
Squid Posted June 26, 2020 Posted June 26, 2020 Memtest (depending upon the version) will not catch corrected memory errors, because they are being corrected at the hardware level. Quote
JorgeB Posted June 26, 2020 Posted June 26, 2020 Also, regular Memtest can still be used if there's an option on the BIOS to disable ECC. Quote
DivideBy0 Posted June 26, 2020 Author Posted June 26, 2020 That I didn't know about ECC errors not being identified during the memtest. Thanks I like your suggestion Johnnie in disabling ECC in the BIOS granted my board allows it. let me check into that and I guess run another test. I would hate to have bad memory Is very rare so should I even worry? Here is my Syslog record of mce errors since I built this box: root@splunk:/var/log/NAS-UNRAID# cat * |grep mce Jun 2 10:55:15 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged Jun 2 10:55:15 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged Jun 3 09:35:07 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged Jun 4 03:58:18 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged Jun 4 03:58:18 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged Jun 4 10:42:27 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged Jun 4 10:42:27 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged Jun 25 21:55:44 NAS-UNRAID kernel: mce: [Hardware Error]: Machine check events logged Jun 2 22:48:27 NAS-UNRAID nerdpack: Downloading mcelog-161-x86_64-1.txz package... Jun 2 22:48:28 NAS-UNRAID nerdpack: mcelog-161-x86_64-1.txz package download sucessful! Jun 2 22:48:28 NAS-UNRAID nerdpack: Installing mcelog-161 package... root@splunk:/var/log/NAS-UNRAID# Quote
Squid Posted June 26, 2020 Posted June 26, 2020 14 minutes ago, johnwhicker said: I would hate to have bad memory Is very rare so should I even worry? You bought ECC memory (and the associated CPU / motherboard) for one of two reasons: So that when an error is detected (and subsequently corrected) you would go and replace the bad DIMM before the errors become uncorrectable. So that when an error is detected (and subsequently corrected) it would buy you some time before the errors become uncorrectable and you then have to buy a replacement DIMM They may *seem* rare, but only when an access to the affected memory happens is the MCE going to be issued. But, judging by the quantity of MCE's being issued over those 3 days, this is not a "cosmic-ray" thing, but rather the DIMM is actually bad. Quote
DivideBy0 Posted June 27, 2020 Author Posted June 27, 2020 (edited) This is an interesting article on testing ECC ram. I am in middle of a drive array swap / rebuild and as soon as is done I will look for the BIOS options to disable ECC and test the ram again. I will get back with the results for sure. https://www.pugetsystems.com/labs/articles/How-to-Check-ECC-RAM-Functionality-462/ Edited June 27, 2020 by johnwhicker Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.