December 10, 20196 yr Hi Guys, I Am a new user on the UNRAID platform an since a couple of weeks a receive hardware errors. Can anyone help me how to troubleshoot. I have installed 256 GB DDR3 ECC RAM but receive some errors in my log. I have no idea what to do to solve this issue. Thanks Tom
December 10, 20196 yr Those are probably due to bad DIMM. You might want to remove one stick at a time and see which one causes the error to go away.
December 10, 20196 yr As testdasi said, it is likely a bad DIMM. I just had to deal with this myself and it isn't fun. Best thing to do would to run memtest86 with all memory installed until you start seeing memory errors. Once you see these errors, take a picture / write down where they occur (eg: test 2, test 3, etc..) and how many. That way you only have to run up to those tests and not have to go through the entire testing to find the bad stick. Once you've found out which test you seem to have failures at, remove all but the minimum amount of memory required by your system and then run the test past the test # you were previously getting errors at. Then you just swap one stick in at a time, retest up to the test # after and repeat until you find the bad stick. After you've eliminated the last stick, run through the full memory test gambit to ensure everything checks out.
December 10, 20196 yr Been down this road recently. OS would catch faults in 2 of the 12 sticks in the server, but memtest86 didn't. Got replacement sticks, and no errors since. also looks like you have a pair of 'em bad.
December 13, 20196 yr Author Hi Guys, Thanks for the reply's. I already run memtest for 50% but it takes to much time. Is it maybe a idea to remove 2 random dims end test till I do not have any errors left? What I also do not understand is how thoses dimms kan be broken. ECC dimms are error corrected right?
December 13, 20196 yr Community Expert 10 minutes ago, TJOPTJOP said: ECC dimms are error corrected right? ECC dimms can still malfunction, but unlike with non ECC RAM it won't corrupt your data when that happens, board's system event log might have more info on which dimms are the problem, if not remove one by one until errors stop. Also no point in running memtest unless ECC can be disable in the BIOS.
December 13, 20196 yr Author 8 hours ago, johnnie.black said: ECC dimms can still malfunction, but unlike with non ECC RAM it won't corrupt your data when that happens, board's system event log might have more info on which dimms are the problem, if not remove one by one until errors stop. Also no point in running memtest unless ECC can be disable in the BIOS. Hi, thanks for the support! I have installed 16 slots of 16 GB which make 256 GB in total. I removed from each second slot the ram which gives my server a total of 128 GB but after booting no errors in the log of UNRAID. I turn the server off and swap the ram with the other 128 GB and boot it up, again no errors. So, I decide to install all the ram modules again, turn off ECC checking in my bios and run a full Memtest. Any suggestions how many pases I need to confirm if my ram is good or bad? I think that one pass will take 12+ hours. 8 hours ago, johnnie.black said:
December 14, 20196 yr try running this: grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count and posting the results, AFTER you confirm you're still having errors. It'll basically list exactly which memory chips are bad, in the order they're installed on the board, per physical processor.
December 17, 20196 yr Author On 12/14/2019 at 4:55 AM, sota said: try running this: grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count and posting the results, AFTER you confirm you're still having errors. It'll basically list exactly which memory chips are bad, in the order they're installed on the board, per physical processor. So, I run memtest on my supermicro server for about 83 hours and finally 1 pass complete, 0 errors. Also after clean reboot no upcoming errors in my unraid log. I have absolutely no idea why I had in the beginning a lot of errors showing in my log and right know everything is working fine!
December 17, 20196 yr Just to be clear, are you now running with ECC off? That seems like a bad idea. Doesn't that mean that instead of seeing the errors, it's just going to be failing silently and potentially corrupting the memory? (especially if you weren't seeing the errors in memtest in the first place) Quote So, I decide to install all the ram modules again, turn off ECC checking in my bios and run a full Memtest. Any suggestions how many pases I need to confirm if my ram is good or bad? I think that one pass will take 12+ hours.
December 17, 20196 yr On 12/13/2019 at 5:07 AM, TJOPTJOP said: What I also do not understand is how thoses dimms kan be broken. ECC dimms are error corrected right? Also, this is exactly what the log is telling you. A "CE memory read error" or "CE memory scrubbing error" is a "Correctable Error (CE)". It would be worse if you were getting "Uncorrectable Errors (UE)" My concern is that you've merely turned off the memory scrubbing and whatnot, hiding the errors rather than fixing them.
December 18, 20196 yr Author Today I receive again CE errors. So as ask I entered the following command in the command line. See pic1 for the results. grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
December 23, 20196 yr Author Today again error but I think I found the bad module. Just removed that bad one and test the machine for upcoming two days. Hopefully that will solve my UNRAID memory problems.
February 17, 20206 yr Author I change all my ram for new ram and it looks like my problems are resolved.
Archived
This topic is now archived and is closed to further replies.