TJOPTJOP Posted December 10, 2019 Share Posted December 10, 2019 Hi Guys, I Am a new user on the UNRAID platform an since a couple of weeks a receive hardware errors. Can anyone help me how to troubleshoot. I have installed 256 GB DDR3 ECC RAM but receive some errors in my log. I have no idea what to do to solve this issue. Thanks Tom Quote Link to comment
testdasi Posted December 10, 2019 Share Posted December 10, 2019 Those are probably due to bad DIMM. You might want to remove one stick at a time and see which one causes the error to go away. Quote Link to comment
GroxyPod Posted December 10, 2019 Share Posted December 10, 2019 As testdasi said, it is likely a bad DIMM. I just had to deal with this myself and it isn't fun. Best thing to do would to run memtest86 with all memory installed until you start seeing memory errors. Once you see these errors, take a picture / write down where they occur (eg: test 2, test 3, etc..) and how many. That way you only have to run up to those tests and not have to go through the entire testing to find the bad stick. Once you've found out which test you seem to have failures at, remove all but the minimum amount of memory required by your system and then run the test past the test # you were previously getting errors at. Then you just swap one stick in at a time, retest up to the test # after and repeat until you find the bad stick. After you've eliminated the last stick, run through the full memory test gambit to ensure everything checks out. Quote Link to comment
sota Posted December 10, 2019 Share Posted December 10, 2019 Been down this road recently. OS would catch faults in 2 of the 12 sticks in the server, but memtest86 didn't. Got replacement sticks, and no errors since. also looks like you have a pair of 'em bad. Quote Link to comment
TJOPTJOP Posted December 13, 2019 Author Share Posted December 13, 2019 Hi Guys, Thanks for the reply's. I already run memtest for 50% but it takes to much time. Is it maybe a idea to remove 2 random dims end test till I do not have any errors left? What I also do not understand is how thoses dimms kan be broken. ECC dimms are error corrected right? Quote Link to comment
JorgeB Posted December 13, 2019 Share Posted December 13, 2019 10 minutes ago, TJOPTJOP said: ECC dimms are error corrected right? ECC dimms can still malfunction, but unlike with non ECC RAM it won't corrupt your data when that happens, board's system event log might have more info on which dimms are the problem, if not remove one by one until errors stop. Also no point in running memtest unless ECC can be disable in the BIOS. Quote Link to comment
TJOPTJOP Posted December 13, 2019 Author Share Posted December 13, 2019 8 hours ago, johnnie.black said: ECC dimms can still malfunction, but unlike with non ECC RAM it won't corrupt your data when that happens, board's system event log might have more info on which dimms are the problem, if not remove one by one until errors stop. Also no point in running memtest unless ECC can be disable in the BIOS. Hi, thanks for the support! I have installed 16 slots of 16 GB which make 256 GB in total. I removed from each second slot the ram which gives my server a total of 128 GB but after booting no errors in the log of UNRAID. I turn the server off and swap the ram with the other 128 GB and boot it up, again no errors. So, I decide to install all the ram modules again, turn off ECC checking in my bios and run a full Memtest. Any suggestions how many pases I need to confirm if my ram is good or bad? I think that one pass will take 12+ hours. 8 hours ago, johnnie.black said: Quote Link to comment
sota Posted December 14, 2019 Share Posted December 14, 2019 try running this: grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count and posting the results, AFTER you confirm you're still having errors. It'll basically list exactly which memory chips are bad, in the order they're installed on the board, per physical processor. 1 Quote Link to comment
TJOPTJOP Posted December 17, 2019 Author Share Posted December 17, 2019 On 12/14/2019 at 4:55 AM, sota said: try running this: grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count and posting the results, AFTER you confirm you're still having errors. It'll basically list exactly which memory chips are bad, in the order they're installed on the board, per physical processor. So, I run memtest on my supermicro server for about 83 hours and finally 1 pass complete, 0 errors. Also after clean reboot no upcoming errors in my unraid log. I have absolutely no idea why I had in the beginning a lot of errors showing in my log and right know everything is working fine! Quote Link to comment
SnickySnacks Posted December 17, 2019 Share Posted December 17, 2019 Just to be clear, are you now running with ECC off? That seems like a bad idea. Doesn't that mean that instead of seeing the errors, it's just going to be failing silently and potentially corrupting the memory? (especially if you weren't seeing the errors in memtest in the first place) Quote So, I decide to install all the ram modules again, turn off ECC checking in my bios and run a full Memtest. Any suggestions how many pases I need to confirm if my ram is good or bad? I think that one pass will take 12+ hours. Quote Link to comment
SnickySnacks Posted December 17, 2019 Share Posted December 17, 2019 On 12/13/2019 at 5:07 AM, TJOPTJOP said: What I also do not understand is how thoses dimms kan be broken. ECC dimms are error corrected right? Also, this is exactly what the log is telling you. A "CE memory read error" or "CE memory scrubbing error" is a "Correctable Error (CE)". It would be worse if you were getting "Uncorrectable Errors (UE)" My concern is that you've merely turned off the memory scrubbing and whatnot, hiding the errors rather than fixing them. Quote Link to comment
TJOPTJOP Posted December 18, 2019 Author Share Posted December 18, 2019 Today I receive again CE errors. So as ask I entered the following command in the command line. See pic1 for the results. grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count Quote Link to comment
TJOPTJOP Posted December 23, 2019 Author Share Posted December 23, 2019 Today again error but I think I found the bad module. Just removed that bad one and test the machine for upcoming two days. Hopefully that will solve my UNRAID memory problems. Quote Link to comment
TJOPTJOP Posted February 17, 2020 Author Share Posted February 17, 2020 I change all my ram for new ram and it looks like my problems are resolved. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.