Memory Errors


TJOPTJOP

Recommended Posts

As testdasi said, it is likely a bad DIMM. I just had to deal with this myself and it isn't fun.

 

Best thing to do would to run memtest86 with all memory installed until you start seeing memory errors. Once you see these errors, take a picture / write down where they occur (eg: test 2, test 3, etc..)  and how many. That way you only have to run up to those tests and not have to go through the entire testing to find the bad stick.

 

Once you've found out which test you seem to have failures at, remove all but the minimum amount of memory required by your system and then run the test past the test # you were previously getting errors at. Then you just swap one stick in at a time, retest up to the test # after and repeat until you find the bad stick.

 

After you've eliminated the last stick, run through the full memory test gambit to ensure everything checks out.

Link to comment
10 minutes ago, TJOPTJOP said:

ECC dimms are error corrected right?

ECC dimms can still malfunction, but unlike with non ECC RAM it won't corrupt your data when that happens, board's system event log might have more info on which dimms are the problem, if not remove one by one until errors stop.

 

Also no point in running memtest unless ECC can be disable in the BIOS.

 

 

Link to comment
8 hours ago, johnnie.black said:

ECC dimms can still malfunction, but unlike with non ECC RAM it won't corrupt your data when that happens, board's system event log might have more info on which dimms are the problem, if not remove one by one until errors stop.

 

Also no point in running memtest unless ECC can be disable in the BIOS.

Hi, thanks for the support! I have installed 16 slots of 16 GB which make 256 GB in total. I removed from each second slot the ram which gives my server a total of 128 GB but after booting no errors in the log of UNRAID. I turn the server off and swap the ram with the other 128 GB and boot it up, again no errors.

 

So, I decide to install all the ram modules again, turn off ECC checking in my bios and run a full Memtest. Any suggestions how many pases I need to confirm if my ram is good or bad? I think that one pass will take 12+ hours.

8 hours ago, johnnie.black said:

 

 

 

Link to comment

try running this:

 

grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

 

and posting the results, AFTER you confirm you're still having errors.

 

It'll basically list exactly which memory chips are bad, in the order they're installed on the board, per physical processor.

 

  • Like 1
Link to comment
On 12/14/2019 at 4:55 AM, sota said:

try running this:

 

grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

 

and posting the results, AFTER you confirm you're still having errors.

 

It'll basically list exactly which memory chips are bad, in the order they're installed on the board, per physical processor.

 

So, I run memtest on my supermicro server for about 83 hours and finally 1 pass complete, 0  errors. Also after clean reboot no upcoming errors in my unraid log. I have absolutely no idea why I had in the beginning a lot of errors showing in my log and right know everything is working fine!

Link to comment

Just to be clear, are you now running with ECC off? That seems like a bad idea.

Doesn't that mean that instead of seeing the errors, it's just going to be failing silently and potentially corrupting the memory? (especially if you weren't seeing the errors in memtest in the first place)

 

Quote

So, I decide to install all the ram modules again, turn off ECC checking in my bios and run a full Memtest. Any suggestions how many pases I need to confirm if my ram is good or bad? I think that one pass will take 12+ hours.

 

Link to comment
On 12/13/2019 at 5:07 AM, TJOPTJOP said:

 

What I also do not understand is how thoses dimms kan be broken. ECC dimms are error corrected right?


Also, this is exactly what the log is telling you.

A "CE memory read error" or "CE memory scrubbing error" is a "Correctable Error (CE)". It would be worse if you were getting "Uncorrectable Errors (UE)"

My concern is that you've merely turned off the memory scrubbing and whatnot, hiding the errors rather than fixing them.

Link to comment
  • 1 month later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.