(SOLVED) Reboots and Memory Errors


Go to solution Solved by JorgeB,

Recommended Posts

Hello All,

 

I recently started noticing that my server was constantly running parity checks.  Turns out, it seems it was randomly rebooting once a day, and I can't figure out why.

 

I looked at the logs, and I see a bunch of errors like this:

 

Jul  4 15:11:48 Dmitri kernel: mce: [Hardware Error]: Machine check events logged
Jul  4 15:11:48 Dmitri kernel: [Hardware Error]: Corrected error, no action required.
Jul  4 15:11:48 Dmitri kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0xdc2041000000011b
Jul  4 15:11:48 Dmitri kernel: [Hardware Error]: Error Addr: 0x00000007c3f5d0c0
Jul  4 15:11:48 Dmitri kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000400040a801202
Jul  4 15:11:48 Dmitri kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Jul  4 15:11:48 Dmitri kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x0 offset:0x0 grain:64 syndrome:0x4)
Jul  4 15:11:48 Dmitri kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

 

So I thought maybe it was memory related.  I rebooted my server and ran Memtest for 24 hours with no errors.

 

Hardware:

ASRockRack X470D4U

AMD Ryzen 9 3900X 12-Core @ 3800 MHz
32GB DDRF ECC

 

This is a relatively new development.  Any idea on what may be going on?

 

Thanks.

dmitri-diagnostics-20220704-1554.zip

Edited by BAlGaInTl
Edited to add diagnostics - Marked as Solved
Link to comment

I'm working on doing that now.

 

I pulled one stick and realized I have a 64GB kit and Unraid is only reporting 32GB with both sticks installed.  Memtest was reporting 64GB.  Could it be a weird issue with Unraid?

 

Seems weird that I could go 24 hours with no errors in Memtest, and then I get memory errors withing 10-30 minutes in Unraid.

Link to comment
5 minutes ago, JorgeB said:

Don't see how.

Yeah... I don't know what I was thinking.

 

Looking at the system info that I uploaded, it says 64GB.

 

I pulled the second stick and started getting errors.  Moved that one I removed to the same slot I was testing and so far no errors.  Looks like I may have a stick going bad?  Fingers crossed I don't get any more errors.

 

So is Memtest just worthless then?

Link to comment
22 minutes ago, trurl said:

builtin memtest doesn't work with ECC memory. You have to get the official memtest86

 

Seriously?  I wasted a lot of time then.  That's the first time I've seen that.

 

For now, I'm verifying that it wasn't just seated improperly by putting the stick I think may have been failing back.  If I start getting errors again, I may download the full MemTest86, but not sure if I will need to.

 

Is it a licensing issue for including the most recent version that supports ECC with Unraid?

Link to comment
Just now, trurl said:

yes

 

I wonder if there is a way that can be clarified in the boot menu, or removed as a boot option?

 

Could have saved me a lot of time.  There is nothing to indicate that it isn't working.  It booted and ran just fine.  Recognized the ram and did 3+ passes in 24 hours.

 

Also, that should probably be spelled out here since it seems that ECC is actually a good idea for Unraid servers even if it isn't necessary:

 

https://wiki.unraid.net/Manual/Troubleshooting#RAM_Issues

 

That says it doesn't have all the features, but not supporting ECC is a pretty big one.  Especially since MemTest still runs and seems to be doing it's thing just fine.

 

Perhaps it would be better if the old version just isn't included anymore, and users just directed to download the latest version for troubleshooting?

Link to comment

The built in memtest will still somewhat exercise the ECC, but doesn't bypass the ECC functionality, so you have to look in the BIOS for memory error logging to see if the ECC registered a correctable bit error.

 

The way ECC is supposed to work is if the memory error is correctable, it corrects and logs the the error, similar to the way Unraid corrects for a read error on a hard drive. If the error can't be corrected, it hard locks the machine to keep silent errors from multiplying and corrupting things, similar to how Unraid drops a hard drive when it gets a write error.

 

I agree more could be done to educate about the limitations of the built in memtest, it's many years old at this point.

Link to comment
31 minutes ago, JonathanM said:

The built in memtest will still somewhat exercise the ECC, but doesn't bypass the ECC functionality, so you have to look in the BIOS for memory error logging to see if the ECC registered a correctable bit error.

 

The way ECC is supposed to work is if the memory error is correctable, it corrects and logs the the error, similar to the way Unraid corrects for a read error on a hard drive. If the error can't be corrected, it hard locks the machine to keep silent errors from multiplying and corrupting things, similar to how Unraid drops a hard drive when it gets a write error.

 

I agree more could be done to educate about the limitations of the built in memtest, it's many years old at this point.

 

That's good info, thanks.

 

So I confirmed that it does look like it is just one module creating the issue.  I'll run of one 32GB module for the time being and try to RMA the other.

 

Thanks to all for the help.

  • Like 1
Link to comment
  • BAlGaInTl changed the title to (SOLVED) Reboots and Memory Errors

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.