BAlGaInTl Posted July 4, 2022 Share Posted July 4, 2022 (edited) Hello All, I recently started noticing that my server was constantly running parity checks. Turns out, it seems it was randomly rebooting once a day, and I can't figure out why. I looked at the logs, and I see a bunch of errors like this: Jul 4 15:11:48 Dmitri kernel: mce: [Hardware Error]: Machine check events logged Jul 4 15:11:48 Dmitri kernel: [Hardware Error]: Corrected error, no action required. Jul 4 15:11:48 Dmitri kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|Scrub]: 0xdc2041000000011b Jul 4 15:11:48 Dmitri kernel: [Hardware Error]: Error Addr: 0x00000007c3f5d0c0 Jul 4 15:11:48 Dmitri kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x000400040a801202 Jul 4 15:11:48 Dmitri kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Jul 4 15:11:48 Dmitri kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x0 offset:0x0 grain:64 syndrome:0x4) Jul 4 15:11:48 Dmitri kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD So I thought maybe it was memory related. I rebooted my server and ran Memtest for 24 hours with no errors. Hardware: ASRockRack X470D4U AMD Ryzen 9 3900X 12-Core @ 3800 MHz 32GB DDRF ECC This is a relatively new development. Any idea on what may be going on? Thanks. dmitri-diagnostics-20220704-1554.zip Edited July 5, 2022 by BAlGaInTl Edited to add diagnostics - Marked as Solved Quote Link to comment
Solution JorgeB Posted July 5, 2022 Solution Share Posted July 5, 2022 Try with just one DIMM at a time and see if those hardware errors go away. Quote Link to comment
BAlGaInTl Posted July 5, 2022 Author Share Posted July 5, 2022 I'm working on doing that now. I pulled one stick and realized I have a 64GB kit and Unraid is only reporting 32GB with both sticks installed. Memtest was reporting 64GB. Could it be a weird issue with Unraid? Seems weird that I could go 24 hours with no errors in Memtest, and then I get memory errors withing 10-30 minutes in Unraid. Quote Link to comment
JorgeB Posted July 5, 2022 Share Posted July 5, 2022 7 minutes ago, BAlGaInTl said: Could it be a weird issue with Unraid? Don't see how. Quote Link to comment
BAlGaInTl Posted July 5, 2022 Author Share Posted July 5, 2022 5 minutes ago, JorgeB said: Don't see how. Yeah... I don't know what I was thinking. Looking at the system info that I uploaded, it says 64GB. I pulled the second stick and started getting errors. Moved that one I removed to the same slot I was testing and so far no errors. Looks like I may have a stick going bad? Fingers crossed I don't get any more errors. So is Memtest just worthless then? Quote Link to comment
trurl Posted July 5, 2022 Share Posted July 5, 2022 builtin memtest doesn't work with ECC memory. You have to get the official memtest86 1 Quote Link to comment
BAlGaInTl Posted July 5, 2022 Author Share Posted July 5, 2022 22 minutes ago, trurl said: builtin memtest doesn't work with ECC memory. You have to get the official memtest86 Seriously? I wasted a lot of time then. That's the first time I've seen that. For now, I'm verifying that it wasn't just seated improperly by putting the stick I think may have been failing back. If I start getting errors again, I may download the full MemTest86, but not sure if I will need to. Is it a licensing issue for including the most recent version that supports ECC with Unraid? Quote Link to comment
trurl Posted July 5, 2022 Share Posted July 5, 2022 1 hour ago, BAlGaInTl said: Is it a licensing issue for including the most recent version that supports ECC with Unraid? yes Quote Link to comment
BAlGaInTl Posted July 5, 2022 Author Share Posted July 5, 2022 Just now, trurl said: yes I wonder if there is a way that can be clarified in the boot menu, or removed as a boot option? Could have saved me a lot of time. There is nothing to indicate that it isn't working. It booted and ran just fine. Recognized the ram and did 3+ passes in 24 hours. Also, that should probably be spelled out here since it seems that ECC is actually a good idea for Unraid servers even if it isn't necessary: https://wiki.unraid.net/Manual/Troubleshooting#RAM_Issues That says it doesn't have all the features, but not supporting ECC is a pretty big one. Especially since MemTest still runs and seems to be doing it's thing just fine. Perhaps it would be better if the old version just isn't included anymore, and users just directed to download the latest version for troubleshooting? Quote Link to comment
JonathanM Posted July 5, 2022 Share Posted July 5, 2022 The built in memtest will still somewhat exercise the ECC, but doesn't bypass the ECC functionality, so you have to look in the BIOS for memory error logging to see if the ECC registered a correctable bit error. The way ECC is supposed to work is if the memory error is correctable, it corrects and logs the the error, similar to the way Unraid corrects for a read error on a hard drive. If the error can't be corrected, it hard locks the machine to keep silent errors from multiplying and corrupting things, similar to how Unraid drops a hard drive when it gets a write error. I agree more could be done to educate about the limitations of the built in memtest, it's many years old at this point. Quote Link to comment
BAlGaInTl Posted July 5, 2022 Author Share Posted July 5, 2022 31 minutes ago, JonathanM said: The built in memtest will still somewhat exercise the ECC, but doesn't bypass the ECC functionality, so you have to look in the BIOS for memory error logging to see if the ECC registered a correctable bit error. The way ECC is supposed to work is if the memory error is correctable, it corrects and logs the the error, similar to the way Unraid corrects for a read error on a hard drive. If the error can't be corrected, it hard locks the machine to keep silent errors from multiplying and corrupting things, similar to how Unraid drops a hard drive when it gets a write error. I agree more could be done to educate about the limitations of the built in memtest, it's many years old at this point. That's good info, thanks. So I confirmed that it does look like it is just one module creating the issue. I'll run of one 32GB module for the time being and try to RMA the other. Thanks to all for the help. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.