Jump to content
Sign in to follow this  
morytox

ECC just Doing it's thing or bad Memory?

13 posts in this topic Last Reply

Recommended Posts

Hi Folks,

 

i am currently running a small Homeserver based on a

Ryzen7 2700

Asrock Pro X470D4U 

32Gig ECC Kingston 2666 Memory. (From supported memory list of the mainboard)

 

First of all, even though the memory is labled with 2666 mhz, it is runnung at 2400mhz. But there are no settings available within the bios to change it. Anyway if it is runnung, everything is fine. Thats why i am more interested in my second topic.

 

I check my log regularily and do always find such kind of red marked error logs:

 

Each block starts with "Corrected Error, no action required" But some of the memory addresses do come more often. Does it mean i have partially faulty memory in there?  

 

Thanks for help in advance!

 

Feb 4 19:38:22 BLACKBOX kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Feb 4 20:00:13 BLACKBOX kernel: mce: [Hardware Error]: Machine check events logged
Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: Corrected error, no action required.
Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b
Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: Error Addr: 0x00000002c5c66b00
Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103
Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Feb 4 20:00:13 BLACKBOX kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x59b8cd offset:0x700 grain:1 syndrome:0xa40)
Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Feb 4 21:05:45 BLACKBOX kernel: mce: [Hardware Error]: Machine check events logged
Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: Corrected error, no action required.
Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b
Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: Error Addr: 0x00000002ea867500
Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103
Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Feb 4 21:05:45 BLACKBOX kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x5e50ce offset:0xb00 grain:1 syndrome:0xa40)
Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Feb 4 22:38:36 BLACKBOX kernel: mce: [Hardware Error]: Machine check events logged
Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: Corrected error, no action required.
Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b
Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: Error Addr: 0x000000030d494200
Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103
Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Feb 4 22:38:36 BLACKBOX kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x62a928 offset:0x500 grain:1 syndrome:0xa40)
Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
Feb 4 23:27:45 BLACKBOX kernel: mce: [Hardware Error]: Machine check events logged
Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: Corrected error, no action required.
Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b
Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: Error Addr: 0x000000031d924c00
Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103
Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0
Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
Feb 4 23:27:45 BLACKBOX kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x64b249 offset:0x900 grain:1 syndrome:0xa40)
Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

 

Share this post


Link to post

Those look to be ECC corrected errors, it still means there's a problem, you don't want to have those errors regularly, try one DIMM at a time.

Share this post


Link to post

I have the same setup as you,


- Ryzen 7 2700

- Asrock Rack X470D4U
- 2x Kingston 16GB ECC DDR4 (KSM26ED8/16ME).

OS is Debian.

I assembled this system today and have seen three corrected ECC errors in about 4 hours of operation, seems to be the same errors as you've reported here. All three of mine are on different addresses, though. Currently trying to figure out if this is normal.

I also noticed that my RAM is running at 2400 and not 2666. This is according to dmidecode.

May be interesting to note that all three errors happened just as I was running a command in the terminal. One of them happened when I ran sensors-detect, another when I terminated a running stress-test. I don't recall what I was doing when the third one occurred.

Since then I've been running memory stress tests using stress-ng to see if I can trigger more errors (going on for about one or two hours now), but haven't seen anything yet. I also ran a few loops of memtester and everything passed.

The fact that we seem to have the exact same issue (if it even is an issue, I'm still holding out on that) on virtually identical systems makes me believe it could be related to some configuration issue. I have left all BIOS settings to their defaults for now - maybe the memory configuration is not optimal for these sticks? Maybe a BIOS update is in order.

Log entries:

[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b
[Hardware Error]: Error Addr: 0x00000000018f03c0
[Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103
[Hardware Error]: Unified Memory Controller Extended Error Code: 0
[Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

—————————

[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2040000000011b
[Hardware Error]: Error Addr: 0x00000003e7eb9800
[Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103
[Hardware Error]: Unified Memory Controller Extended Error Code: 0
[Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

—————————————

[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2040000000011b
[Hardware Error]: Error Addr: 0x0000000000c10980
[Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400102
[Hardware Error]: Unified Memory Controller Extended Error Code: 0
[Hardware Error]: Unified Memory Controller Error: DRAM ECC error.
[Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD


 

Share this post


Link to post
3 minutes ago, plindberg said:

Currently trying to figure out if this is normal.

It's normal for bad memory to do this.  Good memory will not

 

5 minutes ago, plindberg said:

I also ran a few loops of memtester and everything passed.

Memtest won't find any errors, because they are being corrected for the time being.

Share this post


Link to post
2 minutes ago, Squid said:

It's normal for bad memory to do this.  Good memory will not

 

Memtest won't find any errors, because they are being corrected for the time being.


That's good to know. If any errors do occur they should be logged to the system log though, so at least there's that.

I've read elsewhere that if the memory is not setup correctly in BIOS, that could trigger ECC errors. Any thoughts on that?

Seems unlikely that both me and the OP got bad sticks, but I guess there might have been a bad batch. 

Share this post


Link to post
Just now, plindberg said:

Seems unlikely that both me and the OP got bad sticks, but I guess there might have been a bad batch. 

Not particularly IMO.  The memory chips themselves are banged out by the millions for all the stick manufacturers, and each contains a few million transistors and it only takes one to be borderline.

 

So far as the BIOS settings, overclocking (XMP / AMP profiles) only serve to make things worse.

Share this post


Link to post

This thread might be able to lead you in the right direction: https://www.ixsystems.com/community/threads/freenas-build-with-10gbe-and-ryzen.77752/

 

Specifically, the last couple of pages of that thread.

 

EDIT** Also, this one is very thorough and covers almost everything...the only caveat being that it's insanely long so you'll really need to take some time to read it or know exactly what you're searching for.

 

https://forum.level1techs.com/t/asrock-rack-has-created-the-first-am4-socket-server-boards-x470d4u-x470d4u2-2t/139490/885

Edited by ramblinreck47

Share this post


Link to post
5 minutes ago, Squid said:

Not particularly IMO.  The memory chips themselves are banged out by the millions for all the stick manufacturers, and each contains a few million transistors and it only takes one to be borderline.

 

So far as the BIOS settings, overclocking (XMP / AMP profiles) only serve to make things worse.


There's always underclocking 😏

Jokes aside, I'm going to keep monitoring this and see how extensive the errors get. If it continues happening on a regular basis I will probably just order some new sticks. If I don't see any more I will just write it off as a random event caused by cosmic rays or whatever, and be happy that the ECC did its job.

Share this post


Link to post
1 hour ago, ramblinreck47 said:

EDIT** Also, this one is very thorough and covers almost everything...the only caveat being that it's insanely long so you'll really need to take some time to read it or know exactly what you're searching for.

 

https://forum.level1techs.com/t/asrock-rack-has-created-the-first-am4-socket-server-boards-x470d4u-x470d4u2-2t/139490/885

Nearly missed your edit!

Judging from the posts by Marshy on page 4 of the first thread you linked, it seems that setting the memory speed to 2666hz rather than 2400hz might actually fix the issue. I'm going to try this and see if I get any more errors. These sticks are supposed to run at that speed anyway, not quite sure why they defaulted to 2400.

If that doesn't help I'll start digging into the second thread. Should be a good Sunday read 😁

Share this post


Link to post
Just now, plindberg said:

Nearly missed your edit!

Judging from the posts by Marshy on page 4 of the first thread you linked, it seems that setting the memory speed to 2666hz rather than 2400hz might actually fix the issue. I'm going to try this and see if I get any more errors. These sticks are supposed to run at that speed anyway, not quite sure why they defaulted to 2400.

If that doesn't help I'll start digging into the second thread. Should be a good Sunday read 😁

I'm glad that you probably found an answer to your problem.

 

The Level1Tech's thread is incredibly detailed and goes into all the problems that the motherboard has had from the very beginning and what got fixed (and didn't) along the way. This motherboard has soooo much potential but I'm still in the Intel camp for my next build until they fix a lot of these little bugs. A Ryzen/UnRAID/X470D4U just isn't a perfect set yet. Hopefully, the newer motherboard BIOS they are working on and UnRAID 6.9 gets this combination to where it needs to be.

Share this post


Link to post

In case anyone else has these same issues in the future, the changes discussed above seems to have done the trick.

 

My system has been running stable for a week now, no signs of any problems.

 

To sum it up, what I did was follow the suggestions by the user Marshy in the thread linked above:

- Set the RAM speed to 2666Mhz

- Disable global c-states

- Set Power Supply Idle Control to Typical

 

I'm guessing the latter two didn't have anything to do with the ECC errors, but a lot of people seems to be reporting better system stability with those settings.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this