morytox Posted February 5, 2020 Share Posted February 5, 2020 Hi Folks, i am currently running a small Homeserver based on a Ryzen7 2700 Asrock Pro X470D4U 32Gig ECC Kingston 2666 Memory. (From supported memory list of the mainboard) First of all, even though the memory is labled with 2666 mhz, it is runnung at 2400mhz. But there are no settings available within the bios to change it. Anyway if it is runnung, everything is fine. Thats why i am more interested in my second topic. I check my log regularily and do always find such kind of red marked error logs: Each block starts with "Corrected Error, no action required" But some of the memory addresses do come more often. Does it mean i have partially faulty memory in there? Thanks for help in advance! Feb 4 19:38:22 BLACKBOX kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Feb 4 20:00:13 BLACKBOX kernel: mce: [Hardware Error]: Machine check events logged Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: Corrected error, no action required. Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: Error Addr: 0x00000002c5c66b00 Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103 Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0 Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error. Feb 4 20:00:13 BLACKBOX kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x59b8cd offset:0x700 grain:1 syndrome:0xa40) Feb 4 20:00:13 BLACKBOX kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Feb 4 21:05:45 BLACKBOX kernel: mce: [Hardware Error]: Machine check events logged Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: Corrected error, no action required. Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: Error Addr: 0x00000002ea867500 Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103 Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0 Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error. Feb 4 21:05:45 BLACKBOX kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x5e50ce offset:0xb00 grain:1 syndrome:0xa40) Feb 4 21:05:45 BLACKBOX kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Feb 4 22:38:36 BLACKBOX kernel: mce: [Hardware Error]: Machine check events logged Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: Corrected error, no action required. Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: Error Addr: 0x000000030d494200 Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103 Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0 Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error. Feb 4 22:38:36 BLACKBOX kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x62a928 offset:0x500 grain:1 syndrome:0xa40) Feb 4 22:38:36 BLACKBOX kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Feb 4 23:27:45 BLACKBOX kernel: mce: [Hardware Error]: Machine check events logged Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: Corrected error, no action required. Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: Error Addr: 0x000000031d924c00 Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103 Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Extended Error Code: 0 Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: Unified Memory Controller Error: DRAM ECC error. Feb 4 23:27:45 BLACKBOX kernel: EDAC MC0: 1 CE on mc#0csrow#3channel#1 (csrow:3 channel:1 page:0x64b249 offset:0x900 grain:1 syndrome:0xa40) Feb 4 23:27:45 BLACKBOX kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Quote Link to comment
JorgeB Posted February 5, 2020 Share Posted February 5, 2020 Those look to be ECC corrected errors, it still means there's a problem, you don't want to have those errors regularly, try one DIMM at a time. Quote Link to comment
plindberg Posted February 8, 2020 Share Posted February 8, 2020 I have the same setup as you, - Ryzen 7 2700 - Asrock Rack X470D4U - 2x Kingston 16GB ECC DDR4 (KSM26ED8/16ME). OS is Debian. I assembled this system today and have seen three corrected ECC errors in about 4 hours of operation, seems to be the same errors as you've reported here. All three of mine are on different addresses, though. Currently trying to figure out if this is normal. I also noticed that my RAM is running at 2400 and not 2666. This is according to dmidecode. May be interesting to note that all three errors happened just as I was running a command in the terminal. One of them happened when I ran sensors-detect, another when I terminated a running stress-test. I don't recall what I was doing when the third one occurred. Since then I've been running memory stress tests using stress-ng to see if I can trigger more errors (going on for about one or two hours now), but haven't seen anything yet. I also ran a few loops of memtester and everything passed. The fact that we seem to have the exact same issue (if it even is an issue, I'm still holding out on that) on virtually identical systems makes me believe it could be related to some configuration issue. I have left all BIOS settings to their defaults for now - maybe the memory configuration is not optimal for these sticks? Maybe a BIOS update is in order. Log entries: [Hardware Error]: Corrected error, no action required. [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b [Hardware Error]: Error Addr: 0x00000000018f03c0 [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103 [Hardware Error]: Unified Memory Controller Extended Error Code: 0 [Hardware Error]: Unified Memory Controller Error: DRAM ECC error. [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD ————————— [Hardware Error]: Corrected error, no action required. [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2040000000011b [Hardware Error]: Error Addr: 0x00000003e7eb9800 [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103 [Hardware Error]: Unified Memory Controller Extended Error Code: 0 [Hardware Error]: Unified Memory Controller Error: DRAM ECC error. [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD ————————————— [Hardware Error]: Corrected error, no action required. [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2040000000011b [Hardware Error]: Error Addr: 0x0000000000c10980 [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400102 [Hardware Error]: Unified Memory Controller Extended Error Code: 0 [Hardware Error]: Unified Memory Controller Error: DRAM ECC error. [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Quote Link to comment
Squid Posted February 8, 2020 Share Posted February 8, 2020 3 minutes ago, plindberg said: Currently trying to figure out if this is normal. It's normal for bad memory to do this. Good memory will not 5 minutes ago, plindberg said: I also ran a few loops of memtester and everything passed. Memtest won't find any errors, because they are being corrected for the time being. Quote Link to comment
plindberg Posted February 8, 2020 Share Posted February 8, 2020 2 minutes ago, Squid said: It's normal for bad memory to do this. Good memory will not Memtest won't find any errors, because they are being corrected for the time being. That's good to know. If any errors do occur they should be logged to the system log though, so at least there's that. I've read elsewhere that if the memory is not setup correctly in BIOS, that could trigger ECC errors. Any thoughts on that? Seems unlikely that both me and the OP got bad sticks, but I guess there might have been a bad batch. Quote Link to comment
Squid Posted February 8, 2020 Share Posted February 8, 2020 Just now, plindberg said: Seems unlikely that both me and the OP got bad sticks, but I guess there might have been a bad batch. Not particularly IMO. The memory chips themselves are banged out by the millions for all the stick manufacturers, and each contains a few million transistors and it only takes one to be borderline. So far as the BIOS settings, overclocking (XMP / AMP profiles) only serve to make things worse. Quote Link to comment
ramblinreck47 Posted February 8, 2020 Share Posted February 8, 2020 (edited) This thread might be able to lead you in the right direction: https://www.ixsystems.com/community/threads/freenas-build-with-10gbe-and-ryzen.77752/ Specifically, the last couple of pages of that thread. EDIT** Also, this one is very thorough and covers almost everything...the only caveat being that it's insanely long so you'll really need to take some time to read it or know exactly what you're searching for. https://forum.level1techs.com/t/asrock-rack-has-created-the-first-am4-socket-server-boards-x470d4u-x470d4u2-2t/139490/885 Edited February 8, 2020 by ramblinreck47 Quote Link to comment
plindberg Posted February 8, 2020 Share Posted February 8, 2020 5 minutes ago, Squid said: Not particularly IMO. The memory chips themselves are banged out by the millions for all the stick manufacturers, and each contains a few million transistors and it only takes one to be borderline. So far as the BIOS settings, overclocking (XMP / AMP profiles) only serve to make things worse. There's always underclocking 😏 Jokes aside, I'm going to keep monitoring this and see how extensive the errors get. If it continues happening on a regular basis I will probably just order some new sticks. If I don't see any more I will just write it off as a random event caused by cosmic rays or whatever, and be happy that the ECC did its job. Quote Link to comment
plindberg Posted February 8, 2020 Share Posted February 8, 2020 10 minutes ago, ramblinreck47 said: This thread might be able to lead you in the right direction: https://www.ixsystems.com/community/threads/freenas-build-with-10gbe-and-ryzen.77752/ Specifically, the last couple of pages of that thread. Thank you, going to take a look! 🙇♂️ Quote Link to comment
plindberg Posted February 8, 2020 Share Posted February 8, 2020 1 hour ago, ramblinreck47 said: EDIT** Also, this one is very thorough and covers almost everything...the only caveat being that it's insanely long so you'll really need to take some time to read it or know exactly what you're searching for. https://forum.level1techs.com/t/asrock-rack-has-created-the-first-am4-socket-server-boards-x470d4u-x470d4u2-2t/139490/885 Nearly missed your edit! Judging from the posts by Marshy on page 4 of the first thread you linked, it seems that setting the memory speed to 2666hz rather than 2400hz might actually fix the issue. I'm going to try this and see if I get any more errors. These sticks are supposed to run at that speed anyway, not quite sure why they defaulted to 2400. If that doesn't help I'll start digging into the second thread. Should be a good Sunday read 😁 Quote Link to comment
ramblinreck47 Posted February 8, 2020 Share Posted February 8, 2020 Just now, plindberg said: Nearly missed your edit! Judging from the posts by Marshy on page 4 of the first thread you linked, it seems that setting the memory speed to 2666hz rather than 2400hz might actually fix the issue. I'm going to try this and see if I get any more errors. These sticks are supposed to run at that speed anyway, not quite sure why they defaulted to 2400. If that doesn't help I'll start digging into the second thread. Should be a good Sunday read 😁 I'm glad that you probably found an answer to your problem. The Level1Tech's thread is incredibly detailed and goes into all the problems that the motherboard has had from the very beginning and what got fixed (and didn't) along the way. This motherboard has soooo much potential but I'm still in the Intel camp for my next build until they fix a lot of these little bugs. A Ryzen/UnRAID/X470D4U just isn't a perfect set yet. Hopefully, the newer motherboard BIOS they are working on and UnRAID 6.9 gets this combination to where it needs to be. Quote Link to comment
plindberg Posted February 15, 2020 Share Posted February 15, 2020 In case anyone else has these same issues in the future, the changes discussed above seems to have done the trick. My system has been running stable for a week now, no signs of any problems. To sum it up, what I did was follow the suggestions by the user Marshy in the thread linked above: - Set the RAM speed to 2666Mhz - Disable global c-states - Set Power Supply Idle Control to Typical I'm guessing the latter two didn't have anything to do with the ECC errors, but a lot of people seems to be reporting better system stability with those settings. 1 Quote Link to comment
JorgeB Posted February 16, 2020 Share Posted February 16, 2020 With Power Supply Idle Control correctly set you should have no issues leaving global c-states enable, and it will help with power/heat. Quote Link to comment
willi_jan9393 Posted February 22, 2020 Share Posted February 22, 2020 On 2/15/2020 at 2:39 PM, plindberg said: In case anyone else has these same issues in the future, the changes discussed above seems to have done the trick. My system has been running stable for a week now, no signs of any problems. To sum it up, what I did was follow the suggestions by the user Marshy in the thread linked above: - Set the RAM speed to 2666Mhz - Disable global c-states - Set Power Supply Idle Control to Typical I'm guessing the latter two didn't have anything to do with the ECC errors, but a lot of people seems to be reporting better system stability with those settings. Where did you set the ram speed to 2666 MHz? I can't find this option in the bios... Quote Link to comment
Mohrpheus78 Posted March 2, 2020 Share Posted March 2, 2020 Can you please explain how to set the RAM to 2666 MHz? Where in the BIOS settings do I have to configure this setting? Is it AMD CBS > UMC Common Options > DDR4 Common options > DRAM Timing Configuration ??? Quote Link to comment
sagman Posted August 26, 2020 Share Posted August 26, 2020 I have the same problem with the same hardware. Have you been able to fix it? On 2/8/2020 at 9:45 PM, plindberg said: I have the same setup as you, - Ryzen 7 2700 - Asrock Rack X470D4U - 2x Kingston 16GB ECC DDR4 (KSM26ED8/16ME). OS is Debian. I assembled this system today and have seen three corrected ECC errors in about 4 hours of operation, seems to be the same errors as you've reported here. All three of mine are on different addresses, though. Currently trying to figure out if this is normal. I also noticed that my RAM is running at 2400 and not 2666. This is according to dmidecode. May be interesting to note that all three errors happened just as I was running a command in the terminal. One of them happened when I ran sensors-detect, another when I terminated a running stress-test. I don't recall what I was doing when the third one occurred. Since then I've been running memory stress tests using stress-ng to see if I can trigger more errors (going on for about one or two hours now), but haven't seen anything yet. I also ran a few loops of memtester and everything passed. The fact that we seem to have the exact same issue (if it even is an issue, I'm still holding out on that) on virtually identical systems makes me believe it could be related to some configuration issue. I have left all BIOS settings to their defaults for now - maybe the memory configuration is not optimal for these sticks? Maybe a BIOS update is in order. Log entries: [Hardware Error]: Corrected error, no action required. [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[-|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0x9c2040000000011b [Hardware Error]: Error Addr: 0x00000000018f03c0 [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103 [Hardware Error]: Unified Memory Controller Extended Error Code: 0 [Hardware Error]: Unified Memory Controller Error: DRAM ECC error. [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD ————————— [Hardware Error]: Corrected error, no action required. [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2040000000011b [Hardware Error]: Error Addr: 0x00000003e7eb9800 [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400103 [Hardware Error]: Unified Memory Controller Extended Error Code: 0 [Hardware Error]: Unified Memory Controller Error: DRAM ECC error. [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD ————————————— [Hardware Error]: Corrected error, no action required. [Hardware Error]: CPU:0 (17:8:2) MC16_STATUS[Over|CE|MiscV|-|AddrV|-|-|SyndV|-|CECC]: 0xdc2040000000011b [Hardware Error]: Error Addr: 0x0000000000c10980 [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x00000a400a400102 [Hardware Error]: Unified Memory Controller Extended Error Code: 0 [Hardware Error]: Unified Memory Controller Error: DRAM ECC error. [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Quote Link to comment
bgerm Posted November 7, 2020 Share Posted November 7, 2020 (edited) I was having the same problem with the same hardware, like @plindberg I changed the RAM clock speed to 2666Mhz and this seems to have fixed the issue (although its only been a few hours), I will continue to monitor my system, but for anyone else like @sagman and @Mohrpheus78, these are the steps i took to adjust the memory speed: Enter the UEFI (press F2 during boot) Advanced AMD CBS UMC Common Options DDR4 Common Options DRAM Timing Configuration (click accept on the warning page that pops up after this) Memory Clock Speed Change the memory clock speed from "Auto" to 1333Mhz Press F10 to save and exit Edited November 7, 2020 by bgerm typo Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.