Earan Posted December 17, 2021 Share Posted December 17, 2021 Hi Everyone, for a while now, my unraid server throws hardware errors, every now and then, which seem to be RAM related. I recently saw this on the screen it's attached to: Here's the parts that I'm using: Supermicro MBD-H11DSi-NT-B 2x AMD Epyc 7301 8x16 GB of Kingston Server Premier KSM26RD8/16HAI DDR4-2666 regECC One RAM stick seems to have issues, since the server reports as 112GB of Memory sometimes, and not 128GB after a reboot. How do I find out which RAM stick it is, since those errors come up infrequently? Are there other issues in the logs on the screen? Quote Link to comment
JorgeB Posted December 17, 2021 Share Posted December 17, 2021 Looks at the system event log in the BIOS, or IPMI event viewer, there should be more info there. Quote Link to comment
Earan Posted December 17, 2021 Author Share Posted December 17, 2021 Neither on the CLI with IPMITool from the Nerdpack, nor with the IPMI support plugin can I see any RAM Related issues. Downloading the full syslog I see quite a few events like the one on the screen, but all of them say, [Hardware Error]: Corrected error, no action required. also not really stating which RAMslot it is, or atleast, I cannot make it out. this is one full event: Dec 16 21:46:50 itXsvr kernel: mce: [Hardware Error]: Machine check events logged Dec 16 21:46:50 itXsvr kernel: [Hardware Error]: Corrected error, no action required. Dec 16 21:46:50 itXsvr kernel: [Hardware Error]: CPU:8 (17:1:2) MC15_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b Dec 16 21:46:50 itXsvr kernel: [Hardware Error]: Error Addr: 0x0000000143092400 Dec 16 21:46:50 itXsvr kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x000067100a400401 Dec 16 21:46:50 itXsvr kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. Dec 16 21:46:50 itXsvr kernel: EDAC MC2: 1 CE on mc#2csrow#1channel#0 (csrow:1 channel:0 page:0x973092 offset:0x400 grain:64 syndrome:0x6710) Dec 16 21:46:50 itXsvr kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD Quote Link to comment
JorgeB Posted December 17, 2021 Share Posted December 17, 2021 Take a look at the SEL in the BIOS, this is how it appears for one of my SM boards: Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.