Grrrreg Posted December 16, 2020 Share Posted December 16, 2020 (edited) Hello Everyone, I was noticing some instability and rebooted my server the other day. Today I saw the notice and instructions in Fix Common Problems to install the mcelogs plugin. It looks like there are some memory errors. I've got about an hour left on the parity check, but am wondering what my next steps should be. I could definitely use some recommendation on what's the best practice with server memory. From reading the SuperMicro guide, it seems like it would be not recommended to just remove the faulty module, but maybe I'm reading that wrong. I was thinking I'd shutdown the array and reboot and run a memory check from bios. SEL logging was turned off in the bios, now enabled. Also, in replacing the bad dimm, I was thinking of getting 4 new dimms, same spec, but larger capacity, 8 or 16GB dimms and replacing the other three in the same channel/rank etc for a small upgrade. I'm still trying to make sense of all the complexity of server memory, so all advice is welcome and appreciated. Thanks in advance for your help! -Greg Server UNRAID 6.8.3 SuperMicro - SuperStorage 6047R-E1R24N MB: Super X9DRi-LN4F+ Processor 1: Intel Xeon E5-2660 v2 2.2GHz 10 Core 25MB Cache Processor Processor 2: Intel Xeon E5-2660 v2 2.2GHz 10 Core 25MB Cache Processor Memory: 64GB (16x4GB) PC3-10600R 1333MHz DDR3 ECC Errors Dec 15 16:05:47 Seine kernel: mce: [Hardware Error]: Machine check events logged Dec 15 16:05:47 Seine kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Dec 15 16:05:47 Seine kernel: EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 7: 8c00004000010090 Dec 15 16:05:47 Seine kernel: EDAC sbridge MC1: TSC d7f4672046556 Dec 15 16:05:47 Seine kernel: EDAC sbridge MC1: ADDR bce285600 Dec 15 16:05:47 Seine kernel: EDAC sbridge MC1: MISC 207e5286 Dec 15 16:05:47 Seine kernel: EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1608077147 SOCKET 1 APIC 20 Dec 15 16:05:47 Seine kernel: EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbce285 offset:0x600 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:0) Dec 15 20:01:16 Seine kernel: mce: [Hardware Error]: Machine check events logged Dec 15 20:01:16 Seine kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR Dec 15 20:01:16 Seine kernel: EDAC sbridge MC1: CPU 10: Machine Check Event: 0 Bank 7: 8c00004000010090 Dec 15 20:01:16 Seine kernel: EDAC sbridge MC1: TSC d9b8bf06f72be Dec 15 20:01:16 Seine kernel: EDAC sbridge MC1: ADDR bce285600 Dec 15 20:01:16 Seine kernel: EDAC sbridge MC1: MISC 407c0086 Dec 15 20:01:16 Seine kernel: EDAC sbridge MC1: PROCESSOR 0:306e4 TIME 1608091276 SOCKET 1 APIC 20 Dec 15 20:01:16 Seine kernel: EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbce285 offset:0x600 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:0) seine-diagnostics-20201216-1138.zip Edited December 17, 2020 by Grrrreg typos Quote Link to comment
Squid Posted December 19, 2020 Share Posted December 19, 2020 You have a bad memory stick, so the only prudent thing to do is to replace it. 1 Quote Link to comment
Grrrreg Posted December 20, 2020 Author Share Posted December 20, 2020 Thanks Squid, I have two replacement dimms on order. I just need to figure out which dimm to replace. I haven't had any new errors log since the original event. Once the replacements arrive, I'll run a memtest or the supermicro offline memtest and hope they identify which dimm to replace. I think I've figured out the SM recommended memory config and what would happen it I just tried to upgrade the memory in regards to the reduced speed per dimms per channel etc. I love my older server config, the number of cores etc, overall it's worked well. Things have changed a lot since I bought my first 20MB SCSI hard drive in 1990. I think there's a lot of great hardware out there that has lots of life left in it, but at times some of us don't fully understand how best to keep it running. I really appreciate all the advice and help from the forums. Quote Link to comment
Squid Posted December 20, 2020 Share Posted December 20, 2020 There's probably more information in the Servers Event Log about this. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.