MvL Posted May 18, 2019 Posted May 18, 2019 Hi, I have found machine check events on my server. Fix common problems has warned me for three times now so I guess it's time to find out what the problem is. Can someone help to solve the issue? I have my thoughts what the issue is. I have attached my diagnostic file to this post. rackserver-diagnostics-20190518-0936.zip
MvL Posted May 18, 2019 Author Posted May 18, 2019 I see this in the syslog: Quote May 17 22:35:46 RackServer kernel: mce: [Hardware Error]: Machine check events logged May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 13: 8c000049000800c0 May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: TSC ab5c2e16c9629 May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: ADDR a58102000 May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: MISC 90000008000928c May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: PROCESSOR 0:306f2 TIME 1558125346 SOCKET 0 APIC 0 May 17 22:35:46 RackServer kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xa58102 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:2 rank:0) Are this memory error's? I've checked my bios but I can't find anything related to this..
JorgeB Posted May 18, 2019 Posted May 18, 2019 Are this memory error's? Yes, at least that's what it looks like.
MvL Posted May 18, 2019 Author Posted May 18, 2019 Thank you for the reply. The weirdest part is I don't see errors in my bios log files. Any idea how to diagnostic this further?
MvL Posted May 18, 2019 Author Posted May 18, 2019 Investigating! I have found the memtest86 what you see in the options during booting of unRAID. It's running at the moment. I'll keep you informed.
JorgeB Posted May 18, 2019 Posted May 18, 2019 memtest86 won't detect ECC errors, passmark memtest might in some cases.
JorgeB Posted May 18, 2019 Posted May 18, 2019 13 minutes ago, MvL said: So this is ecc error? Looks like a corrected error, board's system event log might also have some more info.
MvL Posted May 18, 2019 Author Posted May 18, 2019 Appreciate your guidance. I've checked the event log via IPMI and there are no events. The latest event was on 2019-5-2. I have put Passmark memtest on usb stick and it is now running.
JorgeB Posted May 18, 2019 Posted May 18, 2019 If no errors on memtest best bet is to remove one or more dimms at a time and see if the log errors go away.
MvL Posted May 18, 2019 Author Posted May 18, 2019 Okay. The log of unRAID is reporting Chan#1_DIMM#0. I'm guessing this is DIMM A2? So the first DIMM position (A) on the motherboard then the second slot of DIMM A thus DIMM A2? I'm guessing there is also a channel 0. So channel 0 --> slot 1, channel 1 --> slot 2.
MvL Posted May 18, 2019 Author Posted May 18, 2019 The Passmark memtest detected ECC error's. I'm trying to figure out which module is defect..
Recommended Posts
Archived
This topic is now archived and is closed to further replies.