May 18, 20197 yr Hi, I have found machine check events on my server. Fix common problems has warned me for three times now so I guess it's time to find out what the problem is. Can someone help to solve the issue? I have my thoughts what the issue is. I have attached my diagnostic file to this post. rackserver-diagnostics-20190518-0936.zip
May 18, 20197 yr Author I see this in the syslog: Quote May 17 22:35:46 RackServer kernel: mce: [Hardware Error]: Machine check events logged May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 13: 8c000049000800c0 May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: TSC ab5c2e16c9629 May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: ADDR a58102000 May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: MISC 90000008000928c May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: PROCESSOR 0:306f2 TIME 1558125346 SOCKET 0 APIC 0 May 17 22:35:46 RackServer kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xa58102 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:2 rank:0) Are this memory error's? I've checked my bios but I can't find anything related to this..
May 18, 20197 yr Author Thank you for the reply. The weirdest part is I don't see errors in my bios log files. Any idea how to diagnostic this further? Edited May 18, 20197 yr by MvL
May 18, 20197 yr Author Investigating! I have found the memtest86 what you see in the options during booting of unRAID. It's running at the moment. I'll keep you informed.
May 18, 20197 yr Community Expert memtest86 won't detect ECC errors, passmark memtest might in some cases.
May 18, 20197 yr Community Expert 13 minutes ago, MvL said: So this is ecc error? Looks like a corrected error, board's system event log might also have some more info.
May 18, 20197 yr Author Appreciate your guidance. I've checked the event log via IPMI and there are no events. The latest event was on 2019-5-2. I have put Passmark memtest on usb stick and it is now running.
May 18, 20197 yr Community Expert If no errors on memtest best bet is to remove one or more dimms at a time and see if the log errors go away.
May 18, 20197 yr Author Okay. The log of unRAID is reporting Chan#1_DIMM#0. I'm guessing this is DIMM A2? So the first DIMM position (A) on the motherboard then the second slot of DIMM A thus DIMM A2? I'm guessing there is also a channel 0. So channel 0 --> slot 1, channel 1 --> slot 2. Edited May 18, 20197 yr by MvL
May 18, 20197 yr Author The Passmark memtest detected ECC error's. I'm trying to figure out which module is defect..
Archived
This topic is now archived and is closed to further replies.