MvL Posted May 18, 2019 Share Posted May 18, 2019 Hi, I have found machine check events on my server. Fix common problems has warned me for three times now so I guess it's time to find out what the problem is. Can someone help to solve the issue? I have my thoughts what the issue is. I have attached my diagnostic file to this post. rackserver-diagnostics-20190518-0936.zip Quote Link to comment
MvL Posted May 18, 2019 Author Share Posted May 18, 2019 I see this in the syslog: Quote May 17 22:35:46 RackServer kernel: mce: [Hardware Error]: Machine check events logged May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 13: 8c000049000800c0 May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: TSC ab5c2e16c9629 May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: ADDR a58102000 May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: MISC 90000008000928c May 17 22:35:46 RackServer kernel: EDAC sbridge MC0: PROCESSOR 0:306f2 TIME 1558125346 SOCKET 0 APIC 0 May 17 22:35:46 RackServer kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0xa58102 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:0 ha:0 channel_mask:2 rank:0) Are this memory error's? I've checked my bios but I can't find anything related to this.. Quote Link to comment
JorgeB Posted May 18, 2019 Share Posted May 18, 2019 Are this memory error's? Yes, at least that's what it looks like. Quote Link to comment
MvL Posted May 18, 2019 Author Share Posted May 18, 2019 (edited) Thank you for the reply. The weirdest part is I don't see errors in my bios log files. Any idea how to diagnostic this further? Edited May 18, 2019 by MvL Quote Link to comment
MvL Posted May 18, 2019 Author Share Posted May 18, 2019 Investigating! I have found the memtest86 what you see in the options during booting of unRAID. It's running at the moment. I'll keep you informed. Quote Link to comment
JorgeB Posted May 18, 2019 Share Posted May 18, 2019 memtest86 won't detect ECC errors, passmark memtest might in some cases. Quote Link to comment
MvL Posted May 18, 2019 Author Share Posted May 18, 2019 So this is ecc error? Googling Passmark memtest. Quote Link to comment
JorgeB Posted May 18, 2019 Share Posted May 18, 2019 13 minutes ago, MvL said: So this is ecc error? Looks like a corrected error, board's system event log might also have some more info. Quote Link to comment
MvL Posted May 18, 2019 Author Share Posted May 18, 2019 Appreciate your guidance. I've checked the event log via IPMI and there are no events. The latest event was on 2019-5-2. I have put Passmark memtest on usb stick and it is now running. Quote Link to comment
JorgeB Posted May 18, 2019 Share Posted May 18, 2019 If no errors on memtest best bet is to remove one or more dimms at a time and see if the log errors go away. Quote Link to comment
MvL Posted May 18, 2019 Author Share Posted May 18, 2019 (edited) Okay. The log of unRAID is reporting Chan#1_DIMM#0. I'm guessing this is DIMM A2? So the first DIMM position (A) on the motherboard then the second slot of DIMM A thus DIMM A2? I'm guessing there is also a channel 0. So channel 0 --> slot 1, channel 1 --> slot 2. Edited May 18, 2019 by MvL Quote Link to comment
MvL Posted May 18, 2019 Author Share Posted May 18, 2019 The Passmark memtest detected ECC error's. I'm trying to figure out which module is defect.. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.