Machine Check Events detected on your server


Recommended Posts

Hello:  I have been a member since March 2018 but have never had any problems with Unraid or my first server.  Technically I am a newbie in the Linux world although I have worked with Windows for many years.  I just started my setting up my 2nd Unraid server Supermicro which is similar to my 1st server. 

Unraid - Version 6.8.3

Motherboard - Supermicro  X9DRi-LN4+/X9DR3-LN4+, Version REV: 1.10 - BIOS AMD Version 3.3

Processor - Intel Xeon CPU E5-2690 0 @ 2.90 GHz

Memory - 64GB DDR3 Multi-bit ECC 

 

So far I have installed 13 Hard drives - 2 Parity, 1 Cache and 10 Data.  The drives do not contain any data since they were formatted last night.  The system was so "noisy" that I cut the power off.  I started it today to perform the Parity-Sync/Data-Rebuild and discovered the Machine Check problem.  I have attached the Diagnostics and Syslog zip files.  I do not know how to run the mcelog program noted in the error message.  Please help!

Thank you,

Richard    

 

 

 

server16-diagnostics-20200805-0914.zip server16-syslog-20200805-1913.zip

Link to comment

Thanks, I will check if the bios is up to date.  Also I tried running with 32gb instead of 64gb - the 1st run crashed after 4 hours into my parity build - the second run seems okay using the other 32gb completed my parity build in 15 hours and ran for another 8 before I shut it down. The 2nd 32gb appear to be okay - I added 2  sticks back (now 48gb) and testing while loading files to the array - hopefully I can find out if there are any bad memory sticks.

Thanks again

Link to comment

Your MCE's are actually memory issues

 

Aug  5 07:42:21 Server16 kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Aug  5 07:42:21 Server16 kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: cc12f00000010092
Aug  5 07:42:21 Server16 kernel: EDAC sbridge MC0: TSC 0 
Aug  5 07:42:21 Server16 kernel: EDAC sbridge MC0: ADDR 85add2080 
Aug  5 07:42:21 Server16 kernel: EDAC sbridge MC0: MISC 40404086 
Aug  5 07:42:21 Server16 kernel: EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1596649337 SOCKET 0 APIC 0
Aug  5 07:42:21 Server16 kernel: EDAC MC0: 19392 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#0 (channel:2 slot:0 page:0x85add2 offset:0x80 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0092 socket:0 ha:0 channel_mask:4 rank:0)

Your system event log in the BIOS will hopefully pinpoint the actual stick, rather than Channel 2, dimm 0

 

The way most systems with ECC work AFAIK is that they will keep correcting the errors until they are unable to at which point the system will just completely stopped.   Based on the number of corrections that are listed, one or more of the sticks are in very bad shape

Link to comment

Squid - Thanks for your help. I did realize that it was a memory stick problem as noted in my reply to civic95man.  The system froze twice running the parity build with 8 memory sticks. Then I tried running the parity build with 4 sticks "twice".  The system froze the 1st time and then ran successfully the 2nd time in 15+ hours with the other 4 sticks - no error messages.  I am now trying to find any bad ones using them one at a time with 3 "good" sticks. Very time consuming process! I am trying this while copying files to the array from my other system with Beyond Compare. Problem is that the memory usage is very small.  Is there a faster and better way to find out which sticks are bad?

Thanks again, 

Richard

Link to comment
16 minutes ago, rktomasa said:

Squid - Thanks for your help. I did realize that it was a memory stick problem as noted in my reply to civic95man.  The system froze twice running the parity build with 8 memory sticks. Then I tried running the parity build with 4 sticks "twice".  The system froze the 1st time and then ran successfully the 2nd time in 15+ hours with the other 4 sticks - no error messages.  I am now trying to find any bad ones using them one at a time with 3 "good" sticks. Very time consuming process! I am trying this while copying files to the array from my other system with Beyond Compare. Problem is that the memory usage is very small.  Is there a faster and better way to find out which sticks are bad?

Thanks again, 

Richard

Why so complicated? Check the RAMs one by one in a "known good slot" - thats it.

Takes only a few minutes....

Edited by Zonediver
Link to comment

Zonediver - Do you mean that I can check the sticks one at a time without any others and boot the system each time? The memory layout for the Supermicro board shows a minimum of 4 sticks (2 cpus 2 each). Will it boot safely or at all with only 1 stick?

Thanks,

Richard 

Link to comment
On 8/8/2020 at 2:09 AM, rktomasa said:

Zonediver - Do you mean that I can check the sticks one at a time without any others and boot the system each time? The memory layout for the Supermicro board shows a minimum of 4 sticks (2 cpus 2 each). Will it boot safely or at all with only 1 stick?

Thanks,

Richard 

Ok, thats a special case - check the manual for the "min. RAM-Sticks necessary".

If you cant test with only "one" RAM, then you need additional hardware where you can check each RAM one by one.

Edited by Zonediver
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.