Jump to content

Unraid keeps crashing: Machine Check Events


Go to solution Solved by singha,

Recommended Posts

Apologies if I am posting this in the wrong area. This is my first time on the forum.

 

My machine has crashed twice with the error "Machine Check Events detected on your server" being displayed by the fix-common-problems plugin. I've been running this server for almost 1.5 years now. The first crash was 3 days ago. The second crash was today. NerdPack and mcelog have already been installed.

 

In bios, "Power Supply Idle Control" is set to "typical current idle", c-states are disabled globally, and DOCP is disabled (has been like this for ~1 year).

 

I am looking for your technical expertise and I thank you for your time.

 

EDIT: No bios update has occurred for >1 year. The server was heavily loaded (on 22/32 threads) when the MCE error occurred both times.

 

tower-diagnostics-20230320-0829.zip

Edited by singha
More details
Link to comment

Ran memtest86+ (one with unraid) for 48 hours last weekend and am now running it again. No errors.

 

Disabled PBO, enabled SR-IOV, and changed the Tips and Tweaks plugin power mode to performance. Let's see if it makes a difference. This error doesn't occur too often though, so it might take some time to resurface

Link to comment
  • 3 weeks later...
  • 4 weeks later...

It was only happening during a very specific (and thermally hot) workload. If I didn't run that specific workload, the server could go without crashing.

 

Memtest didn't show any errors (ran it for a week) and given that system stability was fine except for that one workload, the chance of memory problems was low.

 

Only thing was that the system was running very hot. Normally, CPUs and VRMs will throttle if they get hot. The only thing that won't throttle is ram.

 

Increased the fan speeds everywhere (previously set a custom quieter fan curve) and haven't crashed for ~1 month now, even when running that specific workload.

 

Hypothetically, it still could be bad ram. A bit flip could be occurring in the kernel space if it overheats, which causes a crash. Nevertheless, in my case, the solution was to increase cooling.

 

EDIT: MCE can refer to a lot of things, so not sure if it's applicable to your problem

Edited by singha
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...