Jump to content

MCE errors and random freezes


Recommended Posts

My server was running fine for some time now. Here are the specs:

 

Unraid version: 6.9.2

Asus Prime B350-PLUS

Ryzen 7 1700 @ 3000 MHz

32 GB DDR4 with 4 Dimms (2x 8GB @ 3000 MHz + 2x 8GB @ 3200 MHz) running at 3000 MHz

 

The CPU was overclocked to 3.7 GHz before, as I used my gaming setup as VM on the server. Since moving to a dedicated gaming rig, I restored all overclocking settings in the BIOS to stock values.

 

After this the server started to randomly freeze up - usually daily. When this happens it is apparently still running (case lights are up ;) ) but is not accessible in any way, since the network stack just stops working. Only way to bring it back is to hard reset the device. 

 

Since this behavior started I'm getting following error messages in the syslog:

Mar  3 08:39:49 Nexus kernel: mce: [Hardware Error]: Machine check events logged
Mar  3 08:39:49 Nexus kernel: mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 5: bea0000000000108
Mar  3 08:39:49 Nexus kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff813c3054 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Mar  3 08:39:49 Nexus kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1646293169 SOCKET 0 APIC 6 microcode 8001138

After seeing this, I run an memtest check overnight, which did not bring up any errors.

 

I attached diagnostics. It is however from a running system, i.e. NOT taken after a crash, as like I said when the server crashes, it crashes for good and I cannot access any logs.

 

Only changes between a perfectly running system and one crashing often is reverting the CPU to stock settings and exchanging the crappy PSU for a good one. Maybe one more thing: I used two of the RAM sticks in my new gaming rig for a moment, before the new ram arrived. After that the sticks were put back into the server. At the same time memtest did not detect any errors - I do know this does not mean there are none, but still.

 

My ideas for further troubleshooting are:
- run the server with only 2 RAM sticks at a time to see if this changes anything

- resetting BIOS settings to default, in case I f*** something up cleaning the overclocking

 

Any further ideas? Especially about the error message, as I don't really get what it is trying to tell me ;) 

nexus-diagnostics-20220303-2019.zip

Edited by Namarath
Link to comment
  • Namarath changed the title to MCE errors and random freezes

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...