Unraid 6.7.2 - Freezes Overnight - MCE Hardware Errors


Arizuia

Recommended Posts

Hello all you fine fellas.

I've recently upgraded my Unraid machine from old chunky Q6600 machine to a mATX build. Some specs:

CPU: Ryzen R5 1400 @ stock
RAM: 16GB DDR4 @ 3200MHz
MOBO: Gigabyte B450 Aorus M
PSU: Seasonic 550W (Good Tier)

 

The server is used for NAS and running servers on a Linux VM.

 

First night the server was fine, but the other night it froze and was unresponsive, didn't even let me type in the terminal. Same thing happened next night.

I installed Fix Common Problems plugin and it reported a MCE Hardware Error, I then looked into the log and found this:

mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 5: bea0000000000108
mce: [Hardware Error]: TSC 0 ADDR 1ffff816560ea MISC d012000100000000 SYND 4d000000 IPID 500b000000000
mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1571423384 SOCKET 0 APIC 0 microcode 8001138

 

Seems to be either CPU or RAM issue, but the same components were in my main system for about a year before I upgraded them and I had no issues with them during that time. Seems odd.

 

I thank anyone who can help me with this.

Link to comment
4 minutes ago, Squid said:

Make sure that C-States are disabled in the BIOS

Actually, the better solution is to look for a "Power supply idle mode" setting in the BIOS* and set it to "Typical current idle" rather than the default "Auto". That still allows the CPU to enter C states but doesn't allow the power to drop so low that it can't wake up again. The issue only affects 1000-series processors.

 

*Typically under Advanced -> AMD CBS -> Zen Common Options

Link to comment
3 hours ago, trurl said:

Have you done memtest?

 

Setup Syslog Server:

 

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=781601

 

and get us the diagnostics plus whatever syslog captures before the crash.

 

Tools - Diagnostics, attach complete diagnostics zip file to your next post.

 

I've left memtest running for a while, at 3200MHz OC there were quite many errors, over 700. Removing OC reduced that to 2 so far.

Attaching the diagnostics file as you asked.

kotiservu-diagnostics-20191018-1834.zip

Link to comment
On 10/19/2019 at 5:08 AM, John_M said:

Don't run the CPU's memory controller beyond its spec. You have to set up a server differently from a gaming machine. Two memory errors is a fail.

After reseating the RAM and keeping it stock, everything seems to be fine now. No memtest errors and hasn't crashed or frozen during last night.

 

Very late update: I bought an used Ryzen 3200G and it solved everything, seems the Ryzen 1400 has some early Zen bugs regarding Linux and stability.

Edited by Arizuia
late update
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.