Random Shutdown MCE Unraid 6.8.3


Recommended Posts

Hello everyone,

 

Im Runnig Unraid 6.8.3 on an AMD RYZEN 7 1700 with 4x2TB HARD Drives(1Parity) and 2xSSDs(for Cache) 1SSD is 128GB and the other one is 256GB.

For Memory i have 32GB of DDR3 2144MHz RAM non ECC.

The Mainboard is a Gigabity B450 AORUS ELITE. (works like a charm)

For the GPU i got an old EVGA GTX 980TI Hybrid and an MSI GTX 710.

 

Plugins:

  • Nerdpack
  • ZFS(not in use)
  • NVIDIA UNRAID(not in use)
  • DenyHosts
  • userScripts
  • upnp-monitor
  • fix-common-problems
  • parity-check-tuning
  • gui-links(not in use)
  • dynamixsystemtemp
  • dynamixCacheDirs
  • dynamixActiveStreams
  • community applications
  • disableSecurityAMD
  • turboCache
  • moverTuning
  • dynamixSSDtrim

 

I had some problems in the past few day and checked many different ways. Now i am at some point where i have to talk to you.

 

So my System was runnig great now for 85days.

Some Weeks ago it started to randomly crash and become unresponsive. No WebGUI, no CLI, no reaction when using the keyboard and no reaction when pressing the Boot/Start button on the Chassis.

So i did my research and figured out AMD processors have to be set in BIOS to "typical current idle", changed it and it ran for some days.

Since these changes, the server randomly reboots.

So i changed the CPU freqency tuning settings from "Auto" to a normal fixed Value. Then ist did work for some days and now it randomly reboots again.

 

Since i had no other Idea and it is getting a little bit anoying.

 

I hope someone can help me out at this point.

 

Please see my Diagnostics attached.

In the Syslogfile you will notice there is some Hardware Error with the CPU but i am not quite sure what the this means in detail.

 

Greetings and stay healthy.

mps-diagnostics-20200524-1440.zip

Edited by D'n'S137
Some Corrections
Link to comment

Dec  4 22:22:52 MPS kernel: mce: [Hardware Error]: Machine check events logged
Dec  4 22:22:52 MPS kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 5: bea0000000000108
Dec  4 22:22:52 MPS kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffff8114fef2 MISC d012000100000000 SYND 4d000000 IPID 500b000000000
Dec  4 22:22:52 MPS kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1607116951 SOCKET 0 APIC 2 microcode 8001138
Dec  4 22:22:52 MPS kernel:  #2  #3  #4  #5  #6  #7  #8  #9 #10 #11 #12 #13 #14 #15

Link to comment

I Backedup my data from the System(in SafeMode with GUI) to another Drive, to do so i had to take one of the CacheDrives out of my System, because otherwise i ran into an licence issue.

I noticed since the hole copying process my system runs fine. So i gave it a try put the CacheDrive back in and rebooted the system. After like 1 and a half hour the system crashed again. I removed the CacheDrive again and rebooted into SafeMode with GUI started the array an let the system run. After 17 Hours the system crashed again right next to me.

 

So it beeps twice and then reboots without warning or any singn of shutdown script. The screen goes black, it beeps twice and reboots.
maybe a hardware error or an stability issue from the Mainboard config.

 

I will do some further Testing:

  • Run some Memtests
  • Play with CPU voltage an clock in an attempt to stabilise it
  • Play with MemoryClocks(maybe slowing it down?)

 

Open Questions to me:

  • Is it even possible that this is an GPU error?
  • If yes how to test the GPU?
  • Is there an benchmark software/OS to use for hardware tests?

Maybe someone can help me with one of those questions.

 

Kindly Greetings

Link to comment

I might not be of great help here but I will try to help you with what I would do in your situation ;)

First of all some questions:

 

What Power Supply do you use in the system? Are you sure it works correctly and the power it delivers is sufficient?

 

Is the RAM that you are using a set of 4 Sticks or did you have 2 Sticks from one Brand and added another pair form another Manufacturer?

In your Log-Files I can see the Manufacturer of Channel B RAM but nothing for the Channel A RAM which made me curious.

 

I am not a Guru in these matters but I would try a couple of things.

1. Remove any Overclocking if any is set in the BIOS incl. XMP Profiles for RAM - Test if the Error Persists

2. Run Memtest for at least 24h - Does it run through without errors?

3. Swap Power Supply with a replacement if availabel - Test if the Error Persists

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.