[SOLVED] 6.9.2 Random shutdowns could use an assist


Goobaroo

Recommended Posts

I recently replaced my motherboard and CPU from a i5 to a Ryzen 5700G and ROG Strix X570-E and it ran great for over a week.  In the past few days I have not made it past 24 hours, sometimes as low as 30 minutes.

 

What I've checked:

  • Ran a memtest, and it passed with 4 pases
  • Enabled syslog to disk, but there is no indication that I spot of why it is shutting down.
  • Updated my BIOS, which reset all my settings, so I'm not 100% sure I've got everything the way it should be.
    • Disabled c-state
    • Enabled virtualiztion 
  • Plugged in the additional 4 pin CPU power.  Running a 750W power supply.
  • Temperature seems okay, but I don't know for sure since the sensors don't load.  It finds Nuvoton NCT6798D, but modprobe doesn't load the driver

 

Any help would be greatly appreciated.  Prior to upgrading to Ryzen, the server was running great for over a year.

lucifer-diagnostics-20210916-1327.zip

 

Solved:

 

TLDR: Heat and power.  Replaced case with one with better airflow & replaced 750 W PSU with 1,000W one.

Update: 16 days up, no issues.  Definitely heat and power.

Edited by Goobaroo
solved
Link to comment

I think I have it down two issues, heat and a bad SAS breakout cable.  It ran for over a week with the case open and a small desk fan blowing into the case.  I have a new case on order a Phanteks Enthoo 719 to get a bigger case with better airflow.

 

The SAS cable because I've had the same drive location marked bad by Unraid twice now, once with the original drive and again with a brand new drive.  It can't actually be the drive it has to be the cable.

 

Biggest issue I have is that there are no OS level temperature sensors that work to confirm that things are overheating.

Edited by Goobaroo
Link to comment

I'm going to throttle the scaling_governor to conservative.  I only stayed up 7 hours last night and it was trying to recreate a drive, which is CPU intensive.  That was after replacing the CPU cooler and putting it all in a much larger case with better air flow.

 

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

 

Link to comment

Thanks @JonathanM, I did hit the base of the CPU with a laser thermometer, but you're right.  Not a very accurate measurement.  Usually on reboot the BIOS has the CPU in the 40-50 C range.

 

It would help if Unraid could detect the thermal sensors like it did on my old setup, but loading the nct6775 driver doesn't return anything. Despite that being the recommendation from sensors-detect.

 

But the uptimes are all over the place.  Week, day, hours.  I'm willing to swap the Power Supply to see if it fixes it.

 

 

Link to comment

Passive cooling is completely dependent on the airflow patterns in the case. Since you changed motherboards, the airflow patterns have changed. Try temporarily forcing air over the HBA with an extra case fan, or run with the side off with a household fan directed inside. Note, running with the side off WITHOUT an extra fan blowing directly in is asking for trouble, as any case fans will likely just freewheel uselessly as their airflow bypasses everything it was meant to cool.

Link to comment

So, seems the end answer was a combination of heat and power.

 

I replaced the case, with a Phanteks Enthoo 719.  Way better air flow, way more fans.

 

Replaced the 750W PSU with a 1000W and the server was able to complete it's rebuild of one of the drives.  There are 8 WD Red HDDs in there.

 

So the combination of all the drives being on, the more powerful CPU probably drawing more power was clearly tripping the PSU.

 

I'm going to mark this as solved.  Thanks for talking through it with me.

Link to comment
  • Goobaroo changed the title to [SOLVED] 6.9.2 Random shutdowns could use an assist

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.