Goobaroo Posted September 16, 2021 Share Posted September 16, 2021 (edited) I recently replaced my motherboard and CPU from a i5 to a Ryzen 5700G and ROG Strix X570-E and it ran great for over a week. In the past few days I have not made it past 24 hours, sometimes as low as 30 minutes. What I've checked: Ran a memtest, and it passed with 4 pases Enabled syslog to disk, but there is no indication that I spot of why it is shutting down. Updated my BIOS, which reset all my settings, so I'm not 100% sure I've got everything the way it should be. Disabled c-state Enabled virtualiztion Plugged in the additional 4 pin CPU power. Running a 750W power supply. Temperature seems okay, but I don't know for sure since the sensors don't load. It finds Nuvoton NCT6798D, but modprobe doesn't load the driver Any help would be greatly appreciated. Prior to upgrading to Ryzen, the server was running great for over a year. lucifer-diagnostics-20210916-1327.zip Solved: TLDR: Heat and power. Replaced case with one with better airflow & replaced 750 W PSU with 1,000W one. Update: 16 days up, no issues. Definitely heat and power. Edited October 19, 2021 by Goobaroo solved Quote Link to comment
itsalljustdata Posted September 24, 2021 Share Posted September 24, 2021 (edited) any progress on this? I'm seeing similar issues with a 5700G (different mobo). on 6.10.0-rc1 skynet-diagnostics-20210924-1400.zip Edited September 24, 2021 by Meles Meles Quote Link to comment
Goobaroo Posted September 24, 2021 Author Share Posted September 24, 2021 (edited) I think I have it down two issues, heat and a bad SAS breakout cable. It ran for over a week with the case open and a small desk fan blowing into the case. I have a new case on order a Phanteks Enthoo 719 to get a bigger case with better airflow. The SAS cable because I've had the same drive location marked bad by Unraid twice now, once with the original drive and again with a brand new drive. It can't actually be the drive it has to be the cable. Biggest issue I have is that there are no OS level temperature sensors that work to confirm that things are overheating. Edited September 24, 2021 by Goobaroo Quote Link to comment
Goobaroo Posted September 24, 2021 Author Share Posted September 24, 2021 Of course I say that and it rebooted again today. I'm hoping it is something with the flakey SAS cable at this point, and the replacement coming today will fix the stability. Quote Link to comment
itsalljustdata Posted September 27, 2021 Share Posted September 27, 2021 Heat/Heat caused by CPU load could well be my issue too. Mine actually ran fine for a week, but none of my docker containers were running (different issue, don't ask!). I recreated them all on Friday Quote Link to comment
Goobaroo Posted September 27, 2021 Author Share Posted September 27, 2021 I'm going to throttle the scaling_governor to conservative. I only stayed up 7 hours last night and it was trying to recreate a drive, which is CPU intensive. That was after replacing the CPU cooler and putting it all in a much larger case with better air flow. cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor Quote Link to comment
Goobaroo Posted September 27, 2021 Author Share Posted September 27, 2021 4 hours online, and reboot. So no dice. This is really making me regret this purchase, not Unraid, but moving to AMD Quote Link to comment
Goobaroo Posted September 27, 2021 Author Share Posted September 27, 2021 I'm going to replace the power supply and see if that clears it finally. Heat seems fine now with the new case, there is a Nuctua NH-D9L on the CPU and it is cool to the touch. Power is the next most likely issue. Quote Link to comment
JonathanM Posted September 27, 2021 Share Posted September 27, 2021 25 minutes ago, Goobaroo said: there is a Nuctua NH-D9L on the CPU and it is cool to the touch. You can't judge a heatsink's performance by the temp of the heatsink, because it's possible the CPU itself is boiling hot but not transferring the heat. You've got to measure the CPU directly, the on die sensors are a good indicator. Quote Link to comment
Goobaroo Posted September 27, 2021 Author Share Posted September 27, 2021 Thanks @JonathanM, I did hit the base of the CPU with a laser thermometer, but you're right. Not a very accurate measurement. Usually on reboot the BIOS has the CPU in the 40-50 C range. It would help if Unraid could detect the thermal sensors like it did on my old setup, but loading the nct6775 driver doesn't return anything. Despite that being the recommendation from sensors-detect. But the uptimes are all over the place. Week, day, hours. I'm willing to swap the Power Supply to see if it fixes it. Quote Link to comment
Goobaroo Posted September 28, 2021 Author Share Posted September 28, 2021 @JonathanM, your opinion here would be appreciated. Right before the latest shutdown I got the error "Array has 7 disks with read errors", all my disks. I was rebuilding a drive at the time, but other searches seem to point to a power issue. Quote Link to comment
JonathanM Posted September 28, 2021 Share Posted September 28, 2021 Do you have active cooling on your HBA? Quote Link to comment
Goobaroo Posted September 28, 2021 Author Share Posted September 28, 2021 No, the LSI 9201-8i card I have has passive cooling. But was in use on the previous motherboard and CPU (i5-6500) since November with no issue. Quote Link to comment
JonathanM Posted September 28, 2021 Share Posted September 28, 2021 Passive cooling is completely dependent on the airflow patterns in the case. Since you changed motherboards, the airflow patterns have changed. Try temporarily forcing air over the HBA with an extra case fan, or run with the side off with a household fan directed inside. Note, running with the side off WITHOUT an extra fan blowing directly in is asking for trouble, as any case fans will likely just freewheel uselessly as their airflow bypasses everything it was meant to cool. Quote Link to comment
Goobaroo Posted September 29, 2021 Author Share Posted September 29, 2021 So, seems the end answer was a combination of heat and power. I replaced the case, with a Phanteks Enthoo 719. Way better air flow, way more fans. Replaced the 750W PSU with a 1000W and the server was able to complete it's rebuild of one of the drives. There are 8 WD Red HDDs in there. So the combination of all the drives being on, the more powerful CPU probably drawing more power was clearly tripping the PSU. I'm going to mark this as solved. Thanks for talking through it with me. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.