[Solved] Unraid hangs every night since about a week but used to be fine for quite a while


Recommended Posts

Good day Unraid Community!

 

I've built myself a new PC to take advantage of the Nvidia build so I've transformed my old Unraid server into this somewhat new beast:

Motherboard: MS-7B79/X470 GAMING PLUS MAX (MS-7B79), BIOS H.00 08/05/2019

CPU: Ryzen 2700x

Graphic Card: Zotac GTX 1060 6GB

Memory: Corsair 16GB 3200mhz

 

I installed everything about a month and half ago without much much issues. I've ran the stock Unraid 6.7.2 for a little while but I experienced sql timing issues that got fixed on 6.8.0 so I did the upgrade. After a while, I've switch to the Nvidia build to start using the graphic card. Everything seemed smooth for a while. Then about a week ago, I've noticed my Unraid machine became suddenly unavailable every single night between 2AM to 3AM (not the same time all nights but between that range.) I would explain the "hangs" by for example being unable to join a ssh session or having all dockers/VMs down or by going physically to the computer and not being able to resume it by touching the keyboard or closing/reopening the monitor, etc. Basically nothing goes to revive it other than a hard reset. 

 

After a hard reset yesterday, I've finally decided to enable the syslog server to make a copy of the syslog onto my Flash drive. I was hoping to find the root cause of the hangs. Luckily, the copy seems to have worked but when I checked inside, I didn't see anything particular near the end that would justify the hangs. I did some weird call stack traces around 21h40 but I'm not sure what they are. Perhaps someone with more knowledge will be able to put a finger on it?

 

Note that I've also checked:

  • Leaving the Unraid GUI connected to a monitor to see if I could have more info when it would hang... but it still hanged.
  • Another night, I've left it running on a "htop" command on a physical session to have an idea of what happened when it hangs. I came back and it was the same, no monitor, no keyboard... hard reset.
  • Same with remote ssh sessions. They get disconnected when I come back to them in the morning.
  • Do I have user scripts or cron jobs running around 2-3AM. Nope... I do have things that run hourly but it never hangs during the day. Only around that time.
  • I've checked my scheduling settings to see if Mover was moving around that time. Nope, it does at 1AM and there isn't much to copy anyway. Otherwise, nothing seems to run around that time. (Trim was hourly, I moved it to daily at midnight to see if it would help. Nope.)
  • I've even went back to Unraid stock a few days ago because I thought this could potentially be related to the Nvidia build I was using. Even with this downgrade, no go.
  • I've then recently upgraded to 6.8.2 hoping this would help. Still hangs.
  • ??

 

I'm now wondering what could have I missed? Perhaps someone could shed a light? I've included my syslog that hanged last night + the diagnostics. Let me know if I'm missing anything.

 

Thanks everyone! :)

syslog.jan.02.2020.hanged.txt freddykrueger-diagnostics-20200203-1413.zip

Edited by alexricher
Typos
Link to comment

Ryzen on Linux can lock up due to issues with c-states, make sure bios is up to date, then look for "Power Supply Idle Control" (or similar) and set it to "typical current idle" (or similar), or completely disable C-sates.

 

More info here:

https://forums.unraid.net/bug-reports/prereleases/670-rc1-system-hard-lock-r354/

 

11 hours ago, alexricher said:

Corsair 16GB 3200mhz

IMHO very bad idea to use overclock RAM on a server, especially since that's known to cause instability and even data corruption on some Ryzen servers, respect max ram speed depending on config:

 

533928825_2ndgen.jpg.d5c31bc263fdb94052f3559f3d67a4eb.jpg

Link to comment
7 hours ago, johnnie.black said:

Ryzen on Linux can lock up due to issues with c-states, make sure bios is up to date, then look for "Power Supply Idle Control" (or similar) and set it to "typical current idle" (or similar), or completely disable C-sates.

 

More info here:

https://forums.unraid.net/bug-reports/prereleases/670-rc1-system-hard-lock-r354/

 

IMHO very bad idea to use overclock RAM on a server, especially since that's known to cause instability and even data corruption on some Ryzen servers, respect max ram speed depending on config:

 

533928825_2ndgen.jpg.d5c31bc263fdb94052f3559f3d67a4eb.jpg

Hi Johnnie.black!


Thanks for taking the time to reply, it's appreciated. You've mentioned 2 things very interesting that I wasn't aware of so I'll do this right now.

  • Bios Update: My motherboard was updated to latest when installed but I just checked and there are no newer version.
  • Power Supply Idle Control: This was totally new to me. :)I first checked this as I don't remember having played with that value and it was to "Auto". I've put it to "Typical". Also, you've mentioned "or completely disable C-states". There was an option with such a thing that I've disabled. Should I both disable C-states + put Power Supply Idle Control to Typical? Or only one of them? I did both for now but I'll await your confirmation.
  • 3200mhz Memory: Also interesting because I really wanted 3200mhz to take advantage of faster operations but I wasn't aware this could cause instability with Unraid. I just removed the XMP profile (which brings it back to 2166mhz I think).
  • CPU Overclock: Should I do the same for the CPU? I think it's a bit overclocked at the moment. To be on the safe side, I've put it back to stock but I was wondering if it's safe to do this.

What I found weird is that it only happened during the night between 2AM-3AM. Again last night, it did the same. I honestly think too that it could be hardware-config related. I've done all the suggested changes and I'll report back if the issue is still on-going tomorrow.

 

Thanks for taking the time to give your feedback. Fingers crossed we found the culprit! :D

Link to comment
7 minutes ago, alexricher said:

Should I both disable C-states + put Power Supply Idle Control to Typical? Or only one of them?

Usually just correctly setting the power supply idle control is enough and it's the preferred solution, if it still crashes after doing that you can try the other options.

 

9 minutes ago, alexricher said:

To be on the safe side, I've put it back to stock but I was wondering if it's safe to do this.

IMHO servers and overclock don't really go well together, but at least recommend starting with stocks speeds, only if you can get it stable then think of overclocking, so you know if doing that is introducing stability issues.

 

Link to comment

Perfect, thanks for clarifying! I've only set the Power Supply Idle Control for now and will do the other C-state if it's still problematic. I also removed any overclocking on the CPU and memory. Back to stock on the hardware side...

 

I may be pushing my luck but I saw they just release the Nvidia 6.8.2 so I've just upgraded to it while it was down. I'll report back tomorrow if things are still not working well. Thanks again for your answers! :D

Link to comment

Good morning everyone! 

 

I just wanted to confirm that my server hasn't crashed or hanged last night. This means the overclocking and/or the bios setting changes fixed it! 

 

Thanks a lot Johnny for your insight, you've pointed me in the right directing. Much appreciated! 

 

Have a good day everyone! 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.