Jump to content

Unraid server keeps crashing overnight, Ryzen idle power management suspected.


Recommended Posts

Posted (edited)

My Unraid server keeps crashing overnight, If the device is rebuilding parity or copying data, it will last overnight... but if idle it will be hard locked by morning, including the IPMI bios.

 

I have located many threads indicating it ia a known problem due to idle power management, unfortunately I have tried all the recommended fixes to no avail:

 

  • Adding 'rcu_nocbs=0-11' in syslinux config.
  • Set Global C-state control to disabled.
  • Updated BIOS to latest version.
  • Using newest Unraid release.

 

The only thing I haven't changed is to "Set Power Supply Idle Control to Typical Current Idle" because I cannot find this in the bios.

 

Is there any other name it could be under? Or anything else I can change? I am using an Asrock Rack X570D4I-2T.

Edited by Teg
mentioning ryzen in subject
Link to comment
  • Teg changed the title to Unraid server keeps crashing overnight, Ryzen idle power management suspected.

Updating this thread myself for anyone else who might be struggling, Asrock Rack suggested I try a beta bios, and thanks to that update I am also able to deactivate the "hibernate" s3 & s4 options that weren't able to be disabled before.

 

Hopefully that fixes the issue.

Link to comment

Another update, server is still crashing, I enabled system logging and the final entry before the last crash was simply that it was spinning a drive down.

 

I am going to try under-clocking the DDR4 memory, as it appears to be auto-over-clocking and I've read that this can cause problems.

Link to comment

The system wasn't crashing while running windows, it could even sleep and wake. The symptoms match this description exactly... how could I confirm that this ISN'T the problem?

Link to comment

If you would post your systems diagnostics it would help.  We've been guessing what you have for hardware.  Other than the motherboard, I don't even know which Ryzen CPU you have.

 

The early Ryzen CPUs do have lock up issues in Linux based systems.  The later ones are much better, but there are infrequent reports from some of having issues.  Power management was at the root of one of the issues.  Memory speed was another.

 

With power management, typically this is fixed by the "Typical Idle Current" setting in the BIOS.  If your motherboard's BIOS has this setting, it will likely be under Advanced > AMD CBS.

 

For memory speed, Ryzen is a bit finicky.  The Infinity Fabric used to ties the Zen cores to cache to system RAM will have collisions or missing data if the RAM doesn't sync up with the CPU's timing.  People get hung up on the marketing departments numbers printed in bold letters on the package and the advertisements.  "3200" is not the speed that the RAM actually runs.  That's the overclocked XMP value.  It actually runs at 2666 (1333 MHz x 2).  To make things more confusing, it matters if your RAM chips are Single Rank or Dual Rank on the earlier CPUs.  Below is the Maximum Frequency RAM settings for your motherboard.

 

What I would suggest is:

- Remove whatever mitigations you have added

- Confirm you are on the latest BIOS (2.50 or at lease 2.20)

- See if you can locate the Typical Current Idle setting

- Manually set your DRAM settings to be appropriate for your SO-DIMM *AND* no higher than the Maximum Frequency supported

- Sanity check no other over clocking or over/under volting settings in BIOS

- Disable/turn off any devices on the motherboard which you do not plan on using.

 

Capture.JPG

Link to comment

Thank you so much for this response, here is my diagnostic zip, I'm sorry I thought I had attached it.

 

1) Will remove.

2) I was on bios 2.50, Asrock Rack suggested I try the new beta bios L2.59c, which did enable more c-states to disable. 

3) I have checked everywhere and cannot find a setting to change the idle current, I have asked Asrock Rack and they didn't address it yet. I've asked again.

4) Ram speed is 2666mhz, single rank, it's a Vermeer CPU (Risen 5 5600g).

5) Definitely no over or under clocking or volting.

6) I'm confident everything unused on the motherboard is disabled.

fortdox-diagnostics-20240704-2201.zip

Link to comment

Scanning through the logs, I am seeing a few yellow errors like this:

Jul 13 19:05:11 FortDox kernel: pci 0000:27:00.0: BAR 14: no space for [mem size 0x01800000]
Jul 13 19:05:11 FortDox kernel: pci 0000:27:00.0: BAR 14: failed to assign [mem size 0x01800000]
Jul 13 19:05:11 FortDox kernel: pci 0000:28:00.0: BAR 0: no space for [mem size 0x01000000]
Jul 13 19:05:11 FortDox kernel: pci 0000:28:00.0: BAR 0: failed to assign [mem size 0x01000000]
Jul 13 19:05:11 FortDox kernel: pci 0000:28:00.0: BAR 1: no space for [mem size 0x00020000]
Jul 13 19:05:11 FortDox kernel: pci 0000:28:00.0: BAR 1: failed to assign [mem size 0x00020000]

Link to comment

While waiting for the ram sticks I have been running memtest 24/7 and the rig has passed dozens of cycles without issue, so I continue to suspect idle power management and not memory timing, though I know this is not definitive.

 

However, while continuing to hunt through the bios for typical idle current, which still doesn't seem to be a supported option even in the new beta bios, I did come across something called pstate0, which after some research appears to be about idle power and cpu/gpu settings, including voltage. Am I crazy to think this could be similar?

 

My options are custom, and auto, and these are the default settings for custom. I suspect the 20 might be too low, but am hesitant to just try random higher values.

 

 

Screenshot2024-07-20at8_09_13AM.thumb.png.4090b74ef9d1aafcf2808d3620a1b8d6.png

Link to comment
  • 3 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...