Unraid hanging every 3-4 days, giant syslog file while troubleshooting

Architect · January 24, 2021

Hey folks,

I'm trialling Unraid, but with my build I've basically committed to using it. I'm experiencing most of the server hanging after a few days (usually about 3 days) - this has happened three times now. Usually the system will still be accessible over ssh, but the web UI won't load at all, or will hang when I click on the docker tab (which is probably a clue). And I can't interact with most systems using ssh either.

If my assumption is correct that this is being caused by a docker container (somehow), then I don't understand how it's breaking the other containers and the main unraid OS.

I've also encountered a quirk while troubleshooting, which is a syslog file which grew to 975gb in a four days (pretty much consuming my cache drive, which doesn't really have anything else on it).

Things that might be relevant:

I am using a Ryzen 5600, and have a 1080GTX graphics card, which is passed through to a VM (this works fine when it's on). I have 64Gb of RAM, use a 1TB cache SSD, and a 2TB unassigned NVME drive to host the docker image file, appdata, VMs. I use luks across my data drives.
I have a couple of drives (sdf, sdg) which have some messed up SMART states, but I don't use them in any way. They just sit there as unassigned drives and aren't mounted.
I have two VMs - one for gaming, and one general purpose one that I use as a landing point if I VPN into the home network.
I have 8 docker containers, 5 of which autostart and are on when it hangs (linuxserver plex, radarr, sonarr, sabnzbd, nzbhydra2).
VMs hang with everything else when the system hangs, but in the last cycle I had them turned off for four days and Unraid still hung.
In a hung state, the web interfaces of most docker containers don't load, though sometimes one does.
In a hung state, I can ssh to the server, and this lets me interact with some systems. I can run top and interact with docker to an extent, but trying to stop or kill dockers will sit as busy and I have to ctrl-c to keep interacting with the server. Same deal with diagnostics, which will say 'Starting Diagnostics Collection...' but will sit in that state indefinitely (or at least for hours).
While I couldn't collect diagnostics, I have collected some screenshots while in a hung state, I'll post those below. The Plex memory (in docker settings) and CPU usage (in top) seems high, so maybe there's something wrong with that container? Not sure why it would cause the rest of the system to become unstable though.

I tried logging everything to the local syslog server / a share set to prefer the cache drive - this generated a mammoth log file that I've been struggling to open. But given the size, there's likely to be a lot of garbage in it, so it'd be a needle in the haystack, even if I could open it (I'm giving 'less' a go). I might have generated a feedback loop somehow? I see logs in local syslog saying they're out of space.

Things I've tried:

I thought it might be CPU pinning of docker containers, which I originally had pinned to 1 core/ 2 threads, but I unwound that and now Unraid/Docker share 3 cores/6 threads (VMs pinned to the others), but that didn't stop the issue from occurring
Disabling VMs (it still hangs without them on).
Memtest (Passed 6 rounds)

Any help would be greatly appreciated!

tower-diagnostics-20210124-1231.zip

trurl · January 24, 2021

Architect · January 24, 2021

17 minutes ago, trurl said:

Thanks - I guess I'd assumed it was a docker issue, so had not looked for Ryzen-specific issues.

I've changed my Bios's "Power Supply Idle Control" from "Auto" to "Typical Current Idle".

While most components in my build are quite new, the PSU is reused from an older system.

JorgeB · January 24, 2021

You are also running the RAM above the max officially supported speed, that's also a known problem.

Architect · January 25, 2021

On 1/24/2021 at 8:46 PM, JorgeB said:

You are also running the RAM above the max officially supported speed, that's also a known problem.

Thanks, that's their stock speed / what they have a profile for. I can clock them lower if the Typical Current Idle thing doesn't solve things. Thanks for the pointer.

JorgeB · January 25, 2021

1 hour ago, Architect said:

that's their stock speed / what they have a profile for.

That's not the issue, AMD officially supports 3200MHZ as max for that CPU, and Ryzen has been known to have stability issues (even corrupt data) with overclocked RAM and Unraid.

Architect · February 9, 2021

Just wanted to update in case others have issues:

Now stable with 16 days uptime and counting
Typical Current Idle looks like it was the culprit - changed it and nothing else. Have left RAM frequency @ 3600MHZ for now pending further issues.

Unraid hanging every 3-4 days, giant syslog file while troubleshooting

Recommended Posts

Architect

Link to comment

trurl

Link to comment

Architect

Link to comment

JorgeB

Link to comment

Architect

Link to comment

JorgeB

Link to comment

Architect

Link to comment

Join the conversation