Server unresponsive after 1-2 days (started after updating docker remote share to slave, version update). 6.12.10

mtftl · July 26

This is my first issue in over two years that I can't figure out, so hopefully I am sharing the right details. Situation: my Unraid 6.12.10 server goes unresponsive after 1-2 days. The GUI eventually gives 503 errors. The only "resolution" is to power cycle the box. This was the first case of instability in ~2 years of operation behind a UPS. It's been rock solid, with only an occasional clean shutdown/power up during bad storms.

This started happening within a day after I made a change to my Jellyfin docker config. I have a single remote SMB share attached to it. I was given a warning for the first time that I should use slave mode. I made that change and a day later this unresponsive issue happened the first time ever. After repeated issues I've kept the container off and it doesn't fix anything.

I've been through logs and either missed or couldn't find anything amiss other than an SMTP auth error for notifications that I have since fixed. In case it was a docker issue, I deleted the docker img and rebuilt it, adding in all previous apps using the recommended add container feature. Today, for the first time in over a month of these errors, I actually caught an error message from the GUI that the box ran out of memory and was killing low priority processes (it's happened silently before today). I managed to generate the attached diagnostics with the last gasp of my server before it went unresponsive again.

If anyone can see what might be going on, I'd be greatly appreciative. I can't imagine it was that docker config, it was just spooky that my server went from 99.9999... reliability to broken on a bi-daily basis the next day. I was thinking it had to be hardware, but I can't find what is failing and the fact that I got an out of mem warning today vs. just the server going down has me completely confused. Thanks so much.

tower-diagnostics-20240726-0858.zip

JorgeB · July 26

Server is running OOM, it appears to be a container issue spawning endless processes, see if this helps with the Jellyfin container if you suspect that is the problem:

https://forums.unraid.net/bug-reports/stable-releases/61210-cannot-fork-resource-temporarily-unavailable-r3020/?do=findComment&comment=28505

If not, recommend starting the containers one by one until you find the culprit.

mtftl · July 26

Thanks, Jorge. I'll give that a shot. Since this is the first time I've recorded the out of memory error, is there any chance that this could have "failed silently" the other times in a way that I would not see in the logs?

JorgeB · July 26

If it's a docker fork bomb issue, it usually crashes before there are OOM issues.

mtftl · July 28

Still testing, but I just had my first unresponsive crash since making the change to Jellyfin, so that didn’t work directly. It’s going to be a long testing cycle since it takes a couple days each time but I’ll have to go one by one on my containers.

mtftl · July 30

Another 2 days, another crash. I do not have logs from before the event, I did now turn on mirroring the syslog to the flash again.

No out of memory error this time, but I did get an email alerts that appdata backup failed, along with docker stop reporting an error:

Event: Appdata Backup
Subject: [AppdataBackup] Error!
Description: Please check the backup log!
Importance: alert

docker stop variant was unsuccessful as well! Docker said:

I'm now planning to leave docker disabled to see if this fixes things.

mtftl · August 3

After turning docker off completely, I’m at about 4 days uptime, the first time I have gone beyond 2 since this started.

What are my suggested next moves?Assuming there is something wrong with docker or an app, what should my next move be?

- I can move data and reformat my cache drive (where appdata is located).
- I already killed and remade my docker.img so I doubt that will help

- I can spend weeks enabling single docker services and wait for a crash. But what can I do if I find one is breaking things?

- I can upgrade or downgrade Unraid. Oddly it’s been nothing but problems with this version despite it containing a fix seemingly related to my mounted remote smb share.

I’m still baffled since nothing in my logs or seemingly diagnostics show anything wrong.

JonathanM · August 3

1 hour ago, mtftl said:

I can spend weeks enabling single docker services and wait for a crash. But what can I do if I find one is breaking things?

Shouldn't take weeks if it crashes before 2 days.

Enable half of your normally running containers. If it crashes, divide those in half. If it doesn't, disable that half and enable the other set. I recommend printing a list and noting the start time of each container and notate crash times, keeping track of which containers were running at that point.

Shouldn't take more than a few cycles to narrow it down, unless it's a combination of containers that only crash when they interact, or you have 100's of containers.

Bonus is, you get to continue using critical containers.

mtftl · August 8

I remain baffled. If docker remains off, everything is okay; if any docker container is up, the system crashes out after 2 days, almost to the minute. I'm now able to see alerts related to app backup failing. This time it was specific to a backup file, not the docker error from before:

Event: Appdata Backup
Subject: [AppdataBackup] Error!
Description: Please check the backup log!
Importance: alert

tar verification failed! Tar said: tar: Removing leading /' from hard link targets; mnt/user/appdata/pbs/logs/tasks/DF/UPID\:Tower\:00000009\:0AA590DF\:0000000F\:65D05DDB\:backup\:......\:: Contents differ

It seems like something in my app data perhaps is corrupted. In absence of other ideas, I guess I will try to move app data to Array and back to see if this cleans it up. The cache drive seems okay with everything else which has me confused.

Server unresponsive after 1-2 days (started after updating docker remote share to slave, version update). 6.12.10

Recommended Posts

mtftl

Link to comment

JorgeB

Link to comment

mtftl

Link to comment

JorgeB

Link to comment

mtftl

Link to comment

mtftl

Link to comment

mtftl

Link to comment

JonathanM

Link to comment

mtftl

Link to comment

Join the conversation