When working with docker, I've been having issues for the past month, though not without finding what exactly the trigger is.
The first time this happened was when I update all containers, around the 1st of May for me. Forced an unclean reboot after oom-killer was triggered, and the containers loaded up all right after the OS came back up, albeit with high CPU usage, though I can't remember if the memory usage spiked at the time.
The second was a few days later, May the 4th for me. Again triggered during an update, forced another unclean reboot after oom-killer was triggered, this time the containers did not come back up, docker chocked the system up both on CPU and memory.
Symptoms:
Changes to the docker image seem to illicit high CPU usage from docker. Is also causes high spikes in memory usage.
The spike in memory usage can memory to get completely used, triggering the oom-killer.
The more containers in use, the more likely it seems to cause this every time. The amount required to trigger oom-killer is unknown currently.
Things tried to resolve this:
Checking the docker image filesystem (Result: No problems).
Deleting and rebuilding the docker image, several times.
Did a cache filesystem check (BTRFS, Result: Errors found.)
Replaced the cache drives completely.
Re-flashing the the boot USB and only copying configs back over.
Starting shifting almost all my docker container configs to portainer, as docker template duplicate fields and restores unwanted fields from template updates. "Currently" (I'm having to rebuild the docker image again atm), only 3-4 containers are unRaid managed.
Currently it seems if I only load containers up in small batches, I can get every container up. But only from a clean docker image, or when editing a small number of containers at a time (this I'm feeling iffy on).
For the first diagnostic log attached (atlantis-diagnostics-20230507-1710), this seems to have logs from the 4th of May mentioned above.
It'll have the logs from when docker was no longer able to be started. I ended up leaving docker off when it triggered the oom-killer. the log on the 7th shows that zombie process from the docker containers where preventing unmounting the cache. and unclean reboot was done.
For the second diagnostic log attached (atlantis-diagnostics-20230521-1029), I was working on getting a small container stack of 3 images going. But I was configuring them, meaning they were reloaded several times in a row, and seemly creating a situation where the CPU and Memory spikes got stacked up enough for docker to trigger the oom-killer.
Marking this as urgent as docker killing itself via oom-killer is leaving a big risk of data corruption, along with leaving docker pretty much useless currently.
atlantis-diagnostics-20230507-1710.zip atlantis-diagnostics-20230521-1029.zip
Recommended Comments
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.