Jump to content

CPU stuck at 100% when Docker running, but no containers using up much CPU

Recommended Posts

I'm at my witts end here on this one, totally at a loss.

Been at this for hours tonight and I think I need to admit I need some help here.


I replaced my cache drives w/ new ones, but found out that they didn't support deterministic trim, so I swapped them out for a Asus Hyper M.2 v2 PCIe 3.0 card that holds 4x M.2 NVMe SSDs, and did PCIe bifurcation to 4x/4x/4x/4x and that worked great.


I swapped the NVMe drives in to the cache pool one at a time.

It took about 12 hours for each drive to rebuild.

I did notice my CPU pushing 100% usage pretty much the entire time, but I didn't think much of it because it was rebuilding the cache drives. 


At this time I also swapped everything from using /mnt/user/appdata/ to /mnt/cache/appdata/ to eek out some extra speed.

I changed every container touching appdata to this and also my docker settings pointing to 

/mnt/cachce/system/docker (using folder, not docker.img)

and /mnt/cahce/appdata/ 


I rebooted the server for good measure (glad I did because I found it tried to boot from NVMe and not my flash anymore, whoops!)

and then I realized my MariaDB was corrupted and NextCloud stopped working.

Weird... okay. moving on, I saw my CPU was stuck at 100% usage when Docker was running.


I stopped ALL containers one by one and CPU was still at 100%, which didn't make much sense, all containers in stopped state still had this problem.

At a loss, I deleted my docker directory (I'm using a directory, not an img)

Everything started coming back up fine after re-adding them from CA, with the exception of course of Nextcloud/MariaDB.

Thought I was in the clear and it was a corrupt docker folder at first, but while troubleshooting the MariaDB/Nextcloud issue, I saw my CPU shoot back up to 100% again. Very weird. And that's where I'm stuck now...


I ended up removing MariaDB for now entirely and deleting its img but still no luck. 

As you can see I'm stuck at 100%, but overall usage per container is very low. 


Diagnostics and screenshots attached below.








Edited by CorneliousJD
Link to comment

I left the server on overnight and woke up to much lower CPU usage, but I'm still wondering what could have caused this - the server became so sluggish it took an extremely long time to do anything.


Also btw the top command showed 0.0 wa (IOWait) during all of this, forgot to include that screenshot. 

Link to comment
Just now, JorgeB said:

I suspect one of the containers, but difficult to confirm without trying one by one.


I left docker running and turned them all off one by one and watched my CPU but no changes, I ended up having every single container in a stopped state and CPU was still at 100% with dockerd being the culprit still. Very strange to me.


I'm running a btrfs balance right now and CPU is back to 100% usage but it was shooting up before I started that even.


Currently working so it's difficult to dedicate a ton of time to this during the work day but I'm starting to now regret my decision to go NVMe because everything was absoultely fine before this.

I'm wondering if its somehow related?

Link to comment
5 minutes ago, JorgeB said:

If you are still using a docker folder I would also try a docker image instead, folder has been known to sometimes cause weird issues.


Interesting, I'll move back to a docker.img file -- i figured without that layer of the img file that i might see better performance w/ just a folder instead, plus never hitting limits (as you can see I have a LOT of containers running)


I'll let the current btrfs balance finish first then kill off the docker folder and re-create a docker.img instead in /mnt/cache/system/docker/docker.img and re-fire up containers and see what happens from there. 

Link to comment
6 hours ago, JorgeB said:

If you are still using a docker folder I would also try a docker image instead, folder has been known to sometimes cause weird issues.


I've completed everything above, during recreating my docker.img file I loaded everything back from previous apps in CA and they all ran fine, sitting at like 30% CPU usage.


I setup my autostartup on docker containers I wanted auto-starting and rebooted and I'm back at 100% CPU again... 

I'm at a total and utter loss here. 


Is there anything I can look into? The server becomes unusable after it sits like this for a while.

Link to comment
12 hours ago, JorgeB said:

I still think it must be one of the containers, but possibly doesn't do the issue immediatly.


I wanted to make sure to post here for anyone else who finds this later, but i think this may be resolved.


So there was definitely an issue somewhere, my MariaDB got corrected and that may have been the culprit in the end?


I re-created docker folder as docker.img and re-added containers and they all added fine ,server was stable, but upon rebooting w/ docker autostarts on, it went to 100% again.

I was patient this time and let it sit and it stabilized after 10-15 mins or so. 

I know I have a lot of containers and this is likely to be epxected.


Whatever was initially going on though, the server wasn't stabilizing and it was sitting at 100% CPU for hours, so perhaps a combination of fixing my MariaDB (restored from backup) and/or re-creating docker folder as docker.img ended up resolving everyting.


I'm at about 24 hours of stability w/out hitting high CPU now.


Thanks JorgeB for the help. (Again!)

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...