Jump to content

can't start or stop Dockers and abnormal CPU usage


Go to solution Solved by JorgeB,

Recommended Posts

Hi there,

 

I'm stuck again.

This morning, one my jDownloader Docker behaved a bit strange; super long load times and such. I also had abnormal high CPU usage, half the cores were at full load, netdata only shows "iowait" as high usage. I tried to restart it, but anytime I tried I got this error:

https://i.imgur.com/Y0bczk8.png

 

I googled around a bit and found the recommendation to renew the docker image, so I followed this guide: https://wiki.unraid.net/Manual/Docker_Management#Re-Create_the_Docker_image_file

I made a backup of the original docker.img before I deleted it. I then started to reinstall my dockers; jDownloader started fine and seemed alright, but shortly after had the same issues again. Also, again high CPU load. Some other issues occured also, like Nginx getting the wrong ports assigned. I couldn't change this though, because I can't stop any Docker, I get the same "server error" as above.

 

So I started from scratch, deleted the docker.img and made a new one, but this time increased size (from 20 to 60GB). Installed the Dockers again. Nginx now had the right ports assigned, but didn't autostart like the others and, you guessed it, I can't start it because of the same "server error". I also can't stop other Dockers.

 

Now for a test I disabled the Docker services entirely, stopped my VM, and I still have this abnormal CPU usage without anything running.

All drives are healthy according to unRAID.

 

I attached my diagnostics, I'm not sure what to do.

 

Best,

Rick

lunas-diagnostics-20221203-1203.zip

Link to comment

Hi there Jorge,

 

I ran memtest for 5½h, it made it through 2½ passes without any errors.

 

I rebooted to unRAID to do the scrub operation, but somehow my cache pool (a single SATA3 SSD) does not offer this option. However, my 2nd NVMe SSD (not used as cache) I do have this option. What could be the matter here? :o

 

Best,

Rick

Link to comment
7 hours ago, CameraRick said:

Hi itimpi,

 

it seems I made the pool XFS. I think I did this because when I started the unRAID a year ago, I read a few posts recommending this for single disk caches

 

Best,
RIck

That would explain why there is no ’scrub’ option as that is specific to BTRFS file systems.   With XFS you will instead have options to use xfs_repair to check and/or repair the file system.

Link to comment
40 minutes ago, JorgeB said:

Run a scrub first on the btrfs pool to see if corruptions are detected, any corrupt files will be listed in the syslog.

Hi there Jorge,

 

my Cache pool (a single SSD) can't be scrubbed, because of XFS. I have a 2nd SSD installed which is BTRFS, but that one I don't use as Cache for unRAID but only for Shares.

I started the xfs_repair with -n which seems to be just checking, unfortunately there's not much output from that. The help-dialogue says it can take long (even hours), but doesn't clarify if I can see when it's actually done, haha :)

 

Best,

Rick

Link to comment
2 hours ago, CameraRick said:

at this point I guess it's good to ask if it was clever to backup my cache drive entirely before doing so?

And, if I'm doing that anyway, it makes sense to backup the Cache, format the disk to BTRFS and use that from now on?

That is a decision you have to make.   If running a single device pool I prefer to use XFS as it seems more robust, but has the downside that you do not have built-in file corruption checking.

Link to comment
1 hour ago, JorgeB said:

Scrub that one, if it's empty see here to reset the stats then keep monitoring for new errors.

Ok I will; right now I can't, because xfs_repair wants the array in Maintenance mode and BRTFS scrub wants the array to be started. So I will have to wait for the check, however it's taking quite long already (only around 130GB are used, SSD is 512GB)

 

Just so I understand it, what exactly does the scrubbing help with the initial issue? Dockers not working proper, not being able to start or stop, when they are not sitting on the BTRFS share SSD?

 

Best,

Rick

Link to comment

Hi Jorge,

 

the scrub was so slow, targeted finishing date was july next year. Even after a day it didn't even make 1%, still it already found 12 uncorrectable errors. I still attached the diagnostics

I figured that this drive might just be broken, so I aborted the operation and wanted to copy all data on the array.

 

But something doesn't work.

I wanted to just freshly format both, my XFS cache SSD as well as the other BTRFS which acts up. For all the shares that sit on the XFS cache I switched "use cache pool" to "YES", the mover went into action and copied the data. However, 4,8GB remains on the cache SSD.

For the 2nd one it's a mess. It only has one share on it, I also switched that one to "YES", but it's still 77% full. The mover starts to work, but doesn't move anything it seems.

 

Actually I setup a "backup" of this 2nd share, in case it ever broke; every two days it does an incremental copy to the array. With the DirSyncPro docker. Unfortunately it didn't do anything since July; great container, huh.

 

Anyway, I guess I will remove the NVMe, it seems obvious that it's the culprit here. Would just be good if I could the other SSD empty before formatting it again :o

 

Best,

Rick

lunas-diagnostics-20221205-2142.zip

Link to comment

Hi Jorge,

 

thanks for your extended help on my issue.

 

For now I kicked the drive out of my unRAID. All data on it was not that important, everything of real value was saved somewhere else.

The cache SSD is still not used, I do think I will switch it from XFS to BRTFS. I re-installed my original docker.img that I backuped earlier, starting the Docker engine again (for now running from the array) everything worked as expected.

I wanted to get running again, I want to treat the SSDs when I have some time.

 

It's obvious now this drive was the culprit, I actually did a SMART analysis before but it was "healthy". Will do it again and keep you posted

 

Best,

Rick

Edited by CameraRick
Link to comment

Hi there Jorge,

 

I removed the faulty NVMe and tested it in another computer; every SMART Value is fine, all software (including unRAID) claims it is healthy, but it is definitely broken. Super slow, unresponsive, but no errors so far. What gives, I have to replace it.

 

Anyway, of course the problems don't stop, huh.

I re-formatted my Cache to BRTFS to get better error correction and such, I enabled it again and used the Mover, so far so good.

And most Dockers work, but Nginx (which is essential for my Nextcloud to work) just won't start. Same "Server Error" as before. I restarted the machine, but no dice.

It seems there is some trouble with its port, when I switch it from 80 to 81 I can start Nginx, but I can't get into the WebGUI. Switching back to 80 gives an error, "already in use", but there's no other docker using Port 80.

 

I attached some new diagnostics

 

Best,

Rick

lunas-diagnostics-20221206-2017.zip

Link to comment

Hi there Jorge,

 

believe it or not, I figured it out.

So somehow the ports on the nginx container got messed up. After fixing that, it started but was't accessible through the WebUI. Checking the logs, Nginx wasn't able to find the LetsEncrypt Certs and therefore wouldn't run. Seems like they were not properly transfered by the mover before I reformatted the SSD.

I bit the bullet, installed Nginx from scratch and it does work again now.

 

Thank you so much for putting up with me for so long.

 

May I ask you one last thing: now that the SSD is formatted to BRTFS, would you recommend to schedule an automatic scrub every week or so?

 

Best,
Rick

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...