[SOLVED] Docker containers become slow/unusable during large data movement


Recommended Posts

Every so often, unRAID will have an automated process do a large data movement. Either a torrent is finishing up and moving to its final location, NZB is unpacking a large file, etc. Sometimes it'll be me copying data from another machine to unRAID. At least I think that's what's the 'cause' of what the issue is.

 

This issue is that all my docker containers (and maybe even VMs) will slow down to a crawl. To the point where they're effectively unusable. I've got monitoring from HetrixTools pinging the services and it shows me this happens several times a day. All of those yellow and red ! marks is this issue occurring.

 

I'll look at the CPU usage and it'll show several of the threads being pegged at 100%. When I look at htop to see what it is, usually it's a VM as the highest process. However, for this test I had all my VMs turned off and only my Binhex-Deluge and NginxProxyManager docker containers running to verify the issue was still occurring with a 'light' workload.

 

For example:

Here's a test I did to hopefully gather more info.

 

Instead of everything running, I only had Binhex-Deluge and NginxProxyManager docker containers running. And I was copying over some video files to a user share from my other desktop PC: Which was running anywhere between just 5MB/s and 60MB/s, which I feel is crazy slow?  And there happened to be a parity check running (because unRAID keeps crashing, probably unrelated, this was happening prior to the crashing), but that doesn't seem to matter usually.

Deluge kept losing connection to the webserver over and over again. Refreshing the page to reconnect took forever.

 

image.thumb.png.aae7d1e5098df4f95c47e08c5eb543e5.png

 

What I don't know here is that:

  1. why doesn't HTOP match the Dashboard? Which one is correct?
  2.  why does everything come to a crawl? The unRAID interface itself seems fine, just the stuff hosted on it.
    1. Is it just NginxProxyManager docker container not being able to handle the traffic to the internal services?
      1. I don't think so, because the unavailability even occurs when using local IPs
    2. Is the CPU actually being pegged hard enough to not allow web traffic to the internal services?
  3. is it something completely unrelated and this the CPU stuff is a red-herring? Like, is it something to do with the NIC or something? Some other setting?
    1. I do have my two NICs link aggregated with 802.3ad, but I didn't have this problem when I first set that up
  4. why is it happening with almost everything turned off?
  5. is there some hardware issue? I don't think I'm too low specced for what I'm trying to do, but maybe?

 

Honestly, I'm just at a loss of why everything slows down so much when I'm using what I feel is basic file copies. Diagnostics ran during the issue attached.

 

Thank you for reading. Please let me know if you need anymore information.

hathor-diagnostics-20210427-1809.zip

 

Solution: Moving the Docker containers and the VMs onto their own SSD cache pools away from the data ingestion cache pool. I haven't technically attempted this as I had a hardware failure, but I feel it's the right answer.

Edited by Majawat
add a bit more info
Link to comment
2 hours ago, Majawat said:

copying over some video files to a user share

 

2 hours ago, Majawat said:

parity check running

Multiple writing in parity array will oveload it, as result core pegged at 100%,  dashboard different with HTOP because dashboard include IOWAIT.

 

 

2 hours ago, Majawat said:

I'm just at a loss of why everything slows down so much

Does those docker will access array ?

Edited by Vr2Io
Link to comment

Anyway I can "fix" this? Is it just my processors aren't fast enough? Something else? How can I find out what resource is limiting it? I don't think I see the whole CPUs as pegged, just a number of threads. For example the picture above is only at 47% used. Is that really enough to cause this issue?

 

I really want to be able to have multiple copy and downloading jobs occuring at the same time to make full use of my system

Link to comment

You need understand the bottleneck, as you mention, docker / VM will access array, due to array overload then all relate service will got problem no matter you have how much CPU resources.

Usually we setting docker and VM run in dedicate NVMe / SSD by Unassigned device as standalone storage.

 

Array usually form by mechanical disk, it only suitable for single session I/O per disk, for example, I transfering files ( large file ) to non-parity array, single session per each disk for max transfer efficient.

 

image.thumb.png.dcd71c70d8cf65688951d5c67ca5a627.png

Edited by Vr2Io
  • Like 1
  • Thanks 1
Link to comment

So it's primarily the speed of the mechanical array not being able to keep up with the data requirements and I could help this out by having a large SSD/NVME cache pool and moving my dockers/VMs to a separate SSD/NVME cache pool?

Edited by Majawat
Link to comment

On top of @Vr2Io's suggestions, it can be useful to limit the cores (cpu pinning) for downloaders such as deluge, particularly to exclude the cores primarily used by Unraid i.e. cores 0+12 in your case.

I've noticed deluge will use all resources available when moving files, I've seen it spin up 32 cores on my 1950X when I've moved multiple torrents.

 

Here is how my deluge instance is now configured

 

image.png.4e7b312efabbfcd19b9a651c82afaa23.png

 

As mentioned, I also have a separate disk off the array dedicated to downloads.

Only the finished product (i.e. unzipped movie file) gets moved to array, while the torrrent and associated files stay on downloads HDD.

I set this up long ago on an unassigned device, but if your on 6.9+, setting up on a pool makes more sense.

 

  • Thanks 1
Link to comment
1 minute ago, Squid said:

Your biggest problem is that the docker.img is stored on the array.

 

During a parity check, coupled with a data transfer to the array, and performance is definitely going to drop precipitously.  

 

Move the image to the cache pool

The performance issues happen without a party check occuring, but I'm guessing to much other stuff is occuring to.

 

But I agree. I need to move the system share to ssd. That is the Prefer Cache setting, correct? 

 

I also now plan on getting a few extra SSDs and building a docker/vm cache separate from the data ingestion cache. 

 

Thanks everyone

Link to comment
  • Majawat changed the title to [SOLVED] Docker containers become slow/unusable during large data movement

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.