High inward network traffic upsets Docker

ScottAS2 · December 28, 2023

Hi all,

I've been facing a problem with my Unraid server for a while whereby a large file transfer to the server will somehow upset Docker and cause problems, including:

The Docker portions of the webGUI become slow
Attempts to stop, remove, or kill Docker containers fail with error messages along the lines of "attempted to kill container but did not receive a stop event from the container"
- This is regardless of whether I use the webGUI, the command line, or Compose
Connections to Docker containers from elsewhere on the network fail.
(particularly annoying) my ADS-B receiver Docker container stops working and needs to be manually restarted before it'll work again, even if the offending transfer has already finished.

Examples of big transfers in that can cause this problem:

A backup coming in to the server from offsite (through rsync directly on Unraid)
Windows File History backing up to the server (through Unraid's built-in SMB server functionality)
A large download through Lancache on the server (through the Lancache Docker container)

You'll notice that the first two of those should have nothing to do with Docker.

Frustratingly, the following do not cause the problem:

A backup going offsite from the server (rsync again)
Mac OS backing up through (a Time Machine Docker container)
Media being served out to other devices (built-in SMB)

Clearly, my server is running out of some resource that Docker needs, but I'm at a loss as to what it is. Despite the problem being associated with a lot of network traffic, I doubt it's bandwidth itself, since one of the triggers comes from offsite, and it's unlikely my 100Mb connection (less VPN overheads) is saturating the server's gigabit Ethernet. While the CPU does run up during problematic transfers, it doesn't seem to be totally overloaded, and there's bags and bags of free memory. Let's take as a specimen what happened yesterday evening and see what Netdata recorded:

image.png.b127eb6f993e67dfac3e50e3e396e421.png
As you can see, the ADS-B receiver fell over sometime around 17:00. I believe this was triggered by Windows backing up to Unraid's SMB server. The CPU runs up a bit for about 20 minutes at 16:30, then a bit more an hour later, but it never goes above 60%:

There's some disk IO at the same times:

Nonetheless, we have all of the memory:

And there's a network traffic spike, but nothing gigabit Ethernet shouldn't be able to handle:

Can anyone suggest other metrics I should examine? Or is there a way to decrease the general niceness of the Docker daemon so it will just take a bigger share of what resources there are?

Vital statistics:

Unraid v6.12.6
Dell PowerEdge R720XD
Diagnostics are attached

unraid-diagnostics-20231228-1006.zip

tpill90 · February 21

What drives (model number is helpful) are you running in the server? Do they have a dram cache?

This sounds like your drives simply can't handle the write workload. The SLC cache on the drive is exhausted, so the remaining writes go at the actual full speed of the drive. The huge drop in performance causes huge contention between the applications writing to disk, amplifying the performance issues they all are having.

The reads aren't affected by this because reads don't suffer from the same kind of issues. They will generally be at full speed.

ScottAS2 · February 21

14 hours ago, tpill90 said:

What drives (model number is helpful) are you running in the server? Do they have a dram cache?

This sounds like your drives simply can't handle the write workload. The SLC cache on the drive is exhausted, so the remaining writes go at the actual full speed of the drive. The huge drop in performance causes huge contention between the applications writing to disk, amplifying the performance issues they all are having.

The reads aren't affected by this because reads don't suffer from the same kind of issues. They will generally be at full speed.

This chimes with a discovery I've made in the meantime: bypassing the cache and writing directly to the array seems to solve the problem. Both cache drives (BTRFS/RAID1) are Crucial BX500 1TB SATA SSDs; part number CT1000BX500SSD1. I'm not sure how to find out if they have a DRAM cache, although the data sheet makes no mention of it, which probably means "no".

High inward network traffic upsets Docker

Recommended Posts

ScottAS2

Link to comment

tpill90

Link to comment

ScottAS2

Link to comment

Join the conversation