Jump to content

High inward network traffic upsets Docker


ScottAS2

Recommended Posts

Hi all,

I've been facing a problem with my Unraid server for a while whereby a large file transfer to the server will somehow upset Docker and cause problems, including:

  • The Docker portions of the webGUI become slow
  • Attempts to stop, remove, or kill Docker containers fail with error messages along the lines of "attempted to kill container but did not receive a stop event from the container"
    • This is regardless of whether I use the webGUI, the command line, or Compose
  • Connections to Docker containers from elsewhere on the network fail.
  • (particularly annoying) my ADS-B receiver Docker container stops working and needs to be manually restarted before it'll work again, even if the offending transfer has already finished.

 

Examples of big transfers in that can cause this problem:

  • A backup coming in to the server from offsite (through rsync directly on Unraid)
  • Windows File History backing up to the server (through Unraid's built-in SMB server functionality)
  • A large download through Lancache on the server (through the Lancache Docker container)

You'll notice that the first two of those should have nothing to do with Docker.

 

Frustratingly, the following do not cause the problem:

  • A backup going offsite from the server (rsync again)
  • Mac OS backing up through (a Time Machine Docker container)
  • Media being served out to other devices (built-in SMB)

 

Clearly, my server is running out of some resource that Docker needs, but I'm at a loss as to what it is. Despite the problem being associated with a lot of network traffic, I doubt it's bandwidth itself, since one of the triggers comes from offsite, and it's unlikely my 100Mb connection (less VPN overheads) is saturating the server's gigabit Ethernet. While the CPU does run up during problematic transfers, it doesn't seem to be totally overloaded, and there's bags and bags of free memory. Let's take as a specimen what happened yesterday evening and see what Netdata recorded:

 

image.png.b127eb6f993e67dfac3e50e3e396e421.png
As you can see, the ADS-B receiver fell over sometime around 17:00. I believe this was triggered by Windows backing up to Unraid's SMB server. The CPU runs up a bit for about 20 minutes at 16:30, then a bit more an hour later, but it never goes above 60%:

CPU.thumb.png.68043c949d64cdccdd6b981c7f5d67d8.png

There's some disk IO at the same times:

1504735918_DiskIO.thumb.png.d80eb0eec93d29643a13cd746d9c2ec4.png

Nonetheless, we have all of the memory:

1249524685_SystemRAM.thumb.png.947715044c5359db7c6acf18e9553eb1.png

And there's a network traffic spike, but nothing gigabit Ethernet shouldn't be able to handle:

Network.thumb.png.d00d98fd1fe115101c1d1f2b0a4b9940.png

 

Can anyone suggest other metrics I should examine? Or is there a way to decrease the general niceness of the Docker daemon so it will just take a bigger share of what resources there are?

 

Vital statistics:

  • Unraid v6.12.6
  • Dell PowerEdge R720XD
  • Diagnostics are attached

unraid-diagnostics-20231228-1006.zip

Link to comment
  • 1 month later...

What drives (model number is helpful) are you running in the server?  Do they have a dram cache?

 

This sounds like your drives simply can't handle the write workload. The SLC cache on the drive is exhausted,  so the remaining writes go at the actual full speed of the drive.  The huge drop in performance causes huge contention between the applications writing to disk, amplifying the performance issues they all are having.  

The reads aren't affected by this because reads don't suffer from the same kind of issues. They will generally be at full speed. 

 

 

Link to comment
14 hours ago, tpill90 said:

What drives (model number is helpful) are you running in the server?  Do they have a dram cache?

 

This sounds like your drives simply can't handle the write workload. The SLC cache on the drive is exhausted,  so the remaining writes go at the actual full speed of the drive.  The huge drop in performance causes huge contention between the applications writing to disk, amplifying the performance issues they all are having.  

The reads aren't affected by this because reads don't suffer from the same kind of issues. They will generally be at full speed. 

 

This chimes with a discovery I've made in the meantime: bypassing the cache and writing directly to the array seems to solve the problem. Both cache drives (BTRFS/RAID1) are Crucial BX500 1TB SATA SSDs; part number CT1000BX500SSD1. I'm not sure how to find out if they have a DRAM cache, although the data sheet makes no mention of it, which probably means "no".

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...