[6.8.3] docker image huge amount of unnecessary writes on cache

snoopstah · April 25, 2020

One more data point here, a couple of weeks ago I spotted that my used Intel DC S3500 SSDs had dropped their SMART Media Wearout Indicator values from 95% to 60% in the year or so since I installed them. Based on a 'host writes 32mib' value of 22,840,656, I think that's almost 700TBW.

Looking at the writes column on the main Unraid UI page shows pretty constant writes around 15-20MB/s. I have two drives in a BTRFS RAID 1 pool, no encryption.

mf808 · May 3, 2020

Do we have an update on this issue?

I stumbled on this yesterday and have been analyzing my docker containers behaviour and was able to pinpoint it to a handful of containers that cause the massive amount of writes. (2x 1TB WD Black Raid 1 btrfs)

pihole
unms
sonarr (latest linuxserver version)
hydra2 (latest linuxserver version)
Plex (offical version)

After analyzing with iotop and checking the writes for each and every container I am running for 10 min each I calculated a TBW of ~120/year. (This from a ~10min sample size)

What I did to mitigate:

Using the Sonarr binhex version and Plex linuxserver somehow seems to have reduced the writes massively. I also stopped using hydra2 and will migrate Pihole and UNMS to a spare pi I have laying around.

With these changes I was able to reduce to ~2 TBW/year, which I think is way more acceptable.

I have found a couple of reddit threads which describe the same problem. This is hitting everybody with their docker.img on SSDs.

What logging or other procedures would need 20-30MB/s worth of writes? This is totally unneccesary.

Edited May 3, 2020 by mf808

danielb7390 · May 3, 2020

It seems i got hit with this one also.

Cache is a 1TB crucial sata ssd, btfs, not encrypted afaik

I seems to have 5MB/s+ writes constant on the ssd, after finding one of the culpirts (unifi controller) that was also writing a lot to appdata, now my 2nd process with most writes is the [loop2]...

In a few minutes there's 500MB of writes from [loop2]...

At this rate my ssd will probably die soon 😨

My docker list maybe it can help find common dockers?

linuxserver/sonarr
linuxserver/qbittorrent
nextcloud:latest
gitlab/gitlab-runner
gitlab/gitlab-ce
linuxserver/tvheadend
linuxserver/minisatip
linuxserver/plex
binhex/arch-krusader
linuxserver/mariadb
linuxserver/ombi
linuxserver/bazarr
linuxserver/radarr
phpmyadmin/phpmyadmin
didstopia/tvhproxy
debian8/apcupsd-cgi

nas_nerd · May 4, 2020

I suspect I am also experiencing this issue.

The iotop screenshot is from a 2 hour period where the server was idling for most of the time.

~11GB from loop2 in 2 hours....

2 Samsung 500gb Evo SSD in BTRFS pool, no encryption.

****update****

8 hours later it looks like almost 900gb in writes? I hope I am interpreting this incorrectly?

I need to fix this ASAP otherwise these SSDs will be cooked by the end of the month.

Edited May 6, 2020 by nas_nerd
updated

bastl · May 4, 2020

14 hours ago, mf808 said:

pinpoint it to a handful of containers

I have the same issue and not using a single one of these containers. Even with all my containers turned of I see the same 3-5mb/s writes to the cache. The only thing helps is to completely dissable docker to stop it.

nas_nerd · May 4, 2020

Another update.

I stopped all my docker containers overnight (but docker was still enabled), and I barely had any writes to the cache.

This to me suggests potentially a rogue docker application, or having dockers running is causing an issue.

More testing is required on my behalf.

unRAIDuser7 · May 5, 2020

Been following this thread as I believe I'm also having the issue. Just wanted to list what I've come across in the off chance this is of any use to anyone else.

I stopped all the dockers containers leaving docker still enabled like @nas_nerd did. There were no writes of any sort to the cache drive overnight while all the docker containers were stopped.

Docker Containers

binhex nzbHydra2

binhex plexpass

binhex radarr

binhex sonarr

binhex sabnzbdvpn

Process I followed

I rebooted unraid, after all the dockers above were up and running (as I forgot I had them set to auto-start), I had around 30,000 writes to cache. I stopped them all one-by-one, and the writes stopped around 39,000. All docker containers were stopped overnight and no writes occurred. The next morning I enabled the dockers listed above, ~ 44 hours ago, and I'm now sitting at 1,220,283 writes.

My next step is to stop all docker containers, disable auto-start, reboot unraid, then enable one docker at a time (without actually using them) and monitor the number of writes to the cache to see if I can find an offending docker container.

Taddeusz · May 5, 2020

For me the official Plex container was the largest offender. That and my Windows 10 VM rack up the most writes.

I switched to the linuxserver Plex container and that reduced writes by quite a lot. Not much I can do about the VM other than move it’s drive images.

jfoxwoosh · May 9, 2020

running 6.8.3, and official plex container exhibits this problem

Odessa · May 9, 2020

Running 6.8.3, cache is btrfs - getting between 5-20 MB/s constant writes to my SSD for no apparent reason, with temperature warnings. Running Official Plex docker, some common Binhex media dockers. How can we escalate this to critical since it is potentially causing actual hardware damage?

Edited May 9, 2020 by Odessa

danielb7390 · May 9, 2020

Yeah, i don't get it why this is only minor.. if my brand new ssd dies by the end of the year because of this i will be pissed!

beneath · May 9, 2020

I believe this issue is much more wide spread than it appears - I found this on the unraid subreddit and decided to poke around my server. Currently loop2 is writing over 2gb in under 10 minutes to my unencrypted BTRFS cache pool.

Unraid: 6.8.3

Added a new samsung 860 1tb ssd to my btrfs pool 4 months ago:

22.01 TB (47269069408)

3383 (4m, 18d, 23h)

I'd rather not have to run XFS and or modify unraid beyond what is supported. Hopefully we can get a official update on this and or a fix soon as this is causing excessive writes to my ssds - thus reducing their life and possibly causing unforeseen damage.

Referenced subreddit post:

https://www.reddit.com/r/unRAID/comments/ggbvgv/unraid_is_unusable_for_me_because_of_the_docker/

Edited May 9, 2020 by beneath

chanrc · May 9, 2020

I have the same issue and testing with all dockers stopped, loop2 by itself would still be writing data at 5-15MB/s in iotop to my single unencrypted BTRFS cache SSD. Tried converting my cache drive xfs and now it's down to 20MB over the past 10 minutes with no dockers running and 100MB over 10 minutes with all my dockers up (binhex sonarr, radarr, tautulli, sabnzbd, deluge, ombi, pihole, nextcloud). Huge improvements with XFS over BTRFS though still a problem when there is really no usage in any of those dockers.

My month and half old cache SSD is already at 66TBW (of the 640TBW my manufacturer rates the drive for) before I noticed this Can devs look at this as an urgent instead of minor issue? Probably cratered a lot of peoples SSDs already.

retrosynthesizer · May 10, 2020

Someone should really make a PSA for this. I purchased a brand new ssd in January, 163 TBW on it now. Like others have mentioned earlier, iotop shows loop2 is constantly writing to the disk. Using df -kh shows /var/lib/docker is mounted on loop2.

I have a 5 disk btrfs encrypted array with a non-encrypted btrfs cache disk (Samsung EVO 860 1TB). Running several dockers, mainly for web hosting (traefik, cloudflare, organizr, etc.) and data storage (ms SQL Server, influxdb). No VMs.

Below is a slightly modified/simplified version of a script to calculate drive TBW and health %. Source here

#!/bin/bash

### replace sdg below with label of drive you want TBW calculated for  ###
device=/dev/sdg

sudo smartctl -A $device |awk '
$0 ~ /Power_On_Hours/ { poh=$10; printf "%s / %d hours / %d days / %.2f years\n",  $2, $10, $10 / 24, $10 / 24 / 365.25 }
$0 ~ /Total_LBAs_Written/ {
   lbas=$10;
   bytes=$10 * 512;
   mb= bytes / 1024^2;
   gb= bytes / 1024^3;
   tb= bytes / 1024^4;
   #printf "%s / %s  / %d mb / %.1f gb / %.3f tb\n", $2, $10, mb, gb, tb
     printf "%s / %.2f gb / %.2f tb\n", $2, gb, tb
   printf "mean writes per hour:  / %.3f gb / %.3f tb",  gb/poh, tb/poh
}
$0 ~ /Wear_Leveling_Count/ { printf "%s / %d (%% health)\n", $2, int($4) }
' |
   sed -e 's:/:@:' |
   sed -e "s\$^\$$device @ \$" |
   column -ts@

danielb7390 · May 10, 2020

That script(well, smartctl) gave me wrong numbers, my ssd reports that the power on hours are only 349 (14days), that's not correct, i bought it new when i installed unraid on 14 march, and has been running 24/7 since then (uptime ~57days).

Recalculating manually and assuming the total LBAs written is not wrong... 48905653011 LBA=22.77 TB

perDay = 22.77 TB/57days = 0.40 TB/day = 400 GB/day

perHour = 0.40 TB/24h = 0.017 TB/h = 17 GB/hour

It seems i'm not getting hit that hard as i thought i was.. still think it's a bit on the high side tho.

woble · May 10, 2020

Unraid 6.8.3

Cache: Crucial MX500 2TB SSD - BTRFS w/o encryption

Just want to chip in and say that I have similar issue, write on the cache drive hovers between 5-10MB/s constantly. `iotop` reports huge amount of writes to `loop2` for no reason, or so it seems.

I disabled all active dockers and started enabling them one by one to see how it affects `loop2`. Out of all dockers that I have, `nginx-proxy-manager` seems to have the most effect on it. Without it `loop2` writes around 100MB/min and the cache drive in the UI shows as low as few KB/s or even 0 for writes, which is for a total of 14 dockers, arguably some of them aren't that heavy to begin with or don't do much IO in the first place. Those 100M might be related to the dockerd logs which are reported as `dockerd -p /var/run/dockerd.pid --log-opt max-size=10m --log-opt max-file=1 --storage-driver=btrfs --log-level=error` in `iotop`, althogh these report less than 1MB/min. Then there are `shfs /mnt/user -disks 4095 2048000000 -o noatime,allow_other -o remember=330` processes which report 2-30MB/min. With `nginx-proxy-manager` docker enabled (plus all the other dockers), the write to `loop2` jumps to around 400MB/min, the dockerd log processes go to around 2-3MB/min, and shfs ones jump to 70-90MB/min each.

I used to have 6 Crucial SSDs in RAID10 for cache, which ran for about 2 years. Upon inspecting them with Crucial tool, all of them reported around 260TBW which is crazy high for 2 years of really not that intensive load.

I've seen people mention `pihole` and `nzbhydra2`, which I also run, but they don't seem to affect it overall as much as `nginx-proxy-manager` does.

Edited May 10, 2020 by woble

grigsby · May 10, 2020

I'm seeing this behavior, too. New unraid build (6.8.3), with two nvme drives as a cache pool formatted btrfs without encryption. Numerous docker containers (all the fun stuff -- plex, sonarr, radarr, grafana, telegraf, bitwarden, etc.) iotop shows a huge amount of write activity from loop2 (Gb after just a few minutes of watching). I removed the cache pool, removed one of the drives, and formatted one of the nvme drives as xfs to use a a single cache drive, brought everything back online again, and now the i/o is at what I would consider normal levels (a few megabytes in a few minutes).

It would be great to have some resolution to this bug, since my cache is now unprotected, which makes me uncomfortable. Right now I have to make a choice between having an unprotected cache instance, or thrashing my 1Tb nvme drives....

grigsby · May 10, 2020

Can we update the title of this report to [6.8.3], since it's still happening with this latest version? And I would personally consider this to be more severe than a "minor" bug -- I think it fits the category of "urgent" since it potentially leads to data loss if a cache pool is not a viable option.

nas_nerd · May 11, 2020

Update from my end:

Converted my 500GB SSD BTRFS cache pool to a single XFS 500GB cache.

Writes to the cache have now dropped significantly. I am running the exact same dockers as previously.

This suggests a BTRFS + Docker combination is contributing to this excessive write problem.

Unfortunately now my cache is unprotected and I have a spare 500gb SSD (I'm sure I'll find a use for this :))

I agree with a few comments about this issue/bug being more significant than "minor".

beneath · May 11, 2020

5 hours ago, grigsby said:

Can we update the title of this report to [6.8.3], since it's still happening with this latest version? And I would personally consider this to be more severe than a "minor" bug -- I think it fits the category of "urgent" since it potentially leads to data loss if a cache pool is not a viable option.

I definately agree that bug deserves moving to "urgent". Many users are more than likely affected and not knowing that they are burning through their SSD's. Without reading that reddit thread referenced above, I would of been one of them too.

I keep weekly backups outside of my unraid server - but I know many users don't have that luxury. Would hate to see a perfect "storm" and see potental dataloss.

Edited May 11, 2020 by beneath

S1dney · May 11, 2020

Changed Priority to Urgent

>>

Since I noticed this thread getting more and more attention lately, and more and more people urging it to be urgent instead of minor, I'll raise priority on this one.

Just an FYI, I made/kept it minor initially cause I had a workable workaround that I felt satisfied with. If the Command Line Interface isn't really your thing or you have any other reason to not tweak the OS in an unsupported way I can fully understand this frustration.

In the end... The community decides priority.

Also updated the title to version 6.8.3 as requested.

Cheers

Edited May 11, 2020 by S1dney

itimpi · May 11, 2020

27 minutes ago, S1dney said:

Changed Priority to Urgent

>>

Since I noticed this thread getting more and more attention lately, and more and more people urging it to be urgent instead of minor, I'll raise priority on this one.

Just an FYI, I made/kept it minor initially cause I had a workable workaround that I felt satisfied with. If the Command Line Interface isn't really your thing or you have any other reason to not tweak the OS in an unsupported way I can fully understand this frustration.

In the end... The community decides priority.

Also updated the title to version 6.8.3 as requested.

Cheers

This actually points out that we could do with another intermediate category called something like “Major” meaning it is very important but is not actually stopping the server working or directly causing data loss. I would then put this into the “Major” category rather than “Urgent”. I certainly agree it needs to be more than “Minor”.

S1dney · May 11, 2020

1 hour ago, itimpi said:

This actually points out that we could do with another intermediate category called something like “Major” meaning it is very important but is not actually stopping the server working or directly causing data loss. I would then put this into the “Major” category rather than “Urgent”. I certainly agree it needs to be more than “Minor”.

Agreed!

"Urgent" might be to generic in that it sums up "Server crash", "data loss" and "showstopper" under one caller.

For now it seems to be a showstopper for a bunch of people so it's still accurate.

If a new category is created, let me know and I'll adjust 👍

bonienl · May 11, 2020

The priority "Urgent" means something is seriously wrong and prevents the system from working normally.

This is not really the case here...

The priority "Minor" may sound as insignificant, but it does mean Limetech is looking into the issue and address it as appropriate.

goodGame · May 11, 2020

2 minutes ago, bonienl said:

The priority "Urgent" means something is seriously wrong and prevents the system from working normally.

This is not really the case here...

The priority "Minor" may sound as insignificant, but it does mean Limetech is looking into the issue and address it as appropriate.

I strongly disagree with this. My new purchase of a $550 cache SSD which would have lasted 10+ years with my workload is now at 253TBW out of the warranted 300 after 1 year. How this can be seen as system working normally is frustrating to me.

[6.8.3] docker image huge amount of unnecessary writes on cache

User Feedback

Recommended Comments

snoopstah 0

Link to comment

mf808 3

Link to comment

danielb7390 5

Link to comment

nas_nerd 0

Link to comment

bastl 208

Link to comment

nas_nerd 0

Link to comment

unRAIDuser7 2

Link to comment

Taddeusz 100

Link to comment

jfoxwoosh 8

Link to comment

Odessa 4

Link to comment

danielb7390 5

Link to comment

beneath 4

Link to comment

chanrc 0

Link to comment

retrosynthesizer 1

Link to comment

danielb7390 5

Link to comment

woble 0

Link to comment

grigsby 6

Link to comment

grigsby 6

Link to comment

nas_nerd 0

Link to comment

beneath 4

Link to comment

S1dney 49

Link to comment

itimpi 2236

Link to comment

S1dney 49

Link to comment

bonienl 1763

Link to comment

goodGame 3

Link to comment

Join the conversation