[6.8.3] docker image huge amount of unnecessary writes on cache

bastl · May 16, 2020

@johnnie.black Quick question, how is copy-on-write set for your appdata and system share?

Lignumaqua · May 16, 2020

This thread explains a lot. I had a cache SSD die a couple of months ago after only two years of use. I just assumed it was a random failure and replaced it. Now, after, checking with iotop it looks like I'm seeing this problem with over 40 GB an hour being written from loop2 to the cache.

Overall, the new nvme SSD drive has been running 1,660 hours and has had 70TB written to it for an average of 43 GB per hour. Running a single cache drive, btrfs, un encrypted.

Like everyone else here I'd really appreciate a fix for this. Replacing nvme drives is expensive. The temporary fix posted by OP is clever, and I thank them for providing it, but it's not something I'd want to try. I have too many dockers in daily use to want to risk going that far off piste...

JorgeB · May 16, 2020

25 minutes ago, bastl said:

Quick question, how is copy-on-write set for your appdata and system share?

Everything VM/docker related is inside the same share on my cache drive with COW set to auto, that folder is actually a btrfs subvolume so it can get snapshotted and replicated to a different pool daily by a script, this reminds me that the COW setting can't be the only thing causing the write augmentation for the VMs, since all 3 vdisks are together and just the Windows Server has the high writes issues, both other Windows VMs have normal writes according to iotop.

bastl · May 17, 2020

@johnnie.black I've asked because I have my VMs also on a BTRFS subvol with daily snapshots to a UD device. No problems with that so far and only 1 VM is always powered on and sometimes up to 3 are running. COW on the VM share is also set to auto. But in my case the VMs aren't producing the constant writes. Sure they do some writing, but not that much.

My Appdata share also is set to Auto, but the System share with the docker and libvirt images are set to COW off. Not sure when I set it to off, or if it was default off back when I installed Unraid a couple years ago. Even with all dockers turned off, I see the writes. As soon as I disable Docker itself, the writes go down. So I asume for me it has something to do with the combination of Docker + System Share COW NO + BTRFS subvol

boomam · May 20, 2020

Just came across this after it was linked from Reddit.

Its a little disappointing that Limetech themselves have not commented in anyway whatsoever on this.

For myself, using IOtop for about 30mins, with everything other than Plex otherwise 'on', gets me to about 2Gb of Loop2 usage. So 4Gb/hour, 96Gb/day, 2.88Tb/month.

For a 1Tb drive that's high, but shouldn't kill it.

Calculating it for my drives, MX500's, 1Tb, which are rated at 360TBW & using this handy calculator here: https://wintelguy.com/endurance-calc.pl see's me at needing to hit 300Gb per day to hit a 3 year useful life period.

With the price of the drives being $100, that's $33 a year (each) if they last me 3 years.

Which at 1/3 of the per day rating for the drive currently, means i'd be roughly 5-6 years in at this rate, so $10-15 a year i would guess?

That being said, I still want to see a viable solution though from Limetech...

Edited May 20, 2020 by boomam

tjb_altf4 · May 20, 2020

3 hours ago, boomam said:

Just came across this after it was linked from Reddit.

Its a little disappointing that Limetech themselves have not commented in anyway whatsoever on this.

They have commented many times in this very thread

JahRuul · May 20, 2020

6 hours ago, tjb_altf4 said:

They have commented many times in this very thread

Well, as you are stating the obvious, I would also like to complete your statement: he has commented 3 times, in December. That is 5 months ago. The issue was marked as Minor, because the people that are affected are losing only hardware and money, not their data.

I don't want to start on the wrong foot here so it's important to state that Limetech is not at fault with anything here. The issue is caused by docker, or a filesystem driver, or the kernel, or a mismatch between some software interfaces. But people are getting a little disgruntled that despite all the efforts to track and isolate this issue (which has been at the top of "Bug Reports for stable releases" for a while), there is no feedback or interaction from anyone with a word to say in it.

boomam · May 20, 2020

7 hours ago, tjb_altf4 said:

They have commented many times in this very thread

49 minutes ago, JahRuul said:

Well, as you are stating the obvious, I would also like to complete your statement: he has commented 3 times, in December. That is 5 months ago.

Its the second reply here that's the crux of the matter.

This long without a comment of 'we're working on it' counts as a lack of comment in most books.

Can you imagine if your electricity went off at home and other than "yup, looks broken" came up after you phoned the electricity company, that they didn't comment at all for a week?

Even though you'd done the leg work to diagnose the main breaker was fried because of a tree falling on the local power masts a mile away, and they still were not acknowledging that they were looking at it?

Its the same thing.

Being reasonable, no one is saying that 6.9 work needs to be stopped to deal with this, although it could be argued as necessary in several ways, but at the very least a comment saying "yup, we know its an issue still, we'll get on it after 6.9", or any statement whatsoever to acknowledge the issue is ongoing and a little more severe than minor.

The comments around 'minor' isn't actually minor that I've read in this thread is complete nonsense too.

At the end of the day its all perceptional damage control.

Label it as medium/high/critical, make comments to 'check in' that you are still working on it every few weeks, and most peoples ire on the matter will subside as it shows from a PR perspective that they are listening. Customer service 101!

Whether the issue is with Unraid itself, or the software components it needs to work is by the by, the fact of the matter is that it is happening to Unraid users - who have paid money, and on on a key functional part of the product.

At the very least they need to make a statement on the matter, in the same way they did with SSD usage in an array.

Edited May 20, 2020 by boomam

xxxliqu1dxxx · May 20, 2020

@limetech actually follows this thread as well as 53 other people... I think they are aware and would trust they are looking into it.

DasMarx · May 21, 2020

I have also noticed those high writes to the cache drive and many of these seem to be caused by plex doing the new analyze for tv show skip intro feature. While the daily check on plex is running I do see writes in the area of TBs going through loop2.

Edited May 21, 2020 by DasMarx

-Daedalus · May 21, 2020

It's write amplification. This means that any container that was writing a little bit, will still (relatively) only be writing a little bit. Any container that was writing a lot, will now still be writing a lot. It's not the fault of any one container, but Docker (or something else) itself.

bonienl · May 21, 2020

21 hours ago, xxxliqu1dxxx said:

I think they are aware and would trust they are looking into it.

Exactly

placid09 · May 23, 2020

I'm also suffering from this issue. I'm sure Limetech is looking into it, but some more official communication on the issue would be appreciated. I've shut down my Unraid server for the time being.

mf808 · May 25, 2020

Well there goes another 3TB wasted in under 2 hours.

Issue: While post processing a 500mb file Nzbget was caught in a loop (still have to figure out what happened) and effectively did 3TB of writes.

The appdata folder resides on my raid 1 btrfs cache disk.

I only noticed the problem as I got high temp warnings on the nvmes.

This issue does not give me the confidence to leave the server running unattended.

Edited May 25, 2020 by mf808

WackyWRZ · May 25, 2020

I've come across this same issue recently and done some testing after I found my 500GB 850 pro SSD to have almost 150TBW on it. Used up 1/2 my warranty in about 6 months! I moved my docker img file from cache to the xfs pool / spinners and noticed the loop2 counting up like crazy. I also can hear the drive head writing about every 4.5s (on the dot) and loop2 going up 20-30MB each time as well.

I've since narrowed it down (on my Unraid) to the Plex (official) container. Stopping the container the massive writes have stopped as well as the constant disk noise immediately. My loop2 is sitting at 600mb of writes after about 30 minutes - where before I could hit 1GB within 5 minutes.

The only other Dockers I have running are UniFi controller (lsio), NetData, and now Emby (lsio). They can run no problem, but as soon as the Plex docker is started up it goes right back up to writing like crazy - even while doing nothing. I plan to try lsio or binhex Plex containers instead, but this might have been the push I needed to get away from Plex anyway.

Edited May 27, 2020 by WackyWRZ
Clarification

Lignumaqua · May 25, 2020

23 minutes ago, WackyWRZ said:

I've since narrowed it down to the Plex (official) container. Stopping the container the massive writes have stopped as well as the constant disk noise immediately. My loop2 is sitting at 600mb of writes after about 30 minutes - where before I could hit 1GB within 5 minutes.

As others have posted here, you can't blame Plex, or any single Docker. Something is taking normal writes and amplifying them massively. In my case stopping Plex makes a difference, but only reduces it by about 25%, and rampant writes continue. About 1 GB a minute as I look at it right now! I don't want anyone to think the problem has been solved and the cause was Plex. That isn't the case. It's much more fundamental than that.

mdsloop · May 27, 2020

i see the same thing here,

monitoring with Zabbix i see avg 300writes/sec to my 2 Cache drives.

WackyWRZ · May 27, 2020

I should have been more specific - wasn't trying to say that's what everyone's issue is, just what I found on my machine.

albertogomcas · May 28, 2020

I believe there is a general problem of writes being amplified (easily x10). Since plex (especially some flavors of the docker) writes a lot, this can become the most visible. I actually reached the same conclusion than you when I started investigating. But then I had a hard look at the other dockers and still there is way too much writing (In your case half a GB in half an hour may not sound much, but is still 4TB a year for doing... nearly nothing?)

pellen · May 28, 2020

If this issue is due to some write amplification it could be worth to check how much logging the different containers do, and how big the log files are. If a 10MB log file is constantly written to, could it be that instead of writing a few bytes every log update, it rewrites the whole file?

I noticed that I had debug logging enabled in my Plex container (official), so I had 60+MB of logs just for the "Plex Media Server.X.log" files within the last 24h. It looks like Plex creates a new log file when reaching ~11MB, and with debugging it seem to have been printing logs every 1-2 seconds. Re-writing the whole log file every 2 seconds when it's a couple of MB will cause a lot of unnecessary wear.

hotio · May 28, 2020

Outputting to docker logs or rather stopping the containers from writing to docker logs made a significant impact on my writes, that's why all hotio containers don't output to docker logs by default.....it gave me a fighting chance to see my ssd's in action a wee bit longer....

mdsloop · May 28, 2020

i have looked at all dockers and no logfile locations are on cache, but in /tmp/

No VM online

about 15 minutes iotop 😞

caplam · May 29, 2020

I think i made a mistake as my previous post was deleted.

Anyway i was writing i'm concerned too. I'm trying to find a workaround as unraid starts to throw alerts on both my ssds. The 187 Reported uncorrect attibute is growing.

I'm really pissed off with this situation as i was planning to replace my procurve switch with a unifi one but now i have to buy ssd as if it was ink cartridges for my printer. 👹

Docker service is stopped, i have only 2 vm running and i still have 6MB/S writes on ssd.

As unraid (or me probably ) is not doing things right all the time i take diagnostics from time to time. So i searched in history the starting point of the problem.

Attached is a spreadsheet with smartdata of one of the ssd.

Seeing this it seems pretty obvious things started going crazy with the 6.8.0.

Edited May 29, 2020 by caplam

caplam · May 29, 2020

i'm downgrading to 6.7.2, hope i'm not doing a mistake

hovee · May 29, 2020

1 hour ago, caplam said:

I'm trying to find a workaround as unraid starts to throw alerts on both my ssds.

I followed @S1dney post that is a work around. It DRASTICALLY reduced my writes. I would recommend you check it out. I can now sleep easily as night.

One thing that confused me was that after I rebooted I didn't have any docker containers listed. I then had to select Add Container and then choose the template from the drop down. It was quick and only took about 10 minutes to re-add about 20 containers that I had set up previously.

[6.8.3] docker image huge amount of unnecessary writes on cache

User Feedback

Recommended Comments

bastl 208

Link to comment

Lignumaqua 9

Link to comment

JorgeB 7474

Link to comment

bastl 208

Link to comment

boomam 15

Link to comment

tjb_altf4 397

Link to comment

JahRuul 0

Link to comment

boomam 15

Link to comment

xxxliqu1dxxx 1

Link to comment

DasMarx 6

Link to comment

-Daedalus 73

Link to comment

bonienl 1764

Link to comment

placid09 1

Link to comment

mf808 3

Link to comment

WackyWRZ 3

Link to comment

Lignumaqua 9

Link to comment

mdsloop 3

Link to comment

WackyWRZ 3

Link to comment

albertogomcas 0

Link to comment

pellen 3

Link to comment

hotio 26

Link to comment

mdsloop 3

Link to comment

caplam 20

Link to comment

caplam 20

Link to comment

hovee 4

Link to comment

Join the conversation