• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Solved Urgent

    EDIT (March 9th 2021):

    Solved in 6.9 and up. Reformatting the cache to new partition alignment and hosting docker directly on a cache-only directory brought writes down to a bare minimum.

     

    ###

     

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 17



    User Feedback

    Recommended Comments



    This thread explains a lot. I had a cache SSD die a couple of months ago after only two years of use. I just assumed it was a random failure and replaced it. Now, after, checking with iotop it looks like I'm seeing this problem with over 40 GB an hour being written from loop2 to the cache.

     

    Overall, the new nvme SSD drive has been running 1,660 hours and has had 70TB written to it for an average of 43 GB per hour. Running a single cache drive, btrfs, un encrypted.

     

    Like everyone else here I'd really appreciate a fix for this. Replacing nvme drives is expensive. The temporary fix posted by OP is clever, and I thank them for providing it, but it's not something I'd want to try. I have too many dockers in daily use to want to risk going that far off piste...

    Link to comment
    25 minutes ago, bastl said:

    Quick question, how is copy-on-write set for your appdata and system share?

    Everything VM/docker related is inside the same share on my cache drive with COW set to auto, that folder is actually a btrfs subvolume so it can get snapshotted and replicated to a different pool daily by a script, this reminds me that the COW setting can't be the only thing causing the write augmentation for the VMs, since all 3 vdisks are together and just the Windows Server has the high writes issues, both other Windows VMs have normal writes according to iotop.

    Link to comment

    @johnnie.black I've asked because I have my VMs also on a BTRFS subvol with daily snapshots to a UD device. No problems with that so far and only 1 VM is always powered on and sometimes up to 3 are running. COW on the VM share is also set to auto. But in my case the VMs aren't producing the constant writes. Sure they do some writing, but not that much.

     

    My Appdata share also is set to Auto, but the System share with the docker and libvirt images are set to COW off. Not sure when I set it to off, or if it was default off back when I installed Unraid a couple years ago. Even with all dockers turned off, I see the writes. As soon as I disable Docker itself, the writes go down. So I asume for me it has something to do with the combination of Docker + System Share COW NO + BTRFS subvol

    Link to comment

    Just came across this after it was linked from Reddit.

    Its a little disappointing that Limetech themselves have not commented in anyway whatsoever on this.

     

    For myself, using IOtop for about 30mins, with everything other than Plex otherwise 'on', gets me to about 2Gb of Loop2 usage. So 4Gb/hour, 96Gb/day, 2.88Tb/month.

    For a 1Tb drive that's high, but shouldn't kill it.

     

    Calculating it for my drives, MX500's, 1Tb, which are rated at 360TBW & using this handy calculator here: https://wintelguy.com/endurance-calc.pl see's me at needing to hit 300Gb per day to hit a 3 year useful life period.

    With the price of the drives being $100, that's $33 a year (each) if they last me 3 years.

    Which at 1/3 of the per day rating for the drive currently, means i'd be roughly 5-6 years in at this rate, so $10-15 a year i would guess?

     

    That being said, I still want to see a viable solution though from Limetech...

    Edited by boomam
    Link to comment
    3 hours ago, boomam said:

    Just came across this after it was linked from Reddit.

    Its a little disappointing that Limetech themselves have not commented in anyway whatsoever on this.

    They have commented many times in this very thread

    Link to comment
    6 hours ago, tjb_altf4 said:

    They have commented many times in this very thread

    Well, as you are stating the obvious, I would also like to complete your statement: he has commented 3 times, in December. That is 5 months ago. The issue was marked as Minor, because the people that are affected are losing only hardware and money, not their data.

    I don't want to start on the wrong foot here so it's important to state that Limetech is not at fault with anything here. The issue is caused by docker, or a filesystem driver, or the kernel, or a mismatch between some software interfaces. But people are getting a little disgruntled that despite all the efforts to track and isolate this issue (which has been at the top of "Bug Reports for stable releases" for a while), there is no feedback or interaction from anyone with a word to say in it.

    Link to comment
    7 hours ago, tjb_altf4 said:

    They have commented many times in this very thread

     

    49 minutes ago, JahRuul said:

    Well, as you are stating the obvious, I would also like to complete your statement: he has commented 3 times, in December. That is 5 months ago.

    Its the second reply here that's the crux of the matter.

    This long without a comment of 'we're working on it' counts as a lack of comment in most books.

     

    Can you imagine if your electricity went off at home and other than "yup, looks broken" came up after you phoned the electricity company, that they didn't comment at all for a week?

    Even though you'd done the leg work to diagnose the main breaker was fried because of a tree falling on the local power masts a mile away, and they still were not acknowledging that they were looking at it?  

    Its the same thing. 

     

    Being reasonable, no one is saying that 6.9 work needs to be stopped to deal with this, although it could be argued as necessary in several ways, but at the very least a comment saying "yup, we know its an issue still, we'll get on it after 6.9", or any statement whatsoever to acknowledge the issue is ongoing and a little more severe than minor.

     

    The comments around 'minor' isn't actually minor that I've read in this thread is complete nonsense too.

    At the end of the day its all perceptional damage control.

    Label it as medium/high/critical, make comments to 'check in' that you are still working on it every few weeks, and most peoples ire on the matter will subside as it shows from a PR perspective that they are listening. Customer service 101!

     

    Whether the issue is with Unraid itself, or the software components it needs to work is by the by, the fact of the matter is that it is happening to Unraid users - who have paid money, and on on a key functional part of the product.

    At the very least they need to make a statement on the matter, in the same way they did with SSD usage in an array.

    Edited by boomam
    Link to comment

    I have also noticed those high writes to the cache drive and many of these seem to be caused by plex doing the new analyze for tv show skip intro feature. While the daily check on plex is running I do see writes in the area of TBs going through loop2.

    Edited by DasMarx
    Link to comment

    It's write amplification. This means that any container that was writing a little bit, will still (relatively) only be writing a little bit. Any container that was writing a lot, will now still be writing a lot. It's not the fault of any one container, but Docker (or something else) itself.

    Link to comment
    21 hours ago, xxxliqu1dxxx said:

    I think they are aware and would trust they are looking into it.

    Exactly

    Link to comment

    I'm also suffering from this issue. I'm sure Limetech is looking into it, but some more official communication on the issue would be appreciated. I've shut down my Unraid server for the time being.

    • Like 1
    Link to comment

    Well there goes another 3TB wasted in under 2 hours. 

    Issue: While post processing a 500mb file Nzbget was caught in a loop (still have to figure out what happened) and effectively did 3TB of writes. 

     

    The appdata folder resides on my raid 1 btrfs cache disk.

     

    I only noticed the problem as I got high temp warnings on the nvmes.

     

    This issue does not give me the confidence to leave the server running unattended.

    Edited by mf808
    Link to comment

    I've come across this same issue recently and done some testing after I found my 500GB 850 pro SSD to have almost 150TBW on it.  Used up 1/2 my warranty in about 6 months!  I moved my docker img file from cache to the xfs pool / spinners and noticed the loop2 counting up like crazy.  I also can hear the drive head writing about every 4.5s (on the dot) and loop2 going up 20-30MB each time as well.

     

    I've since narrowed it down (on my Unraid) to the Plex (official) container.  Stopping the container the massive writes have stopped as well as the constant disk noise immediately.  My loop2 is sitting at 600mb of writes after about 30 minutes - where before I could hit 1GB within 5 minutes.

     

    The only other Dockers I have running are UniFi controller (lsio), NetData, and now Emby (lsio).  They can run no problem, but as soon as the Plex docker is started up it goes right back up to writing like crazy - even while doing nothing.  I plan to try lsio or binhex Plex containers instead, but this might have been the push I needed to get away from Plex anyway.

    Edited by WackyWRZ
    Clarification
    Link to comment
    23 minutes ago, WackyWRZ said:

    I've since narrowed it down to the Plex (official) container.  Stopping the container the massive writes have stopped as well as the constant disk noise immediately.  My loop2 is sitting at 600mb of writes after about 30 minutes - where before I could hit 1GB within 5 minutes.

    As others have posted here, you can't blame Plex, or any single Docker. Something is taking normal writes and amplifying them massively. In my case stopping Plex makes a difference, but only reduces it by about 25%, and rampant writes continue. About 1 GB a minute as I look at it right now! I don't want anyone to think the problem has been solved and the cause was Plex. That isn't the case. It's much more fundamental than that.

    • Like 4
    Link to comment

    I should have been more specific - wasn't trying to say that's what everyone's issue is, just what I found on my machine.

    Link to comment

    I believe there is a general problem of writes being amplified (easily x10). Since plex (especially some flavors of the docker) writes a lot, this can become the most visible. I actually reached the same conclusion than you when I started investigating. But then I had a hard look at the other dockers and still there is way too much writing (In your case half a GB in half an hour may not sound much, but is still 4TB a year for doing... nearly nothing?)

    Link to comment

    If this issue is due to some write amplification it could be worth to check how much logging the different containers do, and how big the log files are. If a 10MB log file is constantly written to, could it be that instead of writing a few bytes every log update, it rewrites the whole file?

     

    I noticed that I had debug logging enabled in my Plex container (official), so I had 60+MB of logs just for the "Plex Media Server.X.log" files within the last 24h. It looks like Plex creates a new log file when reaching ~11MB, and with debugging it seem to have been printing logs every 1-2 seconds. Re-writing the whole log file every 2 seconds when it's a couple of MB will cause a lot of unnecessary wear.

    Link to comment

    Outputting to docker logs or rather stopping the containers from writing to docker logs made a significant impact on my writes, that's why all hotio containers don't output to docker logs by default.....it gave me a fighting chance to see my ssd's in action a wee bit longer....

    Link to comment

    i have looked at all dockers and no logfile locations are on cache, but in /tmp/

    No VM online

    about 15 minutes iotop 😞
    image.thumb.png.ee117d9cbbc9d69fd7778a371fb4b04d.png
     

    Link to comment

    I think i made a mistake as my previous post was deleted.

    Anyway i was writing i'm concerned too. I'm trying to find a workaround as unraid starts to throw alerts on both my ssds. The 187 Reported uncorrect attibute is growing.

    I'm really pissed off with this situation as i was planning to replace my procurve switch with a unifi one but now i have to buy ssd as if it was ink cartridges for my printer. 👹 

    Docker service is stopped, i have only 2 vm running and i still  have 6MB/S writes on ssd.

    As unraid (or me probably ) is not doing things right all the time i take diagnostics from time to time. So i searched in history the starting point of the problem.

    Attached is a spreadsheet with smartdata of one of the ssd. 

    Seeing this it seems pretty obvious things started going crazy with the 6.8.0.

    Capture d’écran 2020-05-29 à 17.03.00.png

    Edited by caplam
    • Like 1
    • Thanks 1
    Link to comment
    1 hour ago, caplam said:

    I'm trying to find a workaround as unraid starts to throw alerts on both my ssds.

    I followed @S1dney post that is a work around. It DRASTICALLY reduced my writes. I would recommend you check it out. I can now sleep easily as night.

     

    One thing that confused me was that after I rebooted I didn't have any docker containers listed. I then had to select Add Container and then choose the template from the drop down. It was quick and only took about 10 minutes to re-add about 20 containers that I had set up previously.

     

     

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.