• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Solved Urgent

    EDIT (March 9th 2021):

    Solved in 6.9 and up. Reformatting the cache to new partition alignment and hosting docker directly on a cache-only directory brought writes down to a bare minimum.

     

    ###

     

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 17



    User Feedback

    Recommended Comments



    Can confirm that the Binhex Plexpass Docker drastically reduces writes in contrast to the official Docker.

    Am down from a bit more than 16GB (official Plex + Pihole) an hour to around 2,5GB an hour (Binhex Plexpass + Pihole).

    Am now thinking about running the Pihole on my actual RaspberryPi again^^

     

    Of course, how much the Pihole Docker writes is dependent on the amount of DNS queries at that moment, so take these numbers with a grain of salt. They are however obtained with

    iotop -ao

    over one hour of time, so it should not be too much of a factor.

    Edited by Lucanus
    Added information
    Link to comment

    Hey All,

     

    It is very much appreciated to update your findings, but like many have said here, this bug simply amplifies writes and it affects BTRFS not XFS, so reformatting to XFS or putting the loopdevice out of the equation fixes this bug, but in case of XFS you'll lose redundancy and modifying shell scripts is obviously unsupported.

    So please post containers that write like crazy in the accompanying threads for that specific container, also note that these writes are most likely amplified by a factor x, so the real number of writes might not be that bad.

     

    Cheers.

    Edited by S1dney
    Link to comment

    And again the hint for all readers. Try the following: dissable all Dockers and keep docker itself active and the writes will still happen.

     

    It's not an issue of one specific docker!

     

    Each docker more or less produces writes to the disk, thats normal. The issue is that docker itself in combination with btrfs produces constant writes for no reasons.

    • Thanks 1
    Link to comment
    On 6/9/2020 at 12:01 PM, johnnie.black said:

    No, you have 89% left.

    Which is not great either, at that pace that drive is going to bite the dust in less than a year...

    Link to comment
    8 hours ago, albertogomcas said:

    Which is not great either, at that pace that drive is going to bite the dust in less than a year...

    Yeah, not great, but considerable relief that it wasn't worse news!

     

     

     

    Do we know if this problem is LIKELY to have a solution, to continue using BTRFS, or is reformatting the cache drive to XFS really the best option at this point?

    Link to comment
    1 hour ago, Moz80 said:

    Yeah, not great, but considerable relief that it wasn't worse news!

     

     

     

    Do we know if this problem is LIKELY to have a solution, to continue using BTRFS, or is reformatting the cache drive to XFS really the best option at this point?

    We do not, I wish that Lime could say if it's fixed for 6.9 or not. At this point the biggest issue for me isn't if it will get fixed or not. It's that I have no clue if it'll be fixed next month in a hot patch or in 6.9 half a year away, or not at all by diverging from btrfs or whatever.  That way I could give myself a remote chance of planning my approach.

     

    But it is what it is so I'll go the xfs route, worst case I could just switch back later down the road I guess.

    Link to comment

    crazy thought, just throwing it out there, what if the combo BTRFS (COW) + sqlite (most our databases) is causing the write amplification...for appdata COW is iirc not disabled....

    And truth be told, I'm still seeing a constant write pattern on a non-unraid system, with docker data on a btrfs ssd

    2020-06-11 132802.png

    Link to comment

    Last unofficial information I have is that this issue is hopefully fixed and the fix will be in v6.9-rc1 which is expected soon™, no idea if a patch/update for v6.8 will be available, but very much doubt that as LT doesn't usually patch older releases after a newer one is in active development, maybe if it was a critical security issue.

     

    Also, and completely by accident I might have found a "fix" for current users, at least it's working for be and it's repeatable, this is iotop after 5 minutes with just the plex docker installed (which appears to be one of the most active):

     

    dawb.PNG.e632674ac1c5ea008e1260ce4a2d445d.PNG

     

    The "fix" is using the mover to move the docker image to the array (docker service must be stopped first and share set to cache="yes") then back to the cache device (set share to cache="prefer"), start the docker service back up and this is the same docker image running for 5 minutes after the move:

     

    image.png.af89665b0290d523d2a4e1cfd68b56d6.png

     

    So a 10x decrease in write activity, please try and let me known if it helps.

     

     

     

     

    • Thanks 1
    Link to comment
    2 hours ago, hotio said:

    probably did that 10x times, didn't work for me at all at the time

    Maybe it doesn't work with all dockers, like mentioned it's repeatable and works consistently for me, but I only had the single plex docker installed.

    Link to comment
    5 hours ago, johnnie.black said:

    Last unofficial information I have is that this issue is hopefully fixed and the fix will be in v6.9-rc1 which is expected soon™, no idea if a patch/update for v6.8 will be available, but very much doubt that as LT doesn't usually patch older releases after a newer one is in active development, maybe if it was a critical security issue.

     

    Thanks for the info, and for that work-around. I'm already back to an XFS cache, and spent a couple of days setting up backups and the like, so not really bothered moving back to BTRFS at this point, but it's wonderful if this works for more people.

     

    However, we shouldn't have to hear this from you. I'm sure Limetech are working on this, I'm sure there's some sort of fix coming at some point, but radio silence for something this severe really shouldn't be the norm, especially if Limetech is shooting for a more official, polished vibe, as a company.

     

    Event something simple, like:

     

    Quote

    Hi everyone, we realise this is an issue for a lot of you, and we're very sorry it's happened. We're not entirely sure where the bug originated at this point, but rest assured we are working on a fix for the next release. This should enter RC phase within the next few weeks, but obviously this is subject to change. We won't be porting this fix to 6.8.x, as too many things have changed already. We advise anyone affected to this to switch to XFS and take appropriate measures regarding backups, or at the very least to be aware of the increased wear of SSDs due to extra writes and plan accordingly.

     

    This actually tells us very little, other than not to expect a patch for 6.8, and that the release is only "soon", but at least it's something reassuring. I'm usually not the guy to advocate being seen to be doing something, rather than actually just doing the thing, but in this case I think a little more communication would have been warranted. 

    • Like 1
    Link to comment
    3 hours ago, johnnie.black said:

    Maybe it doesn't work with all dockers, like mentioned it's repeatable and works consistently for me, but I only had the single plex docker installed.

    Thanks for your work @johnnie.black, it would feels strange though that moving the image would work for one docker container, but doesn't for the other?

    Any ideas what changed after the mover kicked in?

     

    50 minutes ago, -Daedalus said:

    However, we shouldn't have to hear this from you. I'm sure Limetech are working on this, I'm sure there's some sort of fix coming at some point, but radio silence for something this severe really shouldn't be the norm, especially if Limetech is shooting for a more official, polished vibe, as a company.

    I think Johnnie is very closely affiliated to Limetech so I'm sure this is coming from first hand. Do remember that the fact that they are not actively responding does not mean they're not working on it. 

    Although I do kind of agree a bit with you, not hearing anything tends to let users believe that no work is being done. It might be a quick win for the dev's to give some info once in a while, I have read some ideas somewhere on the forum to create some kind of blog (or a newsletter by mail or something) about "What we're doing" , which sounded appealing to me.

     

    Nevertheless you have to acknowledge the fact that they do a lot and we all profit from it.

    Keep up the good work guys 🍻

    Link to comment
    3 minutes ago, S1dney said:

    I think Johnnie is very closely affiliated to Limetech so I'm sure this is coming from first hand. Do remember that the fact that they are not actively responding does not mean they're not working on it. 

    Although I do kind of agree a bit with you, not hearing anything tends to let users believe that no work is being done. It might be a quick win for the dev's to give some info once in a while, I have read some ideas somewhere on the forum to create some kind of blog (or a newsletter by mail or something) about "What we're doing" , which sounded appealing to me.

     

    Nevertheless you have to acknowledge the fact that they do a lot and we all profit from it.

    Keep up the good work guys 🍻

     

    Absolutely. I in no way meant for that to come across as complaining (it may have, I apologise), more "passionate suggestion", shall we say. If anyone from the dev team ever decides to visit Ireland, I'll happily buy them a round. 🍻

    • Like 1
    Link to comment
    3 minutes ago, -Daedalus said:

     

    Absolutely. I in no way meant for that to come across as complaining (it may have, I apologise), more "passionate suggestion", shall we say. If anyone from the dev team ever decides to visit Ireland, I'll happily buy them a round. 🍻

    Hahaha BIG like! Cheers.

    Link to comment
    7 hours ago, johnnie.black said:

    Last unofficial information I have is that this issue is hopefully fixed and the fix will be in v6.9-rc1 which is expected soon™, no idea if a patch/update for v6.8 will be available, but very much doubt that as LT doesn't usually patch older releases after a newer one is in active development, maybe if it was a critical security issue.

     

    Thanks for the update, appreciated

    Link to comment
    49 minutes ago, S1dney said:

    it would feels strange though that moving the image would work for one docker container, but doesn't for the other?

    I only tested with one, hence why I asked for other users to test.

     

    Also note that like mentioned my information is unofficial, it's not been confirmed by LT, and it might be wrong, or possibly not work for everyone or every situation, and if it is wrong you can only blame me.

     

     

    Link to comment
    1 minute ago, johnnie.black said:

    Also note that like mentioned my information is unofficial, it's not been confirmed by LT, and it might be wrong, or possibly not work for everyone or every situation, and if it is wrong you can only blame me.

     

     

    At this point, I'll take a vague rumor over an empty void!

    Link to comment

    Just going to say I am also having these excessive writes. All dockers are writing way more then they should but some are really out of control with a gig written in 15 minutes from a single docker (netdata) and others not far behind.

     

    Then some only have a few MB of writes, more then I would expect but not enough to worry about. I systemically went through and tested every docker individually.

     

    Also I noticed that "btrfs-transacti" writes quite a bit as well, I was seeing almost 300MB/hr from that alone. So 2.5TB a year just from that, not nearly as bad as the docker issue but still quite excessive IMHO.

     

    In my testing I decided to try increasing the vm.dirty_expire_centisecs to see if maybe the writes would overwrite themselves in ram before going to the SSD.

     

    My hunch was correct, with most dockers increasing this to 5 minutes dropped the writes by a factor of 8x-10x. Although some were only minorly effected.

     

    Now obviously this is longer then I would run daily so I tried different lengths and ~2 minutes seems to of been about the sweet spot (still saw reduction increasing further but not real comfortable going much loner), cutting writes drastically for most dockers while being acceptable risk short term until a real fix is found.

     

    Not recommending anyone set it that long, but here is a little script I made up to make changing this easy and automating for testing. I have it set to 12000 myself at the moment to see how that goes.

     

    #!/bin/bash
    #description=This script allows changing the time data sits in ram before being written to disk. This can be useful if you have a lot of small writes to the same file as it will overwrite in ram and only write to disk once the time is up. Set the desired time in ms either in the argument box if running directly or after the "argumentDefault" below if scheduling it. 3000 = 30 seconds and is the default value. Careful setting this too long as anything still in ram if the power is lost will be lost forever.
    #argumentDescription= set time in ms here, 3000 = 30 seconds
    #argumentDefault=6000
    
    
    echo
    echo Below is the old config
    echo
    
    sysctl -a | grep dirty
    
    echo
    echo
    
    grep -q 'vm.dirty_expire_centisecs.*' /etc/sysctl.conf && sed -i "s/vm.dirty_expire_centisecs.*/vm.dirty_expire_centisecs = "$1"/g" /etc/sysctl.conf || echo "vm.dirty_expire_centisecs = "$1"" >> /etc/sysctl.conf
                    
    sysctl -p
    
    echo
    echo Below is the new config
    echo
    
    sysctl -a | grep dirty

     

    Any timeline for an official fix for this? I need a cache pool as I only have 128gb SSD's and they are not large enough individually for the files on the cache. Only option I am coming up with right now is tossing an old 2.5" HDD into the array and putting the docker image on there till a fix is released.

    Edited by TexasUnraid
    Link to comment

    Decided to try the suggestion above to use mover to move the docker to array and back again.

     

    I moved it to the array and decided to try starting it up to see what would happen as I have a BTRFS encrypted array at the moment.

     

    Even with all the "bad" dockers running that would of had me well over 1gb in 10 minutes in writes, I only have 50mb of writes in 10 mins and I also reverted the above dirty memory tweak. Although this is on loop3 instead of loop2 but I assume that is for the array.

     

    btrfs-transacti is now writing more then the loop3 though, over 200MB in 10 mins vs 300mb in an hour I saw before.

     

    Although I am now getting this strange message in the docker settings that was not there before, I just installed unraid a few days ago, so not sure how this could be true, guessing it was triggered for some other reason.

     

    "Your existing Docker image file needs to be recreated due to an issue from an earlier beta of Unraid 6. Failure to do so may result in your docker image suffering corruption at a later time. Please do this NOW!"

    Edited by TexasUnraid
    Link to comment
    19 minutes ago, TexasUnraid said:

     "Your existing Docker image file needs to be recreated due to an issue from an earlier beta of Unraid 6. Failure to do so may result in your docker image suffering corruption at a later time. Please do this NOW!"

    I knew this message would start popping up for people...

    Link to comment
    34 minutes ago, tjb_altf4 said:

    I knew this message would start popping up for people...

    What is it?

     

    The plot thickens.

     

    I moved the docker back to the cache pool, I still get the message about the docker image needing to be rebuilt.

     

    I started up all the dockers and am watching the writes, the first thing that stands out is that it still says loop3 instead of loop2? Maybe this is what the docker message is complaining about?

     

    The writes also appears to be only marginally higher then when it was on the array drive, around 60MB in 10 mins.

     

    This is offset though by btrfs-transacti now writing ~1.6GB per hour (extrapolated) vs the ~300MB I saw before.

     

    It is actually pretty close to the numbers I saw before if you combine the btrfs-transacti and loop3, almost like some of the writes have moved from the "loop" to the btrfs-transacti.

     

    Maybe this will give some incite to the root issue?

     

    EDIT: I noticed that even with the docker moved to the array and the cache doing exactly nothing (VM's disabled, cache is not setup for anything ATM, etc) I was still seeing writes spike cache. So the btrfs-transacti must be writing to the cache still, even though nothing at all is happening on it. That is concerning.

     

    Turning in for the night.

     

    Tomorrow gonna try installing another array drive and formatting as xfs and try moving it there.

    Edited by TexasUnraid
    Link to comment

    I’ve seen this message when my Docker.img file was in a share not set to nocow or did not have the nocow attribute set on it.

    Link to comment
    5 hours ago, TexasUnraid said:

    Even with all the "bad" dockers running that would of had me well over 1gb in 10 minutes in writes, I only have 50mb of writes in 10 mins

    That's encouraging and similar to my findings, and I was hoping it would work for everyone else before @hotio's comments, but it still might work for most, or everyone in case he was mistaken.

     

    5 hours ago, TexasUnraid said:

    Although this is on loop3 instead of loop2 but I assume that is for the array.

    The loop# used can change and it's not important, just make sure it's the one where the docker image is mounted.

     

    5 hours ago, TexasUnraid said:

    Although I am now getting this strange message in the docker settings that was not there before, I just installed unraid a few days ago, so not sure how this could be true, guessing it was triggered for some other reason.

    This will happen every time you move the docker image to a btrfs filesystem/share with COW set to "auto" (enable), it can be ignored for this, you could recreate the image to get rid of the warning but then it would resume the amplified writing.

     

     

    Link to comment
    6 hours ago, johnnie.black said:

    That's encouraging and similar to my findings, and I was hoping it would work for everyone else before @hotio's comments, but it still might work for most, or everyone in case he was mistaken.

     

    The loop# used can change and it's not important, just make sure it's the one where the docker image is mounted.

     

    This will happen every time you move the docker image to a btrfs filesystem/share with COW set to "auto" (enable), it can be ignored for this, you could recreate the image to get rid of the warning but then it would resume the amplified writing.

     

     

     

    Ok, that explains the loop and docker message.

     

    Although sadly I don't know that moving it to another drive is working, at least when that other drive is btrfs. While the loop writes dropped, the "btrfs-transacti" writes increased by almost exactly the same amount. So the total writes are only minimally lower now.

     

    Watching the writes again this morning and sure enough, the loop writes are much lower (I set the dirty to 5 mins before bed to help reduce writes) the btrfs-transacti is still going crazy.

     

    Over 150mb in 10 mins. 8tb a year just for that, pretty extreme for a drive literally doing nothing at all.

     

    Next step is to install an XFS formatted drive and try moving it there.

     

    Edited by TexasUnraid
    Link to comment

    So before adding another drive I restarted the server and now loop 2 is back but so are the writes.

     

    Although btrfs-transacti writes are down by a proportional amount. I am convinced that these are connected somehow.

     

    Total writes are lower then before playing musical docker image but still around ~350mb in 10 mins or 18TB a year. It is just now split 50/50 between loop 2 and btrfs-transacti.

     

    The docker is still on the array drive and my cache is setup to do nothing at all yet I am still getting writes to the cache every ~30 seconds, it should not have any writes at all.

     

    I tried increasing the dirty writeback time and sure enough, the writes mostly fell in line with the new time.

     

    This issue is obviously deeper then just the docker image.

     

    Next step, XFS formatted array drive. It would be SO cool to have multiple cache pools right now.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.