• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Urgent

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 15


    User Feedback

    Recommended Comments



    12 minutes ago, limetech said:

    If you click on the device on Main and look at the SMART data, what's the value "data units written" attribute?  Does it line up with what you are measuring as MB/hour being written?

     

    Yes, the SMART report does correlate correctly with the excessive writes.

     

    I think TexasUnraid has done a lot of helpful testing, but some of his terminology might be a bit confusing. Basically it comes down to this:

     

    SSD cache drive formatted as btrfs = huge (unacceptable) amounts of write operations (gigabytes every hour) by the loop2 device

    SSD cache drive formatted as xfs = works normally

     

    I currently have my cache drive formatted as xfs (so my SSDs don't get trashed) and it's working normally. The problem with this arrangement is that you can't have a cache pool or redundancy with xfs-formatted drives, so I'm giving up redundancy to save wear on my drives.

     

    The ideal solution would be:

     

    1. Fix the bug with cache+btrfs so that the drive writes are reduced to a normal level, and we can go back to having cache pools/redundancy

    2. Somehow make cache pools/RAID1 available with xfs-formatted cache drives

     

     

    Link to comment
    Share on other sites
    37 minutes ago, limetech said:

    If you click on the device on Main and look at the SMART data, what's the value "data units written" attribute?  Does it line up with what you are measuring as MB/hour being written?

    Yes, that is the LBA written (this is what Samsung drives call it) smart data I am referring to and how we are measuring the writes.

    Link to comment
    Share on other sites
    22 minutes ago, grigsby said:

     

    Yes, the SMART report does correlate correctly with the excessive writes.

     

    I think TexasUnraid has done a lot of helpful testing, but some of his terminology might be a bit confusing. Basically it comes down to this:

     

    SSD cache drive formatted as btrfs = huge (unacceptable) amounts of write operations (gigabytes every hour) by the loop2 device

    SSD cache drive formatted as xfs = works normally

     

    I currently have my cache drive formatted as xfs (so my SSDs don't get trashed) and it's working normally. The problem with this arrangement is that you can't have a cache pool or redundancy with xfs-formatted drives, so I'm giving up redundancy to save wear on my drives.

     

    The ideal solution would be:

     

    1. Fix the bug with cache+btrfs so that the drive writes are reduced to a normal level, and we can go back to having cache pools/redundancy

    2. Somehow make cache pools/RAID1 available with xfs-formatted cache drives

     

     

    Yeah, I just refer to XFS vs BTRFS as the location or type of drive made basically no difference in my testing. I have had things on cache, array, unassigned device, you name it. All that seems to really matter is the file system being either XFS or BTRFS.

     

    XFS = everything works fine

    BTRFS = Excessive writes orders of magnitudes higher.

     

    That said I am late to this thread and am most likely using different terminology then others. My bad there.

    Edited by TexasUnraid
    Link to comment
    Share on other sites

    And it is not only writes to docker.img, seeing amplified writes to appdata (on cache) too. If I put my MariaDB databases in appdata I will get several gigs written every hour. Even with really small changes to the databases. I have had to move the Mariadb dir with databases from appdata to one of my unassigned devices to save some wear and tear on my ssds. 

     

    I would like to move it back to appdata asap but I also want my hardware to live longer. My drives have 5 years of warranty OR 400+ in TBW. I really want to retain the warranty for at least 5 years but if this strange writes continues, people will have lots of drives out of warranty here because of reaching the warranty tbw limit, fast. My drives are dedicated nas drives, hence the 400+ tbw limit but I will guess that most users use more ordinary brands like Samsung where the limit could be as low as 140 TBW. With this black hole of writing, they could be out of warranty in under a year. That's why I switched from Samsung to Seagate IronWolf ssds. 

    Edited by Niklas
    • Like 1
    Link to comment
    Share on other sites
    Just now, Niklas said:

    And it is not only writes to docker.img, seeing amplified writes to appdata (on cache) too. If I put my MariaDB databases in appdata I will get several gigs written every hour. Even with really small changes to the databases. I have had to move the Mariadb dir with databases to one of my unassigned devices to save some wear and tear on my ssds. 

    Yep, I saw around 100X the writes to appdata when it is on a BTRFS drive vs XFS (500mb/hour vs 5mb/hour)

    • Like 1
    • Thanks 1
    Link to comment
    Share on other sites
    54 minutes ago, Niklas said:

    If I put my MariaDB databases in appdata I will get several gigs written every hour. Even with really small changes to the databases. I have had to move the Mariadb dir with databases from appdata to one of my unassigned devices to save some wear and tear on my ssds. 

    Is this MariaDB specific?  That is, with the DB moved onto xfs volume but rest of appdata still on btrfs, does the btrfs  volume go back to "normal"?

    Link to comment
    Share on other sites
    9 minutes ago, limetech said:

    Is this MariaDB specific?  That is, with the DB moved onto xfs volume but rest of appdata still on btrfs, does the btrfs  volume go back to "normal"?

    Not in my testing, I had these issues and the only containers I had running were qbittorrent, mumble, lancache, krusader, jdownloader, tinymediamanager.

     

    None of them are actively being used, simply running. On an XFS volume they used around 5mb/hour in the appdata. On a BTRFS drive the same exactly settings, dockers and appdata, it would write around 500mb/hour.

     

    If I installed a DB container the writes would skyrocket even higher, as much as 800-900mb/hour for a single DB container that should at most be writing ~50-100mb/hour according to docker stats (plus the fact it was not being used)

     

    It seems to be proportionate to the actual writes on BTRFS. Containers that write more, will have even larger writes. Some are reporting 10-20gb an hour or more with containers like plex and DB's.

     

    To the point SSD's are overheating and all CPU cores are pegged just handling writes.

    Edited by TexasUnraid
    Link to comment
    Share on other sites
    11 minutes ago, limetech said:

    Is this MariaDB specific?  That is, with the DB moved onto xfs volume but rest of appdata still on btrfs, does the btrfs  volume go back to "normal"?

    No. MariaDB gives me the biggest impact because of constant (but very small) changes happening almost all the time. I don't think this is specific to any docker. All writes to the cache ssds formatted as btfs will see like 10-100 times more unexpected data written..... Containers that write or change more data will be noticed much more. Like plex or MariaDB. 

     

    Writes to docker.img and /cache gives this write amplification 

     

     

    Edited by Niklas
    Link to comment
    Share on other sites
    8 hours ago, limetech said:

    This topic is tldr but wondering if anyone has tried turning off btrfs COW?  Either on the docker.img file itself (if stored on a btrfs volume) or within the btrfs file system image.

    It doesn't make any difference for me, and I would guess some affected users are using the default system share, which defaults to NOCOW.

    Link to comment
    Share on other sites

    I've been playing with the various btrfs mount options and possibly found one that appears to make a big difference, at least for now, and while it doesn't look like it's a complete fix for me it decreases writes about 5 to 10 times, this option appears to work both for the docker image on my test server and more encouragingly also on the VM problem on my main server, and it's done by remounting the cache with the nospace_cache option, from my understanding this is perfectly safe (though there could be a performance penalty) and it will go back to default (using space cache) at next array re-start, if anyone else wants to try it just type this:

    mount -o remount -o nospace_cache /mnt/cache

     

    Will let in run for 24 hours and check device stats tomorrow, on average my server does around 2TB writes per day, current value is:

     

    image.png.941e38e4613ff25bff2ffb640fd46c28.png

     

    But like mentioned it's not a complete fix, I'm still seeing constant writes to cache, but where before it was hovering around 40/60MB/s now it's around 3/10MB/s, so I'll take it for now:

     

    image.thumb.png.91296e58d8294ed24939aae93ce2461e.png

    • Like 2
    • Thanks 3
    Link to comment
    Share on other sites
    2 hours ago, johnnie.black said:

     if anyone else wants to try it just type this:

    
    mount -o remount -o nospace_cache /mnt/cache

     

    Before this: 762MB on loop2 over 5 minutes
    After this: 120MB on loop2 over 5 minutes

     

    Good bandaid for now, thanks

    Link to comment
    Share on other sites
    10 hours ago, Niklas said:

    And it is not only writes to docker.img, seeing amplified writes to appdata (on cache) too. If I put my MariaDB databases in appdata I will get several gigs written every hour. Even with really small changes to the databases. I have had to move the Mariadb dir with databases from appdata to one of my unassigned devices to save some wear and tear on my ssds. 

     

    I would like to move it back to appdata asap but I also want my hardware to live longer. My drives have 5 years of warranty OR 400+ in TBW. I really want to retain the warranty for at least 5 years but if this strange writes continues, people will have lots of drives out of warranty here because of reaching the warranty tbw limit, fast. My drives are dedicated nas drives, hence the 400+ tbw limit but I will guess that most users use more ordinary brands like Samsung where the limit could be as low as 140 TBW. With this black hole of writing, they could be out of warranty in under a year. That's why I switched from Samsung to Seagate IronWolf ssds. 

     

    10 hours ago, TexasUnraid said:

    Yep, I saw around 100X the writes to appdata when it is on a BTRFS drive vs XFS (500mb/hour vs 5mb/hour)

    I must say that I was reluctant in believing these statements, I have been testing writing stuff to the BTRFS cache devices in the beginning, could not notice the write amplification there.

     

    Now going back to the fact that my SMART data still shows my drives writing 40GB a day, this does seem quite a lot on second hand.

    TBW on 14-06-2020 23:57:01 --> 12.1 TB, which is 12370.2 GB.
    TBW on 15-06-2020 23:57:01 --> 12.1 TB, which is 12392.6 GB.
    TBW on 16-06-2020 23:57:01 --> 12.1 TB, which is 12431.4 GB.
    TBW on 17-06-2020 23:57:01 --> 12.2 TB, which is 12469.0 GB.
    TBW on 18-06-2020 23:57:01 --> 12.2 TB, which is 12507.4 GB.
    TBW on 19-06-2020 23:57:01 --> 12.3 TB, which is 12547.5 GB.

    I'm not really complaining though cause this writes are neglectable on 300TBW warranty drives. 

    However.... Since docker lives directly on the BTRFS mountpoint this might as well be lower since my containers aren't that busy ones.

     

    Still considerably lower though then the 300/400GB daily writes while still using the docker.img file.

    TBW on 11-11-2019 23:59:02 --> 3.8 TB, which is 3941.2 GB.
    TBW on 12-11-2019 23:59:01 --> 4.2 TB, which is 4272.1 GB.
    TBW on 13-11-2019 23:59:01 --> 4.5 TB, which is 4632.5 GB.
    TBW on 14-11-2019 23:59:01 --> 4.9 TB, which is 5044.0 GB.
    TBW on 15-11-2019 23:59:01 --> 5.2 TB, which is 5351.3 GB.
    TBW on 16-11-2019 23:59:01 --> 5.3 TB, which is 5468.8 GB.
    TBW on 17-11-2019 23:59:01 --> 5.5 TB, which is 5646.1 GB.

     

    Link to comment
    Share on other sites
    2 hours ago, johnnie.black said:

    Will let in run for 24 hours and check device stats tomorrow,

    Only been 3 hours but I can see from my previous stats posts that before doing this the average writes to cache on the last 36 days were 2.78TB per day, last 26 hours it was 3.6TB, which is a little higher than average, but that's 139GB/H, since the change and for the last 3 hours it's writing on average 28GB/H, so while not ideal it's a fivefold decrease, not bad, if the SSD was going to last 5 months before, it's now going to last 2 years, I'll take it.

    Link to comment
    Share on other sites

    On the fresh reinstall my writes have climbed from 5gb/hour to almost 7gb/hour overnight. Docker stats is showing it should actually be closer to 60-70mb/hour. technically a lot less since a lot of those writes are on startup.

     

    Yes, I know it is not nearly as bad as most of you but to put things in perspective, I am still using 1.5TB drives from 2006, that have 45k+ hours. I just had my first die at 48k hours (well, didn't technically die, just started getting more bad sectors then I could accept).

     

    It is not just possible that I could be using the same hardware for 10+ years but probable since finding money for this stuff ain't easy.

     

    Plus my SSD only has a TBW limit of 72.

    Edited by TexasUnraid
    Link to comment
    Share on other sites
    3 hours ago, johnnie.black said:

    I've been playing with the various btrfs mount options and possibly found one that appears to make a big difference, at least for now, and while it doesn't look like it's a complete fix for me it decreases writes about 5 to 10 times, this option appears to work both for the docker image on my test server and more encouragingly also on the VM problem on my main server, and it's done by remounting the cache with the nospace_cache option, from my understanding this is perfectly safe (though there could be a performance penalty) and it will go back to default (using space cache) at next array re-start, if anyone else wants to try it just type this:

    
    mount -o remount -o nospace_cache /mnt/cache

     

    Will let in run for 24 hours and check device stats tomorrow, on average my server does around 2TB writes per day, current value is:

     

    image.png.941e38e4613ff25bff2ffb640fd46c28.png

     

    But like mentioned it's not a complete fix, I'm still seeing constant writes to cache, but where before it was hovering around 40/60MB/s now it's around 3/10MB/s, so I'll take it for now:

     

    image.thumb.png.91296e58d8294ed24939aae93ce2461e.png

    Testing this out now, the first 10 mins looks very promising. Dropped from 5GB/hour down to ~500mb/hour.

     

    Going to let it run a few hours to see what happens. Still inflated over XFS numbers but much better.

     

    Do we know what kind of performance penalty this would cause?

     

    I assume that docker/VM's need to be stopped before running this command? wondering if this could be scripted to run at array start?

    Link to comment
    Share on other sites
    20 minutes ago, TexasUnraid said:

    Do we know what kind of performance penalty this would cause?

    It can depend form system to system and how it's used, in some case it might not be noticeable or even perform better.

     

    22 minutes ago, TexasUnraid said:

    I assume that docker/VM's need to be stopped before running this command?

    No need.

     

     

    • Like 1
    Link to comment
    Share on other sites
    5 minutes ago, johnnie.black said:

    It can depend form system to system and how it's used, in some case it might not be noticeable or even perform better.

     

    No need.

     

     

     

    Cool, will let it run a few hours and see how things go.

     

    I was doing a bit of reading on this and it seems like it should be using ram for this caching and it should not have any effect on writes. Any guess as to why this could be helping?

    Edited by TexasUnraid
    Link to comment
    Share on other sites
    Just now, TexasUnraid said:

    Any guess as to why this could be helping?

    Honestly no idea, was just trying some of the mount options to see if any of them made any difference.

    Link to comment
    Share on other sites

    Ok, a few hours later with nocache and writes are at 825mb/hour.

     

    Still a whole lot better then the 5-7GB/hour and climbing I saw before but not the 200-300mb/hour it should be.

     

    For some reason when I tried to remove a drive from the cache pool the whole pool died, so got to reinstall yet again. Fun.

     

    Tried re-creating the raid 0 cache pool and removing the drive again, once again the pool died even though there is plenty of room (only docker and appdata on it right now).

     

    There is only 25GB of data on the cache, so it doesn't make sense.

     

    Strange since it worked earlier when I removed a drive from a raid 0 pool.

    Edited by TexasUnraid
    Link to comment
    Share on other sites
    1 hour ago, TexasUnraid said:

    Strange since it worked earlier when I removed a drive from a raid 0 pool.

    Don't see how, since RAID0 tries to spread the data across all the devices for speed. Normally you would need to tell btrfs you intended to remove the device and allow it to remove the data from the device you want to remove.

    Link to comment
    Share on other sites

    It worked before once, it also said in the FAQ that could be done.

     

    Quote

    You can remove devices from any type of pool (single, raid0/1, raid5/6, raid10, etc) but make sure to only remove one device at a time, i.e., you can't remove 2 devices at the same time from any kind of pool, you can remove them one at a time after waiting for each balance to finish (as long as there's enough free space on the remaining devices).

     

    How could I tell it to remove the device? I don't see any such options?

    Link to comment
    Share on other sites
    31 minutes ago, TexasUnraid said:

    It worked before once,

    You had the other device still connected yes? It needs to be for a non redundant pool to be converted, you just unassign it and start the array.

     

    That reminds me that I should add that to the FAQ since while it should be obvious some users might assume it's not needed.

     

    Other that that, are you using the new beta? Didn't try it there yet, something might be broken.

    Link to comment
    Share on other sites
    10 minutes ago, johnnie.black said:

    You had the other device still connected yes? It needs to be for a non redundant pool to be converted, you just unassign it and start the array.

     

    That reminds me that I should add that to the FAQ since while it should be obvious some users might assume it's not needed.

     

    Other that that, are you using the new beta? Didn't try it there yet, something might be broken.

    Yep, thats exactly what I did, simply unassigned the device and started the array.

     

    It then says cache is not mountable and it needs to format it. I didn't format and put the device back in the cache pool like before and tried to start it but it still said the same thing.

     

    Luckily nothing was on it but the docker image so nothing was lost but a bit scary for any future changes.

    Link to comment
    Share on other sites
    4 minutes ago, TexasUnraid said:

    Yep, thats exactly what I did, simply unassigned the device and started the array.

    If you haven't rebooted yet please post or pm me the diags so I can try to see what happened.

    Link to comment
    Share on other sites
    4 minutes ago, johnnie.black said:

    If you haven't rebooted yet please post or pm me the diags so I can try to see what happened.

    Sadly I did reboot but I can try doing it again to see if it happens again.

     

    Where do I get the diag you want?

    Link to comment
    Share on other sites



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.