• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Solved Urgent

    EDIT (March 9th 2021):

    Solved in 6.9 and up. Reformatting the cache to new partition alignment and hosting docker directly on a cache-only directory brought writes down to a bare minimum.

     

    ###

     

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 17



    User Feedback

    Recommended Comments



    1 hour ago, TexasUnraid said:

    If you go back a ways in this thread, you will find a few pages of me testing every possible scenario.

     

    While the docker image is the main culprit for sure, appdata was not far behind. With just appdata on the BTRFS I was still seeing around 800mb/hour IIRC. Vs both on the XFS and ~200mb/hour combined.

    Have you redone the tests on 6.9.0 + partition align to 1MiB?

    It makes a huge difference.

     

    Also you probably missed my point a bit. There is a balance to be struck between the needs for endurance vs resiliency.

    • Docker image has the lowest need for resiliency (everything is reinstallable so recovering from complete lost is a mundane mouse-clicking affair) so the need to increase longevity for the SSD naturally floats to the top.
      • Then you add the loop2 amplification, which is the consistently the highest and exclusively affects docker image. That builds the case for having docker image in the xfs disk.
    • Appdata does have some needs for resiliency because reconfiguring every app is a pain in the backside, if impossible in some cases. So one has to debate if the need to reduce SSD wear would trump the need to protect the appdata against failure.
      • In an ideal scenario, you would have a backup to mitigate the risk but just as parity is not a backup, a backup isn't a parity either (note: a mirror i.e. RAID-1 is a special case of parity).

     

    It's like the UK government misguided effort to promote diesel cars to reduce carbon emission. The end result was air quality went down the drain due to particulate matter and nitrogen oxides in diesel exhaust.

    So people don't die 10 years down the road because of global warming. They die next year because of lung cancer.

     

     

    Link to comment

    Yeah, I tested it on 6.9 as well, and while writes were lower across the board (roughly half vs 6.8 IIRC) it was still many times higher then using an XFS cache.

     

    I am aware of the risks with data loss. I am not worried about it personally for a few reasons.

     

    1: I have never had an SSD I trust die on me (had one dead when I first plugged it in but that was DOA).

     

    2: Appdata is backed up with the CA backup tool on a weekly basis. Now that everything is setup and working, having to fall back a week on the dockers is not something I am worried about. I can manually run a backup if I make a lot of changes. I would much rather have the reduced writes.

     

    Worst case an hour of work would restore my XFS cache drive.

    Link to comment
    9 hours ago, testdasi said:

    (...) Are you using 6.9.0? Did you also align the parition to 1MiB? That requires wiping the pool so I would assume quite few people would do it.

    No, I'm on 6.8.3 and I did not align the parition to 1MiB (its MBR: 4K-aligned).

    What is the benefit of aligning it to 1MiB? I mus have missed this "tuning" advice...

    Link to comment
    4 minutes ago, vakilando said:

    No, I'm on 6.8.3 and I did not align the parition to 1MiB (its MBR: 4K-aligned).

    What is the benefit of aligning it to 1MiB? I mus have missed this "tuning" advice...

    You can't align it until 6.9, it won't work with 6.8.

     

    It was explained a few pages back but basically it ensures that each 4kb block of the drive is accessed individually. As it is it might need to access 2 4kb blocks for every write, possibly doubling the writes (which is almost what I saw).

    Link to comment
    57 minutes ago, vakilando said:

    No, I'm on 6.8.3 and I did not align the parition to 1MiB (its MBR: 4K-aligned).

    What is the benefit of aligning it to 1MiB? I mus have missed this "tuning" advice...

    Yep, 6.9.0 should bring improvement to your situation. But as I said, you need to wipe the drive in 6.9.0 to reformat it back to 1MiB alignment and needless to say it would make the drive incompatible with Unraid before 6.9.0.

    Essentially back up, stop array, unassign, blkdiscard, assign back, start and format, restore backup. Beside backing up and restoring from backup, the middle process took 5 minutes.

     

    I expect LT to provide more detailed guidance regarding this perhaps when 6.9.0 enters RC or at least when 6.9.0 becomes stable.

    Not that 6.9.0-beta isn't stable. I did see some bugs report but I personally have only seen the virtio / virtio-net thingie which was fixed by using Q35-5.0 machine type (instead of 4.2). No need to use virtio-net which negatively affects network performance.

     

     

     

    PS: been running iotop for 3 hours and still average about 345MB / hr. We'll see if my daily house-keeping affects it tonight.

    Edited by testdasi
    Link to comment
    6 hours ago, testdasi said:

    No need to use virtio-net which negatively affects network performance.

    Doesn't the change from virtio to virtio-net happen automatically when you open and then save a template under the 6.9.0-betas, making it quite difficult to avoid, unless you're aware of it and revert manually?

     

    From the release notes:

    Quote

    You need to edit each VM and change the model type for the Ethernet bridge from "virtio" to "virtio-net".  In most cases this can be accomplished simply by clicking Update in "Form View" on the VM Edit page.

     

    Link to comment

    So repeated my test overnight for 15 hours

    • Unraid 6.9.0-beta25
    • 2x Intel 750 1.2TB
    • BTRFS RAID-0 for data chunks, RAID-1 for metadata + system chunks
    • Both partitions aligned to 1MiB
    • 35 dockers running in BAU pattern i.e. not trying to keep things idle

     

    Still average about 350 MB/hr (or 8.5 GB/day) on loop2 so sounds like that's my best baseline.

    Loop2 is 5th on the list, only about 2% the top one on the list (which I know for sure has written that much data). So basically negligible.

     

    Link to comment

    So I was planning to add 2x 860 EVO 1TB SSDs this weekend as a Raid-1 btrfs Cache Pool...  I'm on 6.8.2.  Is it just my best bet to wait for 6.9 to get released?  Is there anything I can do on my current version to get this going now without having to wait?

    Link to comment
    1 hour ago, DerfMcDoogal said:

    So I was planning to add 2x 860 EVO 1TB SSDs this weekend as a Raid-1 btrfs Cache Pool...  I'm on 6.8.2.  Is it just my best bet to wait for 6.9 to get released?  Is there anything I can do on my current version to get this going now without having to wait?

    Update to 6.9.0-beta25? 😉

     

    Longer answer: What you can do is to update to 6.9.0-beta25 now and test your server thoroughly (+ doing any necessary tweaks e.g. VM 5.0 / virtio-net etc.). As long as it's stable for you, there's no need to worry about the beta label. Then when you are ready, plop the 2 SSD in a new pool and format.

     

     

     

     

     

     

     

    Link to comment
    21 minutes ago, testdasi said:

    Update to 6.9.0-beta25? 😉

     

    Longer answer: What you can do is to update to 6.9.0-beta25 now and test your server thoroughly (+ doing any necessary tweaks e.g. VM 5.0 / virtio-net etc.). As long as it's stable for you, there's no need to worry about the beta label. Then when you are ready, plop the 2 SSD in a new pool and format.

     

     

     

     

     

     

     

    Thanks.  So the 1MB alignment should fix the excessive write issue?  I'm good to just allow mover to move my docker image as it exists over to the newly created cache?  I see a lot of posts of XFS vs btrfs.  .img vs folder.  It's all too confusing.

    My goal is to get my appdata moved to a cache pool without destroying $300 worth of SSD.  LOL.

    Link to comment
    On 8/11/2020 at 9:57 PM, testdasi said:

    Yep, 6.9.0 should bring improvement to your situation. But as I said, you need to wipe the drive in 6.9.0 to reformat it back to 1MiB alignment and needless to say it would make the drive incompatible with Unraid before 6.9.0.

    Essentially back up, stop array, unassign, blkdiscard, assign back, start and format, restore backup. Beside backing up and restoring from backup, the middle process took 5 minutes.

     

    I expect LT to provide more detailed guidance regarding this perhaps when 6.9.0 enters RC or at least when 6.9.0 becomes stable.

    Not that 6.9.0-beta isn't stable. I did see some bugs report but I personally have only seen the virtio / virtio-net thingie which was fixed by using Q35-5.0 machine type (instead of 4.2). No need to use virtio-net which negatively affects network performance.

     

     

     

    PS: been running iotop for 3 hours and still average about 345MB / hr. We'll see if my daily house-keeping affects it tonight.

    Thanks!
    The procedure "back up, stop array, unassign, blkdiscard, assign back, start and format, restore backup" is no problem and not new for me (except of blkdiscard) as I had to do it as my cache disks died because of those ugly unnecessary writes on btrfs-cache-pool...

    As said before, I tend changing my cache to xfs with a singel disk an wait for the stable release 6.9.x

    Meanwhile I'll think about a new concept managing my disks.

    This is my configuration at the moment:

    • Array of two disks with one parity (4+4+4TB WD red)
    • 1 btrfs cache pool (raid1) for cache, docker appdata, docker and folder redirection for my VMs (2 MX500 1 TB)
    • 1 UD for my VMs (1 SanDisk plus 480 GB)
    • 1 UD for Backup data (6 TB WD red)
    • 1 UD for nvr/cams (old 2 TB WD green)

    I still have two 1TB and one 480 GB SSDs lying around here..... I have to think about how I could use them with the new disk pools in 6.9

    Link to comment

    I can confirm this problem on 6.8.3. I managed to minimize the amount of written data by disabling all non-critical logs on vms and docker images but it is still 1-2GB per hour even if barely any data is actually written.

     

    I will wait for 6.9.0 RC1 but I'd love to see a hotfix for 6.8.x as well.

    Link to comment

    When these changes hit the release candidate, will we be automatically prompted to re-create our cache pools if needed?  Or is this fix applied without needing to re-create the pool?  I'm a little confused.

    Link to comment
    3 hours ago, Alexstrasza said:

    Or is this fix applied without needing to re-create the pool?

    The revised mount option will be applied automatically (it already is in the latest beta) but if you want to change the partition alignment of your SSD devices or the filesystem type or the docker image format you'll have to do it manually, because not everyone needs to change anything. I expect some guidance will be provided in the release notes.

    Link to comment
    On 8/17/2020 at 4:00 AM, John_M said:

    The revised mount option will be applied automatically (it already is in the latest beta) but if you want to change the partition alignment of your SSD devices or the filesystem type or the docker image format you'll have to do it manually, because not everyone needs to change anything. I expect some guidance will be provided in the release notes.

    Thanks for the information, that makes a lot more sense to me now.

    Link to comment
    On 8/11/2020 at 9:57 PM, testdasi said:

    Yep, 6.9.0 should bring improvement to your situation. But as I said, you need to wipe the drive in 6.9.0 to reformat it back to 1MiB alignment and needless to say it would make the drive incompatible with Unraid before 6.9.0.

    Essentially back up, stop array, unassign, blkdiscard, assign back, start and format, restore backup. Beside backing up and restoring from backup, the middle process took 5 minutes.

     

    I expect LT to provide more detailed guidance regarding this perhaps when 6.9.0 enters RC or at least when 6.9.0 becomes stable.

    Not that 6.9.0-beta isn't stable. I did see some bugs report but I personally have only seen the virtio / virtio-net thingie which was fixed by using Q35-5.0 machine type (instead of 4.2). No need to use virtio-net which negatively affects network performance.

     

     

     

    PS: been running iotop for 3 hours and still average about 345MB / hr. We'll see if my daily house-keeping affects it tonight.

    Not that this is really the thread to adress this but anyway:

    I just tried yesterday with 2 newly created VM Q35-5.0 machines (Windows 10), on Beta25 and I still get the "unexpected GSO type" flooded in my logs when i use "virto" so I don't see how using Q35-5.0 would be a solution.

    Only way I get rid of that in the logs for me is to use "vitio-net" with the severely diminished performance.

     

    Edit: 

     

    Just tried again and still same results, attaching my diagnostics if you wish to see for yor self.

    unraid-diagnostics-20200821-0848.zip

    Edited by Koenig
    Link to comment
    1 hour ago, Koenig said:

    Not that this is really the thread to adress this but anyway:

    I just tried yesterday with 2 newly created VM Q35-5.0 machines (Windows 10), on Beta25 and I still get the "unexpected GSO type" flooded in my logs when i use "virto" so I don't see how using Q35-5.0 would be a solution.

    Only way I get rid of that in the logs for me is to use "vitio-net" with the severely diminished performance.

     

    Edit: 

     

    Just tried again and still same results, attaching my diagnostics if you wish to see for yor self.

    unraid-diagnostics-20200821-0848.zip 182.76 kB · 0 downloads

     

    I suggest you take it in the Beta25 thread:

     

    Link to comment
    2 hours ago, Koenig said:

    Just tried again and still same results, attaching my diagnostics if you wish to see for yor self.

    unraid-diagnostics-20200821-0848.zip 182.76 kB · 0 downloads

    It looks like switching to 5.0 fixes it for some and not for others (it was suggested somewhere earlier in the topic).

     

    The officially guaranteed method is to switch to virtio-net (or set up VLAN or use separate NICs for docker and VM).

    LT said next release will allow user to pick between virtio vs virtio-net, which I think is better than defaulting to virtio-net in beta25 since there are other ways to guarantee no errors.

    Edited by testdasi
    Link to comment

    Hi all,

     

    by chance I stumbled upon this problem and immediately changed my cache drive (samsung evo 850 500gb) from btfrs encrypted to xfs encrypted. Constant writes dropped from ~30MB/s to ~700KB/s which of course is a great improvement and might just ensured me a few more months of life of my SSD (although based on SMART data stating 644631465019 lbas written it shouldnt even work anymore).

     

    Now my questions: 

    Cache drive is XFS encrypted now, but file system within Docker image is still btfrs. Is this because I set up docker image when still using btfrs encrypted file system on my cache or is btfrs the standard docker image file system?

     

    As ~700KB/s is still pretty high in my opinion (only loop writes according to iotop), maybe I could further reduce writes by switching file system within docker image to XFS as well!? Or are there any functional reasons why docker image uses btfrs?

     

    If I were to recreate my docker image file on XFS encrypted cache drive, would the file system within docker image still be btfrs?

    Is switching to 1 MiB with Unraid 6.9 also beneficial for XFS formatted drives or does btfrs only benefit from this?
     

    Thanks a lot for your input!

    Edited by Stiefmeister
    Link to comment

    Upgrading to the beta and reformatting should indeed help.

     

    Formatting the docker image to XFS as well has had mixed results. The writes are reduced but not always enough to make the switch worth it.

    Link to comment
    2 hours ago, Stiefmeister said:

    Or are there any functional reasons why docker image uses btfrs?

    There used to be, but not for a while now.

     

    Probably the least amount of excess writes is instead of using a docker image file, you utilize a docker folder instead.  Avoids any overhead involved in the loopback.  But you need the beta to utilize either an XFS formatted image or a directory system

    • Thanks 1
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.