• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Urgent

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 15


    User Feedback

    Recommended Comments



    13 minutes ago, Dephcon said:

    So that's a 2.5x difference.

    But negligible given absolute amount of data written.

     

    A loopback is always going to incur more overhead because there is the overhead of the file system within the loopback and then there is the overhead of the file system hosting the loopback.  In most cases the benefit of the loopback far outweighs the extra overhead.

    Link to comment
    Share on other sites
    23 minutes ago, limetech said:

    But negligible given absolute amount of data written.

     

    A loopback is always going to incur more overhead because there is the overhead of the file system within the loopback and then there is the overhead of the file system hosting the loopback.  In most cases the benefit of the loopback far outweighs the extra overhead.

    In this case, yes, however I purposely removed some of my higher IO loads from this test to limit the variability of writes so i could have shorter test periods.  This test is purely container appdata, excluded is:

     

    • transcoding
    • download/extract
    • folder caching
    • array backup staging

     

    In @johnnie.black's case, a huge amount of SSD wear can be avoided, which is on the opposite end of the spectrum of my test case.  I still might end up using BTRFS RAID for one or more pool devices, i just wanted to provide a reasonably solid number that other users could to apply to their own loads and decide of themselves if X times less writes it's worth switching to XFS.

     

    Either way it was fun to investigate!

    Edited by Dephcon
    Link to comment
    Share on other sites

    I think the beta allows the loopback image to be formatted as XFS?

     

    Might be interesting to test an XFS docker image on a btrfs cache to see the difference.

    Link to comment
    Share on other sites

    Just wanted to circle back to this now that my testing is over and I've finalized my caching config (for now).

     

    Previously I was using a 4 SSD BTRFS RAID10 for "cache", using 4K partitioning

     

    Now I have a 2 SSD BTRFS RAID1, 1M partitioned for array cache and docker-xfs.img and a XFS formatted pool device to use for scratch-space.  currently this includes plex transcoding and duplicacy cache.  I might move my usenet download/extract over to this scratch pool as well, but i want to get extended performance data before changing anything further.

     

    I'm, pretty happy with the reduction in IO from space_cache v2 and 1MiB partitioning.  All XFS would have been "better" for disk longevity, but I really like the extra level of protection from BTRFS RAID.

     

    last 48hrs:

    628862913_Screenshotfrom2020-08-0415-43-22.thumb.png.76f72d46e37d487c819362d9032ac239.png

    Edited by Dephcon
    • Like 1
    Link to comment
    Share on other sites

    I did my own testing as well, still waiting to get final overnight numbers for the setup I think I have settled on but overall it seems the lowest I could get writes on a BTRFS raid5 cache pool was ~1.5GB/hour or so, give or take some depending on what was running at the time (got more dockers now then when I did my prior testing). It seems that total writes to the pool scale somewhat with how many drives are in the pool.

     

    Does the 1mib partition alignment also effect xfs partitions? My writes on the XFS cache seem lower even with more dockers?

     

    When tested with less drives the writes to the pool were less, with more drives would get upwards of 2gb/hour+ total to the pool (had up to 9 drives in the pool during testing).

     

    Vs using an XFS cache pool where I am consistently getting ~200mb/hour writes or less and btrfs pool gets zero writes unless something is actively using it.

     

    Overall I think I have settled on moving my docker/appdata and a few other shares over to the XFS drive. All things that could be easily recreated if it died from regular backups that are made to the array. Then using the BTRFS cache for normal caching duties.

    Edited by TexasUnraid
    Link to comment
    Share on other sites
    On 8/4/2020 at 3:41 PM, Dephcon said:

    currently this includes plex transcoding and duplicacy cache. 

     

    Shift your plex transcoding to memory by putting it in /tmp!  I did this not long ago, create a Ramdisk on boot in /tmp with 4 GB of space and let Plex use it for transcoding.  Can put more if you want it.  There is a post on the forum about it somewhere.  Great to reduce wear and tear on disk.

    Link to comment
    Share on other sites
    1 minute ago, nickp85 said:

    Shift your plex transcoding to memory by putting it in /tmp!  I did this not long ago, create a Ramdisk on boot in /tmp with 4 GB of space and let Plex use it for transcoding.  Can put more if you want it.  There is a post on the forum about it somewhere.  Great to reduce wear and tear on disk.

    The only caveat to this is that if you use Plex to record TV it uses the temp while recording. I discovered this myself when recordings were failing due to running out of space. It does this to produce a streamable version of the recording just in case you want to watch it while it's being recorded.

    Link to comment
    Share on other sites
    2 hours ago, nickp85 said:

    Shift your plex transcoding to memory by putting it in /tmp!  I did this not long ago, create a Ramdisk on boot in /tmp with 4 GB of space and let Plex use it for transcoding.  Can put more if you want it.  There is a post on the forum about it somewhere.  Great to reduce wear and tear on disk.

    i used to do it in ram when i had 32GB, when i upgraded i only had 16GB of DDR4 available so its a bit tight now.

    Link to comment
    Share on other sites

    Damn! My Server seems also to be affected...
    I had an unencrypted BTRFS RAID 1 with two SanDisk Plus 480 GB.
    Both died in quick succession (mor or less 2 weeks) after 2 year of use! 

    So I bought two 1 TB Crucial MX500.
    As I didn't know about the problem I again made a unencrypted BTRFS RAID 1 (01 July 2020).
    As I found it strange that they died in quick succession I did some researches and found all those threads about massive writes on BTRFS cache disks.
    I made some tetst and here are the results.

     

    ### Test 1:

     

    running "iotop -ao" for 60 min: 2,54 GB [loop2] (see pic1)

    pic1.png.1f36cc11a2b99512e6724481eb27c8a8.png

     

    Docker Container running:

    The docker containers running during this test are the most important for me.
    I stopped Pydio and mariadb though its also important for me - see other tests for the reason...

      - ts-dnsserver
      - letsencrypt
      - BitwardenRS
      - Deconz
      - MQTT
      - MotionEye
      - Homeassistant
      - Duplicacy

     

    shfs writes:

      - Look pic1, are the shfs writes ok? I don't know...

     

    VMs running (all on Unassigned disk):
      - Linux Mint (my primary Client)
      - Win10
      - Debian with SOGo Mail Server

     

    /usr/sbin/smartctl -A /dev/sdg | awk '$0~/LBAs/{ printf "TBW %.1f\n", $10 * 512 / 1024^4 }' => TBW 10.9
    /usr/sbin/smartctl -A /dev/sdh | awk '$0~/LBAs/{ printf "TBW %.1f\n", $10 * 512 / 1024^4 }' => TBW 10.9



    ### Test 2:


    running "iotop -ao" for 60 min: 3,29 GB [loop2] (see pic2)

    pic2.thumb.png.ae87dcdb6ad428c2b0da7b4cd778765e.png

     

    Docker Container running (almost all of my dockers):
      - ts-dnsserver
      - letsencrypt
      - BitwardenRS
      - Deconz
      - MQTT
      - MotionEye
      - Homeassistant
      - Duplicacy
      ----------------
      - mariadb
      - Appdeamon
      - Xeoma
      - NodeRed-OfficialDocker
      - hacc
      - binhex-emby
      - embystat
      - pydio
      - picapport
      - portainer

     

    shfs writes:

      - Look pic2, there are massive shfs writes too!

     

    VMs running (all on Unassigned disk)
      - Linux Mint (my primary Client)
      - Win10
      - Debian with SOGo Mail Server

     

    /usr/sbin/smartctl -A /dev/sdg | awk '$0~/LBAs/{ printf "TBW %.1f\n", $10 * 512 / 1024^4 }' => TBW 11 
    /usr/sbin/smartctl -A /dev/sdh | awk '$0~/LBAs/{ printf "TBW %.1f\n", $10 * 512 / 1024^4 }' => TBW 11 

     

     

    ### Test 3:


    running "iotop -ao" for 60 min: 3,04 GB [loop2] (see pic3)

    pic3.thumb.png.7b37ebf8d148be391973a1671d6ca863.png

     

    Docker Container running (almost all my dockers except mariadb/pydio!):
      - ts-dnsserver
      - letsencrypt
      - BitwardenRS
      - Deconz
      - MQTT
      - MotionEye
      - Homeassistant
      - Duplicacy
      ----------------
      - Appdeamon
      - Xeoma
      - NodeRed-OfficialDocker
      - hacc
      - binhex-emby
      - embystat
      - picapport
      - portainer

     

    shfs writes:

      - Look at pic3, the shfs writes are clearly less without mariadb!
        (I also stopped pydio as it needs mariadb...)

     

    VMs running (all on Unassigned disk)
      - Linux Mint (my primary Client)
      - Win10
      - Debian with SOGo Mail Server

     

    /usr/sbin/smartctl -A /dev/sdg | awk '$0~/LBAs/{ printf "TBW %.1f\n", $10 * 512 / 1024^4 }' => TBW 11
    /usr/sbin/smartctl -A /dev/sdh | awk '$0~/LBAs/{ printf "TBW %.1f\n", $10 * 512 / 1024^4 }' => TBW 11


     

    ### Test 4:


    running "iotop -ao" for 60 min: 6,23 M [loop2] (see pic4)

    pic4.thumb.png.cbe03a5616fccfb64bf7aec12a14267a.png

     

    Docker Container running:

      - none, but docker service is started

     

    shfs writes:

      - none

     

    VMs running (all on Unassigned disk)
      - Linux Mint (my primary Client)
      - Win10
      - Debian with SOGo Mail Server

     

    /usr/sbin/smartctl -A /dev/sdg | awk '$0~/LBAs/{ printf "TBW %.1f\n", $10 *
    PLEASE resolve this problem in next stable release!!!!!!!

    Next weenkend I will remove the BTRFS RAID 1 Cache and go with one single XFS cache disk.
     

    If I ca do more analysis and research, please let me know. I'll do my best!

    Edited by vakilando
    Link to comment
    Share on other sites

    perhaps I should mention, that I had my VMs on the cache pool before, but the performance was terrible.

    Since moving them to an unassigned disk their performance is really fine!

    Perhaps the poor performance was due to the massive writes on the cache pool....?

    Link to comment
    Share on other sites

    oh....sorry... I did not read the whole thread...

    Now I did!

    I'll try the fix now an do this: 

    mount -o remount -o space_cache=v2 /mnt/cache

     

    Link to comment
    Share on other sites

    The 6.9 beta fixes all the issues with BTRFS and just leaves the inherent BTRFS write amplification.

     

    In my case I decided to move appdata and docker to an XFS cache pool on 6.9 and leave everything else on the BTRFS pool.

     

    This dropped writes down to ~200mb an hour vs 2GB and should the XFS drive die I can easily rebuild it from backups / re-downloading the dockers.

    Link to comment
    Share on other sites
    5 minutes ago, TexasUnraid said:

    The 6.9 beta fixes all the issues with BTRFS and just leaves the inherent BTRFS write amplification.

    (Unless I've misread), then BTRFS issue persisting just because it's BTRFS + BTRFS is a shame, but for cache redundancy, kind of nessessity.

    Link to comment
    Share on other sites

    I would not say btrfs is a shame, in fact I am really liking it overall. The only issue I have had is the write amplification which is known and kind of a perfect storm with the docker/appdata (a lot of small writes, most to a image file).

     

    During normal cache use I see negligible write amplification and have had no issues. Since switching docker to XFS everything is working great just using btrfs for cache / scratch drive.

    Link to comment
    Share on other sites

    The XFS docker conversion is interesting.

    Had anyone done any in-depth comparisons on speed, writes, etc. ?

    Link to comment
    Share on other sites

    I was referring to moving the btrfs docker image to an XFS formatted cache pool.

     

    In 6.9 beta you can have multiple cache pools.

    Link to comment
    Share on other sites
    2 minutes ago, TexasUnraid said:

    I was referring to moving the btrfs docker image to an XFS formatted cache pool.

     

    In 6.9 beta you can have multiple cache pools.

    I know ;-)

    But I'm asking if anyone has done some proper testing into it, comparative to BTRFS.

    Link to comment
    Share on other sites

    Yes, I did and so did others. I was sitting at ~2GB/ hour with everything on the BTRFS cache pool (although the actual writes would be much higher with more devices int he pool)

     

    Moving docker and appdata to the XFS cache I am seeing ~200-250mb/hour on the XFS pool and basically zero to the BTRFS pool unless I do something.

    Link to comment
    Share on other sites

    Oh, someone posted some results a page or 2 back for that. I tried it but the writes were still quite a bit higher then using the XFS option and the writes were limited to a single SSD I don't care about instead of being spread out over all of them.

    Link to comment
    Share on other sites

    ok, after I've executed the recommended command:

    mount -o remount -o space_cache=v2 /mnt/cache

    this ist the result after 7 hours of

    iotop -ao

    The running dockers were the same as my "Test 2" (all my dockers including mariadb and pydio)

     

    See the picture:

    pic5.thumb.png.b1b2574b41925db8a487763c5f275df1.png

     

    It's better than before (less writes for loop2 and shfs) but it should be even less or what do you think?

    Link to comment
    Share on other sites
    11 hours ago, TexasUnraid said:

    In my case I decided to move appdata and docker to an XFS cache pool on 6.9 and leave everything else on the BTRFS pool.

    Only docker image would need to be in xfs cache. Appdata isn't subjected to the same loop2 overhead. Which is great because the docker image doesn't really need protection.

     

    5 hours ago, vakilando said:

    It's better than before (less writes for loop2 and shfs) but it should be even less or what do you think?

    Are you using 6.9.0? Did you also align the parition to 1MiB? That requires wiping the pool so I would assume quite few people would do it.

    Edited by testdasi
    Link to comment
    Share on other sites

    This is my quick test.

    • Unraid 6.9.0-beta25
    • 2x Intel 750 1.2TB
    • BTRFS RAID-0 for data chunks, RAID-1 for metadata + system chunks
    • Both partitions aligned to 1MiB
    • 35 dockers running but mostly idle

     

    403.41 MB / 70 minutes or 345.78 MB/hr.

    About 100MB/hr worse than @TexasUnraid XFS image but only about 1/3 of @vakilando test.

     

    Maybe I'll do an overnight run or something to see if there's any diff.

     

    2024985508_iotop2020081170minutes.thumb.PNG.d6128b538c84601c994553566926e3cb.PNG

    Link to comment
    Share on other sites
    3 hours ago, testdasi said:

    Only docker image would need to be in xfs cache. Appdata isn't subjected to the same loop2 overhead. Which is great because the docker image doesn't really need protection.

     

    Are you using 6.9.0? Did you also align the parition to 1MiB? That requires wiping the pool so I would assume quite few people would do it.

    If you go back a ways in this thread, you will find a few pages of me testing every possible scenario.

     

    While the docker image is the main culprit for sure, appdata was not far behind. With just appdata on the BTRFS I was still seeing around 800mb/hour IIRC. Vs both on the XFS and ~200mb/hour combined.

     

    The issue is that small writes have a very large write amplification with btrfs and appdata sees a lot of these small writes as well (logs etc).

     

    This write amplification goes up in proportion to the number of drives in the pool as well it seems (the small writes get spread over the drives) thus total writes for a 5 drive pool was much higher then a 2 drive pool. With a 9 device pool at one point I was seeing 1GB/hour PER DRIVE.

     

    I was able to reduce the writes a fair amount by increasing the dirty writeback to like 4 minutes but that is not a practical solution.

    Link to comment
    Share on other sites

    Here's my test with 1 hour of iotop -ao

    • Standard 6.8.3 with no customization
    • 2x 500GB WD Blue SSD

    • BTRFS RAID-1 default

    • Cache, docker.img and libvirt.img with a dozen dockers and 2 VMs, mostly idle

     

    I'd like to know what's responsible for the shfs /mnt/user -disks 7 entries

    iotop.thumb.png.9e94be61ba90f54ca7acf77fc79b3241.png

     

    Link to comment
    Share on other sites



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.