• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Urgent

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 15


    User Feedback

    Recommended Comments



    23 hours ago, limetech said:

    Added several options for dealing with this issue in 6.9.0-beta24.

    Such excellent news! really look forward to upgrading!

     

    Thank you so much!

    Link to comment
    Share on other sites
    On 7/7/2020 at 10:05 PM, limetech said:

    Added several options for dealing with this issue in 6.9.0-beta24.

    I read through the beta 24 release notes.  Is the option do help deal with this related to editing the config/docker.cfg and having the docker image in a directory and not loopback?  Just trying to see what I need to implement from the beta in order to deal with the excessive writes. 

    Link to comment
    Share on other sites
    8 minutes ago, mmag05 said:

    I read through the beta 24 release notes.  Is the option do help deal with this related to editing the config/docker.cfg and having the docker image in a directory and not loopback?  Just trying to see what I need to implement from the beta in order to deal with the excessive writes. 

    Yes but I suggest you wait for beta25 which corrects a few bugs with this.

    Link to comment
    Share on other sites
    22 minutes ago, mmag05 said:

    I read through the beta 24 release notes.  Is the option do help deal with this related to editing the config/docker.cfg and having the docker image in a directory and not loopback?  Just trying to see what I need to implement from the beta in order to deal with the excessive writes. 

     

    You can try just converting the image to xfs to see if it helps before going the folder method.

    1. Backup your /boot/config/docker.cfg
    2.  Check your /boot/config/docker.cfg and remove this line if present:

      DOCKER_OPTS="--storage-driver=btrfs"

       

    3. On the GUI, change Enable Docker to No + Apply to turn off docker

    4. In the Docker vdisk location box, change docker.img to docker-xfs.img + Apply
    5. Change Enable Docker to Yes + Apply

     

    If xfs image helps then at least you can readily resolve the issue without needing to wait for official GUI support for the folder method.

     

     

    Link to comment
    Share on other sites

    I'm curious to know if these "workarounds" in the beta releases are true bugfixes. There's a difference between, "We identified the bug and have fixed it," and "We have not been able to identify the bug, but if we do these non-standard things with dockers/filesystems/etc., things seem to get a little better?"  Sort of just throwing spaghetti at the wall and seeing what sticks. I'm definitely more interested in a true bugfix than some sort of poorly-defined workaround that just appears to make things a little better while the source of the problem remains unknown.

    Link to comment
    Share on other sites
    16 minutes ago, grigsby said:

    I'm curious to know if these "workarounds" in the beta releases are true bugfixes. There's a difference between, "We identified the bug and have fixed it," and "We have not been able to identify the bug, but if we do these non-standard things with dockers/filesystems/etc., things seem to get a little better?"  Sort of just throwing spaghetti at the wall and seeing what sticks. I'm definitely more interested in a true bugfix than some sort of poorly-defined workaround that just appears to make things a little better while the source of the problem remains unknown.

    I assure you that is not the case and not the case with any bug fixes or workarounds in Unraid OS.

    • Like 1
    Link to comment
    Share on other sites
    On 6/27/2020 at 3:16 PM, limetech said:

    mount -o remount -o space_cache=v2 /mnt/cache

    This made a massive difference on my RAID-10 cache, what is it doing to make the improvement?

     

    Screenshot from 2020-07-13 15-47-38.png

    Edited by Dephcon
    Link to comment
    Share on other sites
    17 minutes ago, Dephcon said:

    what is it doing to make the improvement?

    Magic.  (actually it's an improved algorithm for maintaining data structures keeping track of free space)

     

    Thanks to @johnnie.black for pointing out this improvement.

    Link to comment
    Share on other sites
    Quote
    25 minutes ago, Dephcon said:

    This made a massive difference on my RAID-10 cache, what is it doing to make the improvement?

     

    Screenshot from 2020-07-13 15-47-38.png

     

    @Dephcon what tool did you use for measuring this? I only used text tools until now.

    Link to comment
    Share on other sites

    Netdata is great for tracking write speed like that, although looks like he is using something else.

     

    Netdata is a must IMHO for a server like this, it has helped me track down several issues already plus it is nice to be able to see exactly what is happening.

     

    For tracking total writes though the best option is to use the LBA's written smart metric. There was a script posted earlier that automated logging this over time using user scripts.

    Edited by TexasUnraid
    Link to comment
    Share on other sites
    10 minutes ago, TexasUnraid said:

    Netdata is great for tracking write speed like that, although looks like he is using something else.

     

    Netdata is a must IMHO for a server like this, it has helped me track down several issues already plus it is nice to be able to see exactly what is happening.

     

    For tracking total writes though the best option is to use the LBA's written smart metric. There was a script posted earlier that automated logging this over time using user scripts.

    I'm using Netdata and tracking total LBA's via SMART.  I write down the number once a day (sometimes more often sometimes less depends on free time) and I wrote a nice calculation in excel that tracks the daily GB written, but the graph posted above looks very detailed and may help to check influence of different settings quickly.

    Link to comment
    Share on other sites
    6 minutes ago, thecode said:

    I'm using Netdata and tracking total LBA's via SMART.  I write down the number once a day (sometimes more often sometimes less depends on free time) and I wrote a nice calculation in excel that tracks the daily GB written, but the graph posted above looks very detailed and may help to check influence of different settings quickly.

    Here is a script that someone else made, I tweaked a few things to make it easier to use. It outputs to a Temp share right now, you can updated as desired, it is the last line.

    #!/bin/bash
    #description=Basic script to display the amount of data written to SSD on drives that support this. Set "argumentDefault" to the drive you want if you will schedule this.
    #argumentDescription= Set drive you want to see here
    #argumentDefault=sdc
    
    ### replace sd? above with label of drive you want TBW calculated for  ###
    
    device=/dev/"$1"
    
    sudo smartctl -A $device |awk '
    $0 ~ /Power_On_Hours/ { poh=$10; printf "%s / %d hours / %d days / %.2f years\n",  $2, $10, $10 / 24, $10 / 24 / 365.25 }
    $0 ~ /Total_LBAs_Written/ {
       lbas=$10;
       bytes=$10 * 512;
       mb= bytes / 1024^2;
       gb= bytes / 1024^3;
       tb= bytes / 1024^4;
       #printf "%s / %s  / %d mb / %.1f gb / %.3f tb\n", $2, $10, mb, gb, tb
         printf "%s / %.2f gb / %.2f tb\n", $2, gb, tb
       printf "mean writes per hour:  / %.3f gb / %.3f tb",  gb/poh, tb/poh
    }
    $0 ~ /Wear_Leveling_Count/ { printf "%s / %d (%% health)\n", $2, int($4) }
    ' |
       sed -e 's:/:@:' |
       sed -e "s\$^\$$device @ \$" |
       column -ts@
    
    
    
    
    # Get the TBW of /dev/s!db
    TBWSDB_TB=$(/usr/sbin/smartctl -A /dev/"$1" | awk '$0~/LBAs/{ printf "%.1f\n", $10 * 512 / 1024^4 }')
    TBWSDB_GB=$(/usr/sbin/smartctl -A /dev/"$1" | awk '$0~/LBAs/{ printf "%.1f\n", $10 * 512 / 1024^3 }')
    TBWSDB_MB=$(/usr/sbin/smartctl -A /dev/"$1" | awk '$0~/LBAs/{ printf "%.1f\n", $10 * 512 / 1024^2 }')
    
    echo "TBW on $(date +"%d-%m-%Y %H:%M:%S") --> if 2 numbers, Written data first line, read data second line > $TBWSDB_TB TB, which is $TBWSDB_GB GB, which is $TBWSDB_MB MB." >> /mnt/user/Temp/TBW_"$1".log 

    I have it set to run daily now but had it set to hourly when I was actively troubleshooting.

     

    Far as the graph goes, you should be able to get a very similar graph from the hard disks section in netdata.

    Edited by TexasUnraid
    • Like 1
    • Thanks 1
    Link to comment
    Share on other sites
    10 hours ago, thecode said:

    @Dephcon what tool did you use for measuring this? I only used text tools until now.

    Not Dephcon, but I recognize the graph: it's grafana. To get cool visualizations like that, you'll need telegraf (data collection), influxdb (storage), and grafana (visualization). It's super fun if you're into tinkering with stuff like this and monitoring everything on your network. I've attached a few images here of what some of my dashboards look like. Mine aren't super cool (yet!), but they're always evolving. The top one is part of my pfsense firewall dashboard, the rest (disks and docker containers) are from my Unraid server. (Also, the delta-data disk usage numbers are totally wrong. I'm still trying to figure out how to make those work right. I'm not any kind of database or data visualization person, I'm just learning as I go and copying panels from other people who have posted theirs on the grafana repository.)

    Screen Shot 2020-07-13 at 11.31.27 PM.png

    Screen Shot 2020-07-13 at 11.33.08 PM.png

    Screen Shot 2020-07-13 at 11.33.29 PM.png

    Screen Shot 2020-07-13 at 11.34.45 PM.png

    Edited by grigsby
    • Like 1
    • Thanks 1
    Link to comment
    Share on other sites
    18 hours ago, limetech said:

    Magic.  (actually it's an improved algorithm for maintaining data structures keeping track of free space)

     

    Thanks to @johnnie.black for pointing out this improvement.

    I guess i have to thank Facebook for developing it, begrudgingly.

     

    @limetechIs this going to become a standard (or option) in 6.9?  While switching to XFS would be nice, I can't afford new NVME disks and am stuck with BTRFS RAID10 for now.

     

    *edit* that said, if I had separate cache devices for array caching, appdata, etc i might not need raid10 anymore esp. with less overhead from btrfs.  something worth testing i guess.

    Edited by Dephcon
    Link to comment
    Share on other sites
    On 7/10/2020 at 7:08 PM, limetech said:

    Yes but I suggest you wait for beta25 which corrects a few bugs with this.

    Noticed that serious efforts have been taken to place docker in its own directory instead of the loop device.

    Very very happy that I can keep using the folder based approach, as it just works very well.

    Thanks so much for listening to the community @limetech!!

    Link to comment
    Share on other sites
    1 minute ago, S1dney said:

    Noticed that serious efforts have been taken to place docker in its own directory instead of the loop device.

    Very very happy that I can keep using the folder based approach, as it just works very well.

    Thanks so much for listening to the community @limetech!!

    The loopback approach is much better from the standpoint of data management.  Once you have a directory dedicated to Docker engine, it's almost impossible to move it to a different volume, especially for the casual user.

    • Like 1
    Link to comment
    Share on other sites
    2 minutes ago, limetech said:

    The loopback approach is much better from the standpoint of data management.  Once you have a directory dedicated to Docker engine, it's almost impossible to move it to a different volume, especially for the casual user.

    Agreed, which is why having options for both the loopback image and the folder is best of both worlds.

    Also if I ever wanted to move the data I would just remove the entire folder and recreate it anyways since it's non-persistent data.

    Link to comment
    Share on other sites
    1 minute ago, S1dney said:

    Also if I ever wanted to move the data I would just remove the entire folder and recreate it anyways since it's non-persistent data.

    That works for most containers and we highly encourage not storing data in image layers for just that reason BUT if someone does store data in the image this is something to be aware of.

    Link to comment
    Share on other sites
    1 minute ago, limetech said:

    That works for most containers and we highly encourage not storing data in image layers for just that reason BUT if someone does store data in the image this is something to be aware of.

    Surely, you will loose it when you upgrade the containers also so you'll find out soon enough.

    Wiping out the directory is essentially recreating the docker image so that's fine.

    Also I understand that you're trying to warn people and agree with you that for most users taking the loopback approach will work better and causes less confusion.

    It's great that we can decide this ourselves though, unRAID is so flexible, which is something I like about it.

    Link to comment
    Share on other sites
    22 minutes ago, limetech said:

    The loopback approach is much better from the standpoint of data management.  Once you have a directory dedicated to Docker engine, it's almost impossible to move it to a different volume, especially for the casual user.

    Definitely agree that management of a loopback image is easier.  But, outside of the share requirements on a folder, (and that it can't be moved), there are a few pros for the folder.  Not that I have any of the "issues" associated with an image, but the option is a good thing to have.  And many of the cons for each approach (storing infomation within the image / folder thats not part of the container) affects each method equally.

    Link to comment
    Share on other sites
    29 minutes ago, Squid said:

    Definitely agree that management of a loopback image is easier.  But, outside of the share requirements on a folder, (and that it can't be moved), there are a few pros for the folder.  Not that I have any of the "issues" associated with an image, but the option is a good thing to have.  And many of the cons for each approach (storing infomation within the image / folder thats not part of the container) affects each method equally.

    If the increased write to SSD was not an issue then it's not worth the management headache, especially to new users, that come with putting the docker tree, along with all it's layers, be they btrfs subvolumes or unionfs overlays, in a volume that can't easily be moved.

    Link to comment
    Share on other sites

    I can do some testing of the various scenarios once we get a RC release.  Can't risk a beta, in a prime pandemic plex period.

     

    btfs img on btfs cache

    xfs image on btfs cache

    folder on btfs cache

    btfs img on xfs cache

    xfs image on xfs cache

    folder on xfs cache

     

    Luckily I have a drawer full of the same SSD model so I can setup some different cache pools.

    Edited by Dephcon
    Link to comment
    Share on other sites
    6 minutes ago, Dephcon said:

    I can do some testing of the various scenarios once we get a RC release.  Can't risk a beta, in a prime pandemic plex period.

     

    btfs img on btfs cache

    xfs image on btfs cache

    folder on btfs cache

    btfs img on xfs cache

    xfs image on xfs cache

    folder on xfs cache

     

    Luckily I have a drawer full of the same SSD model so I can setup some different cache pools.

    Thank you, we're doing same testing as well.  Other combinations would be

    single-device btrfs pool

    multiple-device (x2) btrfs pool

     

    Also before each test run, it's best to 'blkdisard' the entire SSD(s) first.

    Link to comment
    Share on other sites
    1 minute ago, limetech said:

    Thank you, we're doing same testing as well.  Other combinations would be

    single-device btrfs pool

    multiple-device (x2) btrfs pool

     

    Also before each test run, it's best to 'blkdisard' the entire SSD(s) first.

     

    blkdiscard /dev/sdX or blkdiscard /mnt/cache?

     

    can i assume space_cache=v2 is being used for the testing/default in an upcoming release?

    Link to comment
    Share on other sites



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.