• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Solved Urgent

    EDIT (March 9th 2021):

    Solved in 6.9 and up. Reformatting the cache to new partition alignment and hosting docker directly on a cache-only directory brought writes down to a bare minimum.

     

    ###

     

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 17



    User Feedback

    Recommended Comments



    Ok, one more journal entry then going to let things settle for a few days and if everything checks out I will post a writeup on this process.

     

    Where would I post this that might attract a plugin dev that might be interested in turning this process into a plugin?

     

    I have most of the commands and setup figured out, I just have no clue where to even start with a plugin.

     

    Ok, the final step was to figure out how to handle the appdata ramdrive and I am currently doing this and after several reboots and other tests, it seems to be working good and seamlessly.

     

    The first step after figuring out which appdata needs to move to the ramdisk is to create the ramdisk itself and then copy the appdata into it from the SSD.

     

    First create a folder in /mnt/cache/appdata/, Very important to create the folder on the drive itself and NOT in /user.

     

    mkdir /mnt/cache/appdata/appramdisk
    chmod 777 /mnt/cache/appdata/appramdisk

     

    after this I use a very basic user script that is set to "run at array start":

     

    echo ---------------------------------Create ramdisk for appdata----------------------------------
    mount -vt tmpfs -o size=8G appramdisk /mnt/write-cache/appdata/appramdisk
    
    
    echo ---------------------------------rsync to ramdisk in appdata----------------------------------
    rsync -ah --stats --delete /mnt/user/appdata/binhex-qbittorrentvpn /mnt/user/appdata/appramdisk
    rsync -ah --stats --delete /mnt/user/appdata/binhex-nzbhydra2 /mnt/user/appdata/appramdisk
    rsync -ah --stats --delete /mnt/user/appdata/*arr /mnt/user/appdata/appramdisk

     

    I then have a separate script set to run hourly that rsync's the ramdisk back to the SSD:

     

    rsync -ahv --progress --delete /mnt/user/appdata/appramdisk/* /mnt/user/appdata/

     

    Now for the shutdown, I created a "stop" file on the USB drive at /boot/config. It is called first thing when you click shutdown/reboot in the GUI and the rest of the shutdown will wait until it is finished.

     

    touch /boot/config/stop

    In the stop file I decided to simply redirect it to a script in user scripts called "Run at Shutdown" to make it easier to manage.

     

    #!/bin/bash
    
    #Runs the user script "Run at Shutdown" during shutdown or reboot.
    #it is called before anything else during the shutdown process
    
    # Invoke 'Run at Shutdown' script if present
    if [ -f /boot/config/plugins/user.scripts/scripts/Run\ at\ Shutdown/script ]; then
      echo "Preparing Run at Shutdown script"
      cp /boot/config/plugins/user.scripts/scripts/Run\ at\ Shutdown/script /var/tmp/shutdown
      chmod +x /var/tmp/shutdown
      logger Starting Run at Shutdown script
      /var/tmp/shutdown
    fi

     

    The run at shutdown script itself first stops all running docker containers so they can close out open files. It then rsyncs the appramdisk back to the SSD before clearing the ramdisk and unmounting it.

     

    #!/bin/bash
    #description=This script runs first thing at shutdown or reboot and handles rsyncing appramdisk and unmounting it.
    
    logger Stopping Dockers
    docker stop $(docker ps -q)
    logger Dockers stopped
    
    logger Started appramdisk rsync
    rsync -ah --stats --delete /mnt/user/appdata/appramdisk/* /mnt/user/appdata/ | logger
    logger rsync finished
    
    logger clearing appramdisk data
    rm -r /mnt/user/appdata/appramdisk/* | logger
    
    logger unmounting appramdisk
    umount -v appramdisk | logger

     

    And thats it, seems to be working good, no hang-ups when rebooting and everything is working automatically.

     

    All combined this has reduced me writes by ~10x it seems and my writes were pretty mild compared to a lot of guys to start out with.

     

    It is finally low enough that I would be ok putting appdata and docker back onto my main SSD's with redundancy instead of the single piece of junk drive I am using now.

    Edited by TexasUnraid
    Link to comment

    Wow, I am really impressed with how well this is working.

     

    So 2 days ago I figured it would just be a throwaway day for data as I did a fair amount of messing around with the dockers and testing of the ramdisk setup.

     

    Yet I still only had 13gb of writes!

     

    Yesterday I still had a few extra writes from messing with stuff but pretty close to what I expect regularly with it syncing to the SSD every hour.

     

    13GB of writes!

     

    Guessing total writes would be around 11-12gb/day once it all settles down.

     

    Looks like the average sync was around ~300mb x 24 = 7.2gb

     

    So yeah, big fan of this new setup

    Edited by TexasUnraid
    Link to comment

    Ok, I posted a guide on all this here for anyone interested:

     

     

    What would be the protocol for seeing if any plugin devs are interested in turning it into a plugin?

    Link to comment
    On 6/19/2021 at 8:51 AM, TexasUnraid said:

    So it has been a few days now and the results are pretty consistent.

     

    BTRFS formatted drive + BTRFS image = 75-85gb/day

    BTRFS drive + Docker folder = 60-65gb/day

    XFS drive with BTRFS image = 20-25gb/day

     

    This is useful information, as I was about to go to the effort to change over the folder option over this weekend. For such a difference, I dont think its worth the effort for me, its just another workaround in effect for the performance.

     

    One thing I am tempted to do is pickup 2x small SSDs specifically just for the docker.img, and just have it backup once a day.

    Link to comment
    10 hours ago, boomam said:

    This is useful information, as I was about to go to the effort to change over the folder option over this weekend. For such a difference, I dont think its worth the effort for me, its just another workaround in effect for the performance.

     

    One thing I am tempted to do is pickup 2x small SSDs specifically just for the docker.img, and just have it backup once a day.

     

    Yeah, I am sticking with the image as well. Read through the above thread, with a few tweaks to the docker templates you can vastly reduce the writes by disabling logging and enabling the /tmp ramdisk inside containers that need it.

     

    If you really want to reduce writes you can add a ramdisk for appdata as well but that is a bit more involved.

     

    A plugin for all this would make it super easy though with very little downsides.

     

    I also am not that worried about the docker image itself being backed up, appdata is much more important. The docker image can be re-downlaoded pretty easily, just takes time.

    Link to comment

    Other than the common ones that we'd expect, such as Plex (due to bug with official container), are there any overarching notes on what containers are known to be big offenders for this bug?

    I assume any container that does continuous writes (such as a logging service/poller) would be big culprits?

     

     

    I forgot to pull the trigger and get a small/cheap SSD specifically for containers to help preserve the lifetime of my existing SSDs, and upon taking a quick look at my 2x MX500 1Tb's, life left is 33% and 34% respectively - 

     

                      LBA's Written        Tb Written    Percent lifetime remain
    Cache         168821397049     691.49           34%
    Cache 2      167088799024     684.40          35%

     

     

    Running iotop -oa for 10-15mins shows just 69Mb written by loop2, with a large 105Mb written by 'btrfs-tractacti' (with the above types of containers off). Re-running now with them restarted just as a quick compare...

    Edited by boomam
    Link to comment

    database dockers are bad as well.

     

    Thats a lot of writes, how long were the drives in there? I am afraid there are a LOT more people with write issues but they are not even aware of it.

     

    Easiest way to get an idea of your writes is to use the logging command that was posted earlier or in my writeup thread. That will show you the files that are causing writes and you can then trace them to the containers.

     

    The writeup shows how to trace and deal with most of the writes.

     

    I have been sitting at a steady ~11gb of writes a day since finishing the write modifications.

    Link to comment
    1 minute ago, TexasUnraid said:

    database dockers are bad as well.

     

    Thats a lot of writes, how long were the drives in there? I am afraid there are a LOT more people with write issues but they are not even aware of it.

     

    Easiest way to get an idea of your writes is to use the logging command that was posted earlier or in my writeup thread. That will show you the files that are causing writes and you can then trace them to the containers.

     

    The writeup shows how to trace and deal with most of the writes.

     

    I have been sitting at a steady ~11gb of writes a day since finishing the write modifications.

    Read further up the thread, we've spoken before, quite a few times ;-)

     

    Just retesting now as I've a little time - 

    10mins with all the logging containers back on shows a loop2 usage of 67Mb (so within a margin of error for the 'bad' containers being off).

     

    Extrapolated that would mean - 

    67 Mb per 10mins.

    402 Mb per hour.

    9.64 Gb per day.

    282.65 Gb per month (30 day month)

    3.35 Tb per year

     

    That's not as bad as I thought it would be - I'd guess the large loss in lifetime I'm seeing probably happened before the 6.9 update and some other changes and I'm just forgetting about the last set of numbers the last time I measured.

     

    Also, I wouldn't discount a container as its a database as being 'bad' - its characteristics depends on the app and its usage of the database - what is cached in RAM, how it writes to the database itself, etc.

    For example, NextCloud (popular with unraid users) is a low overhead Db writing app if the user/client count isn't high and files aren't updated often. Its a great example of disk IO scaling with users/data being changed.

     

    ...Regardless, I think I'll look into some new, smaller SSDs specifically for the containers the next time I go by a Microcenter.

    Reserve the larger drives remaining life span for 'normal' caching.

    Link to comment

    I would check the LBA's written to truly see how many writes you have.

     

    A bunch of small writes will have write amplification but a few large writes will not.

     

    I log my writes every day with a script and track them that way.

    Link to comment

    Good point.

    I've made note and will check daily.

     

    Still though, I'm wondering if its worth putting together a defaco list of particularly bad apps.

    Link to comment

    It just occurred to me - 

    The loop2 device is literally just the docker.img file - correct?

    If so, then the loss of a SSD that just holds the image, has no effect on the appdata/what we actually want to recover.

    So having a SSD literally just for the container images, holds little to no risk to data, just an uptime risk due to no redundancy... 

    Link to comment
    3 minutes ago, boomam said:

    just an uptime risk due to no redundancy.

    1-2 minute recovery time when you notice it.

    Link to comment
    2 hours ago, boomam said:

    It just occurred to me - 

    The loop2 device is literally just the docker.img file - correct?

    If so, then the loss of a SSD that just holds the image, has no effect on the appdata/what we actually want to recover.

    So having a SSD literally just for the container images, holds little to no risk to data, just an uptime risk due to no redundancy... 

    recreating the Docker.img and it’s contents with previous settings is a trivial and quick operation.    It is the appdata contents used by each docker container that is what tends to really matter.

    Link to comment
    1 minute ago, itimpi said:

    recreating the Docker.img and it’s contents with previous settings is a trivial and quick operation.    It is the appdata contents used by each docker container that is what tends to really matter.

    Yup.

    Can likely create a script to monitor, and remediate if needed too...

    Link to comment

    So this is odd - put in 2x 250Gb SSDs to act as a second cache pool, strictly for the docker.img file.

    Moved the file, expanded it a little and restarted the service.

     

    ...whilst it does seem to consistently do writes, I'm seeing an equal number of writes, offset, to the main SSD cache pool.

    Strange...

    Link to comment
    7 hours ago, boomam said:

    put in 2x 250Gb SSDs to act as a second cache pool, strictly for the docker.img file.

    I would use one SSD, format it as XFS and set docker to path mode. By that the write amplification is even more reduced. Disabling healthcheck on all containers is an additional easy task.

    • Thanks 1
    Link to comment

    Depends on the containers from my testing, in my case the docker logs caused a lot more writes then the health checks overall. But both are issues for sure.

    Link to comment
    7 hours ago, mgutt said:

    I would use one SSD, format it as XFS and set docker to path mode. By that the write amplification is even more reduced. Disabling healthcheck on all containers is an additional easy task.

    Ordinarily i would, but i wasn't bothered about 2x cheap second hand SSDs dying due to amplification. 

    Having 2x gives me that failover and enough time to react if needed.

     

    Regardless, i'm a little confused why i'm still seeing an amount of IO on the main cache though considering loop2/docker.img is moved to the secondary cache.

    Link to comment
    18 hours ago, boomam said:

    i'm a little confused why i'm still seeing an amount of IO

    Your containers are reading from / writing on files in /mnt/cache/appdata 

    Link to comment
    3 hours ago, mgutt said:

    Your containers are reading from / writing on files in /mnt/cache/appdata 

    Yes, this i know.

    But there are no containers actively creating that level of disk IO, or at least shouldn't be.

    Link to comment

    Looping back - getting quite a lot of consistent writes from the 'Ghost' container.

    Seems to be dumping a few Kb of text to its log files every few seconds due to a random bug - investigating the cause and will post a fix here should i find one, for others reference.

    Link to comment

    Not sure what ghost is but post the file activity log for the problem. Most of those writes can be dealt with using the methods in the writes thread I made.

     

    If it is writing to a log file and the log file is in it's own folder the easiest option is to simply make that folder a ramdrive in the container.

    Link to comment

    Its a website/CMS, similar to WordPress.

    The issue is 'in-app' as the errors it displays that get written out are to do an issue with its tagging. Its not a docker/loop2 issue, but an in-app issue thats causing more things to be logged than needed.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.