• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Solved Urgent

    EDIT (March 9th 2021):

    Solved in 6.9 and up. Reformatting the cache to new partition alignment and hosting docker directly on a cache-only directory brought writes down to a bare minimum.

     

    ###

     

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 17



    User Feedback

    Recommended Comments



    6 hours ago, hotio said:

    I might go as far as calling it 'Urgent' instead of 'Minor'.

    I agree that I felt it to be a real waste, but the product is working so no need to mark it urgent and alert the dev team as they're busy with the 6.8 release 🍻

     

    I have found a fix for my own host though, I was planning to automate it, but working on backing up my array data offsite on regular base so this take precedence now.

     

    However, a short while ago I wrote a very simple script that monitors the TBW of the cache disks so I could get a grip on the numbers:

     

    TBW on 02-11-2019 23:59:01 --> 3.0 TB                                            # Started measuring here
    TBW on 03-11-2019 23:59:01 --> 3.0 TB
    TBW on 04-11-2019 23:59:01 --> 3.1 TB, which is 3147.1 GB.
    TBW on 05-11-2019 23:59:01 --> 3.3 TB, which is 3385.6 GB.
    TBW on 06-11-2019 23:59:01 --> 3.4 TB, which is 3442.8 GB.
    TBW on 07-11-2019 23:59:01 --> 3.4 TB, which is 3520.2 GB.
    TBW on 08-11-2019 23:59:01 --> 3.5 TB, which is 3562.2 GB.
    TBW on 09-11-2019 23:59:01 --> 3.5 TB, which is 3618.7 GB.
    TBW on 10-11-2019 23:59:01 --> 3.6 TB, which is 3721.0 GB.
    TBW on 11-11-2019 23:59:02 --> 3.8 TB, which is 3941.2 GB.             # All containers were on here for a few days, nearly 400 GB daily!
    TBW on 12-11-2019 23:59:01 --> 4.2 TB, which is 4272.1 GB.
    TBW on 13-11-2019 23:59:01 --> 4.5 TB, which is 4632.5 GB.
    TBW on 14-11-2019 23:59:01 --> 4.9 TB, which is 5044.0 GB.
    TBW on 15-11-2019 23:59:01 --> 5.2 TB, which is 5351.3 GB.            # Turned off most containers here, to save writes
    TBW on 16-11-2019 23:59:01 --> 5.3 TB, which is 5468.8 GB.
    TBW on 17-11-2019 23:59:01 --> 5.5 TB, which is 5646.1 GB.
    TBW on 18-11-2019 23:59:01 --> 5.6 TB, which is 5695.8 GB.
    TBW on 19-11-2019 23:59:02 --> 5.6 TB, which is 5738.7 GB.            # Applied fix here, all containers are running
    TBW on 20-11-2019 23:59:01 --> 5.6 TB, which is 5757.7 GB.
    TBW on 21-11-2019 23:59:01 --> 5.6 TB, which is 5778.7 GB.
    TBW on 22-11-2019 23:59:01 --> 5.7 TB, which is 5800.2 GB.
    TBW on 23-11-2019 23:59:01 --> 5.7 TB, which is 5822.8 GB.

     

    I'm now at 22GB of writes on a daily basis, which is perfect!

     

    In short, what I did was:

    - Created a share named docker (which has to be cache only or this will break after the mover kicks in!)

    - Modified the start_docker() function in /etc/rc.d/rc.docker to:

     

    # Start docker
        start_docker(){
          if is_docker_running; then
            echo "$DOCKER is already running"
            return 1
          fi
          if mountpoint $DOCKER_ROOT &>/dev/null; then
             echo "Image is mounted, will attempt to unmount it next."
             umount $DOCKER_ROOT 1>/dev/null 2>&1
             if [[ $? -ne 0 ]]; then
               echo "Image still mounted at $DOCKER_ROOT, cancelling cause this needs to be a symlink!"
               exit 1
             else
               echo "Image unmounted succesfully."
             fi
          fi
          # In order to have a soft link created, we need to remove the /var/lib/docker directory or creating a soft link will fail
          if [[ -d $DOCKER_ROOT ]]; then
            echo "Docker directory exists, removing it so we can use it for the soft link."
            rm -rf $DOCKER_ROOT
            if [[ -d $DOCKER_ROOT ]]; then
              echo "$DOCKER_ROOT still exists! Creating a soft link will fail thus refusing to start docker."
              exit 1
            else
              echo "Removed $DOCKER_ROOT. Moving on."
            fi
          fi
          # Now that we know that the docker image isn't mounted, we want to make sure the symlink is active
          if [[ -L $DOCKER_ROOT && -d $DOCKER_ROOT ]]; then
            echo "$DOCKER_ROOT is a soft link, docker is allowed to start"
          else
            echo "$DOCKER_ROOT is not a soft link, will try to create it."
            ln -s /mnt/cache/docker /var/lib 1>/dev/null 2>&1
            if [[ $? -ne 0 ]]; then
              echo "Soft link could not be created, refusing to start docker!"
              exit 1
            else
              echo "Soft link created."
            fi
          fi
          echo "starting $BASE ..."
          if [[ -x $DOCKER ]]; then
            # If there is an old PID file (no docker running), clean it up:
            if [[ -r $DOCKER_PIDFILE ]]; then
              if ! ps axc|grep docker 1>/dev/null 2>&1; then
                echo "Cleaning up old $DOCKER_PIDFILE."
                rm -f $DOCKER_PIDFILE
              fi
            fi
            nohup $UNSHARE --propagation slave -- $DOCKER -p $DOCKER_PIDFILE $DOCKER_OPTS >>$DOCKER_LOG 2>&1 &
          fi
        }

     

    This basically checks for a mount of the docker.img file and unmounts it. Then it removes the directory /var/lib/docker and replaces it for a soft link to /mnt/cache/docker (has to be /mnt/cache though because /mnt/user/docker is not a BTRFS filesystem so the BTRFS driver option passed to dockerd will complain).

    Afterwards docker is started as normally and eveything works just as within the image, you can still control everything from DockerMan (GUI). All actions are echoed into /var/log/syslog, so you're able to trace what start_docker() does there. 

     

    The rc.docker file will be back to default upon server reboot by design, so my next task was to have it automated but like said I have other priorities at the moment. The automating script will also contain (most likely) sed commands to have the DockerSettings.page modified to not include btrfs commands, cause sometimes the docker settings page loads very slow because the commands don't work on /var/lib/docker, since it's not a mountpoint to a btrfs filesystem anymore.

     

    Hope this helps you out until that time.

    I know I was quite frustrated so wanted to give you a head start.

    Edited by S1dney
    Small addition
    Link to comment

    Hi, I have the same problem. RAM gets 100% consumed, all CPUs go to 100% as per the GUI graph. What is weird though is that htop is not showing full CPU utilization.

    One of the two cache SSDs is being constantly read at the speed of 200+ MB/s and unRAID gives an error of hot drive.

    I have an unencrypted SSD cache pool, two 500GB Samsung drives. Running 6.7.2.

     

    iostat is showing that loop2 is responsible for the massive disk read.

     

    The problem goes away once I disable docker for good. Then after I start the docker service and just one container, after a few hours it goes 100% RAM, 100% CPU and full disk read speed again.

     

    I tried isolating all the dockers only to one core but it did not stop the issue, all cores are 100%.

     

    I think that this is not an isolated case, you can see the below two topics which could be linked to the issue of loop2.

     

     

     

    Edited by szymon
    Link to comment
    14 minutes ago, szymon said:

    Hi, I have the same problem. RAM gets 100% consumed, all CPUs go to 100% as per the GUI graph. What is weird though is that htop is not showing full CPU utilization.

    One of the two cache SSDs is being constantly read at the speed of 200+ MB/s and unRAID gives an error of hot drive.

    I have an unencrypted SSD cache pool, two 500GB Samsung drives. Running 6.7.2.

     

    iostat is showing that loop2 is responsible for the massive disk read.

     

    The problem goes away once I disable docker for good. Then after I start the docker service and just one container, after a few hours it goes 100% RAM, 100% CPU and full disk read speed again.

     

    I tried isolating all the dockers only to one core but it did not stop the issue, all cores are 100%.

     

    I think that this is not an isolated case, you can see the below two topics which could be linked to the issue of loop2.

     

     

     

    Have you tried to delete (and thus automatically recreate) the docker.img already?

    Mine was just doing a lot of writes on disk, but none of the errors you're describing (nor the CPU and RAM usage) was happening to me.

    Seems like to reddit thread you mentioned below (not sure if you created that) seems to have csum errors, which I assume are checksum errors. This might point to corruption of your docker image, the loop2 device is also formatted with btrfs.

    Link to comment

    Not mine Reddit threads but they are all recent and all related to "loop2".

     

    I will recreate the docker image when I'm back at home, I had to disable reverse proxy and guacamole so I cannot access the VM now :) thanks for the tip!

    Link to comment
    Just now, szymon said:

    Not mine Reddit threads but they are all recent and all related to "loop2".

     

    I will recreate the docker image when I'm back at home, I had to disable reverse proxy and guacamole so I cannot access the VM now :) thanks for the tip!

    Sure, you're welcome.

    Let me know how that went, kind of curious ;)

    Link to comment

    Hi guys,

     

    Just to confirm @szymon that you too are running in an encrypted cache pool with btrfs, right?

     

    We will make an effort to recreate this in the lab to see what's going on.

    • Like 1
    Link to comment
    Just now, jonp said:

    Hi guys,

     

    Just to confirm @szymon that you too are running in an encrypted cache pool with btrfs, right?

     

    We will make an effort to recreate this in the lab to see what's going on.

     

    Oh never mind, apparently you are not running in an encrypted pool.  That's even more odd as we have lots of folks running btrfs raid 1 cache pools for Docker and not experiencing this issue.

    Link to comment
    26 minutes ago, jonp said:

    Hi guys,

     

    Just to confirm @szymon that you too are running in an encrypted cache pool with btrfs, right?

     

    We will make an effort to recreate this in the lab to see what's going on.

    No, not running an encrypted SSD cache pool, it's unencrypted. I read that some people have problems running encrypted SSD pool so I left it alone.

    What is weird though is that I had literally zero problems running 6.7.2 for a long time. Iencly recently decided to encrypt data array and this is when the problems started. I just finished encrypting the last disk and after parity rebuild is done I will restart the machine to see if it fixes the issue.

    For now I deleted the docker image and recreated it. I turned all the dockers on again and after a few minutes both CPU and RAM went up to 100% and read rate from one of the two SSDs went over 200MB/s until I shut down docker service. Then it went back to normal. I am now turning the containers on one by one to see when it crashes. And if it doesn't work then I also try 6.8.0.

    Link to comment
    20 hours ago, jonp said:

    Hi guys,

     

    Just to confirm @szymon that you too are running in an encrypted cache pool with btrfs, right?

     

    We will make an effort to recreate this in the lab to see what's going on.

    Great! Thanks for taking the time, I'm curious to see where the investigation leads to! :D

    I must say that I also tried to decrypt my pool once and had the feeling that excessive writes still took place (I switched back after a short while though, cause running unencrypted was not an option for me).

     

    My writes never went to the level @szymon described though.

    You can tweak the docker service file once and move the docker files out of the docker image like I described, to see if it's really the loop device causing the issues, in my case it solved my issue (still running directly onto the cache mountpoint).

    Link to comment
    21 hours ago, szymon said:

    I read that some people have problems running encrypted SSD pool so I left it alone.

    Where'd you read that?  Should be no problem.

     

    Edit: encryption does not result in any more or any less reads or writes, dm-crypt is a "pass-through" layer.

    Link to comment
    44 minutes ago, limetech said:

    Where'd you read that?  Should be no problem.

     

    Edit: encryption does not result in any more or any less reads or writes, dm-crypt is a "pass-through" layer.

     

    That's not what others found with BTRFS and Encryption. They say as soon as they switched to XFS and Encryption their writes were normal, while BTRFS and Encryption their writes were insane. 

    Edited by BRiT
    Link to comment
    1 hour ago, BRiT said:

     

    That's not what others found with BTRFS and Encryption. They say as soon as they switched to XFS and Encryption their writes were normal, while BTRFS and Encryption their writes were insane. 

    got link?

     

    Could be an anomaly related to how I/O is counted.  I have multiple VM's with vdisks located on encrypted cache pool and see no issues.  Quick google search doesn't really turn up anyone reporting issues with btrfs atop dm-crypt.

    Link to comment

    Yeah, it certainly shouldn't impact things one bit, so some sort of feedback-loop bug? It doesn't seem like just a bug in reporting unless it also impacts the SSD's SMART attribute Total_LBAs_Written too.

     

    Maybe this thread, in particular this post:

     

    For anyone who cares... I've re-formatted the cache pool from encrypted-btrfs to just plain btrfs and the writes have droppped by almost 10x. Disk IO for [loop2] has gone from several GB's in a few minutes to maybe a 1GB in a couple of hours using iotop -a. Total writes for the SSD using LBA has gone to 1GB/hour, that is with appdata also on the same device. This will give a lifespan closer to 10 years instead of 1 year 🙂.

     

     

     

     

    Link to comment
    2 minutes ago, BRiT said:

    Maybe this thread, in particular this post:

    Thanks for the link.  Maybe going through dm-crypt is changing the rate of page flushes - wouldn't be surprised.  Don't have time atm to check this...

    Link to comment

    Here's what a typical 1 hour looks like for me on a idle system (see picture), no vm's running, stopping all docker containers will stop all write activity.

    Drives:
    Samsung EVO 850 250GB - btrfs encrypted
    Samsung EVO 860 250GB - btrfs encrypted

    Total LBA written for the EVO 860 is 113977210162 (53.07 TB), power on hours 4973 (6m, 23d, 5h). The EVO 860 has been in no other system, just unraid.

    Amount of data written to the array since that time is roughly 3TB, so I could have been able to write that amount > 17 times to the cache, which obviously didn't happen.

    God only knows what it's constantly writing to the cache disks, but it all goes into a black hole...

    2019-12-21 081022.png

    Link to comment
    10 hours ago, limetech said:

    got link?

     

    Could be an anomaly related to how I/O is counted.  I have multiple VM's with vdisks located on encrypted cache pool and see no issues.  Quick google search doesn't really turn up anyone reporting issues with btrfs atop dm-crypt.

    I spent a lot of time trying to find similar situations on Google as well, but was able to find any.

    The problem isn't really btrfs atop of dm-crypt, as having docker writing to the encrypted cache only produces 27 GB of (daily) writes in comparison to having docker mounted on the loop device, which resulted in 400 GB of daily writes with a lot of less activity compared to now (I've increased the amount of docker in the meantime).

     

    @hotio's problem matches my case exactly though.

     

    Unfortunately I'm lacking the knowledge at the moment to dive in at the level of looking into/fixing the amount of page flushes etc (like @limetech mentioned).

    I was able to pinpoint it to the combination of docker, the loop device and (most likely) dm-crypt. Hence this bugreport.

     

    Appreciate the efforts! 🍻

    • Thanks 1
    Link to comment
    1 hour ago, S1dney said:

    to having docker mounted on the loop device

    This is with docker.img loopback file located on cache correct?

    Link to comment
    3 minutes ago, limetech said:

    This is with docker.img loopback file located on cache correct?

    That's correct.

    I've initially tried to set the docker vDisk path to both the unRAID filesystem (/mnt/user/cache) as directly onto the btrfs mountpoint (/mnt/cache) from within the docker settings, neither made a noticeable change. So therefore was able to rule out the unRAID filesystem.

    Link to comment

    Hi, I managed to isolate my issue to a faulty container as described here

     

    I guess my problem is not caused by the image file mounted as a loop device. Thank you to everyone responding here and for your suggestions!

    Link to comment
    18 hours ago, szymon said:

    Hi, I managed to isolate my issue to a faulty container as described here

     

    I guess my problem is not caused by the image file mounted as a loop device. Thank you to everyone responding here and for your suggestions!

    Good to hear you have it resolved.

    Seemed like a different problem indeed :)

    Link to comment

    I have the same problem on my side as S1dney.

    What I also see is that docker is extremly slow against 6.6.6 where I come from.

    Everything in my point of view changed with 

    a. Update to 6.8.0

    OR

    b. Second SSD in Cache Drive

     

    The screenshot shows around 20 minutes.

     

     

    Unbenannt.png

    Edited by Karatekid
    Link to comment
    On 12/25/2019 at 3:08 AM, Karatekid said:

    I have the same problem on my side as S1dney.

    What I also see is that docker is extremly slow against 6.6.6 where I come from.

    Everything in my point of view changed with 

    a. Update to 6.8.0

    OR

    b. Second SSD in Cache Drive

     

    The screenshot shows around 20 minutes.

     

     

    Unbenannt.png

    I cannot recall docker being extremely slow, but judging the screenshot of iotop this is indeed exactly the problem I was facing.

    Is that cache encrypted?

    Link to comment

    it was a not encrypted cache pool with 2 single ssds but media and "...?!" in raid1.

    I now changed Docker and VM's back to UD and a single cache SSD and everything seems to work normal again.

    But its not a fix for the problem in general.

    Link to comment

    I see lots of data written but not only limited to the docker.img file and it's loopback device. It has something to do with the cache in some way.

    In my later answers in that thread, I talk about mariadb causing 12-20GB of written data every hour. That's extreme with my light use of nextcloud and home assistant (with mariadb as db). My databases are like 100MB in total. The 12-20GB/h just disappears in som black hole. 


    It's a silent SSD killer.
     

    Edited by Niklas
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.