• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Solved Urgent

    EDIT (March 9th 2021):

    Solved in 6.9 and up. Reformatting the cache to new partition alignment and hosting docker directly on a cache-only directory brought writes down to a bare minimum.

     

    ###

     

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 17



    User Feedback

    Recommended Comments



    Hey everyone, I wanted to report that I believe i'm seeing this bug demonstrated on a 6.9.2 Unraid box.

    I had a cache pool of two 480 GB SSDs in RAID 1 that stopped working, which I believe it was due to excessive writes. I replaced the hardware just this morning and put only `appdata`, `domains` and `system` shares on the cache using the setting `prefer`.

    Being concerned about the number of writes, I checked thees and with the server being online for ~26 minutes, the cache has experienced already 110,519 writes (~55,000 per disk).

    Installing `iotop` with Nerdpack allowed me to run `iotop -ao` which showed that `[loop2]` is responsible for the majority of the writes.

     

    ```

    Linux 5.10.28-Unraid.

    root@tower:~# tmux new -s cache

    Total DISK READ :       0.00 B/s | Total DISK WRITE :       0.00 B/s

    Actual DISK READ:       0.00 B/s | Actual DISK WRITE:       0.00 B/s

      TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND

    13149 be/0 root        564.00 K    135.61 M  0.00 %  0.44 % [loop2]

    ```

    I've read that some people have been unable to have their cache drives me unencrypted and experience less writes. That's not something I'd like to do...

     

    I searched online for any advice on how to fix this, found this threads:

    which pointed me to this bug report.

    Any advice on how to resolve this?

    Thanks,

    Greg

    Edited by GrehgyHils
    Link to comment

    I ended up converting to a single XFS formatted drive awhile back and it greatly reduced the number of writes. Still inflated but acceptable.

     

    Interestingly I am actually doing a test today of putting it back to BTRFS and seeing what the writes are. It will take a few days to know what the results are though.

     

    You need to log the LBA's written of the SSD to see the real world writes. Anything else is only a rough guess.

    Edited by TexasUnraid
    Link to comment

    That's unfortunate to hear. Can you share your results of going back to BTRFS when you have them in a few days?

    Also, what's the thought process behind going to XFS? Additionally, how many cache drives did you have when you were using BTRFS?

    Link to comment
    Just now, GrehgyHils said:

    That's unfortunate to hear. Can you share your results of going back to BTRFS when you have them in a few days?

    Also, what's the thought process behind going to XFS? Additionally, how many cache drives did you have when you were using BTRFS?

    If you read back early in this thread I go into extensive detail on my testing. I tried everything. Mutiple drives made the issue worse, I tried a single BTRFS drive as well but it was still 50-100x more writes then XFS IIRC.

     

    They made some chages with 6.9 that is suppose to help that, hence why I am testing again. I still expect more writes then XFS, just not sure if it will be acceptable or not.

     

    Some had success hanging from a docker image to a docker folder, I might give that a try as well.

    Link to comment
    Just now, TexasUnraid said:

    Some had success hanging from a docker image to a docker folder,

    This will "solve it" as only the individual (and tiny) files are updated and not "huge" parts of the docker.img file.

    Link to comment

    I still see problems with how much space the docker folder use as I noted in this thread before. The Unraid gui does not show the same space used as if you check the docker folder size in the terminal. 

    Link to comment
    3 minutes ago, mgutt said:

    This will "solve it" as only the individual (and tiny) files are updated and not "huge" parts of the docker.img file.

    Yeah, thats the theory but when I tested it in the past I didn't see much of a difference so went back to the image. I think the system you are using and the type of writes play a role as well.

     

    I had 32gb of ram (rocking 256gb now lol), and increased my dirty writes so that the ram cache was used more, this had a big effect and I think minimized the effects of moving to a folder.

    Edited by TexasUnraid
    Link to comment
    1 minute ago, Niklas said:

    I still see problems with how much space the docker folder use as I noted in this thread before. The Unraid gui does not show the same space used as if you check the docker folder size in the terminal. 

     

    I fixed this by running a BTRFS scrub of the docker image and then I have this run a few mins after array start :

     

    fstrim -av

     

    This knocks the space used down to the actual amount for me.

    Link to comment
    1 minute ago, TexasUnraid said:

     

    I fixed this by running a BTRFS scrub of the docker image and then I have this run a few mins after array start :

     

    
    fstrim -av

     

    This knocks the space used down to the actual amount for me.

     

    Interesting. I'll try that. Thanks 

    Link to comment
    42 minutes ago, TexasUnraid said:

     

    I fixed this by running a BTRFS scrub of the docker image and then I have this run a few mins after array start :

     

    
    fstrim -av

     

    This knocks the space used down to the actual amount for me.


    No change here. Running trim every night.

    Capture_root@Server_mntcachesystemdocker_2021-05-27_22-52-02_97914852.png.ae136379e799de88c7b6739457f3fe92.png

     

    du for /mnt/cache say 89GB used but df say only 21GB used (df is probably correct)

    Link to comment

    what is the output from the fstrim -av command?

     

    I tried those commands, I get 55g on the docker size, which is about the correct amount used.

     

    /var/lib/docker shows 72g total and 55g used, also correct

     

    My cache drive has other stuff on it but it says 111g used in df and 119g in the GUI.

     

    So off a bit but not enough to bother me.

    Link to comment
    5 minutes ago, TexasUnraid said:

    what is the output from the fstrim -av command?

     

    I tried those commands, I get 55g on the docker size, which is about the correct amount used.

     

    /var/lib/docker shows 72g total and 55g used, also correct

     

    My cache drive has other stuff on it but it says 111g used in df and 119g in the GUI.

     

    So off a bit but not enough to bother me.


    fstrim -av
    /mnt/cache: 111.9 GiB (120167636992 bytes) trimmed on /dev/mapper/sdb1

    /var/lib/docker show same size as /mnt/cache/system/docker
    I use far from 78G. Before the change to directory my docker.img was 30G with about 50% used.

    I have just ignored this discrepancy. I have a lot less writes to my ssds now and with ssds rated at over 400TBW it feels ok. What worries me is that the different size reports could cause problems in the future, or not.

    Link to comment

    oh, you are not using the docker image? That is why trim is not working lol.

     

    I think the reason the space is showing more is a lot of the files in docker have empty space for expansion. In the btrfs image the trim command condenses this space it seems but it would not apply to a folder I am guessing.

    Link to comment
    1 hour ago, TexasUnraid said:

    Yeah, thats the theory but when I tested it in the past I didn't see much of a difference

    Ok, the best method is to avoid the writes at all. Thats why I disabled HEALTHCHECK for all my containers.

     

    1 hour ago, TexasUnraid said:

    increased my dirty writes

    You mean the time (vm.dirty_expire_centisecs) until the dirty writes are written to the disk and not the size?

    Link to comment
    4 minutes ago, mgutt said:

    Ok, the best method is to avoid the writes at all. Thats why I disabled HEALTHCHECK for all my containers.

     

    You mean the time (vm.dirty_expire_centisecs) until the dirty writes are written to the disk and not the size?

     

    Interesting, can you explain the HEALTHCHECK? I have not heard about that but it would explain why I seem to get way more writes then I think I should.

     

    Yes, both the time and the size. I upped centisecs to 12000 and use tips and tweaks to increase the dirty write level to 50% IIRC back when I had 32gb of ram. Now with 256gb it has plenty no matter what lol.

     

    Come to think of it I am not sure if I re-evaluated the centisecs time after all that testing. Think I was going to reduce it some. This was the sweet spot I found during testing though.

    Edited by TexasUnraid
    Link to comment

    Very interesting information. Dang I wish someone would come up with a docker editor plugin that would allow editing of mutiple dockers at once (adding the --no-healthcheck or in particular changing the mapped paths). I have like 50 dockers and it would make things way easier lol.

     

    Guess I will set aside an hour sometime to add that command in.

     

    I know the go file exists but honestly don't know how it works and prefer to not mess with things that "deep" in the system when I know just enough to get myself into trouble lol.

    Link to comment

    I'm amazed that after being 'fixed' in 6.9, that this is still an issue.  

    So is the new advice now to 'fix' by using Docker folder paths instead of IMG?- at least until the next 'fix' comes around?

    Edited by boomam
    Link to comment

    Still testing, hard to say since I have been doing a lot of oddball stuff at this point but appears to be higher writes then xfs but not by the 10-100x it was before. Could be a week or so before it calms down enough to know the actual results.

     

    Might try a folder after this.

    Link to comment
    3 hours ago, boomam said:

    I'm amazed that after being 'fixed' in 6.9, that this is still an issue.  

    So is the new advice now to 'fix' by using Docker folder paths instead of IMG?- at least until the next 'fix' comes around?

    This is something which needs to be solved by docker:

    https://github.com/moby/moby/issues/42200

    • Thanks 1
    Link to comment
    4 hours ago, mgutt said:

    Which is absolutely normal as BTRFS is a copy-on-write filesystem with huge write amplification.

     

    Yeah, I expect higher, the question is how much.

     

    I have been doing way too much other stuff to get a clear picture at this point. Once I finish this work and it goes back to normal I will then give a folder a try.

     

    I am correct in thinking that I can just save the image, try the folder and then switch back to the image later without having to rebuild it correct?

    Edited by TexasUnraid
    Link to comment
    25 minutes ago, TexasUnraid said:

    without having to rebuild it correct?

    I don't know, but who cares? It contains only docker related files. Rebuilding is done fast.

     

    • Thanks 1
    Link to comment

    Well when you have almost 100gb of dockers is might not be so fast lol.

     

    I tend to install any docker that looks interesting just to have it on hand. Although I only actually have like 15 running most of the time.

    Link to comment
    35 minutes ago, TexasUnraid said:

    Well when you have almost 100gb of dockers is might not be so fast lol.

    Which container has a size of more than 1GB?!

     

    And most of them re-use the same packages. 100GB is really crazy ^^

    • Thanks 1
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.