• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Urgent

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 15


    User Feedback

    Recommended Comments



    15 minutes ago, Dephcon said:

    blkdiscard /dev/sdX or blkdiscard /mnt/cache?

    blkdisard /dev/sdX  # that is, on the raw device

     

    16 minutes ago, Dephcon said:

    can i assume space_cache=v2 is being used for the testing/default in an upcoming release?

    correct.  That is default now.

    Link to comment
    Share on other sites
    1 hour ago, limetech said:

    blkdisard /dev/sdX  # that is, on the raw device

     

    correct.  That is default now.

    i might have to install beta25 sometime this week as I'm very curious now lol

    Edited by Dephcon
    Link to comment
    Share on other sites
    5 hours ago, limetech said:

    The loopback approach is much better from the standpoint of data management.  Once you have a directory dedicated to Docker engine, it's almost impossible to move it to a different volume, especially for the casual user.

    Is upgrading to 6.9.0-beta25 all that is needed in order to fix this bug? I see from the change log it says this issue has been fixed. I currently have encrypted xfs on my array and my cache drive but from iotop -oa still shows loop2 writing excessively. I'm assuming I am just missing something obvious here.... 

     

    I see one of the recommendation by @testdasi was to recreate the img to be docker-xfs.img as a work around. 

    Link to comment
    Share on other sites

    9 month old 2x 1 TB Silicon Power 1TB NVMe in a BTRFS RAID1.

     

    Data units written 395,293,581 [202 TB]

     

    Data units read107,756,069 [55.1 TB]

     

    2-3 VMs

    15+ dockers

    lots of unpacking on the drives, however 200 does seem excessive...

     

     

    Link to comment
    Share on other sites

    Did a test with a Windows VM to see if there was a difference with the new partition alignment, total bytes written after 16 minutes (VM is idling doing nothing, not even internet connected):

     

    space_cache=v1, old alignment - 7.39GB

    space_cache=v2, old alignment - 1.72GB

    space_cache=v2, new alignment - 0.65GB

     

    So that's encouraging, though I guess that unlike v2 space cache the new alignment might work better for some NVMe devices and don't make much difference for others, still worth testing IMHO, since for some it should also give better performance, for this test I used an Intel 600p.

     

     

     

     

     

    • Like 3
    • Thanks 2
    Link to comment
    Share on other sites

    interesting that the alignment would make that much of a difference, anyone have a technical explanation on why this is?

    Link to comment
    Share on other sites
    1 hour ago, TexasUnraid said:

    interesting that the alignment would make that much of a difference, anyone have a technical explanation on why this is?

    Here is a nice explanation why alignment is important:
    https://www.minitool.com/lib/4k-alignment.html
    https://www.thomas-krenn.com/en/wiki/Partition_Alignment_detailed_explanation

    However I did not find a written reference why 1MiB. I did however worked on development of a Linux box with internal eMMC few years ago and I remember that the internal controller had very large erase block size, something between 1-3 MiB.

    This may explain again that if the FS is not aligned with the erase block size it will increase the number of data written to the flash. There is a little mention about it here:
    https://www.anandtech.com/show/14543/nvme-14-specification-published#:~:text=More Block Size and Alignment,block sizes measured in megabytes.
     

    • Thanks 2
    Link to comment
    Share on other sites
    53 minutes ago, thecode said:

    why 1MiB

    Because that's what Microsoft chose.

     

    Good reference to other references:

    https://superuser.com/questions/1483928/why-do-windows-and-linux-leave-1mib-unused-before-first-partition

     

    Theoretically partitions should be aligned on SSD "erase block size", eg:

    https://superuser.com/questions/1243559/is-partition-alignment-to-ssd-erase-block-size-pointless

     

    However "erase block size" is an internal implementation detail of an SSD device and the value is not commonly exported by any transfer protocol.  You can write a program to maybe figure it out:

    https://superuser.com/questions/728858/how-to-determine-ssds-nand-erase-block-size

     

    But, referring back to "that's what Micorsoft chose" - SSD designers are going to make sure their products work well with Windows, and they know how Microsoft aligns partitions.  Hence, pretty sure trying to figure out exact alignment is pointless IMHO.

    • Like 1
    • Thanks 3
    Link to comment
    Share on other sites

    Thanks for the explanation, makes total sense, just never really chased that tail to the conclusion before.

     

    Just ordered a drive to use for parity so looking forward to 6.9 and being able to move dockers back onto the cache.

    Link to comment
    Share on other sites

    Ok, so as I understand it the fix for the excessive writes is the combo of space_cache=v2 and the new alignment.

     

    The space_cache=v2 I can do now but it will still be roughly 2x the writes without the alignment fix.

     

    The alignment fix can not be done until 6.9 as unraid will not recognize the partition.

     

    Am I on the right track there?

     

    Is there any way to use the new alignment with unraid 6.8?

     

    I just got a drive to use for parity but that means I need to remove the SSD from the array that is currently formatted XFS and has docker/appdata.

     

    Debating options now.

    Link to comment
    Share on other sites
    18 minutes ago, TexasUnraid said:

    I need to remove the SSD from the array

    Forgive me if you mentioned it earlier in this 17-page thread, but why did you assign your SSD to the array? Why not simply assign it as the (single) cache drive and format it XFS? I can't think of any advantage in putting it in the array.

     

    Edited by John_M
    typo
    Link to comment
    Share on other sites
    6 minutes ago, John_M said:

    Forgive me if you mentioned it earlier in this 17-page thread, but why did you assign your SSD to the array? Why not simply assign it as the (single) cache drive and format it XFS? I can't think of any advantage in putting it in the array.

     

    Because I already have a cache pool setup and 6.8 does not support multiple cache pools.

     

    This drive is not even supposed to be in the server, I stole it out of a laptop since using this drive formatted as xfs was 50-100x less writes vs using the cache pool.

    Link to comment
    Share on other sites
    2 minutes ago, TexasUnraid said:

    Because I already have a cache pool setup and 6.8 does not support multiple cache pools.

    In that case, have you investigated whether the Unassigned Devices plugin would help you in the meantime? It allows devices that are not assigned to the array or the cache to be mounted when the array starts. It's quite possible that that isn't quite early enough to support putting the docker.img there and I haven't tried it, but it might be worth checking the support thread for that plugin.

    Link to comment
    Share on other sites

    I didn't think putting the docker on a UD device would be a good long term option. I always saw UD as a temporary use feature.

     

    That said I could be wrong and it would work perfectly fine for docker and appdata. Anyone have any info on this?

    Link to comment
    Share on other sites
    7 minutes ago, TexasUnraid said:

    Anyone have any info on this?

    It seems like it has worked in the past, then something broke it and it may be fixed again, or not. See this old thread: 

     

    Link to comment
    Share on other sites

    I realized why it won't work, you would have to edit every docker to point to the appdata on UD which would be a real pain and easy to mess up when having to do it x20+.

     

    Although, can symlinks be used on unraid? So could I use a symlink between the cache and a UD device so that I could keep the same paths I have now?

    Link to comment
    Share on other sites
    3 hours ago, TexasUnraid said:

    can symlinks be used on unraid?

    Yes, certainly they can. Why don't you keep your appdata on the cache pool? That's where most people keep it.

    Link to comment
    Share on other sites
    19 minutes ago, John_M said:

    Yes, certainly they can. Why don't you keep your appdata on the cache pool? That's where most people keep it.

    The writes are massively inflated with appdata as well as docker. Now that we understand why, it makes sense, the tiny writes that both make will cause the writing of at least 2 full blocks on the drive + the filesystem overhead with the free space caching. Even if it just wanted to write 1 byte.

     

    Great, so the symlinks won't cause any issues with the fuse file system?

     

    I simply put a symlink in cache pointing towards the UD drive and everything works as expected, the files will be accessible from the /user file system?

     

    That could work, have not actually used symlinks in linux yet but no time like the present to learn lol. Used them a lot in windows.

    Edited by TexasUnraid
    Link to comment
    Share on other sites
    1 hour ago, TexasUnraid said:

    The writes are massively inflated with appdata as well as docker

    I don't think this is true anymore if you repartition the SSD device(s).

    Link to comment
    Share on other sites
    11 hours ago, TexasUnraid said:

    The writes are massively inflated with appdata as well as docker. Now that we understand why, it makes sense, the tiny writes that both make will cause the writing of at least 2 full blocks on the drive + the filesystem overhead with the free space caching. Even if it just wanted to write 1 byte.

     

    Great, so the symlinks won't cause any issues with the fuse file system?

     

    I simply put a symlink in cache pointing towards the UD drive and everything works as expected, the files will be accessible from the /user file system?

     

    That could work, have not actually used symlinks in linux yet but no time like the present to learn lol. Used them a lot in windows.

    No prob with symlinks. I use that to point things everywhere.

    I even make a kill switch for my most important data (bash script to remove the symlink takes millisecond to complete and would completely cut off my data from e.g. any cryptovirus doing sinister stuff on the network).

     

    Edited by testdasi
    Link to comment
    Share on other sites
    10 hours ago, limetech said:

    I don't think this is true anymore if you repartition the SSD device(s).

    Yeah, it was the same root cause as the docker writes. So fix one and you fix them both.

     

    So are you saying that I can reparition the SSD's on 6.8 and they will work? Basically make my cache like 6.9 will be (and hopefully compatible as well so I don't need to convert again later)?

     

    How would I go about doing this?

     

    In the UD thread something was said about the new partition not being backward compatible.

    Edited by TexasUnraid
    Link to comment
    Share on other sites
    8 minutes ago, testdasi said:

    No prob with symlinks. I use that to point things everywhere.

    I even make a kill switch for my most important data (bash script to remove the symlink takes millisecond to complete and would completely cut off my data from e.g. any cryptovirus doing sinister stuff on the network).

     

    Good to know, interesting use case as well. How would the script know that an attack is taking place?

     

    So no gotchas with symlinks on unraid? works just like any other linux system (aka, I can look up generic symlink tutorials online)?

    Edited by TexasUnraid
    Link to comment
    Share on other sites
    2 minutes ago, TexasUnraid said:

    So are you saying that I can reparition the SSD's on 6.8 and they will work?

    No, new alignment only works on v6.9.

    Link to comment
    Share on other sites
    2 minutes ago, johnnie.black said:

    No, new alignment only works on v6.9.

    Thats what I thought, I am waiting for at least the RC of 6.9 to consider upgrading now that the server is in service.

     

    Symlinks / UD sound like a good stopgap, makes it simple to swap over to 6.9 as well since the paths will remain the same.

    Edited by TexasUnraid
    Link to comment
    Share on other sites

    Do we know if there will be a definitive guide created for this issue once 6.9 drops, to help people convert over/transfer data/etc?

    Link to comment
    Share on other sites



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.