• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Urgent

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 15


    User Feedback

    Recommended Comments



    From the console: "diagnostics" will create a ZIP in /logs on the boot USB.

    You can also do it from the GUI somewhere in the Tools menu, if memory serves.

    Link to comment
    Share on other sites
    4 minutes ago, -Daedalus said:

    From the console: "diagnostics" will create a ZIP in /logs on the boot USB.

    You can also do it from the GUI somewhere in the Tools menu, if memory serves.

    Thanks, Just tried it again, same result as before. I assume you are talking about the system log in tools? I downloaded it and will send it in a PM.

    Link to comment
    Share on other sites
    1 hour ago, TexasUnraid said:

    Thanks, Just tried it again, same result as before. I assume you are talking about the system log in tools? I downloaded it and will send it in a PM.

    No, Diagnostics in Tools

    Link to comment
    Share on other sites
    5 hours ago, TexasUnraid said:

    Tried re-creating the raid 0 cache pool and removing the drive again, once again the pool died even though there is plenty of room (only docker and appdata on it right now).

    When you remove a device from an existing btrfs pool, part of array Start is to wipefs the device which has been removed.  This is necessary.  If you are using raid0 this will effectively clobber the original pool too since raid0 has no redundancy.

    Link to comment
    Share on other sites
    3 minutes ago, limetech said:

    When you remove a device from an existing btrfs pool, part of array Start is to wipefs the device which has been removed.  This is necessary.  If you are using raid0 this will effectively clobber the original pool too since raid0 has no redundancy.

    Strange that it worked when I tried it before then.

     

    BTRFS supports removing a drive from a raid0 pool as I understand it, how can it be done in unraid?

    Link to comment
    Share on other sites
    15 minutes ago, TexasUnraid said:

    BTRFS supports removing a drive from a raid0 pool as I understand it, how can it be done in unraid?

    command line

    Link to comment
    Share on other sites

    Why can't it be part of the GUI?

     

    I have no clue how to do it from the command line and trying to look it up briefly left me with more questions then answers.

    Link to comment
    Share on other sites
    Just now, TexasUnraid said:

    Why can't it be part of the GUI?

     

    I have no clue how to do it from the command line and trying to look it up briefly left me with more questions then answers.

    You could convert to raid1 using balance, then remove/unassign the device.

    Link to comment
    Share on other sites
    25 minutes ago, limetech said:

    When you remove a device from an existing btrfs pool, part of array Start is to wipefs the device which has been removed.  This is necessary.  If you are using raid0 this will effectively clobber the original pool too since raid0 has no redundancy.

    Hmm, this came up recently and I tested myself since btrfs can remove a device from a raid0 pool and it worked with Unraid, just retested and went from a 4 device raid0 to a single device pool, removing one device at a time, the removed device is still mounted by the pool and then then deleted after balancing.

     

    Looking @TexasUnraiddiags I think it didn't work form him because his pool was encrypted, the unassigned/removed device wasn't decrypted and so unable to be used during the balance for removing, i.e., same as if the device was disconnected, and in that case obviously it can't mount a raid0 pool r/w with a missing device:

    Jun 20 13:18:07 NAS kernel: BTRFS warning (device dm-2): devid 5 uuid 6baf6b07-3963-4763-ae97-3e0258cc71a8 is missing
    Jun 20 13:18:07 NAS kernel: BTRFS warning (device dm-2): chunk 11967397888 missing 1 devices, max tolerance is 0 for writeable mount
    Jun 20 13:18:07 NAS kernel: BTRFS warning (device dm-2): writeable mount is not allowed due to too many missing devices

     

    Link to comment
    Share on other sites
    2 minutes ago, johnnie.black said:

    Hmm, this came up recently and I tested myself since btrfs can remove a device from a raid0 pool and it worked with Unraid, just retested and went from a 4 device raid0 to a single device pool, removing one device at a time, the removed device is still mounted by the pool and then then deleted after balancing.

     

    Looking @TexasUnraiddiags I think it didn't work form him because his pool was encrypted, the unassigned/removed device wasn't decrypted and so unable to be used during the balance for removing, i.e., same as if the device was disconnected, and in that case obviously it can't mount a raid0 pool r/w with a missing device:

    
    Jun 20 13:18:07 NAS kernel: BTRFS warning (device dm-2): devid 5 uuid 6baf6b07-3963-4763-ae97-3e0258cc71a8 is missing
    Jun 20 13:18:07 NAS kernel: BTRFS warning (device dm-2): chunk 11967397888 missing 1 devices, max tolerance is 0 for writeable mount
    Jun 20 13:18:07 NAS kernel: BTRFS warning (device dm-2): writeable mount is not allowed due to too many missing devices

     

    That makes sense, the question is how would you work around this besides converting to a raid1 pool first? Also explains why it worked before but not now, it was not encrypted before.

     

    The keyfile was in place and this drive used the same encryption as the rest of the drives.

    Link to comment
    Share on other sites
    Just now, TexasUnraid said:

    The keyfile was in place and this drive used the same encryption as the rest of the drives.

    Yes, but since the removed disk was unassigned it's not decrypted by Unraid, you'd need to manually decrypt it, if you want I can show you how to remove a member form the pool using the CLI, but it's really off topic here, if you want please start a new thread and I'll reply there.

    Link to comment
    Share on other sites
    9 minutes ago, johnnie.black said:

    Yes, but since the removed disk was unassigned it's not decrypted by Unraid, you'd need to manually decrypt it, if you want I can show you how to remove a member form the pool using the CLI, but it's really off topic here, if you want please start a new thread and I'll reply there.

    Ok, I will start a new thread.

     

    In other news, still need to wait a few more hours to make sure the numbers don't climb but I currently have a 120gb SSD formatted as XFS in the array and I put appdata and dockers on that.

     

    First hours write data is in, 237mb of writes with the same dockers and settings as before where I was getting 7GB of writes/hour and climbing on the BTRFS cache.

     

    In fact I actually have a few other dockers I am playing with installed now (very lightweight, a speedtest docker etc).

     

    Sucks I have to waste a whole drive just for the 22GB of docker data and if I was using a parity drive now (plan to save up for one down the road), possibly break parity to do it.

     

    The nocache option was a big write saver though, that deserves more widespread testing for sure. I might do some more playing around with it later, my window for messing with this is quickly closing.

     

    Edited by TexasUnraid
    Link to comment
    Share on other sites
    1 hour ago, johnnie.black said:

    Hmm, this came up recently and I tested myself since btrfs can remove a device from a raid0 pool and it worked with Unraid, just retested and went from a 4 device raid0 to a single device pool, removing one device at a time, the removed device is still mounted by the pool and then then deleted after balancing.

     

    Looking @TexasUnraiddiags I think it didn't work form him because his pool was encrypted, the unassigned/removed device wasn't decrypted and so unable to be used during the balance for removing, i.e., same as if the device was disconnected, and in that case obviously it can't mount a raid0 pool r/w with a missing device:

    
    Jun 20 13:18:07 NAS kernel: BTRFS warning (device dm-2): devid 5 uuid 6baf6b07-3963-4763-ae97-3e0258cc71a8 is missing
    Jun 20 13:18:07 NAS kernel: BTRFS warning (device dm-2): chunk 11967397888 missing 1 devices, max tolerance is 0 for writeable mount
    Jun 20 13:18:07 NAS kernel: BTRFS warning (device dm-2): writeable mount is not allowed due to too many missing devices

     

    Ok to be more precise: if a device is unassigned from an existing pool and moved to another pool, the action of integrating the moved device to a new pool is what triggers 'wipefs' - this is a bug: code should always do a 'wipefs' even if device is removed from pool and not assigned anywhere else.  The fact that it works is really an accident and if the moved device is being used for some other purpose, and gets reformatted before the balance completes, will mess up the balance.  This is because btrfs (and blkdev) thinks this device is still part of the original pool (because the UUID is littered all over within the file system).  This is why it's necessary to wipe the file system.

     

    The correct procedure in the supported raid1 profile is to let balance do it's thing.  If you are converting from single or raid0, best to convert first to raid1, then remove device, then let balance again do it's thing (which should be pretty fast for a 2-device config).

    Link to comment
    Share on other sites
    On 6/20/2020 at 9:35 AM, johnnie.black said:

     

    image.png.941e38e4613ff25bff2ffb640fd46c28.png

     

     

    After 24 Hours:

    image.png.4a384408736234b46aa14db529e9ae8d.png

     

    735778GB - 735080GB = 698GB in the last 24 Hours, while still a lot it's much better than the 3TB a day or so it was writing before, so not a happy camper, but a happier camper :)

     

     

    Link to comment
    Share on other sites

    Have followed @johnnie.black's temp fix and the results are very encouraging, iotop -a shows writes to be significantly reduced on my raid0 cache.

    I'll try to get some data points over the next week to compare with previous data points.

     

     

    • Like 1
    Link to comment
    Share on other sites

    Overnight with docker and appdata on the XFS formatted array drive my writes have actually dropped down to ~175mb/hour.

     

    Nice to see compared to it climbing overnight when on the BTRFS cache to 7GB/hour and beyond.

    Edited by TexasUnraid
    Link to comment
    Share on other sites
    On 6/9/2020 at 7:19 PM, Moz80 said:


    I have just now looked at the smart data for my ssd in unraid it has a line that says;

    
    202	Percent lifetime remain	0x0030	089	089	001	Old age	Offline	Never	11

    ...Does that really mean there is only 11% life left of my ssd that’s less than a month old?


    Popping into a calculator, using the lbas written, of 57527742008 shows me 26.79TB. From the crucial data sheet for the ssd they state 180TB written as the endurance of the drive (so I wasn’t as worried) ... but the smart data says only 11% so I’m freaking out a little now!
     

    Should I be worried? 


     

     

     

    6B7A1292-64A7-473D-A151-185502E48F37.png

     

     

     

    So back on June 9 I was showing 89% life left on my SSD.

     

    Fast forward to today:

     

    9	Power on hours	0x0032	100	100	000	Old age	Always	Never	820 (1m, 3d, 4h)
    
    202	Percent lifetime remain	0x0030	086	086	001	Old age	Offline	Never	14

    And I've lost another 3%. Now down to 86% in 1 month and 3 days of use. After applying the "fixes" of changing a couple of docker containers (i haven't reformatted my cache drive to another filesystem).

     

    I've tried to keep up-to-date on this thread (reading all the notifications) and I may have missed it, but it would be still be REAL beneficial to know that a real actual fix was coming for this one. 

    Link to comment
    Share on other sites
    14 minutes ago, Moz80 said:

    After applying the "fixes" of changing a couple of docker containers (i haven't reformatted my cache drive to another filesystem).

    You cannot fix this by changing a couple of docker containers, cause docker containers are not the root cause of the problem, a busy container will just show this problem more.

     

    The only "fixes" that have been working for other were:

    1) Formatting to XFS (works always)

    2) Remounting the BTRFS cache with the nospace_cache options, see @johnnie.black's https://forums.unraid.net/bug-reports/stable-releases/683-docker-image-huge-amount-of-unnecessary-writes-on-cache-r733/?do=findComment&comment=9431 (seems to work for everyone so far)

    3) Putting docker directly onto the cache (some have reported no decreased writes, although some have, this is the one I'm using and it's working for me)

     

    I may have missed one, but suggestion 2 is your quickest option here.

    Edited by S1dney
    Link to comment
    Share on other sites
    2 hours ago, S1dney said:

    You cannot fix this by changing a couple of docker containers, cause docker containers are not the root cause of the problem, a busy container will just show this problem more.

     

    The only "fixes" that have been working for other were:

    1) Formatting to XFS (works always)

    2) Remounting the BTRFS cache with the nospace_cache options, see @johnnie.black's https://forums.unraid.net/bug-reports/stable-releases/683-docker-image-huge-amount-of-unnecessary-writes-on-cache-r733/?do=findComment&comment=9431 (seems to work for everyone so far)

    3) Putting docker directly onto the cache (some have reported no decreased writes, although some have, this is the one I'm using and it's working for me)

     

    I may have missed one, but suggestion 2 is your quickest option here.

    Thanks for this, I really do appreicate it. And it certainly seems I glossed over this one (the remount option) in a hurry and may have missed it.

     

    Also, as he said "But like mentioned it's not a complete fix" ...I'm really kinda hanging on for an actual "complete fix". Without me breaking anything in the meantime.

     

    I was happy to reduce the writes as much as I could by chainging a few containers (I shouldn't have defined that as a 'fix' i guess), as I was comfortable with this, but i'm not personally very comfortable with many other changes proposed beyond that without the guarantee it's not going to break something if an actual fix is pushed out. But i'm absolutely not trying to diminish the efforts yourself and the many others here have made to finding these options for other users! 

     

    I probably didn't need to make another post I guess, should have continued to observe from the sidelines. Just frustration that my poor little SSD's life is ticking away quite quickly with this issue present. And still no idea if it's been officially acknowledged and a fix in the works?

    [Edit] My (basic) calculations has my SSD's life expectancy at 258 days worth of "percent lifetime remain".

    Edited by Moz80
    Link to comment
    Share on other sites

    I agree, I didn't want to hack stuff apart too much to try and fix this so that any official fix would work properly.

     

    At this point the remount option is your best bet, it can be done with the array running, it does not change anything about the base unraid workings and it reverts to stock after a reboot. So no risk to give it a try.

     

    You will just need to run it again after every reboot.

     

    If you only have a single cache drive, then reformatting as XFS is the only true fix, dropped my writes from 7gb/hour to less then 200mb/hour.

     

    For those of us with a cache pool, there is no fix, we have to get creative, like me adding an SSD to the array formatted as XFS, this only works since I don't have a parity drive at the moment.

    • Thanks 1
    Link to comment
    Share on other sites

    Sorry for bringing more noise to this issue, but having just built an unRAID Server with 4 HDDs and 2 NVMe SSDs I want to make sure not to thrash them right away. My intention was to run both SSDs in an encrypted BTRFS pool, but as far as I understand I'd be affected by this bug then.

     

    Right now there is one of the SSDs installed and running as encrypted BTRFS. It only has two VMs running on it, one is Home Assistant, the other an older Debian server with a few Docker containers that I some day want to migrate. Home Assistant is around 6GB in size, the older server is approx. 20GB in total, transferred over via rsync. Apart from that I only tested transferring one or two files, not more than a few GBs. According to SMART the NVMe is at: Data Units Written:                 368,546 [188 GB]

     

    The bigger VM is only running for something like 8 hours. Home Assistant is running since I installed the SSD yesterday, not even 24 hours. I'd have expected something around 50GB written, at most, considering both VMs have only been running for less than a day and very surely didn't write 150GBs to the disk!

     

    I think I'll reformat the cache as encrypted XFS and keep the other NVMe out of the system for now. Its nothing mission critical on the Cache, so backing it up with restic or similar every 6 hours to the array should be fine… though I really wanted to move to RAID1 for quite a while now :(

    Link to comment
    Share on other sites
    On 6/22/2020 at 1:52 AM, Moz80 said:

    And still no idea if it's been officially acknowledged and a fix in the works?

    Yes we are looking into this.

    • Like 3
    • Thanks 3
    Link to comment
    Share on other sites

    Hi everyone and thank you all for your continued patience on this issue.  I'm sure it can be frustrating that this has been going on for as long as it has for some of you and yet this one has been a bit elusive for us to track down as we haven't been able to replicate the issue, but we just ordered some more testing gear to see if we can and I will be dedicating some serious time to this in the weeks ahead.  Gear should arrive this weekend so I'll have some fun testing to do during the 4th of July holiday (and my birthday ;-).

    • Like 1
    • Thanks 2
    Link to comment
    Share on other sites

    Also, if anyone has seen this issue affect their cache pool using HDDs, can you please reply in this thread and let us know?  I'm fairly certain this is an SSD-only issue, but better to ask than assume.

    • Like 1
    Link to comment
    Share on other sites



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.