• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Urgent

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 15


    User Feedback

    Recommended Comments



    10 minutes ago, TexasUnraid said:

    Agreed, I can't make sense of it.

     

    I think most of you that have the truly extreme write black holes are running things like plex, my best guess is that these fixes help the issue those dockers have but not the underlying issue.

     

    I only run very mild dockers, lancache, krusader, mumble, qbittorrent etc that are not actively doing anything right now.

     

    The difference from putting docker/appdata on an XFS array drive vs the btrfs cache is undeniable though at around 200-300mb/hour vs 1000-1500mb/hour and climbing in most cases.

    I would have loved to have blamed it on your individual Docker containers, but I agree, those don't seem like extravagant containers. PMS is definitely a clunker. A lot of database containers also seem to be particularly bad about cache writes. MongoDB was horrendous for me: 

     

     

    Since you seem to still be experiencing this issue, could I get you to run 

    docker stats

    I'm curious if Block I/O identifies a particular container.

     

    -TorqueWrench

     

    Link to comment
    Share on other sites

    Yeah, these are all pretty well behaved containers in theory. Particularly since they are not actively doing anything right now, still just testing things. I have a few other containers like nextcloud and mainandb installed but they are not setup and not sure if I will use them so I don't have the dockers running for most of the testing. I did see increased writes with them running but I wanted to keep things consistent.

     

    Here is docker stats, nothing stands out to me. This is after having everything running for over 3 hours and 1.5GB being written per hour according to the LBA logging.

     

    firefox_nGDbYItDUi.jpg

     

    Interestingly, if I add up all the block I/O numbers and divide by 3 hours and change, it works out to almost exactly what I was seeing with docker and appdata on an XFS drive.

    Edited by TexasUnraid
    Link to comment
    Share on other sites

    Just read this thread today, as I wanted to see if there are any major roadblocks for me when upgrading my system from 6.7.2 to the current stable release.

    After reading thrue all ten pages of this thread upgrading to 6.8.x did not seam to be a good idea ;)

     

    Now that I am home and have access to my system I checked my cache pool drives on a otherwise well behaving unraid install and found out, that my two Samsung EVO 860 500 GB Sata drives which are only 3 months old have written about 50 TB if the LBA calculator did not lie to me.

     

    Reading the thread it looked like a 6.8.x issue allone, but as I am on 6.7.2. and I am pretty sure that I have in no way intentionally written that vast amount of data to those poor little SSDs I might be in the same boat as the rest of the bunch, but can only confirm that it is the same issue after further investigation.

     

    With this rate I would chew thrue the 300TBW warranty limit in about 18 months - puh

    Edited by Kevek79
    Typo
    Link to comment
    Share on other sites
    1 hour ago, Kevek79 said:

    Reading the thread it looked like a 6.8.x issue allone, but as I am on 6.7.2.

     

    The bug was originally reported in 6.7.2. The thread title was changed to 6.8.3 when it was discovered that it still exists in the current release.

     

    Edited by grigsby
    Link to comment
    Share on other sites
    7 minutes ago, grigsby said:

     

    The bug was originally reported in 6.7.2. The thread title was change to 6.8.3 when it was discovered that it still exists in the current release.

     

    I did not realize that. 
    That makes it more likely that its the same issue. 

    Link to comment
    Share on other sites
    On 5/14/2020 at 1:59 PM, johnnie.black said:

    While we wait for the fix, anyone reached a PB? I'm not that far:

     

    imagem.png.ff82b3a639f2e773846ed4008294807a.png

    Just because I'm curious: Did you hit the PB yet @johnnie.black

    Edited by Kevek79
    Link to comment
    Share on other sites
    10 hours ago, Kevek79 said:

    Just because I'm curious: Did you hit the PB yet @johnnie.black

    image.png.4a3c98b2348ed525254816f7ffeb12e0.png

     

    Still a few months away, at current pace I estimate hitting 1PB around Halloween, this assuming the NVMe device doesn't give up the ghost, since it's well past its 300TBW rating.

    Link to comment
    Share on other sites

    I see that the topic is referring to docker running on cache, I'm not running any dockers on the cache (at least not ones that I can't stop for few days).
    After reading this thread I checked my setup (it is a new setup, first version used is 6.8.3) and noticed around 40mb/sec writes to the cache, a new drive already got 1.5TB written.
    I'm using two NVMes in a raid1 cache pool, When I tested the system I had only one drive and did not notice high writes, but I might have missed it.
    2 VMs are running on the cache, Windows 10 with Blue Iris that store data on the Array, and HassOS (which uses MariaDB inside).
    My first assumption was that it is related to the DB inside the HassOS VM, I have installed MariaDB as a docker and let it store on the cache, writing dropped from 40mb/sec to around 5-6 which is still high. On the other hand, the MariaDB only writes about 100-200Kb/sec on the array. Moving forward I moved the whole HassOS VM + Maria DB data to an unassigned SSD (xfs), cache writes dropped down to 1-2mb/sec which is still high, the Windows VM has most of it services disabled and I doubt it write so much data.
    Monitoring the HassOS VM + DB on SSD using LBA showed about 6GB for 12 hours (around 140kB/sec).
    The cache which has nothing on it besides the Win 10 VM has already accumulated more than 20GB of writes in the same 12 hours period.
    I am thinking of moving the Win 10 VM to the unassigned SSD also, but have no idea what should be the next step, my original plan was to use the brtfs mirror on the cache as a sort of fault tolerance, but l doubt it will live long with such high write rate.

    Link to comment
    Share on other sites
    18 minutes ago, thecode said:

    I see that the topic is referring to docker running on cache

    Yes, the topic is mostly about that, but for example I have the problem on one of my VMs, and only one, despite having 3 on the same device, and no issues with the docker image which also is on the same device, it's kind of a strange issue.

    Link to comment
    Share on other sites
    2 hours ago, johnnie.black said:

    image.png.4a3c98b2348ed525254816f7ffeb12e0.png

     

    Still a few months away, at current pace I estimate hitting 1PB around Halloween, this assuming the NVMe device doesn't give up the ghost, since it's well past its 300TBW rating.

    Good luck for that, but as your numbers are way higher and the nvme is still working I can sleep a bit better with my 50TBW up till now ;)

    As everything else is working great and the only issue is still in the current stable release I might skip 6.8 totaly and wait for 6.9.

    So lets hope that 6.9 rc1 is just around the corner and has this one fixed.

    I am eager to test that new release out because of the multiple cache pool options, but as I have only one system available I want to wait at least for a RC version before upgrading.

    Link to comment
    Share on other sites

    Ok, I left the beta running overnight with docker and app data on the cache.

     

    Sure enough, it also started steadily climbing, was up to almost 2GB/hour this morning.

     

    So the beta did not help my write issues and if anything they are worse.

     

    So looks like I need to find another drive to use for docker formatted as XFS in the array.

    Edited by TexasUnraid
    Link to comment
    Share on other sites

     Looks like I've not been impacted?

     

    cache drive is btrfs, with the following dockers: mariadb, kodi-server, duplicati

    # cat /etc/unraid-version; /usr/sbin/smartctl -A /dev/sdb | awk '$0~/Power_On_Hours/{ printf "Days: %.1f\n", $10 / 24} $0~/LBAs/{ printf "TBW: %.1f\n", $10 * 512 / 1024^4 }'
    version="6.8.3"
    Days: 646.9
    TBW: 10.3

     

    Link to comment
    Share on other sites

    Just spit-balling here, but I seem to remember an issue with Samsung drives (mostly 850s at the time). Something to do with a non-standard starting block.

     

    I don't suppose anyone with this issue is using non-Samsung disks?

    • Thanks 1
    Link to comment
    Share on other sites
    4 minutes ago, -Daedalus said:

    Just spit-balling here, but I seem to remember an issue with Samsung drives (mostly 850s at the time). Something to do with a non-standard starting block.

     

    I don't suppose anyone with this issue is using non-Samsung disks?

    No Samsungs here.

    (Seagate IronWolf 110 SATA SSDs)

    Edited by Niklas
    Link to comment
    Share on other sites

    Only samsung driver here is the one I added in to log LBA's to make logging over time easier. The issue presented itself when I was only using other brand drives.

    Link to comment
    Share on other sites

    I just reinstalled unraid after all this testing to get a fresh start before I put this server into use.

     

    The writes are all of the sudden extreme. Been getting 5GB+/hour or more then last few hours with the same dockers and settings as before.

     

    No idea why, going to move things to an XFS drive in the morning but no idea why it is so much worse now, docker stats still show they are all very well behaved like before.

    Link to comment
    Share on other sites
    12 hours ago, johnnie.black said:

    Yes, the topic is mostly about that, but for example I have the problem on one of my VMs, and only one, despite having 3 on the same device, and no issues with the docker image which also is on the same device, it's kind of a strange issue.

    This topic is tldr but wondering if anyone has tried turning off btrfs COW?  Either on the docker.img file itself (if stored on a btrfs volume) or within the btrfs file system image.

    Link to comment
    Share on other sites
    1 minute ago, limetech said:

    This topic is tldr but wondering if anyone has tried turning off btrfs COW?  Either on the docker.img file itself (if stored on a btrfs volume) or within the btrfs file system image.

    I am happy to test if you can tell me how.

    Link to comment
    Share on other sites
    32 minutes ago, limetech said:

    This topic is tldr

     

    Well, I gotta say, LimeTech's response to this bug has been impressive -- in a not good way. This is a major, potentially catastrophic bug that could result in loss of data, time, and hardware/money that was first reported seven months ago, and the only two comments LimeTech makes about it are dismissing it as "tldr"?

     

    I first installed Unraid in May on a new server build and promptly purchased a license for $89. Obviously I don't have much history with Unraid or the company, but their total non-response to this bug report is disheartening.

    Link to comment
    Share on other sites
    33 minutes ago, grigsby said:

    dismissing it as "tldr"

    This is not the only issue or the only thing we are working in.  The 'tldr' was meant as a solicitation for someone to summarize the issue to save time, not being dismissive.  I've seen this kind of thing before where I/O reporting is wildly off vs. what's happening on the media, especially with btrfs.

    Link to comment
    Share on other sites
    1 hour ago, limetech said:

    This topic is tldr but wondering if anyone has tried turning off btrfs COW?  Either on the docker.img file itself (if stored on a btrfs volume) or within the btrfs file system image.

    Would this do it?

    Shut down docker/array

    chattr -R +C /mnt/user/system/docker
    rm -rf /mnt/user/system/docker/docker.img

    start docker/array

    Link to comment
    Share on other sites
    6 minutes ago, limetech said:

    This is not the only issue or the only thing we are working in.  The 'tldr' was meant as a solicitation for someone to summarize the issue to save time, not being dismissive.  I've seen this kind of thing before where I/O reporting is wildly off vs. what's happening on the media, especially with btrfs.

    In this case we can assure you that it is not a reporting issue as iotop and the raw LBA's written to the drives both show heavily inflated writes.

     

    For example on an XFS drive I get around 200-300mb/hour writes which lines up with what docker stats says.

     

    On the cache the LBA's were 1GB/hour and climbing over time, upwards of 2GB/hour when left overnight.

     

    I just reinstalled a few hours ago, now writes as measured by the LBA's of the smart output are upwards of 5GB an hour and climbing.

     

    One thing I did on the old setup was move the cache to an XFS drive and back to cache, someone else reported this helped, maybe it helped me as well, just didn't fix the issue.

    Link to comment
    Share on other sites
    13 minutes ago, TexasUnraid said:

    For example on an XFS drive

    You mean an SSD device formatted with xfs?

    Link to comment
    Share on other sites
    12 minutes ago, limetech said:

    You mean an SSD device formatted with xfs?

    Either SSD or HDD, it didn't matter, XFS writes were what they should be.

     

    Any BTRFS drive would have anywhere from 5x-15x+ the writes and it would climb over time. Although the amount of writes would vary some depending on factors we could not understand.

     

    For example just the appdata being on the cache but docker on an XFS would still cause some very inflated writes 100x more then if reversed.

     

    On my current setup, I should see 200-300mb/hour writes. I am actually seeing 5bg/hour writes and climbing.

     

    At this rate my SSD's will not even last 2 years.

     

    The only fix I have found is to move appdata and docker to an XFS formatted drive in the array. Multiple cache pools would be real handy since I could just make another cache pool for it but can't wait that long. Still got to waste a whole drive just for dockers to keep it from killing my drives.

    Edited by TexasUnraid
    Link to comment
    Share on other sites
    9 minutes ago, TexasUnraid said:

    On my current setup, I should see 200-300mb/hour writes. I am actually seeing 5bg/hour writes and climbing.

    If you click on the device on Main and look at the SMART data, what's the value "data units written" attribute?  Does it line up with what you are measuring as MB/hour being written?

    Link to comment
    Share on other sites



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.