• [6.8.3] docker image huge amount of unnecessary writes on cache


    S1dney
    • Solved Urgent

    EDIT (March 9th 2021):

    Solved in 6.9 and up. Reformatting the cache to new partition alignment and hosting docker directly on a cache-only directory brought writes down to a bare minimum.

     

    ###

     

    Hey Guys,

     

    First of all, I know that you're all very busy on getting version 6.8 out there, something I'm very much waiting on as well. I'm seeing great progress, so thanks so much for that! Furthermore I won't be expecting this to be on top of the priority list, but I'm hoping someone of the developers team is willing to invest (perhaps after the release).

     

    Hardware and software involved:

    2 x 1TB Samsung EVO 860, setup with LUKS encryption in BTRFS RAID1 pool.

     

    ###

    TLDR (but I'd suggest to read on anyway 😀)

    The image file mounted as a loop device is causing massive writes on the cache, potentially wearing out SSD's quite rapidly.

    This appears to be only happening on encrypted caches formatted with BTRFS (maybe only in RAID1 setup, but not sure).

    Hosting the Docker files directory on /mnt/cache instead of using the loopdevice seems to fix this problem.

    Possible idea for implementation proposed on the bottom.

     

    Grateful for any help provided!

    ###

     

    I have written a topic in the general support section (see link below), but I have done a lot of research lately and think I have gathered enough evidence pointing to a bug, I also was able to build (kind of) a workaround for my situation. More details below.

     

    So to see what was actually hammering on the cache I started doing all the obvious, like using a lot of find commands to trace files that were written to every few minutes and also used the fileactivity plugin. Neither was able trace down any writes that would explain 400 GBs worth of writes a day for just a few containers that aren't even that active.

     

    Digging further I moved the docker.img to /mnt/cach/system/docker/docker.img, so directly on the BTRFS RAID1 mountpoint. I wanted to check whether the unRAID FS layer was causing the loop2 device to write this heavy. No luck either.

    This gave me a situation I was able to reproduce on a virtual machine though, so I started with a recent Debian install (I know, it's not Slackware, but I had to start somewhere ☺️). I create some vDisks, encrypted them with LUKS, bundled them in a BTRFS RAID1 setup, created the loopdevice on the BTRFS mountpoint (same of /dev/cache) en mounted it on /var/lib/docker. I made sure I had to NoCow flags set on the IMG file like unRAID does. Strangely this did not show any excessive writes, iotop shows really healthy values for the same workload (I migrated the docker content over to the VM).

     

    After my Debian troubleshooting I went back over to the unRAID server, wondering whether the loopdevice is created weirdly, so I took the exact same steps to create a new image and pointed the settings from the GUI there. Still same write issues. 

     

    Finally I decided to put the whole image out of the equation and took the following steps:

    - Stopped docker from the WebGUI so unRAID would properly unmount the loop device.

    - Modified /etc/rc.d/rc.docker to not check whether /var/lib/docker was a mountpoint

    - Created a share on the cache for the docker files

    - Created a softlink from /mnt/cache/docker to /var/lib/docker

    - Started docker using "/etc/rd.d/rc.docker start"

    - Started my BItwarden containers.

     

    Looking into the stats with "iotstat -ao" I did not see any excessive writing taking place anymore.

    I had the containers running for like 3 hours and maybe got 1GB of writes total (note that on the loopdevice this gave me 2.5GB every 10 minutes!)

     

    Now don't get me wrong, I understand why the loopdevice was implemented. Dockerd is started with options to make it run with the BTRFS driver, and since the image file is formatted with the BTRFS filesystem this works at every setup, it doesn't even matter whether it runs on XFS, EXT4 or BTRFS and it will just work. I my case I had to point the softlink to /mnt/cache because pointing it /mnt/user would not allow me to start using the BTRFS driver (obviously the unRAID filesystem isn't BTRFS). Also the WebGUI has commands to scrub to filesystem inside the container, all is based on the assumption everyone is using docker on BTRFS (which of course they are because of the container 😁)

    I must say that my approach also broke when I changed something in the shares, certain services get a restart causing docker to be turned off for some reason. No big issue since it wasn't meant to be a long term solution, just to see whether the loopdevice was causing the issue, which I think my tests did point out.

     

    Now I'm at the point where I would definitely need some developer help, I'm currently keeping nearly all docker container off all day because 300/400GB worth of writes a day is just a BIG waste of expensive flash storage. Especially since I've pointed out that it's not needed at all. It does defeat the purpose of my NAS and SSD cache though since it's main purpose was hosting docker containers while allowing the HD's to spin down.

     

    Again, I'm hoping someone in the dev team acknowledges this problem and is willing to invest. I did got quite a few hits on the forums and reddit without someone actually pointed out the root cause of issue.

     

    I missing the technical know-how to troubleshoot the loopdevice issues on a lower level, but have been thinking on possible ways to implement a workaround. Like adjusting the Docker Settings page to switch off the use of a vDisk and if all requirements are met (pointing to /mnt/cache and BTRFS formatted) start docker on a share on the /mnt/cache partition instead of using the vDisk.

    In this way you would still keep all advantages of the docker.img file (cross filesystem type) and users who don't care about writes could still use it, but you'd be massively helping out others that are concerned over these writes.

     

    I'm not attaching diagnostic files since they would probably not point out the needed.

    Also if this should have been in feature requests, I'm sorry. But I feel that, since the solution is misbehaving in terms of writes, this could also be placed in the bugreport section.

     

    Thanks though for this great product, have been using it so far with a lot of joy! 

    I'm just hoping we can solve this one so I can keep all my dockers running without the cache wearing out quick,

     

    Cheers!

     

    • Like 3
    • Thanks 17



    User Feedback

    Recommended Comments



    1 hour ago, limetech said:

    Yes we are looking into this.

     

    15 minutes ago, jonp said:

    Hi everyone and thank you all for your continued patience on this issue.  I'm sure it can be frustrating that this has been going on for as long as it has for some of you and yet this one has been a bit elusive for us to track down as we haven't been able to replicate the issue, but we just ordered some more testing gear to see if we can and I will be dedicating some serious time to this in the weeks ahead.  Gear should arrive this weekend so I'll have some fun testing to do during the 4th of July holiday (and my birthday ;-).

    Thank you both for this, communication is very much appreciated, as well as your efforts!

     

    Most of us know how busy you all have been so don’t worry about it 🙂

     

    I have not read anyone reporting this on HDD’s (read all comments actively), @TexasUnraid has been shifting data around a lot, you tried regular harddrives and btrfs by any chance?


    @jonp hope you’ll find a good challenge here, and also, happy birthday in advance! 🥳

    Edited by S1dney
    Link to comment

    I started noticing this a few weeks ago, just happened to look at my cache's TBW and thought it was pretty high. This  was a single new NVMe 1TB drive, formatted BTRFS, installed in a brand new unraid box/install in November 2019. Having tracked the TBW for over a week now, it was writing close to 450GB/day while my system was doing nothing remotely close to that. I found this thread and issued the remount no_cache command some people have suggested, and I'm currently looking at 43GB/day (so a 90% reduction). I'm caught between waiting for a fix or clearing it off and formatting to XFS. Since I have no plans to expand this drive pool, I'm probably just going to reformat as XFS.

    Link to comment
    22 minutes ago, Scorpionhl said:

    just happened to look at my cache's TBW and thought it was pretty high

    TBW = what? Total Byes Written?

     

    Where are you looking at this?

    a) Main page under Writes column, or

    b) SMART Attributes, Data units written

    Link to comment
    40 minutes ago, limetech said:

    TBW = what? Total Byes Written?

     

    Where are you looking at this?

    a) Main page under Writes column, or

    b) SMART Attributes, Data units written

    Sorry, was referring to it in terms of how the SSD manufacturers do for warranty, TBW = Terabytes Written. But yes, this is under the smart data: Data units written  187,917,605 [96.2 TB]

    Edited by Scorpionhl
    typo
    • Thanks 1
    Link to comment
    3 hours ago, S1dney said:

     

    Thank you both for this, communication is very much appreciated, as well as your efforts!

     

    Most of us know how busy you all have been so don’t worry about it 🙂

     

    I have not read anyone reporting this on HDD’s (read all comments actively), @TexasUnraid has been shifting data around a lot, you tried regular harddrives and btrfs by any chance?


    @jonp hope you’ll find a good challenge here, and also, happy birthday in advance! 🥳

    Yes, I know I put the docker and appdata on a BTRFS array hard drive at one point and saw the same excessive writes as SSD. The file system format seemed to be all that mattered. The details should be buried in the thread somewhere.

    Link to comment
    2 hours ago, limetech said:

    TBW = what? Total Byes Written?

     

    Where are you looking at this?

    a) Main page under Writes column, or

    b) SMART Attributes, Data units written

    The exact wording changes depending on SSD brand and model. LBA, TBW, Bytes written etc etc.

     

    The end result is usually the same, just different ways of reporting the amount of data written. On windows diskinfo automatically converts the numbers to bytes for you, in unraid you have to manually calculate it it seems. Some divide by 512 vs 4096 for example.

    Link to comment
    2 hours ago, TexasUnraid said:

    The details should be buried in the thread somewhere.

    Would appreciate if you could find that and/or repeat experiment to be sure.  One thing we're working on is to define a partition layout for SSD that starts at sector 2048 (like windows) - this aligns a partition on a 1 MiB boundary so there are no issues with SSD blocks misaligned with logical sectors.

    Link to comment
    10 hours ago, jonp said:

    Also, if anyone has seen this issue affect their cache pool using HDDs, can you please reply in this thread and let us know?  I'm fairly certain this is an SSD-only issue, but better to ask than assume.

    I'm having the issue with a non-encrypted BTRFS cache pool of two 500GB NVMe.
    Within a week I had about 1 TBW accumulated into the drive. I have moved the VMs & Dockers appdata to XFS SSD and monitored the data closely by taking samples in various points of the day into excel. To make sure there is no error in the TBW display I have calculated the TBW myself which shows that the TBW is correct.
    I had about 200GB per day on the BTRFS and about 14GB per day on the XFS SSD.

    Link to comment
    5 hours ago, limetech said:

    One thing we're working on is to define a partition layout for SSD that starts at sector 2048 (like windows)

    Only did a quick test but this appears to make a notable difference for me, I tested moving my problem VM to an unassigned device first using the default partition alignment and then align it to 1MiB, watching the i/o stats for 5 minutes writes decreased about three times, performance at least on that device was also noticeably better, still high writes though, but most likely a step in the right direction.

    Link to comment

    Hi Everyone,

     

    Just adding my experience here to the pool of information.

    I had this issue when I was running the official Plex container, the cache was constantly writing an average of 5MB/s with no stop.

    I changed to the linuxserver container, and notice that there would be just surges of writes up to 7MB/s but not consistently from looking at the 'main' screen. I then purchased a new SSD. I then assumed it was fixed,

     

    After double checking the forums, I thought I would run iotop and check, and I am having the same issue! 15TB written in less than a month of the new SSD, and was averaging over a GB per 10 mins! Clearly this issue looks different to the plex container issue but is the same.

    After running the remount command it's dropped down to around 130 mb per 10 mins, so a huge change. 

    Has anyone swapped to xfs on their cache as a result of that? I'm thinking I might need to!

    Link to comment
    10 hours ago, limetech said:

    Would appreciate if you could find that and/or repeat experiment to be sure.  One thing we're working on is to define a partition layout for SSD that starts at sector 2048 (like windows) - this aligns a partition on a 1 MiB boundary so there are no issues with SSD blocks misaligned with logical sectors.

    It was nothing amazing really, I put appdata and dockers onto basically every drive in my system and watched the writes and any BTRFS formatted drive would have greatly inflated writes be it SSD or HDD. The exact amount of writes would vary a little but was generally in the same ballpark.

     

    Any XFS formatted drive, SSD or HDD would have the correct number of writes.

     

    Didn't matter if the drive was in the array or cache. Encryption or no encryption. Although the numbers did vary, they were always inflated on a BTRFS drive vs XFS.

     

    I think the best I ever got a BTRFS drive down to was ~2-3x the actual writes on an XFS drive.

     

    I did notice that increasing the dirty writes length from 30 seconds to 2-3 minutes seemed to reduce writes a fair amount in some cases. Not sure if that might trigger something.

    Edited by TexasUnraid
    Link to comment

    Please Stop array, open terminal window and type this command:

    sed -i 's/-o loop/-o loop,noatime/' /usr/local/sbin/mount_image

    Then Start array and let me know if that helps reduce the write load (and by how much if any).

    Link to comment

    How would I see what is being written? 

     

    Eleven days ago, I replaced my cache with 2 x 1TB NVMe in a BTRFS RAID1.  Since then, more than 1TB per day is being written to the drive.  That seems excessive since I am only using it for Docker, appdata, and a couple of shares (which has only had ~400GB written in that time).

     

    Cache 1:

    smartctl 7.1 2019-12-30 r5022 [x86_64-linux-4.19.107-Unraid] (local build)
    Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Model Number:                       Samsung SSD 970 PRO 1TB
    Serial Number:                      
    Firmware Version:                   1B2QEXP7
    PCI Vendor/Subsystem ID:            0x144d
    IEEE OUI Identifier:                0x002538
    Total NVM Capacity:                 1,024,209,543,168 [1.02 TB]
    Unallocated NVM Capacity:           0
    Controller ID:                      4
    Number of Namespaces:               1
    Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
    Namespace 1 Utilization:            719,635,980,288 [719 GB]
    Namespace 1 Formatted LBA Size:     512
    Namespace 1 IEEE EUI-64:            002538 540150134f
    Local Time is:                      Thu Jun 25 11:50:58 2020 CDT
    Firmware Updates (0x16):            3 Slots, no Reset required
    Optional Admin Commands (0x0037):   Security Format Frmw_DL Self_Test Directvs
    Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
    Maximum Data Transfer Size:         512 Pages
    Warning  Comp. Temp. Threshold:     81 Celsius
    Critical Comp. Temp. Threshold:     81 Celsius
    
    Supported Power States
    St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
     0 +     6.20W       -        -    0  0  0  0        0       0
     1 +     4.30W       -        -    1  1  1  1        0       0
     2 +     2.10W       -        -    2  2  2  2        0       0
     3 -   0.0400W       -        -    3  3  3  3      210    1200
     4 -   0.0050W       -        -    4  4  4  4     2000    8000
    
    Supported LBA Sizes (NSID 0x1)
    Id Fmt  Data  Metadt  Rel_Perf
     0 +     512       0         0
    
    === START OF SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    SMART/Health Information (NVMe Log 0x02)
    Critical Warning:                   0x00
    Temperature:                        47 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    0%
    Data Units Read:                    2,930,203 [1.50 TB]
    Data Units Written:                 21,981,884 [11.2 TB]
    Host Read Commands:                 25,097,587
    Host Write Commands:                411,986,950
    Controller Busy Time:               4,473
    Power Cycles:                       14
    Power On Hours:                     281
    Unsafe Shutdowns:                   6
    Media and Data Integrity Errors:    0
    Error Information Log Entries:      0
    Warning  Comp. Temperature Time:    0
    Critical Comp. Temperature Time:    0
    Temperature Sensor 1:               47 Celsius
    Temperature Sensor 2:               57 Celsius
    
    Error Information (NVMe Log 0x01, max 64 entries)
    No Errors Logged
    

     

    Cache 2:

    smartctl 7.1 2019-12-30 r5022 [x86_64-linux-4.19.107-Unraid] (local build)
    Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
    
    === START OF INFORMATION SECTION ===
    Model Number:                       Samsung SSD 970 PRO 1TB
    Serial Number:                      
    Firmware Version:                   1B2QEXP7
    PCI Vendor/Subsystem ID:            0x144d
    IEEE OUI Identifier:                0x002538
    Total NVM Capacity:                 1,024,209,543,168 [1.02 TB]
    Unallocated NVM Capacity:           0
    Controller ID:                      4
    Number of Namespaces:               1
    Namespace 1 Size/Capacity:          1,024,209,543,168 [1.02 TB]
    Namespace 1 Utilization:            719,635,988,480 [719 GB]
    Namespace 1 Formatted LBA Size:     512
    Namespace 1 IEEE EUI-64:            002538 510150a811
    Local Time is:                      Thu Jun 25 11:53:44 2020 CDT
    Firmware Updates (0x16):            3 Slots, no Reset required
    Optional Admin Commands (0x0037):   Security Format Frmw_DL Self_Test Directvs
    Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
    Maximum Data Transfer Size:         512 Pages
    Warning  Comp. Temp. Threshold:     81 Celsius
    Critical Comp. Temp. Threshold:     81 Celsius
    
    Supported Power States
    St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
     0 +     6.20W       -        -    0  0  0  0        0       0
     1 +     4.30W       -        -    1  1  1  1        0       0
     2 +     2.10W       -        -    2  2  2  2        0       0
     3 -   0.0400W       -        -    3  3  3  3      210    1200
     4 -   0.0050W       -        -    4  4  4  4     2000    8000
    
    Supported LBA Sizes (NSID 0x1)
    Id Fmt  Data  Metadt  Rel_Perf
     0 +     512       0         0
    
    === START OF SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED
    
    SMART/Health Information (NVMe Log 0x02)
    Critical Warning:                   0x00
    Temperature:                        45 Celsius
    Available Spare:                    100%
    Available Spare Threshold:          10%
    Percentage Used:                    0%
    Data Units Read:                    4,319,316 [2.21 TB]
    Data Units Written:                 21,981,076 [11.2 TB]
    Host Read Commands:                 38,573,640
    Host Write Commands:                412,195,982
    Controller Busy Time:               4,469
    Power Cycles:                       27
    Power On Hours:                     278
    Unsafe Shutdowns:                   13
    Media and Data Integrity Errors:    0
    Error Information Log Entries:      5
    Warning  Comp. Temperature Time:    0
    Critical Comp. Temperature Time:    0
    Temperature Sensor 1:               45 Celsius
    Temperature Sensor 2:               47 Celsius
    
    Error Information (NVMe Log 0x01, max 64 entries)
    No Errors Logged
    

     

    Link to comment
    59 minutes ago, StevenD said:

    I suppose I can run that command and see where it sits 24 hours from now.

    Right.  Could probably compare 1 hour run time.

    Link to comment

    Looks like that worked.  I will check it again tomorrow. I would expect to see it over 12TB tomorrow.

     

    Cache1:

    22,040,574 [11.2 TB]

     

    Cache2:

    22,039,620 [11.2 TB]

    Edited by StevenD
    • Thanks 1
    Link to comment
    1 hour ago, StevenD said:

    Looks like that worked. 

    If that's one hour of usage after the change it still wrote about 30GB, that's 720GB per day.

     

    Edited to correct values, had misplaced a 0 before.

    • Like 1
    Link to comment

    I'll share the nice little script again someone I don't remember made a few months back here:

     

    #!/bin/bash
    
    ### replace sdg below with label of drive you want TBW calculated for  ###
    device=/dev/sdc
    
    sudo smartctl -A $device |awk '
    $0 ~ /Power_On_Hours/ { poh=$10; printf "%s / %d hours / %d days / %.2f years\n",  $2, $10, $10 / 24, $10 / 24 / 365.25 }
    $0 ~ /Total_LBAs_Written/ {
       lbas=$10;
       bytes=$10 * 512;
       mb= bytes / 1024^2;
       gb= bytes / 1024^3;
       tb= bytes / 1024^4;
       #printf "%s / %s  / %d mb / %.1f gb / %.3f tb\n", $2, $10, mb, gb, tb
       printf "%s / %.2f GB / %.2f TB\n", $2, gb, tb
       printf "Mean writes per hour  / %.3f GB / %.3f TB",  gb/poh, tb/poh
    }
    $0 ~ /Wear_Leveling_Count/ { printf "%s / %d (%% health)\n", $2, int($4) }
    ' |
       sed -e 's:/:@:' |
       sed -e "s\$^\$$device @ \$" |
       column -ts@

    Output example:
     

    /dev/sdc    Power_On_Hours           22386 hours / 932 days / 2.55 years
    /dev/sdc    Wear_Leveling_Count      91 (% health)
    /dev/sdc    Total_LBAs_Written       282880.40 GB / 276.25 TB
    /dev/sdc    Mean writes per hour     12.636 GB / 0.012 TB

     

    Edited by Andreen
    meh
    Link to comment

    Looks like about 400GB was written yesterday. Nothing was written, except normal docker appdata stuff.  
     

    22,685,135 [11.6 TB]

    22,687,899 [11.6 TB]

    Link to comment
    1 minute ago, StevenD said:

    Looks like about 400GB was written yesterday.

    So that's still a good result, was close to 1TB a day before, correct?

    Link to comment
    1 minute ago, johnnie.black said:

    So that's still a good result, was close to 1TB a day before, correct?

    Correct. Certainly better. 

    Link to comment
    3 minutes ago, StevenD said:

    Looks like about 400GB

    And looking at the data units written it was more like 332GB (considering the last stats you posted, though they were posted about 20H ago, so still around 400GB per day)

    • Like 1
    Link to comment
    3 minutes ago, StevenD said:

    Correct. Certainly better.

    On top of tweak you can also try this, and see if there are further decrease, it's perfectly safe to use.

    • Like 1
    Link to comment
    2 hours ago, johnnie.black said:

    On top of tweak you can also try this, and see if there are further decrease, it's perfectly safe to use.

    Also could use 'space_cache=v2'.

     

    Upcoming -beta23 has these changes to address this issue:

    • set 'noatime' option when mounting loopback file systems
    • include 'space_cache=v2' option when mounting btrfs file systems
    • default partition 1 start sector aligned on 1MiB boundary for non-rotational storage.  Will requires wiping partition structure on existing SSD devices first to make use of this.
    • Like 4
    • Thanks 2
    Link to comment
    15 minutes ago, limetech said:

    Also could use 'space_cache=v2'.

     

    Upcoming -beta23 has these changes to address this issue:

    • set 'noatime' option when mounting loopback file systems
    • include 'space_cache=v2' option when mounting btrfs file systems
    • default partition 1 start sector aligned on 1MiB boundary for non-rotational storage.  Will requires wiping partition structure on existing SSD devices first to make use of this.

    I have another 2TB nvme installed, so I can easily backup, wipe and restore the cache pool.

    • Thanks 1
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.