[6.8.3] docker image huge amount of unnecessary writes on cache

boomam · June 27, 2020

On 6/25/2020 at 12:13 PM, limetech said:
Please Stop array, open terminal window and type this command:
sed -i 's/-o loop/-o loop,noatime/' /usr/local/sbin/mount_image
Then Start array and let me know if that helps reduce the write load (and by how much if any).

Hi,

What happens if we run this now as a potential workaround, and then 6.9 drops?

Do we need to reverse it?

I have run "sed -i 's/-o loop/-o loop,noatime,space_cache=v2/' /usr/local/sbin/mount_image" on my array and im retest.

Before it was clocking 1,600Mb in 20mins, lets see what it does now once the arrays calmed down from all the container/VM restarts....

Edited June 27, 2020 by boomam

boomam · June 27, 2020

Unfortunately this didn't really make a difference for me - 20mins later and its at 1,500Mb written from loop2.

I'm gonna leave it be for a bit whilst i'm out and retest over a 20min period later to see if its just a result of settling back down after the array came back online, but so i'm somewhat prepared, how would i reverse the command to put it back to 'stock' ?

Poking about, the config in /usr/local/sbin shows the noatime,space_cache=v2 repeated several times and on several lines. Is that normal?

Quote

#!/bin/sh
#Copyright 2005-2016, Lime Technology
#License: GPLv2 only

# mount_image filepath mountpoint size

IMAGE_FILE=$1
IMAGE_SIZE=$3
MOUNTPOINT=$2

# if no image file we'll create one
if [ ! -e "${IMAGE_FILE}" ]; then
echo "Creating new image file: ${IMAGE_FILE} size: ${IMAGE_SIZE}G"
# ensure parent path exists
mkdir -p $(dirname "${IMAGE_FILE}")
# create 0-length file so we can set NOCOW attribute on the new file and then extend size of file
touch "${IMAGE_FILE}"
# if image file is located on a "user share" dereference to get real device path
DISK=`getfattr -n system.LOCATION --only-values --absolute-names "$IMAGE_FILE" 2>/dev/null`
if [ "$DISK" != "" ]; then
IMAGE_FILE="${IMAGE_FILE/user/$DISK}"
fi
# setting NOCOW attribute will fail if not btrfs, but so what?
chattr +C "${IMAGE_FILE}" 2>/dev/null
# try to use fallocate first
if ! fallocate -l ${IMAGE_SIZE}G "${IMAGE_FILE}" 2>/dev/null ; then
# try truncate
if ! truncate -s +${IMAGE_SIZE}G "${IMAGE_FILE}" 2>/dev/null ; then
echo "failed to create image file"
rm -f "${IMAGE_FILE}" 2>/dev/null
exit 1
fi
fi
# set ownership
chown nobody:users "${IMAGE_FILE}"
# create btrfs file system in the image file
if ! mkfs.btrfs "${IMAGE_FILE}" ; then
echo "failed to create btrfs file system"
rm -f "${IMAGE_FILE}" 2>/dev/null
exit 1
fi
# mount
if ! mount -t btrfs -o loop,noatime,space_cache=v2,noatime,space_cache=v2,noatime,space_cache=v2 "${IMAGE_FILE}" "$MOUNTPOINT" ; then
echo "mount error"
rm -f "${IMAGE_FILE}" 2>/dev/null
exit 1
fi
else
# exists, check that it's a regular file
if [ ! -f "${IMAGE_FILE}" ]; then
echo "${IMAGE_FILE} is not a file"
exit 1
fi
# check that the file is not already in-use (or being moved by 'mover')
if /usr/local/sbin/in_use "${IMAGE_FILE}" ; then
echo "${IMAGE_FILE} is in-use, cannot mount"
exit 1
fi
# if image file is located on a "user share" dereference to get real device path
DISK=`getfattr -n system.LOCATION --only-values --absolute-names "$IMAGE_FILE" 2>/dev/null`
if [ "$DISK" != "" ]; then
IMAGE_FILE="${IMAGE_FILE/user/$DISK}"
fi
# (maybe) extend file size
# try to use fallocate first, will never make the file smaller
if ! fallocate -l ${IMAGE_SIZE}G "${IMAGE_FILE}" 2>/dev/null ; then
# try truncate
truncate -s \>${IMAGE_SIZE}G "${IMAGE_FILE}"
fi
# mount
if ! mount -t btrfs -o loop,noatime,space_cache=v2,noatime,space_cache=v2,noatime,space_cache=v2 "${IMAGE_FILE}" "$MOUNTPOINT" ; then
echo "mount error"
exit 1
fi
# (maybe) extend file system size
btrfs filesystem resize max "$MOUNTPOINT"
fi
exit 0

Edited June 27, 2020 by boomam

limetech · June 27, 2020

23 minutes ago, boomam said:

Unfortunately this didn't really make a difference for me - 20mins later and its at 1,500Mb written from loop2.

The most effective work-around until we complete our testing is to type this after array has Started:

mount -o remount -o space_cache=v2 /mnt/cache

(If you are using a different named pool of course use that mount point instead.)

boomam · June 27, 2020

4 minutes ago, limetech said:
The most effective work-around until we complete our testing is to type this after array has Started:
mount -o remount -o space_cache=v2 /mnt/cache
(If you are using a different named pool of course use that mount point instead.)

Thanks for the quick reply.

Do I need to do anything to revert the other command I ran? I assume if the new workaround is post-array start that it won't be affecting the 6.9 update when it drops?

& Is it normal for those commands to be duplicated on the same line in the config I posted above?

limetech · June 27, 2020

26 minutes ago, boomam said:

Do I need to do anything to revert the other command I ran? I assume if the new workaround is post-array start that it won't be affecting the 6.9 update when it drops?

& Is it normal for those commands to be duplicated on the same line in the config I posted above?

Looks like you typed that 'sed' command more than once. In any case, the file reverts to original upon reboot. If you put that in your 'go' file I'd say remove it.

Lignumaqua · June 27, 2020

Caveat - I am in no way any kind of expert on either Unraid or Linux fiiing systems. I'm just trying to join the dots here! I could easily have done so incorrectly so feel free to shoot this down. I won't be offended. You may already all know this, but I didn't, and I feel a little bit more knowledgeable now.

The link below is to an interesting paper from 2017 that compares the write amplification of different file systems. BTRFS is by far the worst with a factor of 32x for small files overwrite and append when COW is enabled. With COW disabled this dropped to 18.6X, which is still pretty significant. This is three years ago, so things may have changed. In particular space_cache V2 could be a reaction to this? BTRFS + writing or amending small files = very high write amplification.

https://arxiv.org/abs/1707.08514

This suggests that BTRFS is a great system for secure storage of data files, but not necessarily a good choice for writing multiple small temporary files, or for log files that are continually being amended. Looking at common uses of the cache in Unraid might lead to the following suppositions. A BTRFS cache using Raid 1 is a good place for downloaded files before they are moved into the array. It's also good for any static data files. However, it's likely not to be the best place for a Docker img file or any kind of temporary storage. Particularly if redundant storage isn't needed. XFS might be a better choice there.

Docker appdata is a tricky one. That's likely data you want to be redundantly stored, but it might also be changing rapidly. Likely to contain databases for example. I can see that an SQLite or MySQL database could be a real issue with BTRFS write amplification.

The Docker img itself also being BTRFS is a further complication that makes my head hurt...

The new cache pools will likely be a great way to help deal with this dilemma! 🙂

limetech · June 27, 2020

3 minutes ago, Lignumaqua said:

The Docker img itself also being BTRFS is a further complication that makes my head hurt...

Thank you for the post. Yes there a very large number of variables, my head hurts too...

boomam · June 27, 2020

1 hour ago, limetech said:

Looks like you typed that 'sed' command more than once. In any case, the file reverts to original upon reboot. If you put that in your 'go' file I'd say remove it.

I don't know what "go file' is sorry, never heard that terminology.

I just rebooted the system, and then ran "mount -o remount -o space_cache=v2 /mnt/cache" in a SSH session.

Re-testing now...

boomam · June 27, 2020

ok.

Another 20min test -

Pre-any change = 1,600Mb in 20 min.

Post "sed" change = 1500Mb in 20 min.

Post "/mnt/cache" change = 300Mb in 20 min

(mount -o remount -o space_cache=v2 /mnt/cache)

Thats the difference between

115Gb/day or 3.34Tb/month.

&

21Gb/day or 648Gb/month

That's a major improvement! Still not perfect, but still noticeably better than it was.

I'll see if i can find a way to script that to run on array start each time until 6.9 rolls around.

Thanks for the input!

StevenD · June 27, 2020

1 hour ago, boomam said:

ok.

Another 20min test -

Pre-any change = 1,600Mb in 20 min.

Post "sed" change = 1500Mb in 20 min.

Post "/mnt/cache" change = 300Mb in 20 min

(mount -o remount -o space_cache=v2 /mnt/cache)

Thats the difference between

115Gb/day or 3.34Tb/month.

&

21Gb/day or 648Gb/month

That's a major improvement! Still not perfect, but still noticeably better than it was.

I'll see if i can find a way to script that to run on array start each time until 6.9 rolls around.

Thanks for the input!

Either put it in your go file (in the config folder on your flash drive), or add it to the User Scripts plugin.

Edited June 27, 2020 by StevenD

boomam · June 27, 2020

1 minute ago, StevenD said:

Either put it in your go file (in the config folder on your flash drive), or add it to the User Scripts plugin.

I went with the latter route for now, slightly easier to turn off pre-6.9. 🙂

nickp85 · June 28, 2020

On 6/14/2020 at 2:17 AM, nickp85 said:

-Data units read [38.7 TB]

-Data units written [107 TB]

In 13 days I'm up to here...

Data units read 92,516,764 [47.3 TB]

Data units written 213,412,436 [109 TB]

This is crazy... my cache is really only used to run dockers and my windows 10 VM. 2 TB in 13 days is nuts

Edited June 28, 2020 by nickp85

tjb_altf4 · June 28, 2020

My usage was 5TB over 20 days (BTRFS RAID0 cache on 6.8.3).

After moving docker.img to array and back again, my daily writes in the last 7 days has dropped from an average of 0.25TB to 0.14TB (daily data units from 442,229 to 254,563).

A near 50% write reduction.

Edited June 28, 2020 by tjb_altf4

TexasUnraid · June 28, 2020

7 minutes ago, tjb_altf4 said:

My usage was 5TB over 20 days (BTRFS RAID0 cache on 6.8.3).

After moving docker.img to array and back again, my daily writes in the last 7 days has dropped from an average of 0.25TB to 0.14TB (daily data units from 442,229 to 254,563).

A near 50% write reduction.

Have you tried the remount command mentioned above?

tjb_altf4 · June 28, 2020

35 minutes ago, TexasUnraid said:

Have you tried the remount command mentioned above?

Haven't tried that yet, want to collect a few more days worth of data before changing anything else.

itimpi · June 28, 2020

12 hours ago, Lignumaqua said:

The link below is to an interesting paper from 2017 that compares the write amplification of different file systems. BTRFS is by far the worst with a factor of 32x for small files overwrite and append when COW is enabled. With COW disabled this dropped to 18.6X, which is still pretty significant. This is three years ago, so things may have changed. In particular space_cache V2 could be a reaction to this? BTRFS + writing or amending small files = very high write amplification.

https://arxiv.org/abs/1707.08514

This suggests that BTRFS is a great system for secure storage of data files, but not necessarily a good choice for writing multiple small temporary files, or for log files that are continually being amended. Looking at common uses of the cache in Unraid might lead to the following suppositions. A BTRFS cache using Raid 1 is a good place for downloaded files before they are moved into the array. It's also good for any static data files. However, it's likely not to be the best place for a Docker img file or any kind of temporary storage. Particularly if redundant storage isn't needed. XFS might be a better choice there.

I found this research article to be of great interest as it indicates that a large amount of write amplification is inherent in using the BTRFS file system.

I guess this raises a few questions worth thinking about:

Is there a specific advantage to having the docker image file formatted internally as BTRFS or could an alternative such as XFS help reduce the write amplification without any noticeable change in capabilities.
This amplification is not specific to SSD's.
The amplification is worse for small files (as are typically found in appdata share).
Are there any BTRFS settings that can be applied at the folder level to reduce write amplification. I am thinking here of the 'system' and 'appdata' folders.
If you have the CA Backup plugin to provide periodic automated backup of the appdata folder is it worth having that share on a single drive pool formatted as XFS to keep amplification to a minimum. The 6.9.0 support for multiple cache pools will help if you need to segregate by file format.

Symon · June 28, 2020

Just some stats to add:

UnRaid 6.8.3

BTRFS Raid 1 500GB NVME Cache

7 Dockers constantly running

Plex (Binhex)

Deluge (Binhex)

Transmission

letsencrypt

mariadb

nextcloud

Teamspeak (Binhex)

VM's

One windows 10 VM (mostly idle)

Loop2:

Pre changes = 20-25GB in 1 Day.

Post "sed" change = 18GB in 1 Day.

Post "/mnt/cache" change = 4GB in 1 Day

(mount -o remount -o space_cache=v2 /mnt/cache)

I recreated the whole cache 3 Weeks ago. (Tried to move to XFS and didn't realise that raid 1 isn't possible so I created BTRFS again)

If you need any configurtions or settings from my current system just let me know.

Edited June 28, 2020 by Symon

tjb_altf4 · June 28, 2020

2 hours ago, itimpi said:

I guess this raises a few questions worth thinking about:

Is there a specific advantage to having the docker image file formatted internally as BTRFS or could an alternative such as XFS help reduce the write amplification without any noticeable change in capabilities.

XFS isn't a supported backend for Docker, overlay2 seems to be the other usual choice.

S1dney · June 28, 2020

On 6/28/2020 at 10:33 AM, itimpi said:

I found this research article to be of great interest as it indicates that a large amount of write amplification is inherent in using the BTRFS file system.

I guess this raises a few questions worth thinking about:

Is there a specific advantage to having the docker image file formatted internally as BTRFS or could an alternative such as XFS help reduce the write amplification without any noticeable change in capabilities.

This amplification is not specific to SSD's.

The amplification is worse for small files (as are typically found in appdata share).

Are there any BTRFS settings that can be applied at the folder level to reduce write amplification. I am thinking here of the 'system' and 'appdata' folders.

If you have the CA Backup plugin to provide periodic automated backup of the appdata folder is it worth having that share on a single drive pool formatted as XFS to keep amplification to a minimum. The 6.9.0 support for multiple cache pools will help if you need to segregate by file format.

Very interesting indeed.

This got me thinking...

I noticed that writing directly onto the BTRFS cache reduced writes by a factor or roughly 10.

Now I did felt like this was still on the high side, as it's still writing 40GB a day.

What if.... this is still amplified by a factor of 10 also.

Could this mean the a BTRFS formatted image on a BTRFS formatted partition results in 10x10=100 times write amplification?

If I recall correctly someone pointed out a 100x write amplification number earlier in the thread?

I think this is well suited for a test 🙂

I've just recreated the loopimage formatted on XFS.

I'll test my TB's written in a few minutes and check again after an hour.

EDIT:

Just noticed your comment @tjb_altf4

23 hours ago, tjb_altf4 said:

XFS isn't a supported backend for Docker, overlay2 seems to be the other usual choice.

The default seems to work already, XFS is formatted nicely with the correct options:

root@Tower:/# docker info
Client:
 Debug Mode: false

Server:
 Containers: 21
  Running: 21
  Paused: 0
  Stopped: 0
 Images: 35
 Server Version: 19.03.5
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true

According to the docker docs this should be fine, which xfs_info /var/lib/docker seems to confirm:

root@Tower:/# xfs_info /var/lib/docker
meta-data=/dev/loop2             isize=512    agcount=4, agsize=1310720 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=5242880, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

I'm just testing it out, cause I'm curious if it matters.

EDIT2, after 1 hour of runtime:

TBW on 28-06-2020 13:44:11 --> 11.8 TB, which is 12049.6 GB.
TBW on 28-06-2020 14:44:38 --> 11.8 TB, which is 12057.8 GB.

1 hour of running on XFS formatted loopdevice equals 8,2GB written, this would translate into 196,8GB a day.

This would most likely be a bit more due to backup tasks at night.

It's still on the high side, compared to running directly onto the BTRFS filesystem, which results in 40GB a day.

In December 2019 I was seeing 400GB's a day though (running without modifications), and my docker count has increased a bit, so 200 is better. Haven't tried any other options, like the mount options specified. I expect these will bring the writes down regardless of the loopdevice being used since they're ran on the entire BTRFS mount, so the amplification with the loopdevice is likely to occur with them also.

Still kind of sad though, I would have expected to see a very minor write amplification instead of 5 times. Guess that theory of 10x10 doesn't check out then..

Rolling back to my previous state, as I'd take 40GB over 200 any day 😄

EDIT3:

Decided to remount the cache with space_cache=v2 set, running directly on the cache this gave me 9GB of writes in the last 9 hours.

When the new unRAID version drops I'll reformat my cache with the new alignment settings. For now that space_cache=v2 setting does its magic

Edited June 29, 2020 by S1dney

Lignumaqua · June 28, 2020

4 hours ago, S1dney said:

Could this mean the a BTRFS formatted image on a BTRFS formatted partition results in 10x10=100 times write amplification?

I had the same thought, and have written to the authors of the paper here in Austin TX to see if they have any knowledge of how the two BTRFS systems interact. One of the authors is a lecturer and researcher on containerization so may well have studied this. I’ll report back here if/when I get a reply. If this were a genuine concern then, with their reported worst case of 32x amplification, that could lead to 32x32 = 1024x! 😲

tjb_altf4 · July 2, 2020

On 6/28/2020 at 10:41 AM, tjb_altf4 said:

My usage was 5TB over 20 days (BTRFS RAID0 cache on 6.8.3).

After moving docker.img to array and back again, my daily writes in the last 7 days has dropped from an average of 0.25TB to 0.14TB (daily data units from 442,229 to 254,563).

A near 50% write reduction.

Average is still pretty consistent with this.

On 6/28/2020 at 10:49 AM, TexasUnraid said:

Have you tried the remount command mentioned above?

I've run the remount command now, after 30 min the only noticeable writes are from a test media file (4.5GB), which increased writes only by the same amount. Loop is only sitting at 160MB, pretty good as I have a dozen or so very active dockers.

I'll continue collecting data points for the next few days to see how it goes.

dEAd0 · July 2, 2020

Hi, the command mount -o remount -o space_cache=v2 /mnt/cache doesn't seem to do anything.

i've checked and mnt/cache should be correct..

Anybody any ideas? Should I

even see something happening?

Thanks in advance

JorgeB · July 2, 2020

8 minutes ago, dEAd0 said:

Hi, the command mount -o remount -o space_cache=v2 /mnt/cache doesn't seem to do anything.

There won't be any output from the command, but there are a couple of ways you can confirm it's working, by looking at the syslog, there will be be something like this:

Jun 26 13:38:23 Tower1 kernel: BTRFS info (device nvme1n1p1): enabling free space tree
Jun 26 13:38:23 Tower1 kernel: BTRFS info (device nvme1n1p1): using free space tree

or checking the output of the "mount" command and looking at your cache device:

/dev/nvme1n1p1 on /mnt/cache type btrfs (rw,noatime,nodiratime,space_cache=v2)

dEAd0 · July 2, 2020

2 minutes ago, johnnie.black said:
There won't be any output from the command, but there are a couple of ways you can confirm it's working, by looking at the syslog, there will be be something like this:
Jun 26 13:38:23 Tower1 kernel: BTRFS info (device nvme1n1p1): enabling free space tree
Jun 26 13:38:23 Tower1 kernel: BTRFS info (device nvme1n1p1): using free space tree
or checking the output of the "mount" command and looking at your cache device:
/dev/nvme1n1p1 on /mnt/cache type btrfs (rw,noatime,nodiratime,space_cache=v2)

Gotcha, the info is indeed being shown in the syslog. Thanks for the quick reply!

tjb_altf4 · July 7, 2020

On 6/28/2020 at 10:41 AM, tjb_altf4 said:

My usage was 5TB over 20 days (BTRFS RAID0 cache on 6.8.3).

After moving docker.img to array and back again, my daily writes in the last 7 days has dropped from an average of 0.25TB to 0.14TB (daily data units from 442,229 to 254,563).

A near 50% write reduction.

Having run the remount command last week, the last 5 days has seen the daily data units write average drop further to 155,949.

Good result for now.

[6.8.3] docker image huge amount of unnecessary writes on cache

User Feedback

Recommended Comments

boomam 15

Link to comment

boomam 15

Link to comment

limetech 3326

Link to comment

boomam 15

Link to comment

limetech 3326

Link to comment

Lignumaqua 9

Link to comment

limetech 3326

Link to comment

boomam 15

Link to comment

boomam 15

Link to comment

StevenD 88

Link to comment

boomam 15

Link to comment

nickp85 17

Link to comment

tjb_altf4 395

Link to comment

TexasUnraid 113

Link to comment

tjb_altf4 395

Link to comment

itimpi 2237

Link to comment

Symon 17

Link to comment

tjb_altf4 395

Link to comment

S1dney 49

Link to comment

Lignumaqua 9

Link to comment

tjb_altf4 395

Link to comment

dEAd0 1

Link to comment

JorgeB 7466

Link to comment

dEAd0 1

Link to comment

tjb_altf4 395

Link to comment

Join the conversation