Reconsider Btrfs NOCOW Default Option on domains Share due to Irrecoverable Corruption Risks


Recommended Posts

Definitions and Preamble

Skip this section and read the next one if you already understand what COW is and how Btrfs works.

 

To users who are unaware of what COW, and subsequently NOCOW is: COW stands for copy-on-write, as such, NOCOW is the absence of COW. Btrfs (and ZFS) are COW based filesystems, in contrast to something like XFS, ReiserFS, EXT4, NTFS, etc. While those filesystems overwrite blocks in place when an overwrite request is made, copy-on-write filesystems always write any changes to newly allocated space, even if a portion of the file is "overwritten". Once the change is written to new space, it will then update metadata references to reflect the new state of the file. This is how modern filesystems are able to achieve "atomicity" without using a "journal". (Now I recognize this is a simple definition, but I think it gets the point across to those familiar with filesystems, after all, metadata operations also use COW, but this should be enough for most people to understand.)

 

Basically, if any filesystem operation is interrupted on Btrfs or ZFS (such as a system crash, power failure, or a flaky disk), if the change has not been fully committed to the disk(s), the old state is retained on the disk(s). There is no need to run any fsck or do anything extra even for metadata, since the change was not fully committed. That's the idea behind atomicity, either the entire operation is completed in full, or it didn't happen.

 

Most journaling filesystems (apart from ext4 in certain scenarios) do not support journaling of data blocks anyway (doing so would kill performance). There's not really any need to do this anyway, since these filesystems can only ever exist on one disk anyway (as far as they're concerned). If these filesystems are used on any form of RAID, it is that RAID platform that needs to ensure everything is in sync. Whether that's a battery backed raid controller, bitmaps like MD and LVM uses, whatever. Unraid does indeed also work to ensure it's own parity is in sync with a parity sync operation in the event of a system failure for it's array devices. Running a scheduled parity check is also encouraged of course in case something goes wrong during normal operation. While unraid can't avoid the "write hole", at least it can be mitigated.

 

Btrfs and ZFS are different though. As we know, they are the RAID platform in addition to simply being filesystems. There are pools of multiple disks that these filesystem can exists on. Btrfs supports it's own unique spin of "RAID1" (among other profiles) as does ZFS. Since they're handling redundancy, extra thought needs to go into ensuring redundant copies of files are always in sync. When you use a RAID1 profile with Btrfs, a file is replicated across 2 devices, so there's effectively two "copies" of the same file.

The Issue

Since a Btrfs Cache pool in Unraid can span multiple devices, it needs to ensure atomicity of redundant copies on the pool disks to provide reliable redundancy. It is impossible to write to each disk at the exact same time since they are still two physically different drives. If the system crashes when one disk had a change written to it but the other one didn't, now your redundant copies may be out of sync, similarly to how your parity disk(s) can be out of sync when it comes to the unraid array during a crash.

 

Now, thanks to copy on write, this problem is properly addressed. If either disk is out of sync, the old copy still exists. Btrfs provides checksums that can further be used to verify each copy is exactly the same. If they are not, they will "self repair", and a scrub triggers the filesystem to read each and every file and verify their checksums, repairing it if they mismatch using the other copy. ZFS of course does the same thing.

However, when NOCOW is in use, both checksums and atomic updates of the data blocks are gone. While other software RAID solutions do allow for ensuring things are in sync without copy on write, Btrfs does not, as such NOCOW is not intended to be used on anything other than "disposable data". NOCOW makes Btrfs work more like the traditional filesystems for data blocks, but along with that comes significant gotchas that people seem to be quite unaware of. (ZFS has no concept of NOCOW, so it's not an issue for it).

 

There is no way possible to ensure both copies of a NOCOW file is in sync on Btrfs if any of the RAID1,10 or DUP profiles are used. Btrfs does not provide *any* method to fix out of sync NOCOW files when it does go out of sync (scrub only verifies checksums, and since NOCOW has no checksums, it doesn't touch nocow data). To make matters worse, anything as simple as a power failure can trigger this situation if a VM were in use, and anyone using Unraid with VMs can run into this corruption scenario in the default state.

 

Further, Btrfs doesn't have a concept of "master" and "slave" drives, it decides which copy to read based on the PID of the process. To the user, this effectively means the disk a file is read from is "randomly" selected. So in an out of sync scenario with VM disk images, even if one copy was valid, if it reads the invalid copy, it may (and will) end up corrupting the good copy. Then when it reads the other copy, it detects corruption, and so on.... you're stuck in a vicious cycle.

 

There was a long discussion on the Btrfs mailing list about this a few years ago, but the TL;DR of it is while there were patches submitted to allow btrfsck to identify this issue, they were never committed to master, not that anyone let alone Unraid uses btrfsck in the event of a crash, scrub and balance is the usual recommended course of action, depending on the scenario. Using Btrfsck can be dangerous without developer advice anyway.

 

This issue would apply to RAID1, RAID1c3, RAID1c4, RAID10 and the DUP data profiles of Btrfs, since all of these profiles involve making "copies". If the user is using a single disk with the SINGLE profile, if they're using multiple disks with RAID0, or ironically, if they're using Btrfs' grossly unstable RAID5 or RAID6 profiles, the issue doesn't really exist since there is only one copy of the data in these cases (or in the case of RAID5/6, scrub will repair out of sync parity since there is no duplicate copy in that case. Instead, RAID5/6 suffers from the actual same write hole issue Unraid's parity protected array can suffer from).

 

Now users may be wondering: Why is NOCOW even an option then?

Well the issue with Btrfs in particular is it is not well optimized for workloads that involve a lot of small writes to the same file. This is exactly the type of workload that makes it less than ideal for things like VMs, Databases and even Bittorrent downloads. While ZFS can be tuned and has more complicated caching schemes to mitigate this, Btrfs simply doesn't scale well when it comes to tiny writes to files.

 

Each time a write is made, the metadata tree needs to be updated to reference a new extent that will be rather small. As more and more tiny writes are made, this tree can get very "top heavy", and it too can become fragmented. Processing the tree alone can be expensive on system resources, even when an SSD is used. NOCOW is a way to avoid this fragmentation.

 

I briefly discussed this with @jonp on Discord a while back, with regards to performance, and he suggested to do some benchmarks. So I took the time to do some quick benchmarks to show the difference (and also showcase how sparse allocation isn't the greatest with NOCOW anyway). 

 

I've ran these benchmarks with the Phoronix Test Suite. The benchmark was pgbench to show a certain workload that involves a lot of small writes. This all was done on Ubuntu 22.04 using the ext4 filesystem as the VM filesystem (since these are vdisks, Btrfs COW is still at work here). The VMs all used VirtIO SCSI as the vdisk controller, with caching disabled. The Btrfs pool is on a Btrfs RAID1 pool with an 860 EVO and WD Blue SATA SSD. The system specs for the VM were indeed limited (only 2 cores on a Ryzen 5 3400G), however I think it still showcases it quite well.

 

I did three tests. The first was a VM with on a NOCOW Cache pool using a sparsely allocated image, just as unraid behaves out of the box with cache pools. The filefrag indicated this VM had ~30k extents allocated after a single benchmark pass. Over time, as the vdisk fills, this number will increase, but since it's NOCOW, it won't increase once allocated unless snapshots or reflink copies are made. Still, it does not provide the best use of the NOCOW attribute.

 

The second one was a VM on a NOCOW Cache pool but the VM disk image was preallocated using fallocate to reduce fragmentation as much as possible. The filefrag command indicated this VM only had 5 extents both before and after the benchmark.

 

The third was to just straight up use COW (there was no need to make the differentiation between preallocation or not in this case since COW would be used regardless). The filefrag command indicated this VM had ~1.2 million (yikes!) extents after the benchmark ran.

 

image.png.5bb7da730426235510284886a1873427.png

image.png.6965307dfca7fdae5c048a16a8962128.png

image.png.4ea3f1816e853e4ffda52f9752351ffa.png

 

You'll notice the difference between the most optimized NOCOW VM and the one with COW is double the average latency. The Unraid default state was dropped somewhere in the middle, so even then, Unraid's current default isn't the most ideal performance wise. However, in the case of cache pools, when redundancy is used, there's no way to escape a corruption scenario. 

 

Why not use autodefrag?

Autodefrag is proposed to be a solution for desktop use cases, by grouping up small writes to Btrfs to be rewritten in larger extents. However in the case of workloads like VMs, autodefrag ends up causing write amplification, since it is effectively rewriting existing data. It is really only intended for desktop use cases where small databases may be used for applications like web browsers, and the writes to these databases are generally negligible.

 

So what can be done?

My suggestion is to simply just leave COW enabled (AUTO) for the domains share by default going forward for all new unraid installs. Users should accept the performance impact it may bring along unless they so choose to disable it, especially since the performance for many use cases may be perfectly acceptable, especially for a home use cases.

 

Btrfs is a COW filesystem, so CoW should be expected to be used unless otherwise disabled. As was mentioned in this "rant" here, regular users don't set NOCOW, "admins" do, and using it "downgrades the stability rating of the storage approximately to the same degree that raid0 does". Leaving the current default gives Unraid users a false sense of redundancy (and I would argue unraid is intended to improve the UX around redundancy, so I think this default option is contrary to that).

 

If users have a performance complaint, it would be better to clearly note the implications of using NOCOW if they so choose, or they can choose to use an XFS formatted cache pool. At least in that case there is no false sense of redundancy. And regardless, there's been feature requests and plugins that propose adding snapshots for VMs anyway. Both XFS and Btrfs support reflink copies after all. Yet doing so triggers Copy-on-Write, completely negating the gains of using the NOCOW attribute. 

Edited by JSE
fix minor details
  • Like 4
  • Thanks 1
Link to comment

I will say, if there is no decision to leave COW enabled going forward, there should at least be a warning included on the option, and if it's been used, unraid should show a yellow circle on the share to show it's not properly protected if any VMs or the directory itself has the nocow attribute set.

  • Like 1
Link to comment

@JSE thank you for taking the time to do the testing and provide this incredibly detailed write-up.   As you can probably imagine, we are prepping for the release of 6.10 and I don't think we would want to try to make this change to the default behavior in this release, but this is something that deserves attention and review by our team and I will be sure to make that happen.  Please be patient as this will likely require us to block of some time on the schedule to specifically address this, so I will get back to you once we've made that happen.  Thanks again!!!

  • Like 2
Link to comment
On 5/2/2022 at 8:27 AM, JSE said:

I will say, if there is no decision to leave COW enabled going forward, there should at least be a warning included on the option, and if it's been used, unraid should show a yellow circle on the share to show it's not properly protected if any VMs or the directory itself has the nocow attribute set.

 

Thank you for the deep dive into inner workings of btrfs.  Back when we implemented VM manager we indeed wanted to provide vdisk redundancy via btrfs raid1 profile.  Our early testing showed a very obvious performance hit with a COW vdisk vs. NoCOW.  This was circa 2016/2017? and we were aware of the discussion and patch set that ultimately arose.  Actually my assumption was that btrfs metadata would keep track of which chunks were successfully committed to storage - apparently this is not the case?  Also it has always bugged me that btrfs would not maintain checksums across NoCOW chunks.  I can't think of a logical reason why this decision would be made in the code. edit: I guess to avoid read/modify/write.

 

Sure, we can change the default to COW for the domains share.  I think your testing shows that best performance in this case happens when vdisk files are also pre-allocated correct?  Also, changing the default will have no effect on an existing domains share.  To get rid of existing NoCOW flags, one must empty the domains share, delete it, and then recreate it.

 

Moving forward into Unraid 6.11 we plan to introduce a "LVM" pool type of up to 3 mirrored devices.  This will be used to create logical volumes to be used as vdisks in VM's.  This should provide near bare-metal storage performance since we bypass completely any intermediate file system where vdisk (loopback) files are stored.

 

  • Like 4
Link to comment

Hey, thanks for the prompt relies everyone, glad to see there's interest in addressing it so fast, it's appreciated!

 

On 5/3/2022 at 1:35 PM, limetech said:

Actually my assumption was that btrfs metadata would keep track of which chunks were successfully committed to storage - apparently this is not the case?

Basically not at all. MD and LVM use bitmaps that help it identify if the array is degraded during a crash, but Btrfs has no means of doing this with NOCOW data, checksums of the actual data are the only way under the current design. The metadata is always COW and protected with checksums, and any newly allocated chunks will involve it having new transid, but once it's allocated, it's sort of set in stone unless you run a defrag or something that involves rewriting it.

 

On 5/3/2022 at 1:35 PM, limetech said:

Our early testing showed a very obvious performance hit with a COW vdisk vs. NoCOW.

Yep there certainly is, the "Friends don't let friends use Btrfs for OLTP" article showcases it quite well, better than my quick pgbench runs. 

 

On 5/3/2022 at 1:35 PM, limetech said:

Also it has always bugged me that btrfs would not maintain checksums across NoCOW chunks.  I can't think of a logical reason why this decision would be made in the code. edit: I guess to avoid read/modify/write.

Yep exactly, and as long as COW can't be done, there would always be a portion of time where the checksum wouldn't match if one went with the RMW option. It would be better do something like a bitmap like MD does, but the impression I get from Btrfs devs just from reading the mailing list is there's not much interest in working on NOCOW.

 

On 5/3/2022 at 1:35 PM, limetech said:

I think your testing shows that best performance in this case happens when vdisk files are also pre-allocated correct?

Generally yep, as long as NOCOW is used of course, simply because it (usually) involves allocating large extents from the get go. Depending on how fragmented the free space is, there may be less and less benefits from this, but since a balance operation is effectively a free space defrag, if the fs is well balanced before going in, it should be fine.

 

On 5/3/2022 at 1:35 PM, limetech said:

Also, changing the default will have no effect on an existing domains share.  To get rid of existing NoCOW flags, one must empty the domains share, delete it, and then recreate it.

Yea this is a problem. If you copy the files to a directory that doesn't have the NOCOW flag set, and the copy is not a reflink copy (ie with cp --reflink=never), then it should generate checksums for the new copy, as this is just the inverse of what the Btrfs wiki suggests for "converting" an existing COW file to NOCOW, but the downside of course means the data needs to be duplicated. 

 

Simply moving on the same volume won't convert to COW/generate checksums. That's what led to my other post... there is no great option here for existing setups. Perhaps there could be a yellow icon or something to indicate the redundancy is not perfect or something if the attribute is detected on a share on a cache pool. Better than no indicator at all I guess. I dunno, lots of options here but none are perfect lol.

 

The NOCOW flag does especially make sense on devices that are on the parity protected array for VMs, since any array device should be a single device anyway, and they're usually spinning rust, so it's especially important. Perhaps one could even provide an option to only set NOCOW on the array disks only where it's less risky, but have the cache pool option NOCOW option could be separate and default to COW since it can span devices (and I would guess most users have SSDs in Cache pools anyway).

 

That's on top of warning the user about the option in the Share settings when they click on it, and you could even go one step further and warn them anytime a scrub is ran and any data without checksums is found.

 

While Btrfs scrub doesn't do anything with NOCOW data, it still does indicate it finds data without checksums with the -R flag to indicate nothing can really be done with it. A notification could be triggered after a scrub if this is detected to let the user know unprotected data was found and left untouched, and they may need to manually intervene if they have corruption. Documentation or something could further explain that if a user encounters that notification, the most likely cause was probably because the NOCOW attribute is either used now or in the past, and then it could offer mitigation strategies if the user wants to act on it).

 

On 5/3/2022 at 1:35 PM, limetech said:

Moving forward into Unraid 6.11 we plan to introduce a "LVM" pool type of up to 3 mirrored devices.  This will be used to create logical volumes to be used as vdisks in VM's.  This should provide near bare-metal storage performance since we bypass completely any intermediate file system where vdisk (loopback) files are stored.

This sounds awesome! Hopefully LVM and LVM RAID can be used for more than just VM vdisks too :)

 

I personally love the ideas behind Btrfs and it's flexibility, the flexibility alone is like no other solution, so that seems to make it a good fit for Unraid, but the quirks sure are plentiful. Here's to hoping Bcachefs eventually makes its way in the kernel someday, performs well, and is stable. Maybe it could be a good successor in a few years :) 

Edited by JSE
correct bad info
Link to comment
  • 1 month later...
On 5/3/2022 at 11:35 AM, limetech said:

Moving forward into Unraid 6.11 we plan to introduce a "LVM" pool type of up to 3 mirrored devices.  This will be used to create logical volumes to be used as vdisks in VM's.  This should provide near bare-metal storage performance since we bypass completely any intermediate file system where vdisk (loopback) files are stored.

 

 

@limetech quick question on this, if you don't mind - is this implementation of LVM (using a triple mirror) a one-off, or does it mean that we're also getting the standard MD RAID functions underdeath (RAID 5 / RAID 6 along with the rest)? Looking at the code, it seems like most of what'd be necessary to do this is a refactor to the parity/dual parity of the unraid array to be called <something else> so that the standard (unmodified) libs can be called when trying to create a new pool using `md`, so I wondered...

 

I'm sincerely hoping so!! It's the last big hurdle I've yet to solve with my unraid servers actually lol - everything else is... well, fan-friggin-tastic at this point 😁

Link to comment
  • 6 months later...
  • 10 months later...
On 5/3/2022 at 11:21 PM, JSE said:

...

I personally love the ideas behind Btrfs and it's flexibility, the flexibility alone is like no other solution, so that seems to make it a good fit for Unraid, but the quirks sure are plentiful. Here's to hoping Bcachefs eventually makes its way in the kernel someday, performs well, and is stable. Maybe it could be a good successor in a few years :) 


Guess what?

https://www.phoronix.com/news/Linux-6.7-Released
 

There ya go. :)

Link to comment
On 3/8/2023 at 6:04 PM, richardm said:

NOCOW has been a libvirt default since 6.6.0 whenever btrfs datastores are used, per libvirt docs. I've just stumbled across this thread while researching a thread over on the btrfs-progs git.  btrfs with NOCOW is a data-eater, good for tmp or swap or scratch space and that's about it.  I'm flabbergasted.

Yep the issues with NOCOW being used by default go well beyond unraid, thankfully unraid has reverted this default in newer versions it seems. Now on to libvirt. I've also noticed some distros (like Arch) are utilizing systemd-tmpfiles to set the +C attribute on common database platforms as well, such as mysql/mariadb and postgresql.

 

 

On 1/17/2024 at 7:29 PM, Ari-McBrown said:

It's nice to see bcachefs finally merged and I hope to one day see unraid support it since it does potentially provide the same flexibilities as btrfs, potentially without the caveats that btrfs has. It too supports NOCOW and from my testing before it was marged, it was a mkfs option rather than a file attribute, at least at the time i tried it, so it seems in that regard nocow won't be an issue since you just wouldn't use it lol.

It does however still need a lot more attention with regards to it's raid functionality and is missing features like scrub, rebalance, device monitoring, etc. When it sees improvements in these areas and proves itself to not be a data eater, I'd be happy to migrate over to it one day :)

  • Upvote 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.