Jump to content

JSE

Members
  • Posts

    38
  • Joined

  • Last visited

Posts posted by JSE

  1. I would like to eventually see bcachefs pools be added to unraid someday, but it still misses a lot of functionality. Namely, the ability to easily monitor an array, scrub, rebalance when adding/removing devices, and a proper process for device replacement. Not to mention there have been several major data loss bugs since the 6.7 merge. The filesystem is still considered experimental for a reason. I would prefer we hold off until it has had time to mature and become a better, well rounded solution before it's considered for inclusion in unraid.


    Especially given we already have ZFS which has Arc caching, which is much better than linux's native caching. 

  2. On 1/24/2024 at 3:09 PM, PeteAsking said:

    On unraid nocow is the default so its not a problem. If you override that and use cow and snapshots then I agree its not appropriate. However this is not the default setup for unraid. I think the only issue is if you use snapshots and not cow. (The issue is just that more space will be used as the filesystem sees each snapshot as a different copy of a file each taking up space as opposed to sharing space). 

    As @JorgeB mentioned NOCOW is no longer default for any shares, but even if you set NOCOW, you need to keep in mind that not only is redundancy compromised with the use of nocow, compression will not work whatsoever on those files. Compression needs copy on write to function. Nocow is equal to nocsums and no compression.

     

    On 1/24/2024 at 4:13 PM, primeval_god said:

    I would disagree with this. The usefulness of compression is highly dependent on the type of files you have. For media files, which likely make up the vast majority of files stored on unRAID NAS devices, filesystem compression does very little as the files are already highly compressed.

    This is true, however the zstd compression algorithm is more than fast enough on most modern CPUs to discard any files that can't be compressed and store them normally. For the files that are compressible.

    Alternatively unraid could use the `btrfs property set /path/to/share/on/btrfs compress zstd` on a directory or subvolume which will work identical to compress-force on the filesystem wide option. This would allow users to set compression on a per-share basis much like nocow. It doesn't allow setting a compression level, however given unraid defaults to 3 anyway, this option also defaults to 3 which I'd argue is a good default anyway.

     

    Given that ZFS supports compression on a per-dataset basis this might be a better solution rather than per filesystem basis in the long term, however the option would need to be set at share creation time to be most effective. Given that btrfs should be using subvolumes for shares this should depend on this feature request being implemented as well (though this option doesn't technically need to be on a per-subvolume basis like it does for ZFS datasets).

    IMO this would be an ever better way of handling compression as you could leave compression disabled for media shares so no CPU time is wasted whatsoever on media, but have it enabled for shares you know compression would be beneficial as it would use the force compression without any mount options at all.

  3. 10 hours ago, DinisR said:

    A btrfs subvolume could be created and deleted without having to reformat the main btrfs volume. Hopefully this will happen soon and make things easier to work with

    Yep creating and deleting subvolumes is as simple as a directory. Currently, a share on unraid is just a directory on the top level (root) of a disk or pool in unraid. So if you `mkdir myshare` on a disk it will create a share called "myshare". Alternatively, a subvolume is made as simple as `btrfs subv create myshare` and it for all intents and purposes works just like a directory, but with the added performance benefits and ability to snapshot it. 

    Deleting it is the same as deleting a directory. You can `rmdir myshare` the subvolume and it will delete it much the same as `btrfs subv delete myshare` does (the latter is faster though). 

    No formatting necessary :) 

  4. 24 minutes ago, JorgeB said:

    The initial request was this one, it's been a long time since I've read it, and since I was just starting to use btrfs maybe some inaccuracies there, my last request to LT was to implement something like this, it also mentions zfs since those pools are not monitored as well, and while zfs handles a dropped device better than btrfs, the user still needs to be notified:

     

    Basically monitor the output of "btrfs device stats /mountpoint" (or "zpool status poolname" for zfs) and use the existing GUI errors column for pools, which currently is just for show since it's always 0, to display the sum of all the possible device errors (there are 5 for btrfs and 3 for zfs), and if it's non zero for any pool device generate a system notification, optionally hovering the mouse over the errors would show the type and number of errors like we have now when hovering the mouse over the SMART thumbs down icon in the dash.

     

    Additionally you'd also need a button or check mark to reset the errors, it would run "btrfs dev stats -z /mountpoint" or "zpool clear poolname".

     

     

    Welcome any suggestions to make this better, just recently had the chance to ask LT about this again for 6.13, but as of now don't know if this or something similar is going to be done.

     

    I've also created a FAQ entry to monitor btrfs and zfs pools with a script, while not perfect it's better than nothing, but of course most users won't see it, I try to send as many as I can there but usually it's only after there's already been a problem.

     

     

    This sounds perfect, exactly what we need. My thought was have the stats appear on pool page as well, possibly around the balance/scrub options with a button to clear the stats. But 100% we're on the same page here, definitely need this type of monitoring for pools. I haven't tested ZFS that much since it was added but if it's also missing monitoring we need that too :)

    • Like 1
  5. Currently, when you format a disk or pool to btrfs in unraid, an option is provided to enable compression.

    While unraid does use the efficient zstd compression with the default level 3 compression level which I think is a very optimal default, it's using the `compress` mount option rather than `compress-force` option on the filesystem. 

     

    Btrfs has a very rudimentary algorithm when you use the `compress` mount option where it will abandon even attempting compression on a file if the first few KiB is not compressible. This results in a lot of files that have portions that could compress not get compressed, and in a lot of cases behaves as if you didn't have compression enabled whatsoever. It makes the current unraid compression behavior to arguably be not very useful.

     

    Now if you were using some of the other algorithms like zlib, force compression could actually have a negative impact. However, zstd compression is smart enough to not store any compression it attempts if it doesn't yield an improvement in storage.

     

    This request is thus to use the compress-force option instead so compression actually happens on files that have headers that can't compress, or at least provide an option to enable force compression so those of us who do wan't compression can force it (such as a check box or an alternative option). This yields much more space savings for me than the current option, but I currently have to resort to remounting my disks with the compress-force option via the shell or a script rather than rely on the option unraid provides.

  6.  

    7 minutes ago, JorgeB said:

    Agree, have a feature request for this since 2016.

    High time it gets included then ;). Do you have a link? .... lack of this and lots of people can lose data, and could be recoverable right now and not even know it.

    I'm on a raid (ha, no pun intended) of recommending some changes to improve the reliability of btrfs pools since I do a lot of this stuff manually in shell but it really should be included in a more user friendly way since most people are not familiar or have the experience in managing a btrfs pool from shell, and honestly, they shouldn't need to.

  7. On ZFS with unraid, if you create a share and it exists on a ZFS volume, it's created as a dataset. This makes creating snapshots, rolling back, etc much easier. This feature request is to extend this behavior to btrfs subvolumes, where the top level directory (aka a share) should always be a subvolume instead of a regular directory.

     

    A subvolumes in btrfs is also it's own independent extent tree; it acts as if each subvolume is it's own independent filesystem even though they merely appear as a directory. What this means is, by using subvolumes per share, any filesystem locking behavior is limited only to the subvolume in question rather than the filesystem overall (in most cases). This allows for higher levels of concurrency and thus better performance, especially for pools with different shares that have high IO activity.

    • Like 1
  8. With btrfs, if you have a live running pool and a disk disappears from the system (ie you pull it or a cable flakes out), or if the disk straight up just fails while the array is running, btrfs doesn't provide any indication via most of the monitoring commands to detect the missing disk. For example if you run `btrfs filesystem show` after a disk has dropped from a pool, it will still show a reference to the disk even though it's missing. Even if it's just a flaky cable and the disk reappears to the system, it will still remain unused until you fully remount the filesystem (and then scrub, not balance would be all that's necessary to resync but I digress).

     

    If you unmount the pool and remount it with the disk still missing, you will need the degraded option which unraid handles, but it's only after that remount with the degraded option that `btrfs filesystem show` will indicate any missing devices. It's also only after stopping the array will unraid indicate there are any missing devices with a pool.

     

    This means unraid users are in the dark if a disk fails or flakes out or completely fails while the array is running. If the user doesn't stop the array often, they could be in the dark that their pool is degraded for months even.

     

    Btrfs does however provide a means to detect device failures and issues via the `btrfs device stats` command. If any device stats show a non-0 value, this indicates there is an issue with the array and it's possible it's degraded. When a disk flakes out or fails for example, the device stats will indicate write errors.

     

    It is absolutely critical to monitor btrfs device stats to detect a degraded array event for a running array. Thus, this feature request is to have this critical feature be included in the unraid GUI when you're viewing a pool, and also have any non-0 value device stats be notified to the admin so they can act on it. Given that being able to reset these device stats to make detecting errors later possible, we'd also need a GUI option for resetting device stats after any issues are addressed.

     

    I have some other ideas to make btrfs pools more resilient and efficient (particularly around the fact I feel unraid will run balances much more than necessary) but that is left for a separate feature request, device stat monitoring is the most critical feature request I believe is a requirement for proper pool monitoring. 

  9. On 3/8/2023 at 6:04 PM, richardm said:

    NOCOW has been a libvirt default since 6.6.0 whenever btrfs datastores are used, per libvirt docs. I've just stumbled across this thread while researching a thread over on the btrfs-progs git.  btrfs with NOCOW is a data-eater, good for tmp or swap or scratch space and that's about it.  I'm flabbergasted.

    Yep the issues with NOCOW being used by default go well beyond unraid, thankfully unraid has reverted this default in newer versions it seems. Now on to libvirt. I've also noticed some distros (like Arch) are utilizing systemd-tmpfiles to set the +C attribute on common database platforms as well, such as mysql/mariadb and postgresql.

     

     

    On 1/17/2024 at 7:29 PM, Ari-McBrown said:

    It's nice to see bcachefs finally merged and I hope to one day see unraid support it since it does potentially provide the same flexibilities as btrfs, potentially without the caveats that btrfs has. It too supports NOCOW and from my testing before it was marged, it was a mkfs option rather than a file attribute, at least at the time i tried it, so it seems in that regard nocow won't be an issue since you just wouldn't use it lol.

    It does however still need a lot more attention with regards to it's raid functionality and is missing features like scrub, rebalance, device monitoring, etc. When it sees improvements in these areas and proves itself to not be a data eater, I'd be happy to migrate over to it one day :)

    • Upvote 1
  10. Unraid is open source as far as the storage stack is concerned.

    1. The UnRAID array uses a modified version of MD RAID, which has all of its corresponding sources stored right on USB that you could use to compile your own kernel with ;)

    2. Pools use Btrfs (and ZFS as of 6.12). A pool created on UnRAID is completely usable on other systems without any tinkering. XFS pools are single disk and mount like any single disk filesystem.

    3. Disks inside your UnRAID array are independent formatted disks individually, with a dedicated parity disk. Nothing is striped, nor stored in any obscure, proprietary format. Even without the custom patches, all the data on disks are fully available and mountable on any standard linux distro.

     

    Additionally, docker containers use standard docker which can be used on standard linux. You can quite literally take the variables you set in the UnRAID docker web UI, pass them to docker on any standard linux distro, pass in the same data/mounts, and everything works perfectly.

     

    VMs are the same. They use bog standard KVM via Libvirt and QEMU, also readily available on most common distros.

     

    Rest assured, you're never locked in when it comes to your data with UnRAID. The true magic of UnRAID is the web UI and ease it provides for managing and monitoring the array. For that, it's very much worth it, even with some of its shortcomings imo, it still gets the closest to what I want in a storage+compute OS for personal use :)

    • Like 2
  11. On 2/25/2023 at 4:43 PM, Kilrah said:

    Huh?

    Anything not going through /mnt/user already bypasses it, so it's always been possible to bypass it by addressing the disk/pool directly / resp. setting up disk shares for smb access.

    I'm aware of that, but what I mean is, I still want to create individual shares at the directory level (or more specifically, subvolume) for security reasons (each user gets their own share), but not have it run through shfs... in other words, *not* through /mnt/user. A disk share exposes the entire disk/pool, which is undesirable. 

    My "hacky script" does exactly that. Since I don't use the unraid array at all, I rewrite the /mnt/user path to /mnt/storage (name of my btrfs pool) in samba's config to bypass the shfs bottleneck.

    I also add in shadow copy paths to snapshots so I can have shadow copy support, but at the very least, would be nice to bypass shfs when pools are exclusively used to avoid the bottleneck of shfs on faster networks. 

  12. I'm curious, does unraid 6.12 bypass shfs (Sharefs) if you use ZFS (and potentially btrfs pools). I find shfs is my biggest bottleneck, and since I exclusively use Btrfs anyway, I'd like to have a way to bypass it. Currently, I have a user script I wrote that rewrites the samba config at startup and then restarts samba lol. That's hacky imo, wish there was an official way to bypass that, especially with stuff like ZFS coming.

  13. On 1/11/2023 at 5:39 PM, limetech said:
    • (BTW we could add ext4 but no one has really asked for that).

    Well in that case, can I ask for it lol? Now that the linux kernel has NTFS3 it'd be nice to support that as well (it has full POSIX ACLs support too for those who need it!!), and it would allow someone to easily move their data over from Windows machines without formatting anything.

    I've also mulled the idea of being able to easily support "no filesystem", or at least, Btrfs pools within the unraid array.

     

    Then you could use a Btrfs RAID0 to take advantage of snapshots a lot easier and gain the read speed benefits. Then I could better take advantage of 10G on the unraid array itself. You'd also have the ability to add and even remove one disk at a time, and since btrfs now now has "degenerate stripes", even if data is already allocated it can still effectively use all the space, minus the speed improvement. It would never be worse than using independent filesystems speed wise, and balance would fix that up.

     

    You'd still lose out on write speeds since the parity disk is the bottleneck, but you'd gain "stable" parity protection in the event of a drive failure. I've hacked this together before, but with the current design of Unraid and it's GUI, it's really not ideal.

     

    It's flexibility like this that I'd love to see more of from Unraid, more so than ZFS. That said, very excited to see the inclusion of ZFS support :)

  14. 10 minutes ago, dev_guy said:

     

    No argument but there have been issues with support for the Realtek RTL8125 family of 2.5Gb ethernet interfaces lagging behind in Unraid as documented at the start of this thread. Many popular Linux distros supported the RTL8125 before Unraid did. I was trying to be helpful especially in explaining how Realtek may be a better choice than Intel in some cases. 

    Nah I get it, the problem is a lot of those distros are using out of tree drivers which can be a huge burden to maintain and often interfere with other drivers and/or break after a kernel upgrade if they're not well maintained. So it's not as clear cut in unraids case where the maintenance burden is too high to support every possible NIC out there. This is really a realtek specific issue for not having proper mainline drivers.

    • Like 2
  15. On 11/25/2022 at 6:41 PM, dev_guy said:

     

    Yeah, I assumed as much that's why I mentioned Slackware. Interestingly, I've seen references Intel is dropping support for FreeBSD in some cases including never providing drivers for their I225V 2.5 Gb ethernet interface (the only 2.5 Gb Intel option I've seen used). So FreeBSD based applications, like TrueNAS Core and pfSense, are currently at the mercy of Intel and can only support Realtek for 2.5 Gb. For everyone who thinks Intel is the safer choice, that's changing. At least Realtek still supports FreeBSD which is more than Intel can be bothered to do. I don't know much about the Slackware kernel but do wish Unraid was Debian based like TrueNAS Scale which tends to have the best driver support for consumer and small business hardware in the Linux world.

    Debian and slackware have nothing to do with it. Limetech maintains their own kernels for unraid and they are much fresher than both Slackware and Debian. The latest in unraid 6.11 is 5.19. It is up to the driver vendors (or linux community in the case of community made drivers) to get their drivers mainlined with the kernel. 

  16. On 8/7/2021 at 4:54 PM, jonp said:

    The reason it isn't on this list for this poll is for reasons that might not be so obvious. As it stands today, there are really 3 ways to do snapshots on Unraid today (maybe more ;-). One is using btrfs snapshots at the filesystem layer. Another is using simple reflink copies which still relies upon btrfs. Another still is using the tools built into QEMU to do this. Each method has pros and cons. 

     

    The qemu method is universal as it works on every filesystem we support because it isn't filesystem dependent. Unfortunately it also performs incredibly slow.

     

    Btrfs snapshots are really great, but you have to first define subvolumes to use them. It also relies on the fact that the underlying storage is formatted with btrfs. 

     

    Reflink copies are really easy because they are essentially a smart copy command (just add --reflink to the end of any cp command). Still requires the source/destination to be on btrfs, but it's super fast, storage efficient, and doesn't even require you to have subvolumes defined to make use of it.

     

    And with the potential for ZFS, we have yet another option as it too supports snapshots!

     

    There are other challenges with snapshots as well, so it's a tougher nut to crack than some other features. Doesn't mean it's not on the roadmap ;-)

    I think it's important to note, XFS, which is what most unraid users are using on their array disks, also has native reflink support. So if you want an agnostic way of doing "snapshots" this way, look no further than reflinks. It works both with Btrfs and XFS, and is indeed now the default of Coreutils cp.

     

    Old XFS formats do not support reflinks, but it can be enabled on any XFS filesystem formatted with the "newer" v5 on disk format which supports CRCs, which has been around since kernel 3.15. Reflinks have been enabled by default since kernel 5.1. I'd be willing to bet a large number of unraid users today can support this (if not a majority). 

  17. On 7/11/2022 at 7:47 AM, ich777 said:

    That‘s not entirely true because there are workaround so that you can get Windows to work with LXC/LXD (I actually only know a few tutorials which involves LXD to create such a container)

    This should be also possible on Unraid in combination with QEMU.

    Also ARM containers should be theoretically possible in combination with QEMU and LXC on Unraid but I have to look a little bit more into that when I have more time since I think the necessary QEMU libraries are missing on Unraid, that‘s why I locked the LXC plugin to x86_64 only currently.

     

    Agreed.

    I don't know what you mean by this. What's not true? You can't run windows as a linux "container", it would need to be virtualized :P

    I guess you could run QEMU or something in a container, but it's still going to be a VM at the end of the day, it would be a VM with extra steps. 

     

    I suppose you could do all the virtualization in software, in a container, but that would be extra slow, so I'm assuming you'd still be using KVM, which is what I mean about a VM with extra steps... might as well just avoid containerizing all that. 

  18. On 6/24/2022 at 3:25 AM, ich777 said:

    Glad to hear that everything is working now for you... :)

    It is not a perfect workaround but it works for now... ;)

     

    Not yet, but it should work (please keep in mind that in the next two months I'm not able to do much here because I'm really busy in real life)... :/

    LXD is only a set of tools/databases or better speaking a management system for LXC so it is not needed in general.

     

    I've don't integrated LXD because it introduces to much dependencies like Python and could maybe interfere with other Python installations which are maybe installed through the NerdPack and so on...

    Just to be clear, LXD also allows managing VMs since it's just a frontend for managing Containers and VMs, so this is how you'd run Windows with LXD. Windows can't run as a container because it's a literally different OS for those wondering. LXC works by sharing the same kernel and containerizing everything, so think of it like a glorified chroot environment.

     

    Anyway, it seems totally unnecessary for this purpose, as Unraid already has excellent VM management with libvirt and the webUI. If people need windows for something, use the VM platform unraid comes with.

  19. Hey, thanks for the prompt relies everyone, glad to see there's interest in addressing it so fast, it's appreciated!

     

    On 5/3/2022 at 1:35 PM, limetech said:

    Actually my assumption was that btrfs metadata would keep track of which chunks were successfully committed to storage - apparently this is not the case?

    Basically not at all. MD and LVM use bitmaps that help it identify if the array is degraded during a crash, but Btrfs has no means of doing this with NOCOW data, checksums of the actual data are the only way under the current design. The metadata is always COW and protected with checksums, and any newly allocated chunks will involve it having new transid, but once it's allocated, it's sort of set in stone unless you run a defrag or something that involves rewriting it.

     

    On 5/3/2022 at 1:35 PM, limetech said:

    Our early testing showed a very obvious performance hit with a COW vdisk vs. NoCOW.

    Yep there certainly is, the "Friends don't let friends use Btrfs for OLTP" article showcases it quite well, better than my quick pgbench runs. 

     

    On 5/3/2022 at 1:35 PM, limetech said:

    Also it has always bugged me that btrfs would not maintain checksums across NoCOW chunks.  I can't think of a logical reason why this decision would be made in the code. edit: I guess to avoid read/modify/write.

    Yep exactly, and as long as COW can't be done, there would always be a portion of time where the checksum wouldn't match if one went with the RMW option. It would be better do something like a bitmap like MD does, but the impression I get from Btrfs devs just from reading the mailing list is there's not much interest in working on NOCOW.

     

    On 5/3/2022 at 1:35 PM, limetech said:

    I think your testing shows that best performance in this case happens when vdisk files are also pre-allocated correct?

    Generally yep, as long as NOCOW is used of course, simply because it (usually) involves allocating large extents from the get go. Depending on how fragmented the free space is, there may be less and less benefits from this, but since a balance operation is effectively a free space defrag, if the fs is well balanced before going in, it should be fine.

     

    On 5/3/2022 at 1:35 PM, limetech said:

    Also, changing the default will have no effect on an existing domains share.  To get rid of existing NoCOW flags, one must empty the domains share, delete it, and then recreate it.

    Yea this is a problem. If you copy the files to a directory that doesn't have the NOCOW flag set, and the copy is not a reflink copy (ie with cp --reflink=never), then it should generate checksums for the new copy, as this is just the inverse of what the Btrfs wiki suggests for "converting" an existing COW file to NOCOW, but the downside of course means the data needs to be duplicated. 

     

    Simply moving on the same volume won't convert to COW/generate checksums. That's what led to my other post... there is no great option here for existing setups. Perhaps there could be a yellow icon or something to indicate the redundancy is not perfect or something if the attribute is detected on a share on a cache pool. Better than no indicator at all I guess. I dunno, lots of options here but none are perfect lol.

     

    The NOCOW flag does especially make sense on devices that are on the parity protected array for VMs, since any array device should be a single device anyway, and they're usually spinning rust, so it's especially important. Perhaps one could even provide an option to only set NOCOW on the array disks only where it's less risky, but have the cache pool option NOCOW option could be separate and default to COW since it can span devices (and I would guess most users have SSDs in Cache pools anyway).

     

    That's on top of warning the user about the option in the Share settings when they click on it, and you could even go one step further and warn them anytime a scrub is ran and any data without checksums is found.

     

    While Btrfs scrub doesn't do anything with NOCOW data, it still does indicate it finds data without checksums with the -R flag to indicate nothing can really be done with it. A notification could be triggered after a scrub if this is detected to let the user know unprotected data was found and left untouched, and they may need to manually intervene if they have corruption. Documentation or something could further explain that if a user encounters that notification, the most likely cause was probably because the NOCOW attribute is either used now or in the past, and then it could offer mitigation strategies if the user wants to act on it).

     

    On 5/3/2022 at 1:35 PM, limetech said:

    Moving forward into Unraid 6.11 we plan to introduce a "LVM" pool type of up to 3 mirrored devices.  This will be used to create logical volumes to be used as vdisks in VM's.  This should provide near bare-metal storage performance since we bypass completely any intermediate file system where vdisk (loopback) files are stored.

    This sounds awesome! Hopefully LVM and LVM RAID can be used for more than just VM vdisks too :)

     

    I personally love the ideas behind Btrfs and it's flexibility, the flexibility alone is like no other solution, so that seems to make it a good fit for Unraid, but the quirks sure are plentiful. Here's to hoping Bcachefs eventually makes its way in the kernel someday, performs well, and is stable. Maybe it could be a good successor in a few years :) 

  20. I'd also like to see some form of LXC or LXD support, it would put the feature set a bit more in line with proxmox and would definitely be more useful to me than docker. +1

    • Like 1
  21. Definitions and Preamble

    Skip this section and read the next one if you already understand what COW is and how Btrfs works.

     

    To users who are unaware of what COW, and subsequently NOCOW is: COW stands for copy-on-write, as such, NOCOW is the absence of COW. Btrfs (and ZFS) are COW based filesystems, in contrast to something like XFS, ReiserFS, EXT4, NTFS, etc. While those filesystems overwrite blocks in place when an overwrite request is made, copy-on-write filesystems always write any changes to newly allocated space, even if a portion of the file is "overwritten". Once the change is written to new space, it will then update metadata references to reflect the new state of the file. This is how modern filesystems are able to achieve "atomicity" without using a "journal". (Now I recognize this is a simple definition, but I think it gets the point across to those familiar with filesystems, after all, metadata operations also use COW, but this should be enough for most people to understand.)

     

    Basically, if any filesystem operation is interrupted on Btrfs or ZFS (such as a system crash, power failure, or a flaky disk), if the change has not been fully committed to the disk(s), the old state is retained on the disk(s). There is no need to run any fsck or do anything extra even for metadata, since the change was not fully committed. That's the idea behind atomicity, either the entire operation is completed in full, or it didn't happen.

     

    Most journaling filesystems (apart from ext4 in certain scenarios) do not support journaling of data blocks anyway (doing so would kill performance). There's not really any need to do this anyway, since these filesystems can only ever exist on one disk anyway (as far as they're concerned). If these filesystems are used on any form of RAID, it is that RAID platform that needs to ensure everything is in sync. Whether that's a battery backed raid controller, bitmaps like MD and LVM uses, whatever. Unraid does indeed also work to ensure it's own parity is in sync with a parity sync operation in the event of a system failure for it's array devices. Running a scheduled parity check is also encouraged of course in case something goes wrong during normal operation. While unraid can't avoid the "write hole", at least it can be mitigated.

     

    Btrfs and ZFS are different though. As we know, they are the RAID platform in addition to simply being filesystems. There are pools of multiple disks that these filesystem can exists on. Btrfs supports it's own unique spin of "RAID1" (among other profiles) as does ZFS. Since they're handling redundancy, extra thought needs to go into ensuring redundant copies of files are always in sync. When you use a RAID1 profile with Btrfs, a file is replicated across 2 devices, so there's effectively two "copies" of the same file.

    The Issue

    Since a Btrfs Cache pool in Unraid can span multiple devices, it needs to ensure atomicity of redundant copies on the pool disks to provide reliable redundancy. It is impossible to write to each disk at the exact same time since they are still two physically different drives. If the system crashes when one disk had a change written to it but the other one didn't, now your redundant copies may be out of sync, similarly to how your parity disk(s) can be out of sync when it comes to the unraid array during a crash.

     

    Now, thanks to copy on write, this problem is properly addressed. If either disk is out of sync, the old copy still exists. Btrfs provides checksums that can further be used to verify each copy is exactly the same. If they are not, they will "self repair", and a scrub triggers the filesystem to read each and every file and verify their checksums, repairing it if they mismatch using the other copy. ZFS of course does the same thing.

    However, when NOCOW is in use, both checksums and atomic updates of the data blocks are gone. While other software RAID solutions do allow for ensuring things are in sync without copy on write, Btrfs does not, as such NOCOW is not intended to be used on anything other than "disposable data". NOCOW makes Btrfs work more like the traditional filesystems for data blocks, but along with that comes significant gotchas that people seem to be quite unaware of. (ZFS has no concept of NOCOW, so it's not an issue for it).

     

    There is no way possible to ensure both copies of a NOCOW file is in sync on Btrfs if any of the RAID1,10 or DUP profiles are used. Btrfs does not provide *any* method to fix out of sync NOCOW files when it does go out of sync (scrub only verifies checksums, and since NOCOW has no checksums, it doesn't touch nocow data). To make matters worse, anything as simple as a power failure can trigger this situation if a VM were in use, and anyone using Unraid with VMs can run into this corruption scenario in the default state.

     

    Further, Btrfs doesn't have a concept of "master" and "slave" drives, it decides which copy to read based on the PID of the process. To the user, this effectively means the disk a file is read from is "randomly" selected. So in an out of sync scenario with VM disk images, even if one copy was valid, if it reads the invalid copy, it may (and will) end up corrupting the good copy. Then when it reads the other copy, it detects corruption, and so on.... you're stuck in a vicious cycle.

     

    There was a long discussion on the Btrfs mailing list about this a few years ago, but the TL;DR of it is while there were patches submitted to allow btrfsck to identify this issue, they were never committed to master, not that anyone let alone Unraid uses btrfsck in the event of a crash, scrub and balance is the usual recommended course of action, depending on the scenario. Using Btrfsck can be dangerous without developer advice anyway.

     

    This issue would apply to RAID1, RAID1c3, RAID1c4, RAID10 and the DUP data profiles of Btrfs, since all of these profiles involve making "copies". If the user is using a single disk with the SINGLE profile, if they're using multiple disks with RAID0, or ironically, if they're using Btrfs' grossly unstable RAID5 or RAID6 profiles, the issue doesn't really exist since there is only one copy of the data in these cases (or in the case of RAID5/6, scrub will repair out of sync parity since there is no duplicate copy in that case. Instead, RAID5/6 suffers from the actual same write hole issue Unraid's parity protected array can suffer from).

     

    Now users may be wondering: Why is NOCOW even an option then?

    Well the issue with Btrfs in particular is it is not well optimized for workloads that involve a lot of small writes to the same file. This is exactly the type of workload that makes it less than ideal for things like VMs, Databases and even Bittorrent downloads. While ZFS can be tuned and has more complicated caching schemes to mitigate this, Btrfs simply doesn't scale well when it comes to tiny writes to files.

     

    Each time a write is made, the metadata tree needs to be updated to reference a new extent that will be rather small. As more and more tiny writes are made, this tree can get very "top heavy", and it too can become fragmented. Processing the tree alone can be expensive on system resources, even when an SSD is used. NOCOW is a way to avoid this fragmentation.

     

    I briefly discussed this with @jonp on Discord a while back, with regards to performance, and he suggested to do some benchmarks. So I took the time to do some quick benchmarks to show the difference (and also showcase how sparse allocation isn't the greatest with NOCOW anyway). 

     

    I've ran these benchmarks with the Phoronix Test Suite. The benchmark was pgbench to show a certain workload that involves a lot of small writes. This all was done on Ubuntu 22.04 using the ext4 filesystem as the VM filesystem (since these are vdisks, Btrfs COW is still at work here). The VMs all used VirtIO SCSI as the vdisk controller, with caching disabled. The Btrfs pool is on a Btrfs RAID1 pool with an 860 EVO and WD Blue SATA SSD. The system specs for the VM were indeed limited (only 2 cores on a Ryzen 5 3400G), however I think it still showcases it quite well.

     

    I did three tests. The first was a VM with on a NOCOW Cache pool using a sparsely allocated image, just as unraid behaves out of the box with cache pools. The filefrag indicated this VM had ~30k extents allocated after a single benchmark pass. Over time, as the vdisk fills, this number will increase, but since it's NOCOW, it won't increase once allocated unless snapshots or reflink copies are made. Still, it does not provide the best use of the NOCOW attribute.

     

    The second one was a VM on a NOCOW Cache pool but the VM disk image was preallocated using fallocate to reduce fragmentation as much as possible. The filefrag command indicated this VM only had 5 extents both before and after the benchmark.

     

    The third was to just straight up use COW (there was no need to make the differentiation between preallocation or not in this case since COW would be used regardless). The filefrag command indicated this VM had ~1.2 million (yikes!) extents after the benchmark ran.

     

    image.png.5bb7da730426235510284886a1873427.png

    image.png.6965307dfca7fdae5c048a16a8962128.png

    image.png.4ea3f1816e853e4ffda52f9752351ffa.png

     

    You'll notice the difference between the most optimized NOCOW VM and the one with COW is double the average latency. The Unraid default state was dropped somewhere in the middle, so even then, Unraid's current default isn't the most ideal performance wise. However, in the case of cache pools, when redundancy is used, there's no way to escape a corruption scenario. 

     

    Why not use autodefrag?

    Autodefrag is proposed to be a solution for desktop use cases, by grouping up small writes to Btrfs to be rewritten in larger extents. However in the case of workloads like VMs, autodefrag ends up causing write amplification, since it is effectively rewriting existing data. It is really only intended for desktop use cases where small databases may be used for applications like web browsers, and the writes to these databases are generally negligible.

     

    So what can be done?

    My suggestion is to simply just leave COW enabled (AUTO) for the domains share by default going forward for all new unraid installs. Users should accept the performance impact it may bring along unless they so choose to disable it, especially since the performance for many use cases may be perfectly acceptable, especially for a home use cases.

     

    Btrfs is a COW filesystem, so CoW should be expected to be used unless otherwise disabled. As was mentioned in this "rant" here, regular users don't set NOCOW, "admins" do, and using it "downgrades the stability rating of the storage approximately to the same degree that raid0 does". Leaving the current default gives Unraid users a false sense of redundancy (and I would argue unraid is intended to improve the UX around redundancy, so I think this default option is contrary to that).

     

    If users have a performance complaint, it would be better to clearly note the implications of using NOCOW if they so choose, or they can choose to use an XFS formatted cache pool. At least in that case there is no false sense of redundancy. And regardless, there's been feature requests and plugins that propose adding snapshots for VMs anyway. Both XFS and Btrfs support reflink copies after all. Yet doing so triggers Copy-on-Write, completely negating the gains of using the NOCOW attribute. 

    • Like 4
    • Thanks 1
  22. To add to this, it's best to remove the plugin until it is potentially addressed, as is warned here: https://btrfs.wiki.kernel.org/index.php/Gotchas#Block-level_copies_of_devices

     

    While these two "copies" are in fact the same thing, the thing with unraid is the /dev/mdx devices correspond to the /dev/sdx devices. So if the kernel sees both all of a sudden, it just thinks they're unique devices. In normal filesystems, this is fine, but Btrfs isn't "normal", so it gets angry because there are two devices with the same Device ID (all btrfs devices in a pool share the same UUID). A btrfs pool stores unique devices in its own metadata with numerical device IDs which are incremental. If you have a pool, consider running the command:
     

    btrfs device usage /mnt/cache

    It will list the devices in the pool with an ID. However, if you run

    blkid

     on your pool devices, you'll see they all have the same UUID. This is normal. This is how btrfs identifies devices belonging to a pool.

     

    The thing is, it's the device ID in the btrfs metadata that is the issue. The btrfs driver should *never* see duplicate device IDs. When you run a btrfs device scan, it looks for these devices. This is how btrfs can work with pool devices without you needing to specify each device when it mounts, instead, mounting any of the devices in a pool will mount all of them. However, when duplicates are found, this warning is triggered on mount, since the btrfs driver talks to devices by their device ID.

     

    It's like trying to mount two XFS devices with the same UUID, the kernel will get angry. Only in this case, btrfs has its own IDs. It's the same thing.

     

    On old kernels, this would almost certainly cause corruption. While I think there has been effort to harden against this in later kernel versions, like is being used on unraid today, if the kernel ever did get confused and commits a write to the /dev/sdx device, while you may not have fs corruption in this case, it could cause the parity to go out of sync. Or it could cause corruption if there's any race condition or write barrier issues between the MDx devices and the SDx devices.

    Honestly, I think it's just best to avoid that potential issue. The UD plugin must be doing a device scan, triggering the duplicate device detection.

     

    This is probably intentional by the plugin, since of course, if you had a btrfs array that wasn't assigned as a pool in Unraid, it would still be possible to easily mount it without returning any errors (otherwise btrfs will return an error if the kernel hasn't "scanned" all the devices).

     

    So as far as I can tell, since it would probably be undesirable to remove this feature from the plugin, you should either choose to only use Btrfs in pools only if you wanna use UD, or don't use the UD plugin.

  23. On 8/7/2021 at 4:54 PM, jonp said:

    The reason it isn't on this list for this poll is for reasons that might not be so obvious. As it stands today, there are really 3 ways to do snapshots on Unraid today (maybe more ;-). One is using btrfs snapshots at the filesystem layer. Another is using simple reflink copies which still relies upon btrfs. Another still is using the tools built into QEMU to do this. Each method has pros and cons. 

     

    The qemu method is universal as it works on every filesystem we support because it isn't filesystem dependent. Unfortunately it also performs incredibly slow.

     

    Btrfs snapshots are really great, but you have to first define subvolumes to use them. It also relies on the fact that the underlying storage is formatted with btrfs. 

     

    Reflink copies are really easy because they are essentially a smart copy command (just add --reflink to the end of any cp command). Still requires the source/destination to be on btrfs, but it's super fast, storage efficient, and doesn't even require you to have subvolumes defined to make use of it.

     

    And with the potential for ZFS, we have yet another option as it too supports snapshots!

     

    There are other challenges with snapshots as well, so it's a tougher nut to crack than some other features. Doesn't mean it's not on the roadmap ;-)

    I would personally ignore QEMU snapshots and just support snapshots in general with btrfs and ZFS, if ZFS is planned to be officially supported. (Performance warnings should be provided for those who snapshot VMs however on btrfs, since we use NOCOW to specifically avoid this. Snapshotting a nocow file requires copy on write to actually work which can negate the use of the NOCOW attribute.)
     

    Snapshots are a core feature of both filesystems after all and work rather well, so it would be great to take advantage of them in a way that is convenient for the user. The UI could be updated to just create subvolumes when creating user shares, and there could even be a way to migrate old directories based shares to subvolumes on supported filesystems. Adding an ability to schedule snapshots and rotate them would also be nice, and it has the added benefit of being able to help protect against ransomware attacks on your SMB share.


    Now obviously there are a few caveats to this, but let me quantify, here's how I would approach this:

    I'd start with support for the "multiple arrays", but much better and configurable than it currently is. Btrfs (and ZFS) RAID could then become first class citizens for users who choose to use either of those instead of regular unraid parity. This is when snapshots could be easily supported as they would be an array specific feature. I wouldn't really focus on its support on cache pools or with regular unraid parity, since it's convoluted to pull off (you'd have to snapshot each independent disk, and if you have any mixed filesystem setups, it wouldn't be possible).

     

    Having this expanded "multiple arrays" functionality would thus need to be done before proceeding with full ZFS support.

     

    Both Btrfs and ZFS support self healing with scrub, but in the current state of the array you can't utilize it if you go with btrfs (apart from metadata with get duped). Your only option is to use a cache pool (or potentially unassigned devices), which aren't really great for general data storage, because then you can't really use the "cache pool" as a write cache anymore as it was intended, since it's acting as your array. This is actually how I use unraid right now btw. I simply assign a USB drive to the array just so I can start it and totally ignore its functionality, relying on btrfs RAID1 for my HDD array (don't need the extra space parity provides me and this protects my data much better than any parity raid could apart from ZFS RAIDZ).

     

    With "multiple arrays", you'd get to choose Unraid Parity (the default), Btrfs, or potentially ZFS. The UX/UI design for Btrfs could easily be reused for ZFS when it's added, apart from the flexibility to add and remove devices, and you'd still be able to easily use cache pools for faster SSD writes while still being able to have the hard disk array be self healing. It would also allow users of ZFS or Btrfs arrays to easily use SSDs and TRIM in an array, which isn't really possible now without major downsides.


    For snapshotting VMs themselves, this is the only thing I'd use reflink copies for, and it could be supported on cache pools. I wouldn't use them for anything more as you potentially suggested, as to "snapshot" an entire volume (ie, on xfs, since it also supports relinks) using that method would not only be slow, but also use *a lot* of metadata if you have a lot of files, since it's really allocating all those inodes all over again. A real snapshot is basically a glorified "i owe you", deferred reflinks, and best suited to a single btrfs or ZFS array where it's easy to manage.

     

    Would really love to see this! Whatever you folks do, I'm sure I'll be excited to see it. Running the 6.10 RC right now :)

    • Like 1
×
×
  • Create New...