How to impliment BTRFS Checksumming with unRAID?


Noob

Recommended Posts

Also, btrfs in a cache pool is not the same as btrfs on a single disk. So comparing reliability between btrfs pool and xfs single is like comparing zfs to ntfs.  Apples and oranges.

 

You did read the OP is looking to run BTRFS on the DATA drives? Doesn't have anything to do with a cache pool....

Yes, I know that. I was simply stating this because it has been brought up before as a weak comparison to xfs.

Link to comment
  • Replies 92
  • Created
  • Last Reply

Top Posters In This Topic

 

 

We need to assemble better docs on our own wiki too, but not sure what else you are looking for as far as docs from btrfs go.

 

I would love to see examples in the wiki of what a btrfs drive corruption looks like when it happens in unRAID and then the typical steps to attempt and fix said corruption (like if a drive gets corrupt when writing data and then sudden power lose). Also, it would be awesome to see another example of data corruption in unRAID and then the steps needed for btrfs to "rebuild" the corrupt data from its checksums (or however btrfs rebuilds corrupt/bit rot data).

 

Sure, I could easily show that. I'll have to setup a lab to demonstrate, but I'll try to make that happen soon. Probably put it in a blog post too.

Link to comment

 

 

We need to assemble better docs on our own wiki too, but not sure what else you are looking for as far as docs from btrfs go.

 

I would love to see examples in the wiki of what a btrfs drive corruption looks like when it happens in unRAID and then the typical steps to attempt and fix said corruption (like if a drive gets corrupt when writing data and then sudden power lose). Also, it would be awesome to see another example of data corruption in unRAID and then the steps needed for btrfs to "rebuild" the corrupt data from its checksums (or however btrfs rebuilds corrupt/bit rot data).

 

Sure, I could easily show that. I'll have to setup a lab to demonstrate, but I'll try to make that happen soon. Probably put it in a blog post too.

 

Thank you! Last I checked LT was using XFS as the default FS when shipping out the official servers. Is that still true or have you guys switched to btrfs now? If you haven't switched to btrfs yet, do you have any idea on what circumstance might push you to change?

Link to comment

I don't really care to get into any further discussion with fanboys about BRTFS working or not. The simple fact is that a system which works 100% fine using a different filesystem might not work with BTRFS. I'm NOT the only one who has had issues trying to use BTRFS on a single cache drive with no pooling. I posted it so the OP can read it and decide if he wants to follow the warning to test it heavily or just ignore it.

 

Since scrub was questioned. Unless I've missed something, the array disks are still used stand-alone even with BTRFS as the filesystem. The info I have read says scrub can only tell you where the issues are on stand-alone drives, but it can't fix them. I don't really care enough to follow developments but this is what the published info said and it would explain why scrub can't fix my docker image.

Link to comment

 

 

I don't really care to get into any further discussion with fanboys about BRTFS working or not. The simple fact is that a system which works 100% fine using a different filesystem might not work with BTRFS. I'm NOT the only one who has had issues trying to use BTRFS on a single cache drive with no pooling. I posted it so the OP can read it and decide if he wants to follow the warning to test it heavily or just ignore it.

 

Since scrub was questioned. Unless I've missed something, the array disks are still used stand-alone even with BTRFS as the filesystem. The info I have read says scrub can only tell you where the issues are on stand-alone drives, but it can't fix them. I don't really care enough to follow developments but this is what the published info said and it would explain why scrub can't fix my docker image.

 

Umm, not sure who you are referring to as a fanboy here. All I'm saying is that the number of people who have had issues with btrfs since 6.0 final is extremely small.  Again, a few bad apples don't mean you torch the whole orchard.

And regarding the info you claim about scrub not helping on single devices, I'm curious where you read that as it seems completely contradictory to the fundamentals of it.  The way scrub works is by leveraging copy on write, which is enabled on all btrfs devices in unRAID by default. If a bit was corrupted, the scrub would detect that by comparing to the metadata checksum and repair it by using this previously written bit for that item.

 

The reason your scrub may have failed to repair errors is one of two reasons:  either you didn't take off the -r in the GUI field for it (which tells scrub to perform a check only) or the data that was corrupted was set to not use COW which is required for certain file types including the loopback image itself and and any virtual disks for VMs you have.

 

Anyhow, your feedback to the OP is fine, but I do think you are in the minority camp of people still having issues with btrfs.

Link to comment
And regarding the info you claim about scrub not helping on single devices, I'm curious where you read that as it seems completely contradictory to the fundamentals of it.  The way scrub works is by leveraging copy on write, which is enabled on all btrfs devices in unRAID by default. If a bit was corrupted, the scrub would detect that by comparing to the metadata checksum and repair it by using this previously written bit for that item.

 

Everywhere I've read about the details of what scrub can do says it can't repair anything on a single drive.

 

Where would a stand-alone filesystem get the data required to repair a corruption? It makes perfect sense that it can't since the CRC data is a very small portion of the data. With such a small CRC, it can only tell you if the data block it's protecting is bad, but not which bit is bad.  You'd have to use a large portion of a hard drive to store a CRC type checking scheme before the CRC could be used to correct the data, like probably 30% or more of the drive capacity. You could compare the CRC to an unRAID array. A parity check will tell you there is a data issue at a certain location on one of the drives, but not which drive has the issue.

 

This is why Noob asked if the results from the scrub would be used by an operation that used the parity drive + other data drives to re-create the bad spot on the data drive where the scrub found a data issue.

 

Link to comment

This is why Noob asked if the results from the scrub would be used by an operation that used the parity drive + other data drives to re-create the bad spot on the data drive where the scrub found a data issue.

 

That's correct. I didn't know whether or how BTRFS works on a single disk level, that was the impetus for my question that you are referencing. I do know, however, how BTRFS works on any RAID array where there is redundancy. It checks the checksum of the file on one disk when the file is scrubbed or the next time it is read. If the checksum doesn't match, it checksums the other copy. If that one matches correctly, the bad file on the first disk is overwritten with the good one from the second disk. In this way you are always repairing corruption in real time, before the user ever notices it or experiences a fatal error.

 

I had to do some real digging to figure out an answer to @jonp's question. If I had done this in the first place I wouldn't have had to drag the forum into a war over BTRFS, but I think this discussion is a useful one to have. Also, I was hoping not to have to do this research because someone would have an answer for me. Without further adieu, here is the answer you have all been waiting for, followed by citations:

 

BTRFS without RAID redundancy is able to reference checksums in just the same manner as explained above, but is not able to repair bad data once discovered. It will return an error to let you know that your file is corrupt, rather than unknowingly feeding you or your application junk data. This is clearly still an advantage over dinosaur filesystems, but hardly what I would like to see come out of that powerful native-checksumming capability. Since BTRFS was designed with RAID in mind, this "single disk uncorrectable error" problem was not some kind of an oversight by the devs. It does mean, however, that BTRFS does not provide unRAID with any ability to recover my corrupt data (which I thought it did, that's why I'm using it in the first place). To use BTRFS to its fullest and allow for silent data correction, unRAID will have to write full parity data somewhere.

 

In order to maintain the one disk access feature of unRAID, you might do this by implementing RAID 1 and still only spinning up one disk in one array when a file needs to be read. If the checksum is good, then that is all that would be necessary to serve your file. However, if the checksum failed, you could then spin-up the corresponding disk in the other array and repair the first disk's file using the known good data. Obviously, this would need to be an optional feature because it would double the amount of disk space required for every file written to one of those BTRFS/redundant shares, and remove some of the features that make unRAID, well, unRAID. It is totally unclear, in my opinion, that BTRFS is operating this way in unRAID and many users believe they are protected when they are not. Take a look around the forum and that becomes very clear. The help/support pages should be updated to explain this as soon as is practicable.

 

Sources:

"Corruption detection and correction. In Btrfs, checksums are verified each time a data block is read from disk. If the file system detects a checksum mismatch while reading a block, it first tries to obtain (or create) a good copy of this block from another device—if mirroring or RAID techniques are in use. If a good copy is found, it is returned instead and the bad block is corrected. This self-healing mechanism does not appear to introduce significant overhead, and it provides a huge benefit: File systems always are consistent. Administrators are notified of repair events, and checksum failures are logged to the syslog facility." http://www.oracle.com/technetwork/articles/servers-storage-admin/advanced-btrfs-1734952.html

 

The quote above implies, but does not explicitly state, that without a mirror copy, there is no ability to repair. However, I was able to find the original blog post of one of the devs testing BTRFS' healing ability back in 2011, which does state this explicitly. He runs his benchtest using RAID 1 and concludes by saying, "Everything got repaired. This happens on both data and metadata. If there was a true IO error reading from one of the 2 sides we'd have handled that in the filesystem as well. If you don't have mirroring then with CRC it would have told you it was bad data and given you an IO error (instead of reading junk)." https://blogs.oracle.com/wim/entry/btrfs_scrub_go_fix_corruptions

Link to comment

When reading about BTRFS, there are all kinds of claims about how it's self protecting and correcting, except that never made sense to me at a disk level since there was nothing about the filesystem greatly reducing the amount of available data storage space. A CRC small enough to have little impact on the data space of a disk isn't large enough to be of any use correcting errors. As an example. A CD/DVD/BluRay uses a CRC scheme where about 1/4 of the bits on the disc are CRC/redundancy code to allow correcting for read errors.

 

Limetech has claimed that unRAID has the ability to use the parity + other data disks to write-back to a disk when a read fails due to a bad block. The disk hardware is supposed to re-allocate a bad sector on write, which allows it to properly store the data block again. I have no idea if this really works or if this could be used in combination with scrub or any other features of a BTRFS formatted disk. But, a write to the detected bad part of the disk using reconstructed data is a required operation to make unRAID self or user-initiated healing when using BTRFS formatted disks.

 

It appears that this is a good thread. Apparently, it points out a hole in how Limetech expected an array to operate when using BTRFS. I'm actually surprized I have to point this out since I really know squat about BTRFS compared to anyone following it. I would have though LimeTech would understand it fully to be capable of supporting it's use, ie know it's capabilities and recovery methods.

 

 

Link to comment

Also, the concept that COW somehow helps for file repair never made sense to me. Here is how I would simply try to explain why COW doesnt matter for file repair. I'll greatly simplify the example to only include a single sector to get the base premise.

 

Copy on Write ensures that the system only has 1 exact copy of the data for a file. You have a sector filled with data, call this sector A. You then modify that sector by writing to it, call that sector B. No where does the filesystem indicate what the transform for data to go from sector A to sector B. If the system detects corruption in sector B, nothing from sector A can help, unless you know the transform from A to B. The system does not know that transform.

 

What the system can possibly do is give you a snapshot of the file when it was on sector A.

 

Also if you create an exact copy of the file the system could just alter the filesystem housekeeping bits to point the second file to the exact same sectors as the original file, thus you would save space initially.

Link to comment

Just an FYI, COW is REQUIRED in order for scrub to even have the chance to work, as scrub requires COW.  This was confirmed not from online research, but in e-mails between us and a Fujitsu developer on the btrfs mailing list:

 

2:  Out of curiosity, why is data checksumming tied to COW?

 

There's no safe way to sanely handle checksumming without COW, because there is no way (at least on current hardware) to ensure that the data block and the checksums both get written at the exact same time, and that one of the writes aborting will cause the other too do so as well.  In-place compression is disabled for nodatasum files for essentially the same reason (although compression can cause much worse data loss than a failed checksum).

 

The problem with a lot of the online research is that it can be outdated and/or completely invalid.  E-mail the btrfs mailing list if you want answers, because I don't really trust information from sources other than that or the btrfs wiki (or posts by Chris Mason or other btrfs developers themselves).

 

Another thing to consider is that I, @jonp, am NOT a programmer / developer.  At best, you could call me a designing / engineer, but I don't write code.  I am conversing in these threads with you guys in the same pursuit of clarification of information.  When a topic like this gets as deep as it is, I need to bring in the big guns (the actual devs) to help validate information.

Link to comment

I'm pleased that everyone appreciates this thread. I have been a bit of a rabble-rouser on here if you look at my threads since I started posting last week. Mostly, I feel like I've been annoying the community members who want to do things one way and not consider anything other than what they already know. This thread kind of started that way, but has evolved to be a very useful source of information for other users.

 

So, I will again ask a question now that we have a better picture of what is going on here. Does unRAID use the parity disk to rebuild filesystem errors in BTRFS? If not, this needs to be enabled. This solution didn't occur to me until after I left the forum yesterday, but it is a simple enough way to put BTRFS to work and does not increase data storage overhead at all, since the parity device is already there for the purpose of rebuilding failed disks anyhow. Brilliant!

 

Also, just a note, that I VERY MUCH appreciate jonp and all of his helpful responses/inquires in this thread. I am a die-hard BTRFS supporter and I believe that it is the future of array file storage. I'm glad that someone on the dev team is informed and advocating for it, and I appreciate that jonp was willing to stand up to a crowd in this thread. This is how shit gets done, folks.

Link to comment

Another thing to consider is that I, @jonp, am NOT a programmer / developer.  At best, you could call me a designing / engineer, but I don't write code.  I am conversing in these threads with you guys in the same pursuit of clarification of information.  When a topic like this gets as deep as it is, I need to bring in the big guns (the actual devs) to help validate information.

 

How do we drag them in to the thread? Anyone have an e-mail address for Chris Mason, or can someone send a link to this discussion to the BTRFS devs and see if they care to weigh in?

Link to comment

Another thing to consider is that I, @jonp, am NOT a programmer / developer.  At best, you could call me a designing / engineer, but I don't write code.  I am conversing in these threads with you guys in the same pursuit of clarification of information.  When a topic like this gets as deep as it is, I need to bring in the big guns (the actual devs) to help validate information.

 

How do we drag them in to the thread? Anyone have an e-mail address for Chris Mason, or can someone send a link to this discussion to the BTRFS devs and see if they care to weigh in?

I just dropped a note to the btrfs mailing list to get clarification.

Link to comment

Does unRAID use the parity disk to rebuild filesystem errors in BTRFS? If not, this needs to be enabled.

 

At the present moment, no, it does not work that way.

 

This solution didn't occur to me until after I left the forum yesterday, but it is a simple enough way to put BTRFS to work and does not increase data storage overhead at all, since the parity device is already there for the purpose of rebuilding failed disks anyhow. Brilliant!

 

If btrfs detected an unrecoverable error through a scrub today, I suppose that would be the result of what the cool kids like to call "bitrot" and therefore, parity wouldn't have been updated to reflect that, so if you ejected the disk and replaced it with an alternate, a rebuild on the alternate should solve the problem.  Not the right approach to solve the problem, but would prove that unRAID's parity disk could be used to correct the issue.  This thread is starting to turn into a feature request and that's a good thing!

 

In the meantime, even if btrfs scrub on single devices can't correct for errors, automatic checksumming of data as its written to the filesystem is definitely better than being forced to generate your own checksums after the fact as part of a process external to the filesystem.

 

Also, just a note, that I VERY MUCH appreciate jonp and all of his helpful responses/inquires in this thread. I am a die-hard BTRFS supporter and I believe that it is the future of array file storage. I'm glad that someone on the dev team is informed and advocating for it, and I appreciate that jonp was willing to stand up to a crowd in this thread. This is how shit gets done, folks.

 

You are most welcome.  BTRFS has had quite a journey since it first began, but it's come a long way and is absolutely considered stable.  From the very first words in the very first paragraph when you go to the BTRFS wiki:

 

The filesystem disk format is no longer unstable

 

So I do find it a bit funny when folks try to claim it's not.  Especially when there are so many distributions not only including it, but making it the default.  And more importantly, it's really the only game in town as far as a Linux-native advanced filesystem.  ZFS is NOT Linux-native as it's license precludes it from being so (thanks to Oracle).  We are curiously paying attention to the recently announced bcachefs, but it will be quite a while before that filesystem matures.

 

Also, I just got a reply on the mailing list on this subject:

 

...scrub on "single"  or  "raid0" profiles will detect errors but cannot correct them.

 

By default, on a single rotating disk, data (file  content) is single, but metadata (filesystem information, I believe including inodes and dirents)is dip, so can be recovered if an error is detected

On non-rotational (SSD), a single disk defaults to "single" allocation profile for both.

Link to comment

Just an FYI, COW is REQUIRED in order for scrub to even have the chance to work, as scrub requires COW.  This was confirmed not from online research, but in e-mails between us and a Fujitsu developer on the btrfs mailing list:

 

2:  Out of curiosity, why is data checksumming tied to COW?

 

There's no safe way to sanely handle checksumming without COW, because there is no way (at least on current hardware) to ensure that the data block and the checksums both get written at the exact same time, and that one of the writes aborting will cause the other too do so as well.  In-place compression is disabled for nodatasum files for essentially the same reason (although compression can cause much worse data loss than a failed checksum).

 

That's an interesting implementation limitation despite COW having no practical purpose for data scrubbing / checksumming at all. It's just the only means they found and managed to implement checksumming.

Link to comment

 

So, I will again ask a question now that we have a better picture of what is going on here. Does unRAID use the parity disk to rebuild filesystem errors in BTRFS? If not, this needs to be enabled.

 

You're assuming that the corruption was silent corruption at the bit level and not done via normal data writes? If the corruption occurred via writes to the block device data drive, those same writes will be mirrored to the parity drive. You will not have anything else to work from the parity drive to correct those writes which caused errors on the data disk. You will only be able to exactly replicate the data drive as it stands.

Link to comment

Right, I'm assuming silent data corruption. If you write bad data, that's a different problem than BTRFS checksumming is designed to address.

 

I'd use a backup to fix an accidental overwrite of a file or a program that wrote bad data. This whole thread is, as far as I'm concerned, about repairing bitrot/silent corruption.

Link to comment

@Jonp, quick question that likely does not make any sense but that's never stopped me from asking before...

 

Is there any means of creating the docker image file which is setup as a virtual BTRFS internally in a mirror mode BTRFS fashion? If so, would that help to recover from data corruptions automatically since its no longer a single pool setup?

Link to comment

 

So, I will again ask a question now that we have a better picture of what is going on here. Does unRAID use the parity disk to rebuild filesystem errors in BTRFS? If not, this needs to be enabled.

 

You're assuming that the corruption was silent corruption at the bit level and not done via normal data writes? If the corruption occurred via writes to the block device data drive, those same writes will be mirrored to the parity drive. You will not have anything else to work from the parity drive to correct those writes which caused errors on the data disk. You will only be able to exactly replicate the data drive as it stands.

 

I'm pretty sure that's true for any filesystem and even hardware raid.  If you wrote a file to any filesystem, even ZFS, and the corruption occurred during the write itself, how would any filesystem repair for that?  The only way to even know that the write was corrupted would be to checksum the file after the write is done and compare it to the source data.  That would not be the responsibility of the filesystem itself to solve, but rather, the software / method used to copy the data (such as TeraCopy).  I'm pretty sure filesystem-level checksums are designed to detect and repair bit level corruption only anyhow.

Link to comment

@Jonp, quick question that likely does not make any sense but that's never stopped me from asking before...

 

Is there any means of creating the docker image file which is setup as a virtual BTRFS internally in a mirror mode BTRFS fashion? If so, would that help to recover from data corruptions automatically since its no longer a single pool setup?

 

Ok, I stand corrected, this thread has now turned into TWO feature requests.  That's not necessarily a bad idea.  So basically two loopback BTRFS images of the same size (and even on the same btrfs pool if you wanted).  Create those as a btrfs raid1, then scrubbing could, in theory, repair btrfs errors like this.

 

I honestly don't know how that would work, but I'm eager to find out!!  Will be trying this next week!!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.