How to impliment BTRFS Checksumming with unRAID?


Noob

Recommended Posts

Would this BTRFS loopback pool work as a substitute for using the parity data to rebuild BTRFS errors, or as an additional source for good data to be drawn from?

 

I prefer using parity data because, as stated, there is already space allocated for parity data. Any RAID 1 config will work just as well, but at the expense of doubling the size of your dataset.

Link to comment
  • Replies 92
  • Created
  • Last Reply

Top Posters In This Topic

Would this BTRFS loopback pool work as a substitute for using the parity data to rebuild BTRFS errors, or as an additional source for good data to be drawn from?

 

I prefer using parity data because, as stated, there is already space allocated for parity data. Any RAID 1 config will work just as well, but at the expense of doubling the size of your dataset.

 

No, its for Dockers and Dockers only.

 

Currently dockers are contained inside a single file, docker.img. Internally this file is a single device BTRFS pool. In order to have data scrubbing and correcting function on this, one would need this virtual BTRFS to have multiple devices in some sort of mirror mode.

Link to comment

Would this BTRFS loopback pool work as a substitute for using the parity data to rebuild BTRFS errors, or as an additional source for good data to be drawn from?

 

I prefer using parity data because, as stated, there is already space allocated for parity data. Any RAID 1 config will work just as well, but at the expense of doubling the size of your dataset.

 

No, we're talking about something totally independent of your original request.  In addition to supporting btrfs as a filesystem for array devices, we also use it for Docker.  So say you have a completely non-btrfs setup with unRAID, but want to use Docker.  Well Docker leverages btrfs snapshotting, subvolumes, COW, etc.  To support Docker without forcing folks to use btrfs for their storage devices themselves, we create a virtual disk image on top of an existing device formatted with whatever filesystem you want (that unRAID supports, that is).  Then we format the virtual disk with btrfs.  What BRiT is talking about is the ability to then create another virtual disk image for btrfs, then make those two virtual disks into a btrfs raid1, which would then enable us to do btrfs scrub and repair errors on the docker loopback image.

Link to comment

Perfect. I didn't realize that Docker used BTRFS in its image file, regardless of what filesystem was in use on the disk. I think that's a great request, too, and can save some trouble for people who have corrupted docker images in the future.

 

That said, I think protecting the humongous data array should be a higher priority than protecting the docker image, but they are both are worthwhile features to add.

 

I like this thread.

Link to comment

@Jonp, quick question that likely does not make any sense but that's never stopped me from asking before...

 

Is there any means of creating the docker image file which is setup as a virtual BTRFS internally in a mirror mode BTRFS fashion? If so, would that help to recover from data corruptions automatically since its no longer a single pool setup?

 

Ok, I stand corrected, this thread has now turned into TWO feature requests.  That's not necessarily a bad idea.  So basically two loopback BTRFS images of the same size (and even on the same btrfs pool if you wanted).  Create those as a btrfs raid1, then scrubbing could, in theory, repair btrfs errors like this.

 

I honestly don't know how that would work, but I'm eager to find out!!  Will be trying this next week!!

So, as an extension to this, could we get a feature that software RAID1's a single physical disk into 2 volumes to implement data redundancy and correction? I realize this won't be true RAID1 protection but I think it would go a long way into making the cache drive more resilient to corruption caused by power interruptions and crashes. Maybe not to the level of 2 physical disks, but if you have the extra space, it may be a good option.

 

Perhaps also allow carving up multiple cache drives into virtual volumes, say you have 2 SSD's, a 250 and a 500. Maybe carve up half the 500 to provide RAID1 for the 250, but still allow the other half of the 500 for non-fault tolerant use. Or any other combination that would make sense, like perhaps 2 500GB drives, each halved, and the one pair RAID1 and the other RAID0 for speed. You would end up with 250GB fault tolerant and still have 500GB to use for VM's that don't need realtime redundancy.

Link to comment

@Jonp, quick question that likely does not make any sense but that's never stopped me from asking before...

 

Is there any means of creating the docker image file which is setup as a virtual BTRFS internally in a mirror mode BTRFS fashion? If so, would that help to recover from data corruptions automatically since its no longer a single pool setup?

 

Ok, I stand corrected, this thread has now turned into TWO feature requests.  That's not necessarily a bad idea.  So basically two loopback BTRFS images of the same size (and even on the same btrfs pool if you wanted).  Create those as a btrfs raid1, then scrubbing could, in theory, repair btrfs errors like this.

 

I honestly don't know how that would work, but I'm eager to find out!!  Will be trying this next week!!

So, as an extension to this, could we get a feature that software RAID1's a single physical disk into 2 volumes to implement data redundancy and correction? I realize this won't be true RAID1 protection but I think it would go a long way into making the cache drive more resilient to corruption caused by power interruptions and crashes. Maybe not to the level of 2 physical disks, but if you have the extra space, it may be a good option.

 

Perhaps also allow carving up multiple cache drives into virtual volumes, say you have 2 SSD's, a 250 and a 500. Maybe carve up half the 500 to provide RAID1 for the 250, but still allow the other half of the 500 for non-fault tolerant use. Or any other combination that would make sense, like perhaps 2 500GB drives, each halved, and the one pair RAID1 and the other RAID0 for speed. You would end up with 250GB fault tolerant and still have 500GB to use for VM's that don't need realtime redundancy.

 

I actually think that for the array devices, using unRAID's parity disk to correct errors found from btrfs scrub would be a far better and more efficient way to deliver what you're asking.

Link to comment

Just got a response from someone else in the mailing list with some interesting feedback.

 

... And more to the point, expanding on that, on a single device btrfs,

data is single mode by default, so scrub for it (as opposed to metadata)

is error-detect-only as mentioned.

 

However, while the default mode separates data and metadata, and in that

mode, historically (there's a patch to change this, adding the missing

option) data was single-only, mixed-bg mode (the mkfs.btrfs --mixed

option) puts data and metadata both in the same shared block-group type,

which can then be either dup or single mode.

 

Obviously duplicating data as well as metadata means you can only store

half as much data, since it's all stored twice, but that will let scrub

correct errors in cases where only one of the two copies doesn't verify

checksum, but the other one does.

 

And as mentioned above, there's a patch in process now, that will remove

the single-device restriction of data (as opposed to metadata) to single

mode, allowing the choice of dup mode for data as well as metadata.

 

 

Also, in addition to the mixed-mode workaround to get dup data, it's

possible, altho rather inefficient in performance terms, to partition a

physical device such that two equal sized partitions are made available

as logical devices, and then mkfs.btrf -d raid1 -m raid1 the two logical

devices into a single btrfs, raid1 for both data and metadata, so btrfs

creates two copies that way, again letting scrub correct errors when only

one of the two fails to verify against checksum.

 

Sounds like this --mixed option would let you store redundant copies of the data on a single device, enabling scrub to repair errors.  This sacrifices 50% of the capacity on that device obviously, but provides the ability to perform a repair.

Link to comment

Yup, agreed. Parity is positively the best solution for people who are warehousing lots of data and need their capacity to go far.

 

However, because btrfs allows you to select whether you want these functions on or off, adding those options to the GUI of unRAID would give great flexibility. Then, someone could, for example, use disk 5 as a btrfs dup disk and disks 1-4 in btrfs single. Any shares that contain sensitive data that just has to be right could be stored on disk 5, with everything else that's less critical on disks 1-4.

 

I'll probably keep everything in dup mode because that's the feature of btrfs that I like the most, but others will want to be able to choose based on their risk tolerance.

Link to comment

There is one other possibility as well using reflink.  So for those unaware, file-level snapshotting on btrfs is as easy as this:

 

cp /path/to/source /path/to/dest --reflink

 

Doing this will cause the destination to simply be a snapshot of the source.  However, unlike traditional snapshots that require the source to remain for the destination to work, you can modify/delete the source in btrfs.  Here's the basic test:

 

root@unJON:/mnt/cache/appdata/tmp# echo 123 > source

root@unJON:/mnt/cache/appdata/tmp# cp source dest --reflink

root@unJON:/mnt/cache/appdata/tmp# echo 456 >> dest

root@unJON:/mnt/cache/appdata/tmp# echo 789 >> source

root@unJON:/mnt/cache/appdata/tmp# cat source

123

789

root@unJON:/mnt/cache/appdata/tmp# cat dest

123

456

root@unJON:/mnt/cache/appdata/tmp# rm source

root@unJON:/mnt/cache/appdata/tmp# cat dest

123

456

 

In another article I read (see here), it talks about using this to recover from a corruption this way.  Here's the example:

 

Let's say there is a file foo. The file is snapshotted nightly with the command "snap.ocfs2 foo foo.$(date)". Today is 2008.11.18. Yesterday the file was corrupted, and so you want to go back to the version from two days ago. It's simple.

 

# unlink foo

# reflink foo.2008.11.18 foo

 

Now foo is an exact copy of the snapshot, and it takes no extra space to boot. The CoW properties of the refcount tree will take hold when you start modifying foo.

 

So their commands are a little different, but the result is the same.  Not sure if this would protect against bitrot corruption.  Thoughts?

Link to comment

There is one other possibility as well using reflink.  So for those unaware, file-level snapshotting on btrfs is as easy as this:

 

cp /path/to/source /path/to/dest --reflink

 

Doing this will cause the destination to simply be a snapshot of the source.  However, unlike traditional snapshots that require the source to remain for the destination to work, you can modify/delete the source in btrfs.  Here's the basic test:

 

root@unJON:/mnt/cache/appdata/tmp# echo 123 > source

root@unJON:/mnt/cache/appdata/tmp# cp source dest --reflink

root@unJON:/mnt/cache/appdata/tmp# echo 456 >> dest

root@unJON:/mnt/cache/appdata/tmp# echo 789 >> source

root@unJON:/mnt/cache/appdata/tmp# cat source

123

789

root@unJON:/mnt/cache/appdata/tmp# cat dest

123

456

root@unJON:/mnt/cache/appdata/tmp# rm source

root@unJON:/mnt/cache/appdata/tmp# cat dest

123

456

 

In another article I read (see here), it talks about using this to recover from a corruption this way.  Here's the example:

 

Let's say there is a file foo. The file is snapshotted nightly with the command "snap.ocfs2 foo foo.$(date)". Today is 2008.11.18. Yesterday the file was corrupted, and so you want to go back to the version from two days ago. It's simple.

 

# unlink foo

# reflink foo.2008.11.18 foo

 

Now foo is an exact copy of the snapshot, and it takes no extra space to boot. The CoW properties of the refcount tree will take hold when you start modifying foo.

 

So their commands are a little different, but the result is the same.  Not sure if this would protect against bitrot corruption.  Thoughts?

 

Mind f-ing blown

Link to comment

Mind f-ing blown

 

;-)

 

BTW, yes, this does work with virtual disk images for VMs as well ;-).  And the time it takes to create a reflink copy is measured in seconds, even with a 35GB+ size vdisk.  What was even more surprising is that this worked, even though I disabled COW for my virtual disk image file.

Link to comment

;-)

 

BTW, yes, this does work with virtual disk images for VMs as well ;-).  And the time it takes to create a reflink copy is measured in seconds, even with a 35GB+ size vdisk.  What was even more surprising is that this worked, even though I disabled COW for my virtual disk image file.

 

You would have a lot of fun playing around with an XtremIO EMC products [ http://xtremio.com/ ]. It's so fast when creating snapshots of our production servers and databases. Pretty similar concepts.

Link to comment

 

 

;-)

 

BTW, yes, this does work with virtual disk images for VMs as well ;-).  And the time it takes to create a reflink copy is measured in seconds, even with a 35GB+ size vdisk.  What was even more surprising is that this worked, even though I disabled COW for my virtual disk image file.

 

You would have a lot of fun playing around with an XtremIO EMC products [ http://xtremio.com/ ]. It's so fast when creating snapshots of our production servers and databases. Pretty similar concepts.

 

Yup. I used to work for an IT integrator that sold solutions like that from both EMC and NetApp.  Pretty amazing what a couple $100k can get you, eh?  Even nicer when we can bring similar benefits to our own users for a fraction of the cost.

Link to comment

This is a cool development, but it only solves the problem if you happen to run a scrub on your device before your foo.12.18.2008 file gets rotated off of the disk, right?

 

If you don't notice that the checksum fails before the good copy goes bye-bye, then having 7 versions of a corrupted file is not going to help anyone. The nice thing about RAID parity is that you always have a second copy somewhere, it's not time-limited. The odds of both files corrupting on two independent disks before you run your next bi-weekly or monthly scrub are impossibly small. I don't have enough space to keep 14 or 30 days worth of foo.(date) hanging around. And I don't want to run scrubs much more often than bi-weekly because of the consequences for disk life. So, unless I misunderstood this, it is cool, but it is not a replacement for a conventional btrfs setup.

Link to comment
  • 2 weeks later...

Seeing as I got no answer, I went and found Jon's discussion with the btrfs devs on the listserv. They claim that doing block-level rebuilds from parity is possible, but that it is a non-trivial task to try to accomplish.

 

Jon P made a statement about simply using parity data to restore an entire disk when a BTRFS checksum fails. That solution doesn't sound like it would be difficult to implement because that's what the parity in unRAID is already configured to be able to do, albeit under different circumstances. Jon admitted that this is not an ideal solution, but that it would be better than nothing.

 

I have to agree; it is better than nothing. If unRAID were able to rebuild my data for me when it failed a checksum, even if that means rebuilding an entire disk, I would start deploying unRAID servers. This is the only missing feature keeping me from becoming a customer.

 

Updates, other than these from weeks ago on the listserv, would be appreciated from the LT team.

Link to comment

I'm going to hazard a guess that while jon might be discussing it on the listserv, and shit who knows maybe even playing a little on a test box, LT's entire effort is being put towards finalizing 6.2 and what we all believe is dual parity support. So I wouldn't expect much movement on this front for a while.

 

In the mean time, if you want data rebuild capability you might look into the Checksum Suite plug-in which includes PAR2 functionality that can both identify corrupt files but also repair them. Being a community plug-in vice built-into-unRAID it wuold be understandable if you weren't willing to "deploy" this as a solution if "deploy" means as a VAR vice in your home usage.

 

http://lime-technology.com/forum/index.php?topic=43396.0

 

EDIT: bwahaha I see you are already well versed in checksum suite. please excuse my ignorance :)

 

 

Link to comment

Hi!

Coming from my topic

http://lime-technology.com/forum/index.php?topic=44858

with minimal experience of anything other than windows!

 

Looking exactly this. BTRFS self-healing feature.

 

One easy (I believe) solution is to add the option for a "Safe Share" with specified space (with options to shrink/expand), being a virtual raid1 of two folders either in the same disk or another.

In this case, we will be able to double store only the most important staff and get bit rot protection plus disk failure protection.

 

On the other hand, this could be done with other tools also, like  par2 with 100% redundancy!

 

Just got confused again :D

I think that it's impossible to combine btrfs raid and unraid raid.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.