How can I detect which files are bad during a read error in a rebuild?

sincero · July 4, 2015

I want to begin adding more discs to my array (currently 3TB x 3, one parity). Thinking of adding another 3TB disk, or maybe swapping to 4TBs soon. My question is: if I have a URE, the block will be skipped according to my knowledge. In this event, how should I setup unRaid ahead of time, so that I can find out what file it is and swap it from a backup with a valid file?

I don't like the idea of some random silent corruption but don't like the idea of being forced to rollback, either for a single URE. To my understanding, btrfs may be able to help me here.

Thanks!

garycase · July 4, 2015

Not sure what your question is. If you have an unrecoverable read error from a disk, UnRAID will reconstruct the data from the other disks plus parity, and will attempt to write the data back to the disk that failed. If the write fails, the disk will be "red-balled" (marked as bad and not used by UnRAID). Then you can simply replace the failed disk and UnRAID will rebuild the data for that disk on its replacement.

If you're asking how you can determine which file may be corrupted if you encounter a parity error during a parity check, the answer is simply that you can't ... at least not directly. The best way to ensure you can always check your files for validity is to maintain checksums of all your files. Many of us maintain MD5's of all of our files for exactly that reason.

garycase · July 4, 2015

... Looking at this again, the title of the question makes it clear that what you're asking is what files might be incorrectly rebuilt if UnRAID encounters an error during a rebuild. Again, there's no direct way to do this; but if you've maintained MD5's (or other checksums) of your files, then you can run a validation of the checksums on the rebuilt disk and determine which file(s) was/were corrupted in the rebuild.

sincero · July 4, 2015

Hi,

MD5 looks like a decent way to do this. Is there no way to do this with BTRFS? It seems to have checksum support built in.

itimpi · July 4, 2015

Hi,

MD5 looks like a decent way to do this. Is there no way to do this with BTRFS? It seems to have checksum support built in.

I believe that BTRFS should automatically handle bitrot (which is one of its plus points). However I am not sure that many users are yet convinced it is mature and stable enough to be used for all your data drives.

BRiT · July 4, 2015

Hi,

MD5 looks like a decent way to do this. Is there no way to do this with BTRFS? It seems to have checksum support built in.

I believe that BTRFS should automatically handle bitrot (which is one of its plus points). However I am not sure that many users are yet convinced it is mature and stable enough to be used for all your data drives.

Yup, I'm not convinced it's stable or mature, considering I have never seen anyone fix a corruption issue on BTRFS ever, other than deleting it entirely and starting over again and how many people are having issues with it.

garycase · July 4, 2015

Agree. BTRFS has potential, but it's immature and needs to evolve a good bit before I'd trust it as the primary file system you'd want to use for UnRAID. It's RAID-1 implementation seems to work just fine, so it's a reasonable choice for the file system in a cache pool; but I would not select it for your array drives. I know a few have, but I think it's a bit early in its evolution for that to be a good choice.

JonathanM · July 4, 2015

Yup, I'm not convinced it's stable or mature, considering I have never seen anyone fix a corruption issue on BTRFS ever, other than deleting it entirely and starting over again and how many people are having issues with it.

Well, I just recovered from a non-mountable btrfs cache drive, but it took me the better part of a day to pull off, and I'm an IT guy by trade since '90. The thing that ticked me off the most was that the corruption was caused by something that reiserfs just shrugs off most of the time, there were 2 transactions in the journal that were out of sync because of an unclean shutdown. There was no way I could find to replay those transactions and get back in sync, so I was forced to purge the journal records to enable the drive to mount. NONE of the rescue mount commands that I found would work, they would crash the kernel and force me to hard reset to restart the machine each time I tried. I was able to get a mostly clean backup using the btrfs restore command, and then after I got my backup I zeroed the log and was able to get the drive to mount.

Shame on me for not having a current backup of the VM's that were running on the drive, forcing me to go through a day of hell trying to recover data.

I'm counting this incident as a definite negative for btrfs, even though I was able to recover. I'm seriously reconsidering using it, even for my SSD cache.

garycase · July 4, 2015

... I'm counting this incident as a definite negative for btrfs, even though I was able to recover. I'm seriously reconsidering using it, even for my SSD cache.

Definitely not a good story ... but certainly reinforces the immaturity of btrfs. The protected cache pool is a really nice feature ... but this makes me wonder if a better choice is to just use a RAID card and a RAID-1 array with XFS. A 4-port RAID card would allow plenty of capacity ... and the only "downside" is the drives would have to be the same size (which isn't a big deal).

jonp · July 4, 2015

Yup, I'm not convinced it's stable or mature, considering I have never seen anyone fix a corruption issue on BTRFS ever, other than deleting it entirely and starting over again and how many people are having issues with it.

Well, I just recovered from a non-mountable btrfs cache drive, but it took me the better part of a day to pull off, and I'm an IT guy by trade since '90. The thing that ticked me off the most was that the corruption was caused by something that reiserfs just shrugs off most of the time, there were 2 transactions in the journal that were out of sync because of an unclean shutdown. There was no way I could find to replay those transactions and get back in sync, so I was forced to purge the journal records to enable the drive to mount. NONE of the rescue mount commands that I found would work, they would crash the kernel and force me to hard reset to restart the machine each time I tried. I was able to get a mostly clean backup using the btrfs restore command, and then after I got my backup I zeroed the log and was able to get the drive to mount.

Shame on me for not having a current backup of the VM's that were running on the drive, forcing me to go through a day of hell trying to recover data.

I'm counting this incident as a definite negative for btrfs, even though I was able to recover. I'm seriously reconsidering using it, even for my SSD cache.

I'd be curious on the exact steps you performed to determine you had corruption, what files were corrupted, and what steps you took to attempt repairs (in order). I've never lost data with btrfs in my use cases both array and cache, ssd and HDD, but everyone's experiences may vary depending on what they are trying to accomplish and what hardware they are using.

OpenSUSE moved to btrfs as its default filesystem and there are plenty of other discussions posing the question as to when others will follow suit as well. Btrfs isn't as immature as some believe, but there are definitely features that are further along in development than others. Examples of things still undergoing lots of changes include higher raid levels and subvolume quotas.

There is also a need to apply NOCOW to directories on btrfs that will store virtual disk images. This is why we added the "enable copy on write" setting to user shares.

JonathanM · July 4, 2015

I'm counting this incident as a definite negative for btrfs, even though I was able to recover. I'm seriously reconsidering using it, even for my SSD cache.
I'd be curious on the exact steps you performed to determine you had corruption, what files were corrupted, and what steps you took to attempt repairs (in order).

Well, this is a little fuzzy, my memory isn't as good as it used to be, but here goes.

SAB was d/l and unraring Dukes of Hazzard to the cache drive at ~7Mb/s, I was watching Miami Vice on Emby through my debian HTPC. Emby quit streaming, so I logged into the server to see what was going on, and CPU was at 20 and climbing fast. I tried to stop unnecessary services to lessen the load and allow things to process through, but nothing stopped the CPU upward climb. My W7 home automation VM was getting extremely laggy, so I attempted to shut it down cleanly, but it just hung. Telnet and top was still responding, and CPU was climbing (40+), so I left it overnight to see if things would straighten out on their own. Next morning nothing would respond, even the local console was dead, so I rebooted.

Unraid booted up fine, started a parity check, but the cache drive came up unmountable. Syslog had an entry about trying to mount the btrfs volume but failed due to transid errors, and there was a crash reported on the local console. Any further mount actions would just hang, no response, and I couldn't stop array.

Restart in maintainance mode, attempt a recovery mount, crash, repeat. Tried MANY different mount options, all resulted in crash.

Unassigned cache drive so I could start the rest of the array without mounting and crashing the btrfs volume, and used btrfs restore, which worked to copy at least some of the data to an array location.

Issued a zero log command, and after that I could mount the drive without crashing.

Reassigned drive as cache, started array, and attempted to start the VM. W7 VM had boot errors, but was able to do normal W7 recovery stuff to get it back working.

How can I detect which files are bad during a read error in a rebuild?

Recommended Posts

sincero

Link to comment

garycase

Link to comment

garycase

Link to comment

sincero

Link to comment

itimpi

Link to comment

BRiT

Link to comment

garycase

Link to comment

JonathanM

Link to comment

garycase

Link to comment

jonp

Link to comment

JonathanM

Link to comment

Archived