Dealing with silent corruption (aka bitrot)


Recommended Posts

Hi,

First of all, I want to thank everyone in this community working on making unRAID such an awesome product. I just had a disk fail on me 2 days ago and I swapped it for a brand new one with almost no downtime at all, the GUI made me feel safe during the whole process of rebuilding the disk from parity and I really appreciate knowing my data is very safe in the future (with backups too just in case ;-)).

 

My question is regarding silent data corruption, also known as bit-rot. I've read a little bit about it on the unRAID forums and over r/unraid on Reddit and did notice the existence of the Dynamix File Integrity plugin below.

 

 

However, while this plugin, well configured, will help me detecting any potential issue, it doesn't fix the issues. Two years ago, someone was asking a similar question (https://www.reddit.com/r/DataHoarder/comments/537ys9/unraid_users_how_do_you_handle_disk_encryption/), and I see comments saying the devs at LT were considering adding some sort of protection against bit-rot, because if well integrated it could be fixed on-the-fly thanks to the parity disks. Years later I don't see many discussions regarding this and I'm sure it's an important subject to a lot of us data hoarders. 

 

Am I missing something or do we still have in 2019 to address this issue completely manually (ie. by restoring backups when it happens and hope the backup isn't corrupted and is up-to-date enough)? If there is anything about this coming soon in unRAID that I'm unaware of, I'd definitely like to hear about it!

 

Edited by dnLL
  • Upvote 1
Link to comment
19 hours ago, dnLL said:

some sort of protection against bit-rot, because if well integrated it could be fixed on-the-fly thanks to the parity disks.

Are you sure you got this right?  One of the greatest strength of unRAID (in my opinion) is the fact that each disk is a separate file system.  So, as I'm sure you've heard before, if more disks fail than you have parity drives, you can still recover data from the remaining disks since they're each an independent file system.  Unlike in a RAID5 array, for example, which stripes data across multiple disks.  Lose too many, and all your data is toast.

 

But unRAID's independent disks also mean there can be no "automatic" file integrity fix.  There is no overarching file system that can locate redundant copies of corrupted files (like it does in BTRFS, for example).  So it looks like you can either have one, or the other.  I'm no expert on file systems but given the devs' silence on most things related to BTRFS and file integrity, I doubt anything like this is in the pipeline.  I'd be happy to be proven wrong, of course 😉

Edited by servidude
Link to comment

If you are truly concerned about data corruption you should have a MB with ECC memory installed.  Many of us feel that most 'Bit-rot' problems are actually memory issues (ex., bit-flipping due cosmic radiation, static discharge, or power surges)  as modern hard drives are designed to have detect and correct any disk errors as they occur on-the-fly.  If the HD can not correct the issue, it reports a read error.) 

  • Like 1
Link to comment
Just now, Frank1940 said:

If you are truly concerned about data corruption you should have a MB with ECC memory installed.  Many of us feel that most 'Bit-rot' problems are actually memory issues (ex., bit-flipping due cosmic radiation, static discharge, or power surges)  as modern hard drives are designed to have detect and correct any disk errors as they occur on-the-fly.  If the HD can not correct the issue, it reports a read error.) 

Good to know. I do have ECC memory installed. What is the purpose of the file integrity plugin then if it just shouldn't happen anymore?

Link to comment
6 hours ago, servidude said:

But unRAID's independent disks also mean there can be no "automatic" file integrity fix. 

I'm no expert (so no, I'm not sure about this) but from my understanding, since you have at least 1 parity disk which has each bit set to the sum of the corresponding bits on all the other disks, if one disk is corrupted, some bits won't add up anymore, thus creating parity check issues. I thought it would be easy to repair then. But I'm probably wrong, I just need someone with better knowledge about this to explain it.

 

With RAID5 it's different, the parity is stripped across all disks. So if you lose a disk, you definitely lose some data and some parity, but you can recover with your stripped parity and data from the other disks.

Edited by dnLL
Link to comment
2 minutes ago, dnLL said:

I'm no expert (so no, I'm not sure about this) but from my understanding, since you have at least 1 parity disk which has each bit set to the sum of the corresponding bits on all the other disks, if one disk is corrupted, some bits won't add up anymore, thus creating parity check issues. I thought it would be easy to repair then. But I'm probably wrong, I just need someone with better knowledge about this to explain it.

It is easy to detect that something is wrong.     What is NOT possible with parity is detecting which drive is the problem one (and thus which one to change to correct the error). 

Link to comment
Just now, itimpi said:

It is easy to detect that something is wrong.     What is NOT possible with parity is detecting which drive is the problem one (and thus which one to change to correct the error). 

So what advantage does the plugin provide over a simple parity check overall other than performance-related issues?

 

Does bitrot still happen in modern system or is it something that should be disregarded altogether? The reason I started worrying about it is my friend lost some JPG files recently due to this, had to recover from a backup and he's been using unRAID for a while. He does have a lot of old hard drives however and old hardware in general so that may be an issue that I'm not really facing as much.

Link to comment
Just now, dnLL said:

So what advantage does the plugin provide over a simple parity check overall other than performance-related issues?

 

Does bitrot still happen in modern system or is it something that should be disregarded altogether? The reason I started worrying about it is my friend lost some JPG files recently due to this, had to recover from a backup and he's been using unRAID for a while. He does have a lot of old hard drives however and old hardware in general so that may be an issue that I'm not really facing as much.

The file integrity plugin will allow you to pinpoint exactly which file has the problem.    It also allows you to do it quickly without doing binary compares of files.    Note that corruption can occur for a wide variety of reasons.    This is different to ‘bitrot’ which is a special case where a file reads data back incorrectly but does not realise the read is bad.

Link to comment
9 hours ago, dnLL said:

So what advantage does the plugin provide over a simple parity check overall other than performance-related issues? 

As @itimpi points out, the File Integrity plugin checks on a per-file basis.  It gets confused (understandably) by constantly changing files like docker.img, so be sure to exclude those from your checks if you don't want a bunch of false positives.

Interestingly if you use BTRFS for your data disks you can do a "btrfs scrub" from the web interface which will provide the same functionality (detect, but not fix corruption).

 

9 hours ago, dnLL said:

Does bitrot still happen in modern system or is it something that should be disregarded altogether? The reason I started worrying about it is my friend lost some JPG files recently due to this, had to recover from a backup and he's been using unRAID for a while. He does have a lot of old hard drives however and old hardware in general so that may be an issue that I'm not really facing as much.

Good question!  I think the concept of bitrot keeps many of us data hoarders awake at night. 😄  I don't think it's impossible, but I also think it's pretty rare.  Like your friend I have encountered corrupted files before (it's easy to tell with JPG, ZIP etc. - who knows, there might be more) but my hardware (flaky motherboard) and OS (Windows 98 😂) were probably to blame for that.

Either way, nothing can replace backups, even if you're using something bulletproof like ZFS you will need them!

Link to comment
  • 9 months later...
On 6/15/2019 at 7:11 AM, Frank1940 said:

If you are truly concerned about data corruption you should have a MB with ECC memory installed.  Many of us feel that most 'Bit-rot' problems are actually memory issues (ex., bit-flipping due cosmic radiation, static discharge, or power surges)  as modern hard drives are designed to have detect and correct any disk errors as they occur on-the-fly.  If the HD can not correct the issue, it reports a read error.) 

I had bitrot issues with unraid using ECC (Supermicro server setup), I've since moved to ZFS and get bitrot/corruption issues reasonably regularly due to old disks (same hardware). I feel that you really need bitrot detection in the FS... I'd like to move back to Unraid if they can crack the online bitrot and recovery issue.

  • Like 4
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.