Jump to content
Sign in to follow this  
RedCow

Mover (Cache->Array) Error checking

8 posts in this topic Last Reply

Recommended Posts

I use TeraCopy to move files from drive<->drive / machine<->machine with "Always Verify" enabled.  This means that after a copy or move all of the written data gets read back and verified before removing files from the source disk/machine.  You would be surprised how many times this has saved my ass due to inconsistencies discovered -- I have moved 100's of TBs of data.  There have been instances where data written to a hard disk finished happily but when verified the CRC was different with no errors anywhere to be seen.

 

This leads me to my concern: I currently have no Cache in my unRAID setup so that I can be sure TeraCopy is verifying the files as they are written on disk are indeed correct; however, if I add a Cache volume this verification will essentially become 'worthless' unless the mover in unRAID which moves from Cache to Array also performs such verification.  

 

Does the unRAID mover perform CRC or hash checking when moving files from Cache to the Array?  If not... this seems like a critical issue for file integrity.

Share this post


Link to post
55 minutes ago, RedCow said:

Does the unRAID mover perform CRC or hash checking when moving files from Cache to the Array?

No, but

56 minutes ago, RedCow said:

how many times this has saved my ass due to inconsistencies discovered -- I have moved 100's of TBs of data. 

I've also moved 100's of TBs and never had a single checksum error, I don't use teracopy but I create checksums with corz while still on my desktop and from atime to time run a complete check on the server, all checksums still match.

 

Share this post


Link to post

I have also had many instances of drives that happily write, but when you read back, that's when you find out your drive is going bad.  You hit a bad sector, the read stops to a halt, your SMART "pending sector reallocation count" goes up, and if you've already deleted the source file, your data is gone.  

 

I understand we're writing to an array with parity, but I think that the mover having an option to perform read-back verification before deleting something off the cache is pretty important...

 

 

Share this post


Link to post

Right, what I'm saying is that you don't find out if your data is safe until you read it back.  Many times hard disks will write without knowing/reporting any issues and they don't crop up until you actually try to read it back.  This, IMO, makes the mover not verifying that what it has moved is readable a pretty bad design choice for data integrity.

Share this post


Link to post
1 minute ago, RedCow said:

Many times hard disks will write without knowing/reporting any issues and they don't crop up until you actually try to read it back.

You are correct that disks usually don't error on writes, they mostly error on reading back the data, but that would be fixed by Unraid's redundancy (assuming you have parity), if the mover had to read back all writes it would take twice as long, and the disks can still fail at any time after that, so it would not be a garantee that you could still the read the same data tomorrow.

Share this post


Link to post

Yes, if the drive errors on reading the data back and thus you know which drive is a problem, that drive's data can be restored from parity (I run two parity); however, the silent type of data error would not be recoverable since you would not know which drive(s) have invalid data -- all you would know is that something may be wrong during array party checking/verification, but not able to recover without knowing which bit is trustworthy from which drive.  

 

I am very surprised to hear you handle large quantities of data without encountering this silent data corruption -- I have encountered it myself a fair few times in video files getting corruption out of nowhere, zip and 7z archives which no longer extract at all, and others.  Regardless of your and my own personal experience, silent data corruption is a known and studied phenomenon -- it's not something I'm making up.  If you have any doubt regarding this, read the details about "silent data corruption" here: https://queue.acm.org/detail.cfm?id=1866298 

 

Maybe you value your write speeds, but I would gladly give up 50% of my write speed (or even 75%) to get read-back-verification on the cache mover in unRAID.  It would be a checkbox and be optional anyway, so you could just opt to not use it.  Also keep in mind you wouldn't lose performance as long as you're not writing more than your cache pool's data size.  This would be the mover taking some extra time in the middle of the night to move these files.

 

I would really like to see someone from unRAID staff chime in on this -- I want this on the roadmap or a good reason why it's not needed (which I don't believe exists).  I'm assuming the mover is some kind of simple application like linux cp or windows xcopy (which has a /v flag to do read-back-verify).  This shouldn't be too hard to implement.

Edited by RedCow

Share this post


Link to post
the silent type of data error would not be recoverable since you would not know which drive(s) have invalid data

You are talking about what's usually called bit rot, and while that exists it's extremely rare, all disks have ECC checking and should never return bad data, they either return good data or give a read error, like UNC @ LBA, there are exceptions due to bad firmware or onboard DRAM going bad but like mentioned these are extremely rare and no way a possibility with the frequency you describe, you likely have a serious hardware problem, unless you're using ECC RAM that would be the prime suspect, but could also be a bad controller, etc, certainly not the disks, maybe if it was always the same disk doing it.

 

I am very surprised to hear you handle large quantities of data without encountering this silent data corruption

I've encountered it once, with a Sandisk SSD, that silent corrupted data after a sector remapping, I have over 200 disks on my servers, besides the manual checksums all my disks are formatted with btrfs or ZFS, that have their own checksum and can detect silent data corruption on all reads or with a scrub, and like mentioned it never happened (except that SSD, and the corruption was immediately caught by btrfs).

 

 

 

 

 

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this