Mover (Cache->Array) Error checking

RedCow · June 30, 2020

I use TeraCopy to move files from drive<->drive / machine<->machine with "Always Verify" enabled. This means that after a copy or move all of the written data gets read back and verified before removing files from the source disk/machine. You would be surprised how many times this has saved my ass due to inconsistencies discovered -- I have moved 100's of TBs of data. There have been instances where data written to a hard disk finished happily but when verified the CRC was different with no errors anywhere to be seen.

This leads me to my concern: I currently have no Cache in my unRAID setup so that I can be sure TeraCopy is verifying the files as they are written on disk are indeed correct; however, if I add a Cache volume this verification will essentially become 'worthless' unless the mover in unRAID which moves from Cache to Array also performs such verification.

Does the unRAID mover perform CRC or hash checking when moving files from Cache to the Array? If not... this seems like a critical issue for file integrity.

JorgeB · June 30, 2020

55 minutes ago, RedCow said:

Does the unRAID mover perform CRC or hash checking when moving files from Cache to the Array?

No, but

56 minutes ago, RedCow said:

how many times this has saved my ass due to inconsistencies discovered -- I have moved 100's of TBs of data.

I've also moved 100's of TBs and never had a single checksum error, I don't use teracopy but I create checksums with corz while still on my desktop and from atime to time run a complete check on the server, all checksums still match.

RedCow · June 30, 2020

I have also had many instances of drives that happily write, but when you read back, that's when you find out your drive is going bad. You hit a bad sector, the read stops to a halt, your SMART "pending sector reallocation count" goes up, and if you've already deleted the source file, your data is gone.

I understand we're writing to an array with parity, but I think that the mover having an option to perform read-back verification before deleting something off the cache is pretty important...

JorgeB · June 30, 2020

17 minutes ago, RedCow said:

I have also had many instances of drives that happily write

Checksum verifications are a read operation.

RedCow · June 30, 2020

Right, what I'm saying is that you don't find out if your data is safe until you read it back. Many times hard disks will write without knowing/reporting any issues and they don't crop up until you actually try to read it back. This, IMO, makes the mover not verifying that what it has moved is readable a pretty bad design choice for data integrity.

JorgeB · June 30, 2020

1 minute ago, RedCow said:

Many times hard disks will write without knowing/reporting any issues and they don't crop up until you actually try to read it back.

You are correct that disks usually don't error on writes, they mostly error on reading back the data, but that would be fixed by Unraid's redundancy (assuming you have parity), if the mover had to read back all writes it would take twice as long, and the disks can still fail at any time after that, so it would not be a garantee that you could still the read the same data tomorrow.

RedCow · June 30, 2020

Yes, if the drive errors on reading the data back and thus you know which drive is a problem, that drive's data can be restored from parity (I run two parity); however, the silent type of data error would not be recoverable since you would not know which drive(s) have invalid data -- all you would know is that something may be wrong during array party checking/verification, but not able to recover without knowing which bit is trustworthy from which drive.

I am very surprised to hear you handle large quantities of data without encountering this silent data corruption -- I have encountered it myself a fair few times in video files getting corruption out of nowhere, zip and 7z archives which no longer extract at all, and others. Regardless of your and my own personal experience, silent data corruption is a known and studied phenomenon -- it's not something I'm making up. If you have any doubt regarding this, read the details about "silent data corruption" here: https://queue.acm.org/detail.cfm?id=1866298

Maybe you value your write speeds, but I would gladly give up 50% of my write speed (or even 75%) to get read-back-verification on the cache mover in unRAID. It would be a checkbox and be optional anyway, so you could just opt to not use it. Also keep in mind you wouldn't lose performance as long as you're not writing more than your cache pool's data size. This would be the mover taking some extra time in the middle of the night to move these files.

I would really like to see someone from unRAID staff chime in on this -- I want this on the roadmap or a good reason why it's not needed (which I don't believe exists). I'm assuming the mover is some kind of simple application like linux cp or windows xcopy (which has a /v flag to do read-back-verify). This shouldn't be too hard to implement.

Edited June 30, 2020 by RedCow

JorgeB · June 30, 2020

the silent type of data error would not be recoverable since you would not know which drive(s) have invalid data

You are talking about what's usually called bit rot, and while that exists it's extremely rare, all disks have ECC checking and should never return bad data, they either return good data or give a read error, like UNC @ LBA, there are exceptions due to bad firmware or onboard DRAM going bad but like mentioned these are extremely rare and no way a possibility with the frequency you describe, you likely have a serious hardware problem, unless you're using ECC RAM that would be the prime suspect, but could also be a bad controller, etc, certainly not the disks, maybe if it was always the same disk doing it.

I am very surprised to hear you handle large quantities of data without encountering this silent data corruption

I've encountered it once, with a Sandisk SSD, that silent corrupted data after a sector remapping, I have over 200 disks on my servers, besides the manual checksums all my disks are formatted with btrfs or ZFS, that have their own checksum and can detect silent data corruption on all reads or with a scrub, and like mentioned it never happened (except that SSD, and the corruption was immediately caught by btrfs).

mckagse · May 31, 2022

This is a legitimate concern that just caused me to loose a lot of data. I was dumping lots of documents to my array today and I got a notice that my cache disk was low on space. I invoked the mover and half my drives disappeared. A single rail on my PSU failed during the moving process. I replaced the PSU and started back up and checked my data to learn much of the data I had dumped and already erased from the original media earlier that day was gone. Once I rebuild the now disabled devices, perform a manual parity check, and invoke the mover to completion, I will make sure to disable the cache disk until I can determine a work around for this issue.

This issue should be readdressed.

trurl · May 31, 2022

Attach Diagnostics to your NEXT post in this thread

mckagse · May 31, 2022

My apologies. I'll share what I can in hopes it will help others. I think I grabbed the Diagnostics too late as the event occurred around 17:00 and the logs start at 19:00. I'll recover, there was nothing of irreplaceable value, just lots of effort lost. I've learned a lot due to this.

When I brought the array back online in maintenance mode I checked each drive individually, including the cache drive, looking for files I knew I had just created. I can't give you a quantitative value of how much remained and how much was missing, but it was mixed.

When the event occurred, following a manual invoking of the mover, which likely cause a PSU failure as the disks where spinning up, I observed warnings of multiple disks failing, however the mover continued to run and no option was available to cancel. I just watched as the write errors started to add up on the remaining parity disk and array. 1 of 2 Parity disks lost power, 1 of 3 array disks lost power, unassigned devices lost power, the cache disk remained powered. I believe the mover was attempting to move data to the disk in the array that lost power leading to the error count.

The reason I posted here is that it seemed to be writing the data to a location that didn't exist, then erases from the cache disk without verification the data was transferred.

I attempted to intervene by requesting a shutdown, as no option to stop the mover exists. I lost connection with the server, however, when I physical accessed the server it was still running and I forced a shutdown.

I almost think a complete loss of system power would have been easier to recover from. I probably won't completely disable the cache, but I'll up the frequency to hourly to lower the chance of me filling up the cache drive and running into this issue again.

tower-diagnostics-20220530-2103.zip

trurl · May 31, 2022

7 minutes ago, mckagse said:

When I brought the array back online in maintenance mode I checked each drive individually, including the cache drive, looking for files I knew I had just created.

I don't understand what you say you did here. In maintenance mode no disks are mounted so can't be read. Can you give more details about this?

trurl · May 31, 2022

10 minutes ago, mckagse said:

up the frequency to hourly to lower the chance of me filling up the cache drive

It is impossible to move from fast cache to slow array as fast as you can write to fast cache. If you intend to write more than cache can hold, don't cache.

Mover (Cache->Array) Error checking

Recommended Posts

RedCow

Link to comment

JorgeB

Link to comment

RedCow

Link to comment

JorgeB

Link to comment

RedCow

Link to comment

JorgeB

Link to comment

RedCow

Link to comment

JorgeB

Link to comment

mckagse

Link to comment

trurl

Link to comment

mckagse

Link to comment

trurl

Link to comment

trurl

Link to comment

Join the conversation