Checksum Suite


Squid

Recommended Posts

Good points.  So one plan might be to have checksums and backups for files of value (your 'critical' files, photos, music, financials and docs, family videos, system images and backups), and par2 for those of less value and large size (files that can be reacquired or not a big loss).  I wonder if a cutoff size (perhaps 50MB?) could be set, for users who chose this type of plan, so that all files less than the cutoff would only get checksums, and all files larger would only get par2 processing.  That would make checksumming much faster, no huge files to process, and would also make par2 processing faster, no large numbers of folders and small files to deal with.  The only exception would be personal/family videos, which would get par2 processing, but should also be backed up.

Link to comment

Good points.  So one plan might be to have checksums and backups for files of value (your 'critical' files, photos, music, financials and docs, family videos, system images and backups), and par2 for those of less value and large size (files that can be reacquired or not a big loss).  I wonder if a cutoff size (perhaps 50MB?) could be set, for users who chose this type of plan, so that all files less than the cutoff would only get checksums, and all files larger would only get par2 processing.  That would make checksumming much faster, no huge files to process, and would also make par2 processing faster, no large numbers of folders and small files to deal with.  The only exception would be personal/family videos, which would get par2 processing, but should also be backed up.

 

Interesting idea.

Link to comment

... That would make checksumming much faster, no huge files to process ...

 

I'd think you'd still want MD5's on your large files so you'd at least have a way to KNOW when they'd been modified.

 

I also think that a LOT of folks who think "I'll just re-acquire my media if I lose it" will have a FAR different mindset if/when they actually lose it.    I've seen numerous cases of that on this forum ... and folks quickly come to the conclusion that backups aren't really all that expensive once they've actually lost a few TB (or 10's of TB's) of their data  :)

 

The risk for those who elect to not bother with backups will soon be mitigated a LOT (with dual parity) ... but even that is still not a substitute for backups.

 

 

Link to comment

It seems to me that the problem should be fully constrained under the assumption of bitrot. For example, if just one bit in the file changed and you know it is just one bit, you should be able to recursively modify one bit and recalculate the checksum until it matches the original file. This may not be doable in practice if very similar (but different) files have the same checksum. I forget the term for this situatuon, but just some food for thought.

Link to comment

It seems to me that the problem should be fully constrained under the assumption of bitrot. For example, if just one bit in the file changed and you know it is just one bit, you should be able to recursively modify one bit and recalculate the checksum until it matches the original file. This may not be doable in practice if very similar (but different) files have the same checksum. I forget the term for this situatuon, but just some food for thought.

A movie file of 50 billion bits is not unusual. The expected average of MD5 calculations needed before you hit the correct checksum would be one-half that, or 25 billion MD5 calculations to fix that one file. And of course, the time to calculate a single MD5 of a file would be related to its size in some way. Could be waiting a long time.
Link to comment

It seems to me that the problem should be fully constrained under the assumption of bitrot. For example, if just one bit in the file changed and you know it is just one bit, you should be able to recursively modify one bit and recalculate the checksum until it matches the original file. This may not be doable in practice if very similar (but different) files have the same checksum. I forget the term for this situatuon, but just some food for thought.

 

:) :)

 

Assuming you KNOW that only 1 bit has changed in a standard 4.7GB DVD.    That's 4,700,000,000 bytes, or 37,600,000,000 bits.    To recursively check the impact of a single bit change, you'd have to compute the MD5 for EVERY different bit value.    My experience is it takes ~ 30 seconds to calculate the MD5 of a file of that size.  Let's assume you found a spectacular algorithm that could cut that time to ONE second.    That would still take over 1,000 YEARS to find the missing bit  :)    [37,600,000,000 seconds (at 1 sec/check) /60 = 626,666,666.67 minutes /60 = 10,444,444.44 hours /24 = 435,185.19 days /365 = 1,192.29 years !!

 

AND that assumes you KNOW that only one bit has changed.    It'd be really depressing to wait 1192 years for the result and find out there weren't any matches  8) 8)

 

... and that time is for a single-sided DVD of 4.7GB.    Double-sided DVDs or BluRay discs are FAR larger ... and would take much longer !

Link to comment

Most of my terabytes of information are media files, such as movies and music. I am not going to backup all these files or try to repair them. I use my original DVD/BR disks to restore them if corruption is found.

 

Anyway that is just my way of working, you may completely disagree and do otherwise. :D

Link to comment

... I have no desire to keep backups of terabites of media files if I can recover them with par2.

 

"... if I can recover them with par2 ..."  ==> The simple fact is you can NOT "recover them with par2."    The error recovery capabilities of Par2 are good ... but just how much you can recover depends on the number of parity blocks you've used and the degree of corruption.    It's certainly better than just using checksums ... but nothing close to an actual backup.

 

Nevertheless, I understand that many are willing to simply take the risk of losing their large media files and re-ripping all of them if they were to lose them.    I could do that as well ... I've got boxes full of DVDs I've acquired over the past 15 years or so (over 3000).    But the time I've spent ripping them; recompressing them; cataloging them; etc. FAR exceeds what I'd ever be willing to do again.    I consider maintaining complete backups cheap insurance.    Most of my backup drives are actually older drives that I've replaced with larger ones and/or have more reallocated sectors than I wanted for actively used drives -- I doubt I've averaged more than $100/year extra to actually buy new backup drives.

 

 

 

Link to comment

It seems to me that the problem should be fully constrained under the assumption of bitrot. For example, if just one bit in the file changed and you know it is just one bit, you should be able to recursively modify one bit and recalculate the checksum until it matches the original file. This may not be doable in practice if very similar (but different) files have the same checksum. I forget the term for this situatuon, but just some food for thought.

 

:) :)

 

Assuming you KNOW that only 1 bit has changed in a standard 4.7GB DVD.    That's 4,700,000,000 bytes, or 37,600,000,000 bits.    To recursively check the impact of a single bit change, you'd have to compute the MD5 for EVERY different bit value.    My experience is it takes ~ 30 seconds to calculate the MD5 of a file of that size.  Let's assume you found a spectacular algorithm that could cut that time to ONE second.    That would still take over 1,000 YEARS to find the missing bit  :)    [37,600,000,000 seconds (at 1 sec/check) /60 = 626,666,666.67 minutes /60 = 10,444,444.44 hours /24 = 435,185.19 days /365 = 1,192.29 years !!

 

AND that assumes you KNOW that only one bit has changed.    It'd be really depressing to wait 1192 years for the result and find out there weren't any matches  8) 8)

 

... and that time is for a single-sided DVD of 4.7GB.    Double-sided DVDs or BluRay discs are FAR larger ... and would take much longer !

Fair enough, but do i get points for uniqueness? :)

Link to comment

 

"... if I can recover them with par2 ..."  ==> The simple fact is you can NOT "recover them with par2."    The error recovery capabilities of Par2 are good ... but just how much you can recover depends on the number of parity blocks you've used and the degree of corruption.    It's certainly better than just using checksums ... but nothing close to an actual backup.

 

I wouldn't go so far as to categorically say you can not actually recover from bit rot with par2... as with everything there are trade-offs.

 

I see a few different camps here, camp 1, is just re-do all the work you did to get your media on your system again... it hits the fan...

 

Camp 2, says make 100% sized backups and checksums and have two copies of everything...

Camp 3, says that literal duplicates of everything is overkill, and that we can achieve a level of protection that is only slightly worse via other means.

 

There are pros and cons to each. none of these are actually wrong it's all about what works best for you.

 

 

 

Link to comment

Agree -- every user has to decide what trade-offs you want to make vis-à-vis protecting your data ... just as you decide which of your valuables to insure and which you'd be willing to simply absorb the loss of.

 

With data the options are basically (a) the ability to KNOW that your file(s) have been corrupted ... but no "insurance"  (i.e. just MD5's);  (b)  the ability to know that files have been corrupted, with a limited capability to correct the damage (Par2's);  or ©  the ability to both know the file has been corrupted (MD5's) AND to restore it if that happens (backups).

 

... and of course with backups, you need to decide just how well you'll be backed up -- locally; off-site; in the cloud; etc.

 

Link to comment

... I have no desire to keep backups of terabites of media files if I can recover them with par2.

 

"... if I can recover them with par2 ..."  ==> The simple fact is you can NOT "recover them with par2."    The error recovery capabilities of Par2 are good ... but just how much you can recover depends on the number of parity blocks you've used and the degree of corruption.    It's certainly better than just using checksums ... but nothing close to an actual backup.

 

Nevertheless, I understand that many are willing to simply take the risk of losing their large media files and re-ripping all of them if they were to lose them.    I could do that as well ... I've got boxes full of DVDs I've acquired over the past 15 years or so (over 3000).    But the time I've spent ripping them; re-compressing them; cataloging them; etc. FAR exceeds what I'd ever be willing to do again.    I consider maintaining complete backups cheap insurance.    Most of my backup drives are actually older drives that I've replaced with larger ones and/or have more reallocated sectors than I wanted for actively used drives -- I doubt I've averaged more than $100/year extra to actually buy new backup drives.

 

I guess the point for me, and maybe a few others too, is that right now a corrupt file means the choice of rebuilding an entire disk (which might not even fix the file), re-ripping / downloading, or restoration from a full backup (which I would rather not maintain in the first place). Unless I misunderstand how par2 works, it adds an option to recover a file that has suffered some level of corruption; for a 1% corruption tolerance it'll only cost of 1% of my array which I believe we've established is probably more than we need for basic bitrot recovery. Anything more than that (head crash, janky cable, power surge during write, etc) is likely to be taken care of by a parity rebuild.

 

Is that not correct?

 

Again, I understand par2 is of no help if the file is deleted or I suffer a catastrophic system loss (fire/flood/theft/etc). But my media files just are NOT important enough for me to setup a geographically distinct backup (because that is the only backup that protects against those) and I'm pretty sure I'll have something more important to worry about at that moment in my life. My critical files are both another story and much smaller in size. For those there are thumbdrives in fire boxes, cloud storage, and if I get paranoid enough, another thumbdrive mailed to another state.

 

I know there are anecdotes here of people who thought they didn't need to backup their media, lost it, and then lamented their decision. Perhaps that will be me someday. But even if it is, I will know in the back of my mind that the option wasn't really within my desired level of effort and that the most important irreplaceable stuff has been backed up.

 

I'll say finally that you have to consider the difference of scale we may be talking about. I don't have movies in the thousands that I would find myself re-ripping. More like under 200 and frankly I don't know that I'd even both now with Netflix, Amazon Prime, etc. also you say 100% complete backup of your media is "worth it" to you as cheap insurance. I guess I would say that 1% is worth it to me as cheap insurance [shrug]

Link to comment

... I guess the point for me, and maybe a few others too, is that right now a corrupt file means the choice of rebuilding an entire disk (which might not even fix the file), re-ripping / downloading, or restoration from a full backup (which I would rather not maintain in the first place). Unless I misunderstand how par2 works, it adds an option to recover a file that has suffered some level of corruption; for a 1% corruption tolerance it'll only cost of 1% of my array which I believe we've established is probably more than we need for basic bitrot recovery. Anything more than that (head crash, janky cable, power surge during write, etc) is likely to be taken care of by a parity rebuild.

 

Is that not correct?

 

 

I'd say no, it's not correct.

 

e.g.  "... a corrupt file means the choice of rebuilding an entire disk  ..."  => rebuilding a disk isn't likely to fix the file, as most likely the bit errors were propagated to parity -- i.e. you'll just rebuild the same corrupted file.  There ARE cases where that wouldn't be true ... i.e. if it was a case of bit-rot and you had NOT done a parity check since that occurred (which would have "fixed" parity).

 

"... Unless I misunderstand how par2 works, it adds an option to recover a file that has suffered some level of corruption; for a 1% corruption tolerance it'll only cost of 1% of my array "  ==>  What Par2 can correct depends on how many blocks you've saved AND whether the errors are constrained to a single block.    i.e. if you have multiple errors in DIFFERENT blocks then you can't recover ... regardless of what % of the file the errors represent.  It IS definitely better than having no recovery capability at all ... but if you're thinking of it as a replacement for a backup, that's simply not correct.    And, of course, it doesn't help at all with the traditional issues a backup can help you recover from ... accidental deletion; over-writing; virus/malware infection;  failed disk  (in the case of a fault tolerant array, N+1 failed disks, where N = the number of faults the array can tolerate);  etc.

 

"... Anything more than that (head crash, janky cable, power surge during write, etc) is likely to be taken care of by a parity rebuild. " ==>  Providing, of course that ONLY one disk (or two after we get dual parity) is impacted by the event that caused the issue [Likely if it was a head crash;  not so clear that only one drive would be impacted by a power surge.]

 

 

I'll say finally that you have to consider the difference of scale we may be talking about. I don't have movies in the thousands that I would find myself re-ripping. More like under 200 and frankly I don't know that I'd even both now with Netflix, Amazon Prime, etc. also you say 100% complete backup of your media is "worth it" to you as cheap insurance. I guess I would say that 1% is worth it to me as cheap insurance [shrug]

 

A difference of scale simply means you need less space to backup => e.g. 200 movies, even if they're all 25GB BluRay rips, would all fit on ONE 5TB disk ... these are often on sale for ~ $150 => of if you are upsizing your array you'd likely have that much space on 2 or 3 of the older drives, so the net cost to keep backups is zero.

 

I agree, however, that with modern streaming services a lot of folks will simply use those instead of acquiring more physical media.  My DVD acquisition rate has dropped a LOT in the past couple years since we've been streaming a lot of the movies we watch.  It is, however, nice to be able to watch our collection independent of whether we have a broadband connection => e.g. in a motorhome or at a vacation home.

 

 

A final thought => I completely agree that nobody NEEDS backups ... just as you don't NEED insurance on anything else you're willing to lose.    Many folks don't buy collision coverage on their cars, and some don't even bother with homeowner's insurance.    Just depends on your risk tolerance.    Most of us don't really NEED all of the media we've collected ... so except for the emotional trauma losing it all wouldn't be a big deal.

 

Link to comment

I'd say no, it's not correct.

 

e.g.  "... a corrupt file means the choice of rebuilding an entire disk  ..."  => rebuilding a disk isn't likely to fix the file, as most likely the bit errors were propagated to parity -- i.e. you'll just rebuild the same corrupted file.  There ARE cases where that wouldn't be true ... i.e. if it was a case of bit-rot and you had NOT done a parity check since that occurred (which would have "fixed" parity).

I think that is the point of running regular par2 validation scans and perhaps even just before a parity check. But I'd agree that a rebuild based on parity is not the best way to attempt to fix file corruption since that is not the actual purpose of parity.

 

What Par2 can correct depends on how many blocks you've saved AND whether the errors are constrained to a single block.    i.e. if you have multiple errors in DIFFERENT blocks then you can't recover ... regardless of what % of the file the errors represent.  It IS definitely better than having no recovery capability at all

And that is at least part of the point, better than none at all ... the open question is if its actually useful. You mention "multiple error in different blocks" to which I ask;

 

What event will cause that level of corruption that isn't likely to require an actual drive rebuild if not also replacement? Given how rare bitrot really is, how many blocks with multiple errors (multiple means >  % par in this case?) do you expect to happen before we do our next validation scan?

Really I'm trying to better understand this here. Wouldn't a 1% par2 allow for 1% corruption per block?

 

... but if you're thinking of it as a replacement for a backup, that's simply not correct.    And, of course, it doesn't help at all with the traditional issues a backup can help you recover from ... accidental deletion; over-writing; virus/malware infection;  failed disk  (in the case of a fault tolerant array, N+1 failed disks, where N = the number of faults the array can tolerate);  etc.

I believe I mentioned I'm aware of the uselessness of par2 in those situations and that it isn't a backup solution.

 

Providing, of course that ONLY one disk (or two after we get dual parity) is impacted by the event that caused the issue [Likely if it was a head crash;  not so clear that only one drive would be impacted by a power surge.]

By that logic you can say the same thing about dual parity, "Provided of course that ONLY two disks ..." As to power surge I was mainly trying to invoke a scenery which is typically handled by parity protection. System loss is of course only rectified by backups. I am personally willing to lose my discretionary entertainment media to a power surge the level of which gets past my UPS and power supply to the point that it fries multiple drives or the system. Critical files are being treated differently as already mentioned.

 

A difference of scale simply means you need less space to backup => e.g. 200 movies, even if they're all 25GB BluRay rips, would all fit on ONE 5TB disk ... these are often on sale for ~ $150 => of if you are upsizing your array you'd likely have that much space on 2 or 3 of the older drives, so the net cost to keep backups is zero.

Your point is spot on, but for me it is a non sequitur. The scenarios from which I feel I need to protect myself beyond parity and par2, require off-site backups. I'm not doing that for 5TB of data with either cloud (cost), another system (cost and effort), or even mailed hard drives (cost and effort). I will however do it with a few GB of data that can fit on a thumbdrive that is cheap and easy to mail, and/or which will fit in a free cloud setup.

 

Deleted media is a grayer area since one simple mistyped command can easily wipe out a share, disk, or even array for that matter. I just don't consider that very likely enough for me to bother with a local backup. Sure it could happen, but I just don't care. My next level of "care" is system loss and protection for that requires off-site backups. But you know, if I end up with a spare drive sitting around, I just might [shrug]

 

I agree, however, that with modern streaming services a lot of folks will simply use those instead of acquiring more physical media.  My DVD acquisition rate has dropped a LOT in the past couple years since we've been streaming a lot of the movies we watch.  It is, however, nice to be able to watch our collection independent of whether we have a broadband connection => e.g. in a motorhome or at a vacation home.

HA! My acquisition of physical media (movies and music) stopped almost a decade ago. Sheesh my last music acquisition, physical or digital, I can't even remember; I'm not even getting stuff from questionable sources either. If I can't stream it or hear it on the radio I don't care. I probably just made some serious music fans die a little inside [sorry] but music has really just become bubblegum for my ears in the last decade. As for disconnected viewing, for us that happens via a quick copy of desired media to a thumb drive + roku3 or a plex sync to the tablet. But I suspect we also do that less often than you might with a motorhome. Seems like you need an all SSD micro-array with autosync to your main array whenever you have connection >;-)

 

A final thought => I completely agree that nobody NEEDS backups ... just as you don't NEED insurance on anything else you're willing to lose.  Many folks don't buy collision coverage on their cars, and some don't even bother with homeowner's insurance.    Just depends on your risk tolerance.    Most of us don't really NEED all of the media we've collected ... so except for the emotional trauma losing it all wouldn't be a big deal.

Yup risk tolerance is exactly it with Risk = likelihood of occurrence * consequences of occurrence. We just differ in the level of effort we're willing to put in for the level of risk. And that is of course a completely fair place to disagree but probably not worth debating much further :)

 

For the record, I not only have renters insurance where I live, homeowners insurance on my rental property, and collision coverage on my 14yo car (because it is very inexpensive), but I also have umbrella coverage which I have actually had to use and was worth every single penny (and I mean pennies) when I was taken to court. The insurance co's lawyer ate them alive!!! But I also don't buy insurance for electronics or extended warranties because I know they are not worth the cost and I can afford to self-insure. It helps that I like to work on cars too :)

Link to comment

This thread has gone in a very helpful direction. I'm in the "I don't like to keep off-site backups of huge files" group.

 

I think the take away here is that we need something (Par2 or otherwise) built in to unRAID that performs checksumming, validation, and rebuild. It needs to be easy to use through the GUI and it needs to be well-tested.

Link to comment

Suppose for a moment that I don't want all of the Nerd Pack packages installed... where on the boot drive would I want to drop the slackware package for inotify tools so that inotify tools is auto installed on reboot?

 

It used to be /boot/extras but now I see /boot/packages and packages in the /boot/config/plugins/Name/Packages folders...

 

So uh do I need to make an extras folder or is there a better (AKA best practices) approach to this.

Link to comment

Really I'm trying to better understand this here. Wouldn't a 1% par2 allow for 1% corruption per block?

 

No.

 

Par2 works with blocks. In every block it can correct 1 error bit, or all bits that are in error. For par2, that is the same. Par2 corrects blocks, not bits.

 

I split my blurays in 32762 (virtual) blocks. Then I create 50 par2 blocks (=0.165% par2 files).

 

This allows me to repair 50 bits all over the bluray. If 51 bits are bad (so 51 blocks are bad), then par2 can't help me.

 

But if in 50 blocks alle bits are bad, par2 can still repair this.

 

Hope this helps...

Link to comment

Suppose for a moment that I don't want all of the Nerd Pack packages installed... where on the boot drive would I want to drop the slackware package for inotify tools so that inotify tools is auto installed on reboot?

 

It used to be /boot/extras but now I see /boot/packages and packages in the /boot/config/plugins/Name/Packages folders...

 

So uh do I need to make an extras folder or is there a better (AKA best practices) approach to this.

dmacias has a plugin to let you choose. See the last few pages of the NerdPack thread.
Link to comment

Really I'm trying to better understand this here. Wouldn't a 1% par2 allow for 1% corruption per block?

 

No.

 

Par2 works with blocks. In every block it can correct 1 error bit, or all bits that are in error. For par2, that is the same. Par2 corrects blocks, not bits.

 

I split my blurays in 32762 (virtual) blocks. Then I create 50 par2 blocks (=0.165% par2 files).

 

This allows me to repair 50 bits all over the bluray. If 51 bits are bad (so 51 blocks are bad), then par2 can't help me.

 

But if in 50 blocks alle bits are bad, par2 can still repair this.

 

Hope this helps...

Yes it does and is more in-line with what I had originally believed.

 

Based on what you said then, is this true? So there can be any number of rotted bits in a virtual block, all that matters is that I only have as many bad blocks as I have par2 blocks.

 

Also, checking my understanding or your math ... 50 par2 blocks and 32762 virtual blocks is 0.153% is it not?

Link to comment

Really I'm trying to better understand this here. Wouldn't a 1% par2 allow for 1% corruption per block?

 

No.

 

Par2 works with blocks. In every block it can correct 1 error bit, or all bits that are in error. For par2, that is the same. Par2 corrects blocks, not bits.

 

I split my blurays in 32762 (virtual) blocks. Then I create 50 par2 blocks (=0.165% par2 files).

 

This allows me to repair 50 bits all over the bluray. If 51 bits are bad (so 51 blocks are bad), then par2 can't help me.

 

But if in 50 blocks alle bits are bad, par2 can still repair this.

 

Hope this helps...

Yes it does and is more in-line with what I had originally believed.

 

Based on what you said then, is this true? So there can be any number of rotted bits in a virtual block, all that matters is that I only have as many bad blocks as I have par2 blocks.

 

True. It doesn't matter how many bits are bad inside a bad block. It could be 1 bit is bad, it could be all bits inside that block are bad. Par2 just works with blocks, so for par2 the whole block is bad.

 

Par2 calculates a checksum for each block, if that checksum is not correct, then the whole block is considered bad and needs to be reconstructed.

 

Par2 does not repair that 1 bad bit, it just reconstructs the whole block.

 

Thats why you want for bitrot many small blocks. With 50 blocks, I can repair 50 bad bits, no matter where they are in the big bluray iso file.

 

Also, if somehow the drive got bad (say bad sector) and the unraid parity got contaminated with that bad sector, then you could repair this with par2 (if you have enough par2 blocks).

 

Also, checking my understanding or your math ... 50 par2 blocks and 32762 virtual blocks is 0.153% is it not?

 

I just calculated this with the filesizes of the bluray and par2 file. Another example is here (0.161% par2 blocks): https://lime-technology.com/forum/index.php?topic=43396.msg424245#msg424245

 

It is not always exact the same size (%), an indexfile (with checksums for each par2 block) is also included with each .par2 file.

 

 

Link to comment

Also, if somehow the drive got bad (say bad sector) and the unraid parity got contaminated with that bad sector, then you could repair this with par2 (if you have enough par2 blocks).

 

That’s why I use 5% PAR2s for my media servers, can’t afford to backup it up all and once had a disk with some read errors while rebuilding a failed disk that left me with some corrupted files, I had checksums so I knew what files were damaged but no way of fixing them.

 

I will probably go back to checksums only once dual parity is available.

 

Link to comment
  • Squid locked this topic
  • Squid unlocked this topic

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.