Checksum Suite


Squid

Recommended Posts

  • 2 weeks later...

Checksums (any checksum be it md5 sha or blake2) only detect corruption (silent or otherwise).  There is par2 within the plugin which will repair corruption.  The par2 side of things while its functional the gui is a little rough around the edges (real life keeps getting in the way of me finishing that - should be able to do it over the holidays)

Link to comment

Checksums (any checksum be it md5 sha or blake2) only detect corruption (silent or otherwise).  There is par2 within the plugin which will repair corruption.  The par2 side of things while its functional the gui is a little rough around the edges (real life keeps getting in the way of me finishing that - should be able to do it over the holidays)

 

Sorry I keep asking Noob questions, I understand the concept of using checksums but have never actually messed with it before.

 

Just to be 100% clear, if I want to have detection and repair (of minor) corruption, I'd choose par2 as my format?

 

Are there Pros/Cons for the different checksum formats? How big is the file they generate (in general), and where exactly are they stored with in the folder scanned?

 

Some of this might be answered by downloading your plugin...  ;)

Link to comment

Checksums (any checksum be it md5 sha or blake2) only detect corruption (silent or otherwise).  There is par2 within the plugin which will repair corruption.  The par2 side of things while its functional the gui is a little rough around the edges (real life keeps getting in the way of me finishing that - should be able to do it over the holidays)

 

Sorry I keep asking Noob questions, I understand the concept of using checksums but have never actually messed with it before.

 

Just to be 100% clear, if I want to have detection and repair (of minor) corruption, I'd choose par2 as my format?

 

Are there Pros/Cons for the different checksum formats? How big is the file they generate (in general), and where exactly are they stored with in the folder scanned?

 

Some of this might be answered by downloading your plugin...  ;)

par2 is not a checksum. It is one or more additional files that allow missing or corrupt files to be reconstructed. Par2 is used extensively with usenet binaries (NZBs) and the applications that download them, such as SABnzbd, NZBGet. Something like how parity allows a disk to be reconstructed, but much different in the details.
Link to comment

Just to be 100% clear, if I want to have detection and repair (of minor) corruption, I'd choose par2 as my format?

 

Are there Pros/Cons for the different checksum formats? How big is the file they generate (in general), and where exactly are they stored with in the folder scanned?

 

For repairing of minor corruption issues (or major depending upon how you configure everything), you would choose par2.  https://en.wikipedia.org/wiki/Parchive

 

The defaults the plugin uses at this moment is the ability to repair up to 10% corruption totalling 200 corruptions (the important thing here is the 10%).  However, the ability to do that requires an additional 10% of storage space.  For 100% recovery of a complete folder (ie: every single file got deleted and/or major corruption happened with every file), the par2 takes up the same amount of space as the originals.

 

A checksum alone stores a file that's maybe 200 bytes.

 

For checking the files, they are both roughly equivalent in speed.  However, for creation, a par2 set takes significantly longer than merely creating a checksum of the file.

 

par2 is not a checksum. It is one or more additional files that allow missing or corrupt files to be reconstructed. Par2 is used extensively with usenet binaries (NZBs) and the applications that download them, such as SABnzbd, NZBGet.

Mostly correct.  Par2 CAN be used as a checksum device, but its not generally how its used.  And this plugin does not generate par2 sets that you would use with usenet (although it is compatible), as par2 sets that you would find being used on usenet would have multiple files each with varying amounts of recovery blocks up to and including the final file which has all of the recovery blocks.  I am only generating the file with all of the recovery blocks and forgoing the smaller files because ultimately with the intermediate files, a 10% recovery system would actually take up 20% hard drive space.  Useful for usenet because it only needs to download the extra par2 files that it actually requires (saves bandwidth / time), but useless for this application.

 

Something like how parity allows a disk to be reconstructed, but much different in the details.

No parity system can detect corruption if parity has been updated to reflect the corruption

 

The key thing about Par2 is that its really designed for folders with STATIC content.  If there are changes occurring within the folder, then those changes can interfere with the ability to reconstruct any corruption issues, and its also possible that those changes could get reversed if a par2 repair operation is performed, since any purposely changed file would be detected as corruption against the original par2 set.  eg: I am using Par2 on my photos share because that share is basically static, and the space requirements are relatively small (and a number of years ago I managed to successfully delete every single photo my wife had ever saved, and I don't ever want to have to deal with that flack again.)  Every once in a while I will perform a par2 check against that folder, and if everything is ok then I will recreate the par2 set to reflect any additional photos that have been added.

 

The important thing here is that every tool (RAID6, unRaid's parity, checksum's, and Par2, btrfs, zfs) all have their limitations, and the tools all complement each other.

Link to comment

Hi Squid,

 

Great news you are working on such a plugin!!

 

Will the PAR2 GUI be able to configure :

- only create par2 from .iso files

- split the iso file in 32762 (virtual) par2 parts

- create 50 par2 parts (so I can fix 50 SMALL errors)

- don' t create an index file

 

I'm using a simple python prog (under windows) to create par2 recovery parts (with multipar, has only a windows version), but 10% looks for me to be way to much :-)

 

Would be great if I could use a plugin instead of my own prog on a windows computer (local processing should be faster).

 

 

Create par files

import glob
import os
import subprocess

def _getDirs(base):
    return [x for x in glob.iglob(os.path.join( base, '*')) if os.path.isdir(x) ]

def rglob(base, pattern):
    list = []
    list.extend(glob.glob(os.path.join(base,pattern)))
    dirs = _getDirs(base)
    if len(dirs):
        for d in dirs:
            list.extend(rglob(os.path.join(base,d), pattern))
    return list

# Version 0.2

ROOT_DIR = u'Y:\_Te_Parren'
# ROOT_DIR = u'D:\_Te_Parren'
# ROOT_DIR = u'P:\\'


RESULT_FILE = 'Create_files.txt'

ISO_flist = rglob(ROOT_DIR,"*.iso") # Get list with all iso images
ISO_flist.sort()
PAR2_flist = rglob(ROOT_DIR,"*.par2") # Get list with all par2 files (from iso images)
PAR2_Searchstring = "\n".join(PAR2_flist); # Make 1 big string of all par2 files, is used for searching
# os.remove(RESULT_FILE)

with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
    myfile.write("********************* Start Parren voor "+ROOT_DIR+' *********************\r\n')
for files in ISO_flist: # for each ISO file
    if files in PAR2_Searchstring: # If ISO file already has a PAR2 file, do not so much
        with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
            myfile.write("Has already PAR2 file "+files+'\r\n')
    else: # Has no par2 file
        with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
            myfile.write("Creating PAR2 file    "+files+'\r\n')
        com_string = 'par2j.exe c /in /rn50 /sn32762 "'+files+'" "'+files+'"'
        return_code = subprocess.call(com_string)
        if return_code == 0:
            with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
                myfile.write('     OK               '+files+'\r\n')
        else:
            with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
                myfile.write("     FAIL             "+files+'\r\n')

 

check par files

import glob
import os
import subprocess

# Version 0.4

def _getDirs(base):
    return [x for x in glob.iglob(os.path.join( base, '*')) if os.path.isdir(x) ]

def rglob(base, pattern):
    list = []
    list.extend(glob.glob(os.path.join(base,pattern)))
    dirs = _getDirs(base)
    if len(dirs):
        for d in dirs:
            list.extend(rglob(os.path.join(base,d), pattern))
    return list


# ROOT_DIR = u'D:\Te_Parren'
# ROOT_DIR = u'Y:\_Te_Parren'
# ROOT_DIR = u'S:\Blu-Ray'
ROOT_DIR = u'W:\\'
RESULT_FILE = 'Check_files_Server1_disk7.txt'

ISO_flist = rglob(ROOT_DIR,"*.iso") # Get list with all iso images
ISO_flist.sort()
PAR2_flist = rglob(ROOT_DIR,"*.par2") # Get list with all par2 files (from iso images)
PAR2_Searchstring = "\n".join(PAR2_flist); # Make 1 big string of all par2 files, is used for searching
# os.remove(RESULT_FILE)

for files in ISO_flist: # for each ISO file
    if files in PAR2_Searchstring: # If ISO file has a par2 file, check if everything is OK
        com_string = 'par2j.exe v "'+files+'*.par2'+'"'
        return_code = subprocess.call(com_string)
        print(return_code)
        if return_code == 0:
            with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
                myfile.write('OK '+files+'\r\n')
        else:
            with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
                myfile.write("FAIL "+files+'\r\n')
    else: # Has no par2 file
        with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
            myfile.write("FAIL "+files+'\r\n')

 

Link to comment

Hi Squid,

 

Great news you are working on such a plugin!!

 

Will the PAR2 GUI be able to configure :

- only create par2 from .iso files

- split the iso file in 32762 (virtual) par2 parts

- create 50 par2 parts (so I can fix 50 SMALL errors)

- don' t create an index file

 

I'm using a simple python prog (under windows) to create par2 recovery parts (with multipar, has only a windows version), but 10% looks for me to be way to much :-)

 

Would be great if I could use a plugin instead of my own prog on a windows computer (local processing should be faster).

 

 

Create par files

import glob
import os
import subprocess

def _getDirs(base):
    return [x for x in glob.iglob(os.path.join( base, '*')) if os.path.isdir(x) ]

def rglob(base, pattern):
    list = []
    list.extend(glob.glob(os.path.join(base,pattern)))
    dirs = _getDirs(base)
    if len(dirs):
        for d in dirs:
            list.extend(rglob(os.path.join(base,d), pattern))
    return list

# Version 0.2

ROOT_DIR = u'Y:\_Te_Parren'
# ROOT_DIR = u'D:\_Te_Parren'
# ROOT_DIR = u'P:\\'


RESULT_FILE = 'Create_files.txt'

ISO_flist = rglob(ROOT_DIR,"*.iso") # Get list with all iso images
ISO_flist.sort()
PAR2_flist = rglob(ROOT_DIR,"*.par2") # Get list with all par2 files (from iso images)
PAR2_Searchstring = "\n".join(PAR2_flist); # Make 1 big string of all par2 files, is used for searching
# os.remove(RESULT_FILE)

with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
    myfile.write("********************* Start Parren voor "+ROOT_DIR+' *********************\r\n')
for files in ISO_flist: # for each ISO file
    if files in PAR2_Searchstring: # If ISO file already has a PAR2 file, do not so much
        with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
            myfile.write("Has already PAR2 file "+files+'\r\n')
    else: # Has no par2 file
        with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
            myfile.write("Creating PAR2 file    "+files+'\r\n')
        com_string = 'par2j.exe c /in /rn50 /sn32762 "'+files+'" "'+files+'"'
        return_code = subprocess.call(com_string)
        if return_code == 0:
            with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
                myfile.write('     OK               '+files+'\r\n')
        else:
            with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
                myfile.write("     FAIL             "+files+'\r\n')

 

check par files

import glob
import os
import subprocess

# Version 0.4

def _getDirs(base):
    return [x for x in glob.iglob(os.path.join( base, '*')) if os.path.isdir(x) ]

def rglob(base, pattern):
    list = []
    list.extend(glob.glob(os.path.join(base,pattern)))
    dirs = _getDirs(base)
    if len(dirs):
        for d in dirs:
            list.extend(rglob(os.path.join(base,d), pattern))
    return list


# ROOT_DIR = u'D:\Te_Parren'
# ROOT_DIR = u'Y:\_Te_Parren'
# ROOT_DIR = u'S:\Blu-Ray'
ROOT_DIR = u'W:\\'
RESULT_FILE = 'Check_files_Server1_disk7.txt'

ISO_flist = rglob(ROOT_DIR,"*.iso") # Get list with all iso images
ISO_flist.sort()
PAR2_flist = rglob(ROOT_DIR,"*.par2") # Get list with all par2 files (from iso images)
PAR2_Searchstring = "\n".join(PAR2_flist); # Make 1 big string of all par2 files, is used for searching
# os.remove(RESULT_FILE)

for files in ISO_flist: # for each ISO file
    if files in PAR2_Searchstring: # If ISO file has a par2 file, check if everything is OK
        com_string = 'par2j.exe v "'+files+'*.par2'+'"'
        return_code = subprocess.call(com_string)
        print(return_code)
        if return_code == 0:
            with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
                myfile.write('OK '+files+'\r\n')
        else:
            with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
                myfile.write("FAIL "+files+'\r\n')
    else: # Has no par2 file
        with open(RESULT_FILE, "a", encoding='utf-8') as myfile:
            myfile.write("FAIL "+files+'\r\n')

No index file - Don't see the point, since at that point you can't do a quick verify on the contents to see if its corrupted, and the size of that file is miniscule.

Number of recovery blocks - yes

Variable redundancy - yes, but based upon a %

Include / Exclude files - yes

 

This isn't going to be a general par2 creation tool, but rather something purpose built for that emergency recovery when you just can't live without a certain file, hence why it'll always be a limited subset of features of what par2 actually offers.  If you require the complete feature set, either continue using Windows, or propose a docker container for one of the Linux GUI's for par2.

Link to comment
No index file - Don't see the point, since at that point you can't do a quick verify on the contents to see if its corrupted, and the size of that file is miniscule.

Number of recovery blocks - yes

Variable redundancy - yes, but based upon a %

Include / Exclude files - yes

 

The index file is already included in the par2 file that contains the recovery blocks, so you can always do a quick check. Why would you create a seperate indexfile in this use case? It is used in usenet as checksum file (minimum download if the file is not corrupt), if some parts are missing then you download the extra par2 blocks.

 

Why do you need a buttload (10%) of large par2 blocks?

I see this as the following:

 

It is first a checksum replacement, no need to use another prog/algoritm because it is included in par2.

 

Secondly, I don't see this as a disaster recovery tool. If the driveheads crashed, then re-rip the blu-rays (or rebuild the drive from parity). However, if some sectors/clusters go bad, this is an independent tool to recover the iso file (seperate from the unraid parity protection). Because I want protection against very small failures (bit-rot/ bad sector on drive/....) which may get 'picked up (protected)' by unraids parity protection, I don't need a very large par2 recovery block size. The smaller the better, then you can have more blocks (for the same filesize of .par2 file) to repair a larger number of errors that happen at the same time. Thus I use 32762 par2 blocks for a file (slice the file in as much parts as possible), and create 50 of them.

 

How do you see these par2 blocks in an unraid protected server?

 

This isn't going to be a general par2 creation tool, but rather something purpose built for that emergency recovery when you just can't live without a certain file, hence why it'll always be a limited subset of features of what par2 actually offers.  If you require the complete feature set, either continue using Windows, or propose a docker container for one of the Linux GUI's for par2.

 

That's also my view. I just want an extra layer of protection for static files...

 

An example moviefolder: I need an extra 0.16% of the filesize of the file I want to protect. If I go larger, then this would get quickly expensive (5% would mean an extra 6Tb drive just for these files (biggest server is more than 100 Tb large))

 

__files.png.951a66e0bf9e39cde6d71d9ffd749673.png

Link to comment

Question for anyone on the thread here, not necessarily Squid.

 

Earlier it was stated that the GUI for Par2 is up and running, although it is a bit rough. I have the Checksum Suite installed. I have checksummed my data on my original device and verified that the data on my backup server matches (Yay!). However, I haven't gotten Par2 up and running and I don't see a button/option for that.

 

Anyone able to point me in the right direction? I know Squid said this would be coming in full-form soon, but I'm anxious to test this so I can get my butt covered in case of bitrot.

Link to comment

Question for anyone on the thread here, not necessarily Squid.

 

Earlier it was stated that the GUI for Par2 is up and running, although it is a bit rough. I have the Checksum Suite installed. I have checksummed my data on my original device and verified that the data on my backup server matches (Yay!). However, I haven't gotten Par2 up and running and I don't see a button/option for that.

 

Anyone able to point me in the right direction? I know Squid said this would be coming in full-form soon, but I'm anxious to test this so I can get my butt covered in case of bitrot.

The button is in Settings - Squid's Checksum Suite - Par2 Tools
Link to comment

Is there any way to estimate how much hard drive space is needed as overhead when you implement checksumming and par2 (which I have learned are different things)?

 

There is talk about the cpu impact being minimal since the process is a very low priority, but how is the disk space impact?

 

Obviously the more files / space used the more checksums there will be... but is there a rule of thumb here? If I add a 3 gb file how much additional space is needed for the checksum / par2?

Link to comment

Yes. It depends on how much corruption you want to be able to correct. As I understand it, the default is 10% per file, so that's 10% overhead. Depending on what your use-case is, or what failure you're trying to prevent, you may want more or less par2 overhead. 10% is very conservative for bitrot protection, it's basically impossible to have that kind of bitrot occur in a 6GB file in one month.

 

If, however, you wanted Par2 to protect you against a total file loss (accidental deletion, crashed disk, etc) it would require 100% overhead, as it would be keeping enough data to reconstruct the entire file.

 

Hope that helps!

Link to comment

For checking the files, they are both roughly equivalent in speed.  However, for creation, a par2 set takes significantly longer than merely creating a checksum of the file.

 

Mostly correct.  Par2 CAN be used as a checksum device, but its not generally how its used.

 

Is there any advantage to using MD5 over Par2? Other than the fact that I can schedule MD5 checks in the GUI using Checksum Suite.

Link to comment

For checking the files, they are both roughly equivalent in speed.  However, for creation, a par2 set takes significantly longer than merely creating a checksum of the file.

 

Mostly correct.  Par2 CAN be used as a checksum device, but its not generally how its used.

 

Is there any advantage to using MD5 over Par2? Other than the fact that I can schedule MD5 checks in the GUI using Checksum Suite.

It doesn't take as much disk space. But it only detects and cannot correct corruption.
Link to comment

Yes. It depends on how much corruption you want to be able to correct. As I understand it, the default is 10% per file, so that's 10% overhead. Depending on what your use-case is, or what failure you're trying to prevent, you may want more or less par2 overhead. 10% is very conservative for bitrot protection, it's basically impossible to have that kind of bitrot occur in a 6GB file in one month.

 

If, however, you wanted Par2 to protect you against a total file loss (accidental deletion, crashed disk, etc) it would require 100% overhead, as it would be keeping enough data to reconstruct the entire file.

 

Hope that helps!

 

That does help, and addresses how Par2 works. What I am still a little confused about is the amount of overhead for Checksumming (assuming you use both Par2 and something else to checksum)

Link to comment

Yes. It depends on how much corruption you want to be able to correct. As I understand it, the default is 10% per file, so that's 10% overhead. Depending on what your use-case is, or what failure you're trying to prevent, you may want more or less par2 overhead. 10% is very conservative for bitrot protection, it's basically impossible to have that kind of bitrot occur in a 6GB file in one month.

 

If, however, you wanted Par2 to protect you against a total file loss (accidental deletion, crashed disk, etc) it would require 100% overhead, as it would be keeping enough data to reconstruct the entire file.

 

Hope that helps!

 

That does help, and addresses how Par2 works. What I am still a little confused about is the amount of overhead for Checksumming (assuming you use both Par2 and something else to checksum)

A Checksum hash file will take up ~150 bytes per file hashed

Link to comment

Yes. It depends on how much corruption you want to be able to correct. As I understand it, the default is 10% per file, so that's 10% overhead. Depending on what your use-case is, or what failure you're trying to prevent, you may want more or less par2 overhead. 10% is very conservative for bitrot protection, it's basically impossible to have that kind of bitrot occur in a 6GB file in one month.

 

If, however, you wanted Par2 to protect you against a total file loss (accidental deletion, crashed disk, etc) it would require 100% overhead, as it would be keeping enough data to reconstruct the entire file.

 

Hope that helps!

 

That does help, and addresses how Par2 works. What I am still a little confused about is the amount of overhead for Checksumming (assuming you use both Par2 and something else to checksum)

A Checksum hash file will take up ~150 bytes per file hashed

 

a late comer to this thread but I did read the thread I just want to understand in my own words...

 

if i have a 7TB array and it 's mostly full, i would need 700GB free at all times for the checksums ?

 

thank you

 

 

Link to comment

Yes. It depends on how much corruption you want to be able to correct. As I understand it, the default is 10% per file, so that's 10% overhead. Depending on what your use-case is, or what failure you're trying to prevent, you may want more or less par2 overhead. 10% is very conservative for bitrot protection, it's basically impossible to have that kind of bitrot occur in a 6GB file in one month.

 

If, however, you wanted Par2 to protect you against a total file loss (accidental deletion, crashed disk, etc) it would require 100% overhead, as it would be keeping enough data to reconstruct the entire file.

 

Hope that helps!

 

That does help, and addresses how Par2 works. What I am still a little confused about is the amount of overhead for Checksumming (assuming you use both Par2 and something else to checksum)

A Checksum hash file will take up ~150 bytes per file hashed

 

a late comer to this thread but I did read the thread I just want to understand in my own words...

 

if i have a 7TB array and it 's mostly full, i would need 700GB free at all times for the checksums ?

 

thank you

The 10% number is for par2, and could probably be set lower and still work. Par2 will allow you to reconstruct corrupt or missing data. If you just want to detect corruption but not reconstruct then MD5 or something else will give you that with just a few bytes per file.
Link to comment

I haven't worked with par2 myself, just read up on it somewhat.  It appears to be designed first for the communications issues of Usenet, then primarily used for data protection on CD's as they age and degrade.  The 10% figures and similar make sense for those uses, but make no sense at all for our use, on massive hard disks and SSD's.  We're concerned about bitrot on the order of one bit per terabyte, and many of us have never even seen that, as far as we know.  The scale is enormously different.  Even a par2 percentage of 1% seems enormously wasteful, unnecessary.  1% results in files of about 10GB/TB, a ratio of 10 gigabytes to one bit - roughly 80,000,000,000 to 1!  That seems beyond overkill.  Apparently, there are ways to specify par2 files of arbitrary size well below 1%, but since it doesn't look like it was designed for very low sizes, it would have to be investigated how reliably it works with them.

 

I can't imagine we're the first to think of this, so somewhere there must be further research on using par2 against the incredibly small amounts of bitrot that we're concerned with.

Link to comment

Par2 can be adjusted to provide varying levels of recovery capability ... but it's never as good as actually having a backup of the data.    Personally, I just use MD5 checksums for all my files, and if I get a mismatch when I do a check I simply copy the file from the backup.

 

I agree bitrot is a VERY small problem (but it does happen) ... but there are plenty of other reasons to maintain current backups.    With a complete set of backups, simply being able to identify any bitrot (via an MD5 mismatch) is all you need.

 

 

 

Link to comment

Looking at it only from a "wasteful" point of view, my intent would be to only use par2 for my "critical" files. Critical = personally created / unique and thus damn near impossible to re-create if not actually impossible. That means all photos, personal videos, archived emails, most if not all Office documents, tax records, and the like. The nice thing is those file do not take up that much space so 1% par2 is hardly a hardship.

 

But I also know that par2 is just another layer of parity-like protection in so much as it does nothing to protect me from system destruction or file deletion. For that reason, those files are also backed up to my PC, backed up to a thumb drive stored in a firebox, and eventually I plan on adding crashplan to the mix. But I see par2 as a small price to pay, for those files, for yet another layer of protection.

 

As to my media files ... meh ... I can always reacquire them. Parity saves me the time and bandwidth if i have a drive crash. Par2 would save me a little more if I were to experience bitrot (for the record I have very old jpgs that likely suffered from bitrot but I can't prove it of course) but I'm still not sure if I'd enable that for media isn't "critical". Perhaps if we could reliably get par2 for those down to 0.1%. [shrug]

Link to comment
  • Squid locked this topic
  • Squid unlocked this topic

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.