Data Integrity Monitoring

smino · September 11, 2009

Do you know of any utilities that does

"Data Integrity Monitoring"

Ie create a hash list (md5/sha128 or 256). Stores it, and compares the files on a weekly basis, against the original hash?

There are many tools out there to create a hash, but what about comparing the hashes from previous weeks?

Anyone create any scripts or have a solution?

RobJ · September 11, 2009

What I have wanted to do is from Brian's (bjp999) methods of using par2 files, but don't recall where he posted about it. That includes both error detection AND error correction. As I recall, it was very scriptable.

RobJ · September 11, 2009

See also the following: Program to verify multiple file integrity via md5

WeeboTech · September 11, 2009

Don't know much about par2. How dioes it do errror correction?

How can it rebuild a file?

I was considering the use of md5 to create an index of all files and an MD5 hash of them.

Then create a tool to match the prior hash with the current and provide notification if they differ.

Sort of ala tripwire.

It would be interesting to use par in this methodology if a file could be rebuilt somehow, but how does it work to do the rebuild?

How big is the par file?

BRiT · September 12, 2009

This is oversimplifying things, but if you want to be able to recover from a 5% corruption of a file, your parity file would be 5% the size of the file. It works on a data-block size (logical grouping of data, similar to a sector on a harddrive), so let's assume a data-block is 1024K. If you have a 1 byte corruption in the file, that would be 1 bad data-block. At a minimum you would need a 1 data-block sized recovery file (1024K). If you have 1024K byte corruption that falls within the same data-block, you only need 1 data-block sized recovery file (1024K). If you have 2 byte corruption that happens to be within different data-blocks, you would need 2 data-block sized recovery file (2048K).

http://en.wikipedia.org/wiki/Parchive and http://parchive.sourceforge.net/

The technology is based on a 'Reed-Solomon Code' implementation that allows for recovery of any 'X' real data-blocks for 'X' parity data-blocks present. (Data-blocks referring to files OR much smaller virtual slices of files).

Parchive (a contraction of parity archive volume set) is an open source software project that emerged in 2001 to develop a parity file format, as conceived by Tobias Rieper and Stefan Wehlus.[1] These parity files use a forward error correction-style system that can be used to perform data verification, and allow recovery when data is lost or corrupted.

Version 2 files generally use this naming/extension system: filename.vol000+01.PAR2, filename.vol001+02.PAR2, filename.vol003+04.PAR2, filename.vol007+06.PAR2, etc. The +01, +02, etc. in the filename indicates how many blocks it contains, and the vol000, vol001, vol003 etc. indicates the number of the first recovery block within the PAR2 file. If an index file of a download states that 4 blocks are missing, the easiest way to repair the files would be by downloading filename.vol003+04.PAR2. However, due to the redundancy, filename.vol007+06.PAR2 is also acceptable.

Version 2 supports up to 32768 (2^15) recovery blocks. Input files are split into multiple equal-sized blocks so that recovery files do not need to be the size of the largest input file.

WeeboTech · September 12, 2009

So can I have one PAR file for an ISO file and still be reliable.

It seems the command supported it.

root@Atlas /mnt/disk1/bittorrent #ls -l YOGA_BASICS.iso

-rw-r--r-- 1 root root 3080683520 Oct 9 2008 YOGA_BASICS.iso

root@Atlas /mnt/disk1/bittorrent #ls -l ../tmp/YOGA_BASICS.iso.*

-rw-r--r-- 1 root root 40408 Sep 11 20:08 ../tmp/YOGA_BASICS.iso.par2

-rw-r--r-- 1 root root 154361856 Sep 11 20:08 ../tmp/YOGA_BASICS.iso.vol000+100.par2

I could see us actually using this tool as a way of handling reliability It's just with all the multvolume files, it could be come unwieldy.

I was thinking to create a command which would take a fullpath and create a mirror image of that directory and instead of copying the source file, it creates the par files there.

SSD · September 12, 2009

The problem with PAR is that it will not process files in directories. If you keep movies as .ISO files all in a single folder, it is easy to use. But if you keep them in a folder structure, it is impossible.

I considered a plan to create file links to all of the files on a disk in a single folder, and then create the PAR set using that folder. But I never spent any time on it.

WeeboTech · September 12, 2009

Hmm.... according to the usage it seems like it can do multiple files.

Perhaps they all need to be listed on the command line.

Maybe if it cannot recurse, then a tool could be created to recurse down the tree and run par2 for each file.

I was thinking of using a disk directory like a database then when running the wrapper tool with a full command line it would build the par files into a directory oriented database structure.

I.E.

par2wrapper create (or verify or repair) /mnt/disk1/Videos/SOMEVIDEO.ISO

would create verify repair par2 files from

/mnt/diskx/par2db/disk1/Videos/SOMEVIDEO.ISO.par2

etc, etc,

root@Atlas ~ #/mnt/user/pub/slackware/par2

Not enough command line arguments.

par2cmdline comes with ABSOLUTELY NO WARRANTY.

This is free software, and you are welcome to redistribute it and/or modify

it under the terms of the GNU General Public License as published by the

Free Software Foundation; either version 2 of the License, or (at your

option) any later version. See COPYING for details.

Usage:

par2 c(reate) [options] <par2 file> [files] : Create PAR2 files

par2 v(erify) [options] <par2 file> [files] : Verify files using PAR2 file

par2 r(epair) [options] <par2 file> [files] : Repair files using PAR2 files

You may also leave out the "c", "v", and "r" commands by using "parcreate",

"par2verify", or "par2repair" instead.

Options:

-b<n> : Set the Block-Count

-s<n> : Set the Block-Size (Don't use both -b and -s)

-r<n> : Level of Redundancy (%%)

-c<n> : Recovery block count (Don't use both -r and -c)

-f<n> : First Recovery-Block-Number

-u : Uniform recovery file sizes

-l : Limit size of recovery files (Don't use both -u and -l)

-n<n> : Number of recovery files (Don't use both -n and -l)

-m<n> : Memory (in MB) to use

-v [-v]: Be more verbose

-q [-q]: Be more quiet (-q -q gives silence)

-- : Treat all remaining CommandLine as filenames

If you wish to create par2 files for a single source file, you may leave

out the name of the par2 file from the command line.

NAS · September 12, 2009

Yeah its only recursion etc its missing. It can happily make par2 files from 10,000 files if needed

SSD · September 12, 2009

Hmm.... according to the usage it seems like it can do multiple files.

Perhaps they all need to be listed on the command line.

Maybe if it cannot recurse, then a tool could be created to recurse down the tree and run par2 for each file.

You would really lose a lot of the benefit doing it a file at a time.

perfessor101 · September 12, 2009

As an exercise in disk thrashing and data integrity one thing I have been thinking about doing is running unRaid with (dare I say it) Flexraid (snapshot raid at the file system level) for my more sensitive data. (sort of Parchive for directories I guess)

You define your data directories and parity directories and it takes a snapshot raid with a definable raid level of 5 currently (with raid level 6 to N in the works).

I have most 'sensitive' data on one drive currently so I'm not sure if it would give me any more protection than unRaid.

I don't know how feasible or workable this would be

Hope it helps,

Bobby

WeeboTech · September 12, 2009

Hmm.... according to the usage it seems like it can do multiple files.

Perhaps they all need to be listed on the command line.

Maybe if it cannot recurse, then a tool could be created to recurse down the tree and run par2 for each file.

You would really lose a lot of the benefit doing it a file at a time.

How? The purpose is a parity protection to detect heres and/or allow them to be fixed.

The only downside I see is the huge directory it would create, but what benefit am I loosing?

If it's file based, then I should be able to detect or repair any single file in the filesystem.

If it's based on a directory, then I would expect all files in that directory need to be present?

Perhaps I'm missing something that I have not learned about yet.

perfessor101 · September 12, 2009

Par/par2 files are normally used on Usenet for recovery of downloaded binaries with missing parts.

for every block that is missing you need a Par2 block to replace it in order to recover the missing files / parts.

if you ran Par2 on a directory it would calculate parity on every file and say you used a 5% parity factor... if you no longer needed 6% of the files in that directory and deleted them you would have to rerun par to recalculate parity because you wouldn't have enough blocks for a recovery 5% - 6% would mean nothing would be recoverable. If you deleted 4% of the files your parity protection left would be about 1%.

One of the frustrating things was when people would add *.sfv files and *.jpg files to the parity protection for a video download ... and one part was missing. You would need to download a enough Par files to recover the sfv and jpg files as well as the video. For every missing file (sfv or jpg or ...) you would need a 384k par file to recover it and then one more 384k par file to recover the missing block

(most binaries posters on Usenet use 384k as a block size as this minimizes transfer glitches for very large files)

SSD · September 13, 2009

How? The purpose is a parity protection to detect heres and/or allow them to be fixed.

The only downside I see is the huge directory it would create, but what benefit am I loosing?

If it's file based, then I should be able to detect or repair any single file in the filesystem.

If it's based on a directory, then I would expect all files in that directory need to be present?

Perhaps I'm missing something that I have not learned about yet.

PAR breaks files into blocks. One par "block" can repair any block within the "set". If the set is one file, than a recovery block can only repair a problem in that one file.

So lets say you have 100 files and create 10 recovery blocks per file. Then lets say a file is corrupted. If it has more corruption than 10 blocks can repair, yoiu are out of luck.

Then lets say you put all 100 files into a single PAR set, and create 1000 recovery blocks. If one file is corrupted, all 1000 blocks can be used to repair that one file. It will take up the same amount of space as the 10 blocks per file, but provide much better recoverability.

NAS · September 13, 2009

Also par3 (which is on its way) benefits greatly for having more blocks with repairs becoming faster the larger the redundancy you have.

Data Integrity Monitoring

Recommended Posts

smino

Link to comment

RobJ

Link to comment

RobJ

Link to comment

WeeboTech

Link to comment

BRiT

Link to comment

WeeboTech

Link to comment

SSD

Link to comment

WeeboTech

Link to comment

NAS

Link to comment

SSD

Link to comment

perfessor101

Link to comment

WeeboTech

Link to comment

perfessor101

Link to comment

SSD

Link to comment

NAS

Link to comment

Join the conversation