RFC: MD5 checksum/Hash Software

WeeboTech · September 2, 2014

I've mentioned for quite some time about having a new locate style program for unRAID.

One in which provides a database for locating files and also storing the md5 hash.

Since SQLite will now be included with unRAID 6, I've been moving forward with good success on this project.

It can catalog the files asked to catalog. Individually and recursively.

You can optionally calculate the md5 hash. I have the md5 logic embedded in the updatedb program.

It can determine if the file changed based on size, mtime, ctime and/or optionally hash.

What I do not have yet, is the ability to prune the database of dangling entries. That's where I'm working next.

There is a companion locate program for extracting files out of the database. ALA gnu updatedb / locate or slocate, mlocate.

For us, I have a field for storing the hash.

You can extract the filename and a new function to extract the hash ala md5sum type format.

What I need to add is a relative root option. Since everything uses full path in the DB, you may want to extract a subset and have the filenames be relative to a specific root.

I choose SQLite instead of gdbm as this will give people the ability to access the data stored from the command line or scripts.

The downside is the size and location of the database.

1 million files takes up about 550MB on /var/lib/sqlite/locate.sqlite3. You can put that somewhere else, but then performance may suffer.

Inserting 300,000 files took about 30 minutes without hash at about 160MB.

In comparison, the GDBM file for 1 million files was about 200MB, without the hashes, I'm not sure how large the hash GDBM would be.

I'll probably make it so you can have multiple databases for the locate command.

This way you can updatedb each individual disk to it's own database and have the locate command traverse each.

Or you can have it all in one (like I do).

If people have md5sum files with full paths in it, I can probably make the updatedb have the ability to import the hashes.

Now that SQLite is compiled into PHP, this eventually will bring is a new webGui plugin that will interface with the SQLite tables directly.

You will be able to search through your locate table to look for files at specific locations.

Possibly verify checksum and/or locate a file and access it directly from the webGui. I haven't even started on this part yet.

This is being split off from the original post as a RFC, request for comment.

MD5 checksum software - what do you use?

http://lime-technology.com/forum/index.php?topic=34978.0

I wanted to start a separate thread to get some input on what people need and find out what some of the gotchas are.

jumperalex · September 2, 2014

I'll be awaiting something to test with the eagerness

WeeboTech · September 2, 2014

... by the way, WeeboTech's utility to catalog all the files, generate checksums, verify them, etc. sounds like a VERY nice addition !! Looking forward to when this is completed, primarily because running verifications natively on the UnRAID server will be appreciably faster than across the network.

I'm trying to work this out.

Apparently, my internal md5 routine that I grabbed from somewhere is not as fast as the external md5sum.

Then there is the fact that I'm not taking advantage of threads or multi processing.

I can run the external programs in a pipe, not sure how much that will add or save, and that's my current study.

If it's feasible then you can set a variable or command line argument to your favorite external hasher.

However if I do this right, you can run processes in parallel from the database itself.

SQLite supports locking, so you can have a couple of processes walking the tree and summing at the same time.

i.e.

./updatedb /mnt/disk1 &

./updatedb /mnt/disk2 &

I'll have to start another thread or fork pieces of this one off to get some input on what's really needed by people.

My needs are a simple locate database with stored md5's.

garycase · September 2, 2014

... I'm trying to work this out.

It definitely sounds like it'll be really slick when you're done. I'm actually very happy with my current process ... the Corz utility works great, and how long it takes isn't a big deal => it takes ~ 5 seconds of "my time" to start a full verify, and how long it runs doesn't really matter. But I'll definitely switch to what you've described once it's completed ... having multiple threads (potentially one per disk) running natively on the UnRAID box will almost certainly be much faster; and the cataloging it provides also sounds really nice.

Certainly no rush ... but let us know when it's far enough along you want some testers

WeeboTech · September 2, 2014

Current schema

# sqlite3 /var/lib/sqlite/locate.sqlite3 ".schema locate"
CREATE TABLE 'locate' (name TEXT PRIMARY KEY, time INTEGER, size INTEGER, updatetime INTEGER, jobstime INTEGER, jobetime INTEGER, rc INTEGER, lock INTEGER, hash TEXT, stat BLOB);

While I could have put the jobstime/etime and rc in another table, I think that would complicate things and also make the database larger. I.E. keys have to connect one way or another.

The purpose of these fields is to track external jobs operating on a file and prevent collisions.

I.E. If a multiple hashing jobs are running in parallel. Somehow you have to prevent one from doing the same thing as the other.

other ideas are to track rsync's to some other destination.

Time is a single value with the max of mtime / ctime on the file.

updatetime is whenever the record gets updated, so you can track changes easier and write reports.

Also another reason to use SQLite is the Firefox SQLite browser extension.

https://addons.mozilla.org/en-US/firefox/addon/sqlite-manager/

I.E. you will be able to get access to your data one way or another.

The companion locate tool will also be able to export the data as a normal text hash file.

jumperalex · September 2, 2014

Weeboo, if speed of the md5 hash really is seen as a bottleneck you might consider BLAZE2 as used in checksum. And as Gary says, a process per disk could really scoot things along

Gary is also right that user interaction is minimal so speed in that sense is not critical. However, if speeds can be brought down to something approaching "reasonable" then periodic validations on the order of days / weeks / months becomes feasible without turning our servers into literal perpetual has generating machines Not to mention the peace of mind of validating checksums quickly after a parity fail.

I'm not as concerned with finding files, but I can see the benefit for others. Beyond that though I do start to consider that the day we can confidently convert the entire array to btrfs with its built in checksumming is the day many people who whinge about bitrot will be happy

Feature Request: With an option to isolate per-drive hash lists for validation after a parity fail it would also be nice to have lots of copies; like one on every disk. Frankly consuming 200MB, or even 400MB per drive sounds trivial to me so long as I know I'll have access to an unmolested database of hashes. Oh yeah and a hash of THAT file to validate its not corrupted ... I'm kidding. I think?!?!

WeeboTech · September 2, 2014

The companion locate tool will also be able to export the data as a normal text hash file.

Here are some of the examples of how the locate command could work.

root@unRAID: # ./flocate filelist.txt

/mnt/disk3/Music/filelist.txt

/mnt/disk3/filelist.txt

root@unRAID:# ./flocate --hash filelist.txt

(no data because I never imported the md5)

root@unRAID:# ./fupdatedb --md5 /mnt/disk3/Music/filelist.txt

root@unRAID:# ./fupdatedb -S --md5 /mnt/disk3/filelist.txt

1 items processed in 1 seconds, 1 select, 0 inserts, 1 update, 0 duplicates, 0 deletes, 0 busys, 0 errors

root@unRAID: # ./flocate --hash filelist.txt

0c6ca1e01c575493abfaae4ee9eeb10a */mnt/disk3/Music/filelist.txt

5c2328e6c04083278cdd334e5c0895dd */mnt/disk3/filelist.txt

root@unRAID:# ./flocate --hash filelist.txt | md5sum -c

/mnt/disk3/Music/filelist.txt: OK

/mnt/disk3/filelist.txt: OK

WeeboTech · September 2, 2014

Weeboo, if speed of the md5 hash really is seen as a bottleneck you might consider BLAZE2 as used in checksum. And as Gary says, a process per disk could really scoot things along

I can't find much on this, If there is source around or a simple command like md5sum sha1sum etc, etc. I can see how to incorporate it's use.

I'm not as concerned with finding files, but I can see the benefit for others.

I have millions of files, I need to find them and find dupes.

I suppose with the hash database, we can now find duplicates pretty easily with SQL.

Beyond that though I do start to consider that the day we can confidently convert the entire array to btrfs with its built in checksumming is the day many people who whinge about bitrot will be happy

This is part of the impetus to do this, Plus there is moving files from one area to another.

You can export your md5sum file with locate. move that directory somewhere and double check it.

I've done this umpteen times.

also the whole parity error and is my data rotting question does not give me much peace of mind.

WeeboTech · September 2, 2014

Feature Request: With an option to isolate per-drive hash lists for validation after a parity fail it would also be nice to have lots of copies; like one on every disk. Frankly consuming 200MB, or even 400MB per drive sounds trivial to me so long as I know I'll have access to an unmolested database of hashes. Oh yeah and a hash of THAT file to validate its not corrupted ... I'm kidding. I think?!?!

What exactly are you requesting? The same database on multiple disks or 1 data base per disk containing only the data for that disk.

So far I have the --database option so you can specify a specific output database.

The normal locate command has the DBPATH variable and I was planning to implement that so you could have one database per disk (wherever you choose to put it).

When doing the updatedb you would have to specify the database (it defaults to /var/lib/sqlite)

However the locate command could iterate through more then one database with the DBPATH variable.

Keep in mind speed is of concern when updating the table.

if you are hashing a file at a time, speed of database update becomes a smaller part of that.

I still have to do a study of having the SQLite table on a disk. Last time I did that it was pretty slow to insert.

So far I get the following statistics on inserts/updates without hashing.

This is where the filesystem was not pre-cached with a find.

308107 items processed in 1920 seconds, 0 selects, 283519 inserts, 0 updates, 0 duplicates, 0 deletes, 0 busys, 0 errors

329035 items processed in 1977 seconds, 2 selects, 303060 inserts, 2 updates, 0 duplicates, 0 deletes, 0 busys, 0 errors

An this is moments later after the filesystem was pre-cached.

root@unRAID:# ./fupdatedb -S --progress 60 /mnt/disk3

11860 items processed in 60 seconds, 11393 selects, 0 inserts, 11393 updates, 0 duplicates, 0 deletes, 0 busys, 0 errors

36986 items processed in 120 seconds, 34531 selects, 16 inserts, 34531 updates, 0 duplicates, 0 deletes, 0 busys, 0 errors

179601 items processed in 180 seconds, 164821 selects, 80 inserts, 164821 updates, 0 duplicates, 0 deletes, 0 busys, 0 errors

329120 items processed in 235 seconds, 303062 selects, 80 inserts, 303062 updates, 0 duplicates, 0 deletes, 0 busys, 0 errors

and after a few hours worth of operation.

root@unRAID:/mnt/disk1/cache/home/rcotrone/hercules/src/flocate# time ./fupdatedb -S --progress 60 /mnt/disk3

329171 items processed in 46 seconds, 303174 selects, 16 inserts, 0 updates, 0 duplicates, 0 deletes, 0 busys, 0 errors

real 0m45.688s

user 0m24.310s

sys 0m6.450s

WeeboTech · September 2, 2014

Weeboo, if speed of the md5 hash really is seen as a bottleneck you might consider BLAZE2 as used in checksum. And as Gary says, a process per disk could really scoot things along

I can't find much on this, If there is source around or a simple command like md5sum sha1sum etc, etc. I can see how to incorporate it's use.

Did you mean blake2?

https://blake2.net/

jumperalex · September 2, 2014

Basically the option to store synced copies of the database to multiple locations, even so much as one on each disk (containing the entire array's hashes).

Sorry, yes i meant blaKe now blaZe

EDIT: I realize I could probably just run a series of rsyncs to accomplish this myself, but thought it might be nice to have as part of the config UI (GUI?)

WeeboTech · September 2, 2014

Basically the option to store synced copies of the database to multiple locations, even so much as one on each disk (containing the entire array's hashes).

What about just using rsync?

The overhead of the multiple SQLite operations to multiple databases is going to really slow things down.

SQLite is fast to read, but slow to write.

grandprix · September 3, 2014

While the possibility of MD5 hash conflicts is extraordinarily rare, should you consider perhaps a concatenation of MD5 and SHA1 (to keep processing/calculation time lower) or just SHA256. Or maybe just SHA-2/Blaze as previously suggested.

Are you using bit for bit the file itself or are you incorporating other file information (filename, date stamp, etc.) into the MD5 algorithm?

WeeboTech · September 3, 2014

So far I'm only using the output of file for the md5 routines, hash of the data itself.

This way the hash can be extracted and piped into the md5sum program or other programs.

Look in earlier post how I extract the md5 has with locate then pipe it into md5sum -c for checking.

The time(max of mtime or ctime), size and inode are stored in the SQLite table.

I forgot to include inode initially. (actually had it then removed it) Once I saw a note on btrfs checksum reporting I remembered why I stored the inode in a sql searchable method. I.E. BTRFS checksum reports the inode where it found the checksum, not the file.

So you have to use find / -inum (inode) -print to find the file. (which could take a pretty long time).

The impetus for that is also for my job, since I need to find all files that are hard linked to the same file.

I'm debating on the internalizing of the md5 routines or using a helper with a pipe.

something like

--hash-helper '/bin/md5sum -b {}'

I want to avoid actually forking a shell which is what many tools will do,(even find).

So I would have to break the command down into pieces, substitute the {} with the current filename and then pipe() fork() exec() wait and read.

use if the internal routines is pretty straight forward for now, it just doesn't run as fast.

With the helper program you can substitute your favorite hasher for importing data as long as it conforms to a standard hash file of

hash filename

The issue for me is it complicates things more. I.E. Argument breaking, pipe, fork, exec, read line, split and utilize.

I wrote a custom popen routine today so I'm partially there.

WeeboTech · September 3, 2014

Example of utilizing find to find the inode on a BTRFS checksum report.

http://www.commandlinefu.com/commands/view/13277/btrfs-find-file-names-with-checksum-errors

Anyway if all the files are cataloged in the SQLite table, the look up should take maybe 10-30 seconds at most vs traversing a whole filesystem.

garycase · September 3, 2014

While the possibility of MD5 hash conflicts is extraordinarily rare, should you consider perhaps a concatenation of MD5 and SHA1 ...

FWIW you can do this with Corz' Checksum now -- you can select Blake2, SHA1, or MD5 or any combination thereof ... it will compute & check all 3 if you want

WeeboTech · September 3, 2014

While the possibility of MD5 hash conflicts is extraordinarily rare, should you consider perhaps a concatenation of MD5 and SHA1 ...

FWIW you can do this with Corz' Checksum now -- you can select Blake2, SHA1, or MD5 or any combination thereof ... it will compute & check all 3 if you want

For unix/linux? I thought the corz tool was only windows.

I'm really looking to stick with one machine to encompass the archiving / hash functionality.

Eventually I would like to automate it on a live basis using inotify.

I.E. as soon as a new file is placed on the server's /mnt/disk* it is cataloged and hashed in real time.

garycase · September 3, 2014

It was developed for Windows, but there's now a "basic Linux/UNIX/BSD version" included in the download. I have NOT tried this, but I believe jumperalex has, and it's not very complete. I don't think Corz intends to further develop that, so I suspect it's not very useful.

My comment wasn't intended to mean you could/should use that -- just that for those who want multiple hashes for the added security it's already available as long as you do it from Windows.

jumperalex · September 3, 2014

Correct it is not feature complete in its management functions. But create, and validate are available

WeeboTech · September 3, 2014

OK that's good news guys!! I'll check it out and see if I can use it as an engine or helper.

....for those who want multiple hashes for the added security it's already available as long as you do it from Windows.

I'm not sure that having a hash or multiple hash's is added security. While there are chances of a collision, The purpose of hash tracking is in determining if a file changed. Did a bit flip. For that purpose any of these will do and doing multiples is over kill.

For real security or peace of mind, it would be useful to have a set of par2 files so that errors can be detected and corrected.

FWIW, I had embarked on that based on information for NAS. This is where I first got involved in dealing with md5's.

I had planned to create a set of par2 files per directory. NAS suggested using a larger set of files. So I created one directory with symlinks to every file. The name of the file was the md5 hash of the full path. (This is where the internal functions for my md5 were born). I then created a set of par2 files for that directory. This is where the database first became useful. Should there be corruption or missing files, you would need the source name to md5 hash. So the Database's early incarnation had the full path, the md5 translated path, md5 hash of the data and other stat information.

This project proved futile in 4.7 as the performance requirements to do this was pretty high. I had so many files that par2 would always segfault because of malloc failures. There would be OOM Killer errors and the array would be unstable.

With 64 bit and larger machines this may now be feasible, but I'm not holding it to high on the list.

In my case, one directory of symlinks to 350,000 files just doesn't seem feasible for managing easily.

I may re-explore the par2 files per directory, but that's for another day... and thread.

WeeboTech · September 19, 2014

Cross post from another thread. Here as food for thoughyt. I.E. Using par2 to create a hash/repair set of files. A per directory testing tool I'm working on makes a .hash file that can be used by the windows corz checksum.

It also makes a set of par2 files so you can validate and repair some damage or missing files (dependent on how bad the damage is)

This goes one step further then detecting it, by allowing you to repair it. In the quick preliminary shell I create a file as

root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# more folder.hash
# file: folder.hash# user.dir="/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)"
# user.hash.dir="a59998b6369450c9269203279720b427"
# user.hash.time="1411076621"60343da3cd9725c218ba522df3024503  18 Michael E - Late Night Dreams.mp3
9060cd879b9870aae4df72e9ae568307  15 Eddie Silverton - White Sand.mp3
cd4d1b45f9b6dfcd0441ca13e91b560d  04 Mazelo Nostra - Smooth Night.mp
3011875f40fea633e0b8a1f94c583ab81  25 Pianochocolate - Long Long Letter.mp3

This way the folder.hash file can be used to validate the current files.I'm purposely using relative filenames here.

The corz checksum makes a file as

# made with checksum.. point-and-click hashing for windows (64-bit edition).
# from corz.org.. http://corz.org/windows/software/checksum/##md5#folder.par2#[email protected]:143a23f04cd50f96a1dcc2b10f3406376d *folder.par2
#md5#filelist.lastrun#[email protected]:15
f6cd171fd3955f1e9a25010451509652 *filelist.lastrun
#md5#disk3.filelist.2014-37.txt#[email protected]:03
62064d427826035e7360cbf1a409aa61 *disk3.filelist.2014-37.txt

In my version, I tried to make the comments look like the exported getfattr -d format.

I'll build a parser in .c to allow use of this file to import into SQLite or do other things with it.

I have the parser for the hash line, now need the variable line.

Since I write to folder.hash in every folder (and folder.par2) the user.hash.dir is a hash of the path so the folder.hash's can be collected to a safer location off the filesystem.

Anyway, with the par2create/verify/repair executed within each directory, I can damage a file and repair it.

can delete a file and have it recreated depending on how many blocks are available.

here's an example of the directory.

Cool thing with the .hash naming, you can go onto windows, right click on the file, click check checksum and get quick validation from the windows workstation.

root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)
# ls -ltotal 280837-rw-rw-rw- 1 rcotrone 522 15798176 2014-09-18 11:25 01\ The\ Diventa\ Project\ -\ Serenity\ (Lazy\ Hammock\ Mix).mp3
-rw-rw-rw- 1 rcotrone 522  7648975 2014-09-18 11:25 02\ Stepo\ Del\ Sol\ -\ Fly\ Far\ Away.mp3
-rw-rw-rw- 1 rcotrone 522  9873573 2014-09-18 11:25 03\ Stargazer\ -\ White\ Caps\ (Sylt\ Mix).mp3
-rw-rw-rw- 1 rcotrone 522  9378281 2014-09-18 11:25 04\ Mazelo\ Nostra\ -\ Smooth\ Night.mp3-rw-rw-rw- 1 rcotrone 522  8689679 2014-09-18 11:25 05\ Billy\ Esteban\ -\ Dream.mp3
-rw-rw-rw- 1 rcotrone 522 16075024 2014-09-18 11:25 06\ C.A.V.O.K\ -\ Night\ Flight.mp3
-rw-rw-rw- 1 rcotrone 522  9431569 2014-09-18 11:25 07\ Nasser\ Shibani\ -\ Time\ Chase.mp3
-rw-rw-rw- 1 rcotrone 522 12205755 2014-09-18 11:25 08\ Dave\ Ross\ -\ Solana.mp3
-rw-rw-rw- 1 rcotrone 522  9747172 2014-09-18 11:25 09\ Gabor\ Deutsch\ -\ Rearrange\ (Feat.\ Harcsa\ Veronika).mp3
-rw-rw-rw- 1 rcotrone 522 11318708 2014-09-18 11:25 10\ Ryan\ KP\ -\ Everythings\ Gonna\ Be\ Alright\ (Feat.\ Melody).mp3
-rw-rw-rw- 1 rcotrone 522 10283170 2014-09-18 11:25 11\ Florzinho\ -\ Primavera\ (Dub\ Mix).mp3
-rw-rw-rw- 1 rcotrone 522 13498323 2014-09-18 11:25 12\ Lazy\ Hammock\ -\ One\ of\ Those\ Days.mp3
-rw-rw-rw- 1 rcotrone 522 10098189 2014-09-18 11:25 13\ Myah\ -\ Falling.mp3-rw-rw-rw- 1 rcotrone 522 10972802 2014-09-18 11:25 14\ Peter\ Pearson\ -\ I\ Need\ to\ Chill.mp3
-rw-rw-rw- 1 rcotrone 522 10983245 2014-09-18 11:25 15\ Eddie\ Silverton\ -\ White\ Sand.mp3
-rw-rw-rw- 1 rcotrone 522  6619741 2014-09-18 11:25 16\ Ingo\ Herrmann\ -\ Filtron.mp3
-rw-rw-rw- 1 rcotrone 522 14187937 2014-09-18 11:25 17\ DJ\ MNX\ -\ Cosmic\ Dreamer.mp3
-rw-rw-rw- 1 rcotrone 522  9187068 2014-09-18 11:25 18\ Michael\ E\ -\ Late\ Night\ Dreams.mp3
-rw-rw-rw- 1 rcotrone 522  5278124 2014-09-18 11:25 19\ Francois\ Maugame\ -\ Like\ a\ Summer\ Breeze.mp3
-rw-rw-rw- 1 rcotrone 522 12749119 2014-09-18 11:25 20\ Collioure\ -\ Perfect\ Resort.mp3
-rw-rw-rw- 1 rcotrone 522  7050247 2014-09-18 11:25 21\ Leon\ Ard\ -\ Caribbean\ Dreams.mp3
-rw-rw-rw- 1 rcotrone 522 10242401 2014-09-18 11:25 22\ Syusi\ -\ Bright\ Moments.mp3
-rw-rw-rw- 1 rcotrone 522 10783678 2014-09-18 11:25 23\ Thomas\ Lemmer\ -\ Above\ the\ Clouds.mp3
-rw-rw-rw- 1 rcotrone 522  8890313 2014-09-18 11:25 24\ Frame\ by\ Frame\ -\ Borderland.mp3
-rw-rw-rw- 1 rcotrone 522  9443076 2014-09-18 11:25 25\ Pianochocolate\ -\ Long\ Long\ Letter.mp3
-rw-rw-rw- 1 root     522     2049 2014-09-18 17:43 folder.hash
rw-rw-rw- 1 root     522   132623 2014-09-18 17:03 folder.jpg
-rw-rw-rw- 1 root     522    46760 2014-09-18 17:43 folder.par2
-rw-rw-rw- 1 root     522 26635780 2014-09-18 17:43 folder.vol000+200.par2

Food for thought. Here's a log of what is possible by creating the set of par2 files per folder.

I.E. being able to repair a small file directly.Or storing the par2 file somewhere else.

root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# par2verify folder.par2 
Loading "folder.par2".
Loaded 54 new packets
Loading "folder.vol000+200.par2".
Loaded 200 new packets including 200 recovery blocks
There are 26 recoverable files and 0 other files.
The block size used was 131244 bytes.
There are a total of 2000 data blocks.
The total size of the data files is 260566968 bytes.
Verifying source files:
Target: "01 The Diventa Project - Serenity (Lazy Hammock Mix).mp3" - found.
Target: "02 Stepo Del Sol - Fly Far Away.mp3" - found.
Target: "03 Stargazer - White Caps (Sylt Mix).mp3" - found.
Target: "04 Mazelo Nostra - Smooth Night.mp3" - found.
Target: "05 Billy Esteban - Dream.mp3" - found.Target: "06 C.A.V.O.K - Night Flight.mp3" - found.
Target: "07 Nasser Shibani - Time Chase.mp3" - found.
Target: "08 Dave Ross - Solana.mp3" - found.
Target: "09 Gabor Deutsch - Rearrange (Feat. Harcsa Veronika).mp3" - found.
Target: "10 Ryan KP - Everythings Gonna Be Alright (Feat. Melody).mp3" - found.
Target: "11 Florzinho - Primavera (Dub Mix).mp3" - found.
Target: "12 Lazy Hammock - One of Those Days.mp3" - found.
Target: "13 Myah - Falling.mp3" - found.Target: "14 Peter Pearson - I Need to Chill.mp3" - found.
Target: "15 Eddie Silverton - White Sand.mp3" - found.
Target: "16 Ingo Herrmann - Filtron.mp3" - found.Target: "17 DJ MNX - Cosmic Dreamer.mp3" - found.
Target: "18 Michael E - Late Night Dreams.mp3" - found.
Target: "19 Francois Maugame - Like a Summer Breeze.mp3" - found.
Target: "20 Collioure - Perfect Resort.mp3" - found.
Target: "21 Leon Ard - Caribbean Dreams.mp3" - found.
Target: "22 Syusi - Bright Moments.mp3" - found.
Target: "23 Thomas Lemmer - Above the Clouds.mp3" - found.
Target: "24 Frame by Frame - Borderland.mp3" - found.
Target: "25 Pianochocolate - Long Long Letter.mp3" - found.
Target: "folder.jpg" - found.All files are correct, repair is not required.


root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# rm -vi folder.jpg
rm: remove regular file `folder.jpg'? y
removed `folder.jpg'


root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# par2verify folder.par2 


Loading "folder.par2".
Loaded 54 new packets
Loading "folder.vol000+200.par2".
Loaded 200 new packets including 200 recovery blocks
There are 26 recoverable files and 0 other files.
The block size used was 131244 bytes.
There are a total of 2000 data blocks.
The total size of the data files is 260566968 bytes.
Verifying source files:
Target: "01 The Diventa Project - Serenity (Lazy Hammock Mix).mp3" - found.
Target: "02 Stepo Del Sol - Fly Far Away.mp3" - found.
Target: "03 Stargazer - White Caps (Sylt Mix).mp3" - found.
Target: "04 Mazelo Nostra - Smooth Night.mp3" - found.
Target: "05 Billy Esteban - Dream.mp3" - found.
Target: "06 C.A.V.O.K - Night Flight.mp3" - found.
Target: "07 Nasser Shibani - Time Chase.mp3" - found.
Target: "08 Dave Ross - Solana.mp3" - found.
Target: "09 Gabor Deutsch - Rearrange (Feat. Harcsa Veronika).mp3" - found.
Target: "10 Ryan KP - Everythings Gonna Be Alright (Feat. Melody).mp3" - found.
Target: "11 Florzinho - Primavera (Dub Mix).mp3" - found.
Target: "12 Lazy Hammock - One of Those Days.mp3" - found.
Target: "13 Myah - Falling.mp3" - found.
Target: "14 Peter Pearson - I Need to Chill.mp3" - found.
Target: "15 Eddie Silverton - White Sand.mp3" - found.
Target: "16 Ingo Herrmann - Filtron.mp3" - found.
Target: "17 DJ MNX - Cosmic Dreamer.mp3" - found.
Target: "18 Michael E - Late Night Dreams.mp3" - found.
Target: "19 Francois Maugame - Like a Summer Breeze.mp3" - found.
Target: "20 Collioure - Perfect Resort.mp3" - found.
Target: "21 Leon Ard - Caribbean Dreams.mp3" - found.
Target: "22 Syusi - Bright Moments.mp3" - found.
Target: "23 Thomas Lemmer - Above the Clouds.mp3" - found.
Target: "24 Frame by Frame - Borderland.mp3" - found.
Target: "25 Pianochocolate - Long Long Letter.mp3" - found.
Target: "folder.jpg" - missing.


Scanning extra files:Repair is required.1 file(s) are missing.
25 file(s) are ok.You have 1998 out of 2000 data blocks available.
You have 200 recovery blocks available.Repair is possible.
You have an excess of 198 recovery blocks.2 recovery blocks will be used to repair.


root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# par2repair folder.par2  


.Loading "folder.par2".
Loaded 54 new packetsLoading "folder.vol000+200.par2".
Loaded 200 new packets including 200 recovery blocks
There are 26 recoverable files and 0 other files.
The block size used was 131244 bytes.There are a total of 2000 data blocks.
The total size of the data files is 260566968 bytes.
Verifying source files:
Target: "01 The Diventa Project - Serenity (Lazy Hammock Mix).mp3" - found.
Target: "02 Stepo Del Sol - Fly Far Away.mp3" - found.
Target: "03 Stargazer - White Caps (Sylt Mix).mp3" - found.
Target: "04 Mazelo Nostra - Smooth Night.mp3" - found.
Target: "05 Billy Esteban - Dream.mp3" - found.
Target: "06 C.A.V.O.K - Night Flight.mp3" - found.
Target: "07 Nasser Shibani - Time Chase.mp3" - found.
Target: "08 Dave Ross - Solana.mp3" - found.
Target: "09 Gabor Deutsch - Rearrange (Feat. Harcsa Veronika).mp3" - found.
Target: "10 Ryan KP - Everythings Gonna Be Alright (Feat. Melody).mp3" - found.
Target: "11 Florzinho - Primavera (Dub Mix).mp3" - found.
Target: "12 Lazy Hammock - One of Those Days.mp3" - found.
Target: "13 Myah - Falling.mp3" - found.
Target: "14 Peter Pearson - I Need to Chill.mp3" - found.
Target: "15 Eddie Silverton - White Sand.mp3" - found.
Target: "16 Ingo Herrmann - Filtron.mp3" - found.
Target: "17 DJ MNX - Cosmic Dreamer.mp3" - found.
Target: "18 Michael E - Late Night Dreams.mp3" - found.
Target: "19 Francois Maugame - Like a Summer Breeze.mp3" - found.
Target: "20 Collioure - Perfect Resort.mp3" - found.
Target: "21 Leon Ard - Caribbean Dreams.mp3" - found.
Target: "22 Syusi - Bright Moments.mp3" - found.
Target: "23 Thomas Lemmer - Above the Clouds.mp3" - found.
Target: "24 Frame by Frame - Borderland.mp3" - found.
Target: "25 Pianochocolate - Long Long Letter.mp3" - found.
Target: "folder.jpg" - missing.


Scanning extra files:
epair is required.1 file(s) are missing.25 file(s) are ok.
You have 1998 out of 2000 data blocks available.
You have 200 recovery blocks available.
Repair is possible.You have an excess of 198 recovery blocks.
2 recovery blocks will be used to repair.
Computing Reed Solomon matrix.
Constructing: done.
Dolving: done.
Wrote 132623 bytes to disk
Verifying repaired files:Target: "folder.jpg" - found.
Repair complete.


root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# ls -l folder.jpg


-rw-rw-rw- 1 root 522 132623 2014-09-19 02:20 folder.jpg


root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# md5sum -c folder.hash | grep folder.jpgfolder.jpg: OK

jumperalex · September 19, 2014

Two things Weebo:

1) I hope you plan on charging for this, because I know I'll pay and you'll deserve it

2) I hope you're in talks with LT to incorporate this natively as an official add-on (like Pro+) and let you get your beak wet.

WeeboTech · September 19, 2014

I wasn't planning to charge for it.

It's a mlocate/slocate replacement using sqlite.

It's also a bit like tripwire where it will detect changes (that you ask it to).

Later on I may add another inotify part of it that will update the locate database with hashes in real time.

For now I'm just trying to validate our files if a problem happens and possibly detect/correct a 'small' problem should something go awry.

jumperalex · September 20, 2014

Well that is very generous of you. I would say at the least you should put up a webpage and offer it as donation-ware ;especially if it is generalized beyond unRAID usability.

It may seem like "just" something to you but you know damn well your time is worth something and not everyone has the skills, time, or both to create their own solution. Believe me there are things I do professionally (and personally) as well that are just no skin off my back but which other people recognize the value.

Anyway, I look forward to seeing what you come up with, maybe a little testing if you're looking for beta testers, and at the very least buying you one of those cups-of-coffee that was bandied about a few month's ago.

WeeboTech · September 20, 2014

For those keeping up to date with this thread.

I tested quickpar on the windows machine to verify the folder.par2 files I made with the command line par2cmdline.

It works in verify mode.

So this is two wins.

we can have unRAID automated batch tools create folder.hash and folder.par2 files in each directory.

Then use windows tools to quickly validate these files, along side with some kind of whole filesystem batch validation function on unRAID.

So I have some questions for those considering use of tools like this.

I plan to write a different tool to sweep the filesystem like find using ftw, and for each directory execute a program for directory protection/validation.

I.E.

generate a folder.hash (which can be verified with corz checksum and/or other unRAID tools)

generate a folder.par2 (which can be verified with par2verify and windows quickpar)

What is the maximum files you have per directory? I think the limit will be a little over 8000.

What should I name these. Being the programmer I'll think of something stupid like

ftwdirfolderpar2

ftwdirfolderhash

This can be done with find, xargs, md5sum & par2, but I have uneasy feelings about running the md5sum/par2 files with command line filenames.

You have to sweep for directories, then for each directory sweep for files, build command lines and run them.

There's all kinds of quoting issues that can arise.

Where as if it's done in a program with opendir, array and execv of a program, those concerns do not exist the same way.

In other words it's safer to prevent a bad filename from doing something bad on your filesystem. (especially as root).

Thoughts?

Any one have an example command line to find maximum file count in directories?

Name suggestions?

Is it worthwhile having a local folder.hash and folder.par2 file in each directory?

While corz uses the name of the directory as the .hash file, I choose to use folder.hash so a program can sweep the filesystem.

Where ever folder.hash exists, check if files are newer then folder.hash and rehash it. Or if the directory mtime is > then folder.hash verify and/or rehash it. Same with the folder.par2 file. (i.e. to catch deletes).

RFC: MD5 checksum/Hash Software

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation