RFC: MD5 checksum/Hash Software


Recommended Posts

  • Replies 56
  • Created
  • Last Reply

Top Posters In This Topic

Has anyone given thought to running snapraid (snapshot) on top of unraid? Wouldn't that give the same sort of protection as par2?

 

I personally have not, allot of my files change pretty often.

Then there are the others that change less often, but often enough that would require a daily sync operation.

 

However, I think the snapraid as an additional check is a great idea.

 

In my particular case, I store data in large trees and tend to move them to archive areas.

So having the folder by folder validation is important enough for me to build it.

 

I have no less then half a million mp3's in a few trees on a few mini unRAID servers.

Link to comment

Regarding the snapraid, It does compile under slackware.

 

So this has possibilities.

Thanks for the idea Brit.

 

At this point without a slackbuild (unless someone finds one) making it available/useful isn't all that easy.

I have to read into this more to see how the parity functionality works, I don't have a spare SATA disk slot to test this out.

It might be worthwhile for alpha/beta/rc testers to help overcome/monitor gotcha's of a bad file system driver.

 

root@slacky:/mnt/disk1/cache/home/rcotrone/slacky/src/snapraid/snapraid-6.3# ./snapraid -h
snapraid v6.3 by Andrea Mazzoleni, http://snapraid.sourceforge.net
Usage: snapraid sync|status|scrub|diff|dup|pool|check|fix [options]

Commands:
  sync   Syncronize the state of the array of disks
  pool   Create or update the virtual view of the array of disks
  diff   Show the changes that needs to be syncronized
  dup    Find duplicate files
  scrub  Scrub the array of disks
  status Print the status of the array
  check  Check the array of disks
  fix    Fix the array of disks

Options:
  -c, --conf FILE         Configuration file
  -f, --filter PATTERN    Process only files matching the pattern
  -d, --filter-dist NAME  Process only files in the specified disk
  -m, --filter-missing    Process only missing/deleted files
  -e, --filter-error      Process only files with errors
  -p, --percentage PERC   Process only a part of the array
  -o, --older-than DAYS   Process only the older part of the array
  -i, --import DIR        Import deleted files
  -l, --log FILE          Log file. Default none
  -a, --audit-only        Check only file data and not parity
  -h, --pre-hash          Pre hash all the new data
  -Z, --force-zero        Force synching of files that get zero size
  -E, --force-empty       Force synching of disks that get empty
  -U, --force-uuid        Force commands on disks with uuid changed
  -D, --force-device      Force commands on disks with same device id
  -N, --force-nocopy      Force commands disabling the copy detection
  -F, --force-full        Force commands requiring a full sync
  -s, --start BLKSTART    Start from the specified block number
  -t, --count BLKCOUNT    Count of block to process
  -v, --verbose           Verbose
  -H, --help              Help
  -V, --version           Version

 

 

I'm still moving forward with my per folder .hash and .par2 variants.

Link to comment

Regarding the snapraid, It does compile under slackware.

 

So this has possibilities.

Thanks for the idea Brit.

 

At this point without a slackbuild (unless someone finds one) making it available/useful isn't all that easy.

 

 

Docker. Snapraid inside docker with /mnt/disk# or /mnt/user mapped through.

 

Though one would need ability to feed through commands to execute inside the docker container from unraid unless snapraid has a webui too.

Link to comment

Regarding the snapraid, It does compile under slackware.

 

So this has possibilities.

Thanks for the idea Brit.

 

At this point without a slackbuild (unless someone finds one) making it available/useful isn't all that easy.

 

 

Docker. Snapraid inside docker with /mnt/disk# or /mnt/user mapped through.

 

Though one would need ability to feed through commands to execute inside the docker container from unraid unless snapraid has a webui too.

 

There are somethings that lend it self well to docker, I'm not sure this is a candidate.

 

 

Frankly, for my low level apps, I doubt I'll use docker.  I'm not totally sold on the technology for my administrative applications.

Perhaps I don't know enough about docker.

I get paid to do VM's and administrative programming, I don't gain anything by Docker yet.

 

If it came down to a whole packaged solution with webmin, I would consider docker solution.

For now I don't have the spare hardware or time to play with Docker.

Link to comment

...

 

Thoughts?

Any one have an example command line to find maximum file count in directories?

Name suggestions?

Is it worthwhile having a local folder.hash and folder.par2 file in each directory?

 

...

 

By chance, I was looking for this just the other day!  This command has worked for me:

 

find DIR_NAME -type f -print | wc -l 

 

Following this thread, awesome stuff Weebo!

Link to comment

I'm making good headway so far.

 

I have squpdatedb and sqlocate working.

I found a new md5 set of routines that are a little faster.

The cool thing about this module is it's ssl compatible. With a switch I can compile internally, albeit a lil slower.

With another switch I can compile it to require the SSL MD5 routines and it's as fast as the current gnu coreutils md5sum.

 

An improvement with the embedded function is that I can dump the cache on the file that was just read for the md5.

This means that it aid in preventing the push of directory entries from the buffer cache. I haven't tested it fully.

But for smaller files which do not overrun all of memory, it does work, and it works well.

 

There's still more to do, such as allowing the calling of any external hashing application that people want to use.

As long as it returns a standard hash line of

 

 

hexdigts(space)(space)fullpath of file

 

It can be imported.

 

What I've been curious about is how people see this kind of hash database to be used?

I can export it via locate into a format that can be piped into the hashing application.

 

This is pretty routine.

You can also locate files with built in grep.

 

 

What I'm curious about is what time fields people might need.

Currently I have the max time of (change time vs modification time) update time (when the record was updated in the database).

I was thinking of adding a hash verified time, This way you can automatically schedule a verification of a file every 7, 15, 30, 60 days by some calculation.

 

The purpose of the database in itself is to avoid rehashing the files over and over if they haven't changed.

But to detect bitrot or some kind of corruption, you need to re-verify them without overwriting them(unless it is intentional).

 

Thoughts?

Link to comment
  • 1 month later...

In case you are still interested in these comments, here goes.

 

What I'm curious about is what time fields people might need.

Currently I have the max time of (change time vs modification time) update time (when the record was updated in the database).

I was thinking of adding a hash verified time, This way you can automatically schedule a verification of a file every 7, 15, 30, 60 days by some calculation.

 

The purpose of the database in itself is to avoid rehashing the files over and over if they haven't changed.

But to detect bitrot or some kind of corruption, you need to re-verify them without overwriting them(unless it is intentional).

 

Thoughts?

 

I would be interested in file-modification-time' at the time of last hash-scan and date and time of last hash-scan. A scan should probably keep both the old and the new checksum of the file if file has changed but files modification timestamp has not. If files mod-timestamp has changed I just want the new timestamp.

 

It sounds interesting with .par2 or .md5 files pr directory though I'm not sure I would use it, I'm generally against files being littered all over in my folders, as I always browse with hidden files visible (in total commander, windows). I might very well enable .par2 files for some specific directories.

 

Best Alex

Link to comment

In case you are still interested in these comments, here goes.

 

What I'm curious about is what time fields people might need.

Currently I have the max time of (change time vs modification time) update time (when the record was updated in the database).

I was thinking of adding a hash verified time, This way you can automatically schedule a verification of a file every 7, 15, 30, 60 days by some calculation.

 

The purpose of the database in itself is to avoid rehashing the files over and over if they haven't changed.

But to detect bitrot or some kind of corruption, you need to re-verify them without overwriting them(unless it is intentional).

 

Thoughts?

 

I would be interested in file-modification-time' at the time of last hash-scan and date and time of last hash-scan. A scan should probably keep both the old and the new checksum of the file if file has changed but files modification timestamp has not. If files mod-timestamp has changed I just want the new timestamp.

 

Best Alex

 

I've considered the time fields.

 

mtime is already stored.

update time is stored i.e. anytime the record is updated this time is updated.

I was going to use update time for the update/hash verification time.

I may split it out to a separate hash time

 

hashtime will be a new field of when the hash is changed or verified as matching.

 

I'm not sure of the use on saving the old hash.

 

FWIW, the locate command is capable of exporting the data as a list of files.

it can print like an

 

/bin/ls -1

/bin/ls -l

/bin/stat

 

and finally output like the output of md5sum

 

hash filename

 

This lets you export the hashes of some matching file expression at the current point in time.

 

 

 

It sounds interesting with .par2 or .md5 files pr directory though I'm not sure I would use it, I'm generally against files being littered all over in my folders, as I always browse with hidden files visible (in total commander, windows). I might very well enable .par2 files for some specific directories.

 

I thought the same way at first with files littered all over.

However when I thought about it, in my mp3 directories I have mp3s a playlist, a folder.jpg.

If I have a folder.hash I can use the corz checksum utility in a flash

If I have a folder.par2 I can use one of the windows par2 gui's to validate the files.

 

While storing all the hashes either in text(md5sum file), gdbm(key/value) or sqlite is do-able.

There's a limit as to how many files you can use with the par2 command.

Since corz limits hashes to a folder, it makes sense to do this with the par2 for that added ability to recover.

 

Years ago I tried doing par2 on a huge set of files and it caused all sorts of OOM issues with unRAID.

It's not feasible in my array which over a million files.

Link to comment

FWIW, I have 3 versions of hash storage I've been working with.

 

1. The SQLite updatedb/locate command.

 

2. a suite of tools to do this with .gdbm files.

.gdbm files are very fast for key/value pairing and it's allot smaller then the sqlite variant.

I have been testing this for my own version of the cache_dirs program.

The goal was a c based caching initiator, that would also catalog the files into the .gdbm file.

When files change, they are updated in the .gdbm and written to stdout which can then be used a seed to update the md5sums immediately or at some scheduled time.  Then you can re-import the smaller subset of md5sums back into the .gdbm.

 

3. a suite of tools manage hash values into extended attributes of a file.

it's very similiar to the bitrot shell, only I do it in C for speed.

The export creates a file that can be used by md5sum

the import reads the md5sum file into the extended atrributes.

The hashfattr command works like find, md5sum and setfattr all in one.

The delay on it's release has been in using external hash'ers.

I've been playing with functionality to allow use of an external hasher like this --hash-exec '/bin/b2sum -a {}'

so every file found calls this program, which is piped back into the hash tool and then stored.

 

It's all very similar and easy to do in bash.

In my case, the overhead of spawning all these extra helper processes on a million files really makes the process longer.

Plus it's an exercise for me in programming.

 

 

I really like the extended attributes method. If you use rsync -X to move files to another server, it preserves the attributes.

Thus you can jump onto another server and verify the hash.

 

 

However if the file is re-created (as if copied/moved by windows, the extended attribute is lost.

Thus the reason to cache this data elsewhere as an exportable md5sums file, .gdbm or sqlite table.

 

 

I'm making headway, however I'm now in the process of updating some hardware because the HP microserver N40L just takes way too long to test these programs out on large datasets.  It takes days on end just to md5deep 1 drive of over 300,000 files.

 

Hence you can see why it's important for me to create functionality to

1. seed the data

2. only do updates on files that actually change with mtime/size.

3. verify files on demand based on some manageable set of rules.

 

 

Just traversing a drive with 300,000 files takes 30 minutes if the data is not cached.

Hence why I started working on my own cache_dirs which tracks what files change by mtime/size.

 

 

and finally... I'm going to experiment with the newer seagate hybrid drives to see if they have any impact on caching the directory structures in the MLC/SLC cache.  I figure at the very least if I continually drop the cache, rescan the drive  a few times, it should, in theory, cache the directory LBA's thus lessening the time it takes to walk a directory try.

 

 

So while it's been quiet here, I've been hard at work.

Link to comment

I'm not sure of the use on saving the old hash.

 

I'm only interested in the old hash when I fear bitrot. That is when the timestamp of the file is unaltered. I'll probably have up to thousands of file changes on a monthly basis, and would likely never be interested in the old hashes, except when fearing bitrot. If I fear bitrot, I would restore from crashplan or out-of-home backup and check against old hash, I would think

 

 

This lets you export the hashes of some matching file expression at the current point in time.

 

 

Awesome

 

I thought the same way at first with files littered all over.

However when I thought about it, in my mp3 directories I have mp3s a playlist, a folder.jpg.

If I have a folder.hash I can use the corz checksum utility in a flash

If I have a folder.par2 I can use one of the windows par2 gui's to validate the files. [/font]

 

While storing all the hashes either in text(md5sum file), gdbm(key/value) or sqlite is do-able.

There's a limit as to how many files you can use with the par2 command.

Since corz limits hashes to a folder, it makes sense to do this with the par2 for that added ability to recover.

 

Years ago I tried doing par2 on a huge set of files and it caused all sorts of OOM issues with unRAID.

It's not feasible in my array which over a million files.

 

Sounds like you've thought about the negative consequences of one big par. I think its a great possibility to have it add .par2 file recursively to some folders. But I don't want it to do it to all folders. For instance I probably don't want my git-workspace infiltrated with .par2 files. I could make a git-ignore, but would prefer not to. I suppose you are still planning an sql-database underneath?

 

Best Alex

 

PS: I'm looking forward to it :)

Link to comment

The folder.hash / folder.par2 generator would be a separate program.

 

I.E. I have a separate utility that walks down the tree.

If the directory is newer then the folder.par2 or folder.hash, both of them are re-created.

 

It will not be attached to the other suite of tools. It's a separate tool.

I suppose we could talk about how to set up the ignore function.

I.E. When I walk a tree. if there was a .prognameIgnore file then I would skip that directory.

 

I'm open to suggestions on filename.

 

 

In thinking of this tool, if it's 0 bytes ignore the whole directory. i.e. assume .*

If it's > 0, read that as a list of expressions to ignore. i.e. *.o, *~ ,etc, etc.

 

In any case, it's a separate suite of tools,

One program to walk a tree. comparing mtimes on folder.hash folder.par2.

Another program to read the files in the directory and make the respective files.

 

This can probably be done with find/xargs and a shell. It will work, but all of the external processes add up to allot of overhead.

Then you have the issues with filenames and quoting.

So doing it in C, I can create a 'safer' array and fork/exec the programs without concern of being interpreted by the shell.

 

So while all of this can be done in bash with tools. I'm paranoid about filenames and quoting enough to do it all in C.

Link to comment

I'm not sure of the use on saving the old hash.

 

I'm only interested in the old hash when I fear bitrot. That is when the timestamp of the file is unaltered. I'll probably have up to thousands of file changes on a monthly basis, and would likely never be interested in the old hashes, except when fearing bitrot. If I fear bitrot, I would restore from crashplan or out-of-home backup and check against old hash, I would think

 

 

What I was planning is check mtime/size if they have not changed, no hash is calculated unless you do a --verify on purpose.

Even then the hash is not updated if it differs.

 

 

The only time the hash is updated is mtime/size has changed or a forced --verify-update is used or possibly a --force switch.

 

 

If you had to force an update, there is a --delete command to the squpdatedb which will delete the record allowing the next invocation to reinsert the data.

Link to comment
  • 4 weeks later...

While I know it's been quiet, I've been continuing to work on hashing database tools.

What came to mind recently is the ability to label specific directory tree scans with labels.

 

We do have user shares which could show up in the paths and would be searchable, I'm thinking more about foreign disks.

Disks that are mounted temporarily.

 

In my case, I have a box of older 1tb hard drives I'm trying to consolidate.

I found myself doing an md5deep -r down the tree to a file named according to the drives model and serial.

This gave me the hint if I added a label field which is any text string you define, then the squpdate/locate database functionality now becomes a wider catalog. Using the drives serial number, or any other identifying string, you could now search for these files in one locate front end. 

 

This would make the database larger, but gives you newer functionality.

I suppose if everything has an md5 or other hash, then duplicates can be found across a whole archive.

 

Currently the squpdate/locate tools allow you to specify a database, so it doesn't all have to be in one database.

I figured I would throw it out there to see what boomerangs back to the idea of a string label as an additional filter.

Link to comment

It sounds good, to be able to locate files on external non attached drives, though I would not use it as my Tower is in the 'attic'. I'm not certain what you mean but I think you are suggesting to create a manual configured label (chosen name) for an external harddrive which your software then maps to its actualy 'serial'/identifier number right?

 

Its a bit difficult to keep up and to come up with ideas when I don't have the software to work with. Ideas come when playing with stuff and experiencing it. Its more difficulty for brains to come up with brilliant new ideas, when I have to imagine everything, because of the lack of new information - information what could come from actually playing with the software and seeing how it works. I know the software is not yet done, but if you could send out something that could be played with it would help generating ideas. Ideas you don't necessarily have to implement, but which might generate new good ideas. I wouldn't mind having to rescan everything once in a while on new updates if the database changes. I do a full monthly rescan anyway with my current vary basic md5sum-to-files strategy. I think this is not really meant as the nag 'give me that beta already!' , but as an excuse for not being very helpfull to you :)

 

Best Alex

 

 

 

 

Link to comment

No, md5deep didn't work for me it crashed. I don't quite remember if i tried hashdeep, but my memory tells me its the same just different name (?). So I use md5sum with find command, and I run a command pr disk on each cpu-core.

find /mnt/${DISK} -type f -exec md5sum {} \; > $MD5DIR/MD5_${DATESTAMP}_${DISK}.md5 

I got most of the program from this forum, I think, and I'm not sure who to credit. Or I made it myself from bits and pieces, I'm not really sure.

 

I have attached the file.

 

Best Alex

md5_array.zip

Link to comment

...

I found myself doing an md5deep -r down the tree to a file named according to the drives model and serial.

This gave me the hint if I added a label field which is any text string you define, then the squpdate/locate database functionality now becomes a wider catalog. Using the drives serial number, or any other identifying string, you could now search for these files in one locate front end. 

...

 

I am new to this thread but this comment caught my eye as its not that far from what I do for cataloging.

 

I put a sticker with a number on the side of every disk I own and in the root of that drive create a file called number.key. This file can contain some info on the disk but thats not relevant here. Then I have a perl script that essentially does a ls -R to a text file named after the disk id and stored elsewhere. The reason this is perl is that i can teach it to understand XBMC movie.nfo and tvshow.nfo and place the TVDB id or the IMDB in the catalog as well. This is just handy but it shows how similar an md5 catalog and a general file collection catalog are.

 

Currently I prefer txt files that a database for the catalog as it allows me to use the power of the shell.

 

Food for though about where this could perhaps evolve to.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.