Help Wanted - Script to create and check checksum files


Recommended Posts

Hi,

 

On my windows platforms, I have for a long time employed a very simple bat file to create sfv files with checksums for all files in that directory. I would use a sweep command to do this recursively for all subdirectories. With a switch, the bat file would check the sfv files and output any errors to a file in the root of the drive.

 

This allowed be to detect any errors when transferring files from one drive to a new one – and I have to say I’ve often needed to re-transfer a bunch of files. Yes, I used to be a NF4 owner - omg that chipset sucked (I will not ever own a nVidia chipset based product again).  Before that, about 10 years ago I had something else that also generated the occasional bit error.

 

After having experienced my first parity error after a disk rebuild, I’d like to set up a similar system on my unraid box – this would enable me to go in and see exactly which files, if any, have been impacted by any parity errors I might encounter in the future. Please don’t make this topic about these mentioned parity errors – but feel free to comment about those over here if You like. Let’s keep this topic on creating a suitable script.

http://lime-technology.com/forum/index.php?topic=12884.0

 

Since I’m still very green in the Linux world, I thought I’d ask if tools for this already exist, and if not perhaps some pointers on how to go about creating at least the same functionality I had with my previous system – and perhaps some more as well.

 

Required features:

- Create sfv or md5 file containing ckecksums for all files in each directory, recursively for all subdirectories

- Check sfv/md5 file recursively, outputting any errors along with full path and filename of failing file to a specified log file

 

Nice-to-have features

- Add new files to existing sfv/md5 files, recursively

- Detect missing files and log a warning with the relevant path, filename and size in the log file

- Print a warning if user tries to create sfv/md5 files on Disk share (because this could result in duplicate files in the unraid filesysten, filling up Your syslog and not achieving what You want if ever wanting to check a user share)

 

Obscure super-paranoid extra-nice-to-have features

- After creating checksum files, perform some step to flush the cache (cp large amount of data to null), then check the created checksum files to make sure the checksums are correct

- After adding new files to existing checksum files, perform some step to flush the cache (cp large amount of data to null), then check the created checksum files to make sure the checksums are correct

- In case errors are detected during a check, then perform some step to flush the cache (cp large amount of data to null), and then re-check

 

Any help in terms of pointers or actual code will be greatly appreciated, and if nothing like this exists, I’d like to think it might benefit other users as well.

 

Thanks!

 

Link to comment

Yeah, I wanted to work on something like this also.  Especially when copying from my computer or another server to the unRAID.  Generate a md5 pre-copy and send the two files over, and then check that md5 against the copied file.  I'd be willing to work on this a bit.  I'm not a Bash expert, but work a bit with Perl.  I know there is an md5 plugin in the unRAID package manager.  I haven't used it at all though.

Link to comment

Cool. Yes, there is 'md5sum' command in unraid and I'm fine with using that since it comes build in, and also technically I guess MD5 is a bit more secure since it's longer.

 

My problem is I'd have to figure out the commands to sweep subdirs, and then all the nice-to-have features, I'd not know where to start.

Link to comment

To start, you could use the 'find' command to recurse through directories, and for each 'type -f' (file), 'exec' the md5sum on that file to generate the sum.  Output that sum to a file with the same name as the file, with a .md5 extension, for example.

 

Then you could create a script that runs nightly, weekly, whatever--to match the md5 against each file, and if the md5 doesn't exist, create a new one.  If it doesn't match, write out to an error file, and maybe email that error file to you.  I'm just letting my mind flow, so I may be missing some things, but that's the basic idea for the first part at least.

Link to comment

Would result in too many files. Not to my liking. Looking for a single file in each subdir.

 

Will have to look over the features of the md5sum command, and on second thought, it might make sense to check and see if others exist that support some of the nice-to-have features on my list before settling on md5sum.

Link to comment

Will get back here with my solution. Status now is that I have md5deep and hashdeep working. Installed the slightly outdated 3.6 version from unmenu by first installing the cdd compiler, also with unmenu - just follow the instructions and push buttons. I'm generating hash files now with hashdeep by running

cd /mnt/user/TV
nohup hashdeep /r * >/mnt/disk2/hashdeep-tv.log

 

The nohup will allow you to close the terminal window and let it run overnight. Make sure the log file is not in the same share being hashed, that will make it choke on itself (hashing it's own output).

 

Haven't gotten around to testing (auditing) yet, and also I'm not 100% sure this is what I want, as it outputs to a single log file for all subdirectories (as opposed to one for each directory). But I saw mention of a feature that sounded like it would be able to recognize files that had moved directories - if that is true and working, that might negate my wish for logfiles per subdirectory.

 

The reason I'm using hashdeep is that the hashdeep output includes the file size in addition to the checksum and filename. It actually also outputs more than one checksum per file (default MD5 plus SHA256 or something like that). The md5deep package installs both md5deep and hashdeep.

 

Link to comment
  • 6 months later...

Following up on this, I have familiarized myself with the matching, negative matching and audit modes of hashdeep and found one major shortcoming with all 3 modes. Files that have changed content (and crc) are not distinguishable from new files added after the hashfile was generated.

 

Bit of background here, the whole philosophy in hashdeep seems to be that files are identified by their hashes. No matter which folder and under which filename a file resides, if it matches both the hashes recorded for a file in the hashfile generated with the command in my previous post, then that's the same file and the program will report that the file was found and whether or not it has moved location / filename. The flip side of this is that if a file changed contents but retained the same filename and path, it's a new file (the fact that it has same filename and path as an existing file is ignored). The consequence is that files that have changed content (fail crc check) are reported as "no match" in the audit report, which is exactly the same as if the file was a new file one had added to the directory structure after the hashfile had been generated.

 

So this is as far as I've come with hashdeep.

 

Generate hashfiles for all folders in "TV" share:

cd /mnt/user/TV
nohup hashdeep -r -l * >/mnt/disk2/hashdeep-tv.log

Options used are "-r" = Recurse dirs and "-l" = Use relative paths

(Relative paths allows you to audit the files after they've been copied or moved to a new folder or share without all of them being reported as moved files - the list can be quite distracting..)

 

Audit files (either in a new location or at the same location) against the previously generated hashfile for all of the "TV" share's contents:

]cd /mnt/user/TV  (or cd /mnt/user/TV-BACKUP)
nohup hashdeep -a -r -l -k /mnt/disk2/hashdeep-tv.log -v -v * >/mnt/disk2/hashdeep-tv.audit

Options are

            -a = Audit Mode

            -r = Recurse dirs

            -l = Use relative paths

            -k = location of hashfile

            -v = verbose (used twice, to print which files fail instead of only stating the audit has failed)

 

In both cases, nohup can be omitted if You want to keep the console window open while hashdeep is working (without the nohup, hashdeep will terminate when the console window is closed).

 

As mentioned, the resulting .audit report does not distinguish between files changed and files added to the "TV" share after the hashfile.log was generated, so any possible file failing the crc check will show up just the same as all the new TV shows added since the hashfile was generated.

 

Back to the drawing board !

 

Link to comment

Had a bit of a look at md5deep - it's got it's own shortcomings. One can generate hashes by

nohup md5deep -r -l * >/mnt/disk2/md5deep.log

 

The logfile can then be used with md5deep in either the matching or negative matching mode. Negative matching mode sounds nice at first - get a list of all the files that didn't match their md5 checksums - but this has the same downside as hashdeep - can't distinguish between new files and those that have changed contents (both simply does not produce a match against the md5deep.log file).

 

The negative matching mode can also be used with the -n switch which shounds nice because now we'll be getting a list of all the md5 checksums in the md5deep.log file that no match were found for. Problem here is You can't distinguish between TV shows you've deleted and TV shows that fail their crc check (for both cases, there will be an entry in the md5deep.log file with a corresponding md5sum but no match is found in the folder being scanned).

 

At this point I'm decidedly frustrated that none of these tools are able to clearly alert me to files which have retained their filename and relative path, but that have changed content.

Link to comment

What are your requirements?

What is the goal?

 

I'm building hybrid tool to do 2 things.

 

1. Catalog every file on the array to a database with respective stat(ls -l) information.

   This is to assist with my own version of the locate command.

   If I want to find a file, I call locate with a regular expression and it will be listed.

   I use slocate now, but it stats every file it finds, which causes the array to spin up.

   cache_dirs cannot help me, I have way too many "small" files.

 

2. I'm going to md5sum each file in to a field in the database.

   This will allow me to extract two fields, hash and name to build a md5sum file for double checking.

 

With the stat information and md5sum, I can tell if a file changed or does not pass an integrity check.

I can also find files that might be missing. Doubt I will care if they are moved.

They will show up as missing and then new and I'll make a determination.

 

My use is to find files, and if a problem occurs, verify those files pass a validation to insure that an fsck or parity sync did not corrupt the file.

 

At some point I was planning to extract the path create a symlink to the files extracted with the hash as the name.

This directory of symlinks could then be used to make par2 files for other integrity checks.

 

All of this would be stored in a sqlite table with a tool to update the table and locate the files.

Probably a tool to extract as md5sum, and another to allow direct sql commands to the database.

 

The reason to store the stat information is to refrain from md5sum calcualtions unless I know for sure, the file changed.

So size and ctime will be used for verification and if they change a new md5sum will be calculated and stored.

 

The sqlite part is pretty nifty, I learned how to make a bash extension which allows you to include sql lite commands as part of a shell script pretty easily.

 

In any case I would like to learn more about your goals and requirements to see of the needs intersect.

 

I can do all of the database as .gdbm files pretty easily, but I'm not sure many people would be that familiar with them.

 

A cool thing about having the stat and hash information duplicated, if you do an fsck and there are files in lost+found, you may be able to identify them, the location and original permissions to reset the file.

 

When 5.x comes out, I'll have a search tool that will scan the locate database, return an ls -l of the file and a hyperlink to the file so you can access it.

Link to comment

What You're working on there does sound cool. Way cool.

 

As for my wishes, they do go a bit further - but paradox-ally I have been served well by a much simpler system up-until now.

 

I had sort of dabbled some requirements in the first post, but on a higher level here is what I'd like to achieve:

 

1)  Be able to detect changes to files that stay in place

2)  Be able to detect changes to files after copying or moving to another folder or server (without needing to look at 12000 lines of new files and missing files after moving to a new server with different share or disk names)

3)  Be able to detect changes to files, cross-system (after copying files to windows or other operating systems)

 

If I'm reading your project right, it will be able to achieve the top one.

 

The system I had covered at least the first two, with potential to write something for unraid and cover the last one as well. It was so painfully simple I'm almost embarrassed to share it here - based on executing a batchfile based on sfv32.exe for DOS under windows 98, 2K and XP. In case of files stored on unraid, it would be working across the network (this, and the fact that it doesn't work on W7 without a VM, is the reason I'm looking to modernize).

 

This was my batch file:

@echo off
if "%1"=="T" goto test
if "%1"=="t" goto test

sfv32.exe -C -i -f "!cheksum.sfv" "*.*" <"C:\Program Files (x86)\!Transfered\DOSUTILS\enter.dum"
goto end

:test
sfv32.exe -T -l -n "!bad-crc.txt" -f "!cheksum.sfv" <"C:\Program Files (x86)\!Transfered\DOSUTILS\enter.dum"
cd >>\!bad-all.txt
type !bad-crc.txt >>\!bad-all.txt
echo . >>\!bad-all.txt
echo -------------------------------------------------------------------------------- >>\!bad-all.txt

:end

 

I'd typically execute the batch file with the sweep.com command - "sweep sfv" to create checksumfiles (thereby overwriting existing checksumfiles), and "sweep sfv t" to check. The sweep command will recurse all subfolders, executing the batch file for each one.

 

This system did have what some may find to be a disadvantage, in that it places a checksum file in every single folder. Never bothered myself, and kept it nice and simple.

 

The features I'd like added are for it to be able to execute on unraid, and then the other nice-to-haves listed in my first post.

 

--------------

 

In Your system - You may be able to obtain purpose 2) by doing it this way: Let it add new files to the database first. After that, detect missing files, and for each missing file search the database for that files checksum (make sure to use long checksums...). If another file with identical checksum and filesize is found, then label that particular file as 'moved' instead of 'new'. This way the report may still show 12000 new files, but as long as they just moved it won't report 12000 new files - however, and here's the major difference, if one file changed content during the move or copy to another folder, then that file will be labeled as 'new' and will stand out from the horde or files that were successfully moved or copied over ('moved').

Link to comment

It does seem like you have more requirements then I.

I just want a catalog and checksum.

If files move, I know they have moved. I suppose any new files could be double checked by checksum.

I'll have to think about it though.

I wanted to have a fast phase where files are inserted with stat information.

Then a slower phase where the checksums are calculated by another external process.

 

In order match what you have the insert/checksum has to be done at the same time.

I have to think about it more.

 

As far as md5sums in each directory, that's pretty easy to obtain.

However, my thought and expressions of others, is if the disk in question is having issues. Do you want your checksum information in the same location or in a different place?

Link to comment

Yeah, my focus on also being able to verify checksums after data has been moved or copied is a leftover from the pre-unraid years - always moving data onto a newer and larger disk for both consolidation, growing and opsoleting drives that had seen several years of successful service. With that and my near-insanity NF4 experience, paranoia now seems a normal state of mind.

 

When You say that is a file has moved, You would already know, I'm sure that is correct. But think about cases where You have renamed your category-folder for all Your "Heavy Metal" music collection to "Hair Metal" (or renamed a folder to correct a spelling error), then all of the files under that structure have moved. I'd like to avoid generating false warnings or errors for scenarios such as these.

 

What was the concern You and others would have with storing the checksums in the same place as the files? My thinking here is that if You can't read the files You won't need the checksums anyways.

 

That being said, I have been known to keep a dir/s catalog of all my drives on the c-drive so that I would know what I have lost when a drive did fail. In terms of replacing what was lost - if I had a backup that backup would contain it's own checksum file already, and if I needed to re-download then I'd trust the new download was OK and just generate new checksums once in a while (but a weakness was that before generating new ones for a whole drive I of course needed to check the existing ones first to make sure I wasn't creating new checksums for files that had failed).

 

Thanks for offering up the newer version of md5deep - did You happen to know if it will successfully distinguish between new or missing files and files actually failing the check?

 

Link to comment

Thanks for offering up the newer version of md5deep - did You happen to know if it will successfully distinguish between new or missing files and files actually failing the check?

 

I had not checked it like that. I'm looking at a whole disk or filesystem catalog environment.

Like locate but with stat and checksum information stored.

I had not considered dealing with moved files, anything that is moved, is moved by me and I would expect to see a

 

deleted message

added message

 

I suppose I could do a check on the sqlite that if a file is deleted, look for the corresponding size and hash to find a match.

 

But I have so many files, who knows how long it's going to take to checksum all the files.

We're talking about 12,000,000 files so far and the array is not full

 

 

 

Link to comment
  • 2 weeks later...

hey guys, just a quick post to let you know I'm following this topic. I'm curious about the outcome.

 

A bit of background:

I've been on unRAID for about 8 months, everything has been pain-free during the last 6, but the first 2 months were hurtful...

This a short summary of my venture... I had a failing drive that 10 days before has passed  5 preclear cycles, and SMART reports were not indicating nothing concernable.  When rebuilding to a new one, I swapped the disks around and one of my healthy disks was connected in a failing slot of the backplane (I learnt this much later...). To make things worse, I had a power outage during rebuild... Bottom line, the parity sync showed zillion of mistakes, which couldn't be right. I recovered almost everything by directly connecting the failing drive to another station and running recovery tools but I must confess I still cross my fingers every month on the monthly parity check. If an error appears I wouldn't know whether is the MB port, the cable, the backplane, the data drive or the parity drive. I was looking for some tool that could performs checksums of the data on the drive and of the data ignoring the data drive and using the parity drive instead. Any mismatch would alert me something is wrong!

Link to comment
performs checksums of the data on the drive and of the data ignoring the data drive and using the parity drive instead.

 

The parity drive does not contain actual file data, it's a parity calculation of all matching sectors of all drives. There's no way to read the parity drive in this manner. When a drive is removed from the array. all "other" data drives AND the parity drive are used to reconstruct the missing sectors/blocks/files/disk.

Link to comment

performs checksums of the data on the drive and of the data ignoring the data drive and using the parity drive instead.

 

The parity drive does not contain actual file data, it's a parity calculation of all matching sectors of all drives. There's no way to read the parity drive in this manner. When a drive is removed from the array. all "other" data drives AND the parity drive are used to reconstruct the missing sectors/blocks/files/disk.

 

That's what I meant: calculate the file ignoring the data drive and using build coming from the parity drive and rest of the data drives

Link to comment

performs checksums of the data on the drive and of the data ignoring the data drive and using the parity drive instead.

 

The parity drive does not contain actual file data, it's a parity calculation of all matching sectors of all drives. There's no way to read the parity drive in this manner. When a drive is removed from the array. all "other" data drives AND the parity drive are used to reconstruct the missing sectors/blocks/files/disk.

 

That's what I meant: calculate the file ignoring the data drive and using build coming from the parity drive and rest of the data drives

 

That's not possible to do on demand. I.E. read parity and all other drives except the one where the data actually resides.

The parity check is supposed to do the validation between actual data drives and parity drive.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.