MD5 Scripting help!


TheDragon

Recommended Posts

I'm trying to cobble together a script to create MD5 checksums of each of my array disks.  I've been successful for the most part, however the finishing touch I can't seem to get right is including the disk number that each file relates to.

 

Here is what I have so far:

# Script to Create MD5 Hashes of Data Files on Array Disks (Monthly)
cd /mnt/
find /mnt/ -type d -maxdepth 1 | grep -v cache | grep -v user | grep disk | cut -c 6-11 | xargs -n 1 -I {}  md5deep -re {} > /mnt/cache/Backups/MD5_{}_$(date +"%d_%m_%Y").txt

 

It creates the file, with MD5 hashes as expected.. however, the files are being generated with a name as below:

MD5_{}_23_06_2013.txt

 

I was expecting to get files named like this:

MD5_disk1_23_06_2013.txt
MD5_disk2_23_06_2013.txt

 

Can anyone more knowledgeable than myself, see where I'm going wrong? Any input would be greatly appreciated!  :)

Link to comment
  • Replies 63
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

Here's a scriptlet that may give you a different idea on how to do this.

While I don't have the whole dated thing going on, it's how i created a filelist.disk# and a filelist.disk#.newer

 

This was a precursor to a larger script to do it with md5sums.

 


#!/bin/bash 


find /mnt -type d -maxdepth 1 -name disk* -print | while read DIR
do   DISK="${DIR##*/}"
     FILELIST=/mnt/cache/.flocate/filelist.${DISK}
     find ${DIR} -type f -newer ${FILELIST} -print | sort > ${FILELIST}.newer
     find ${DIR} -type f -print | sort > ${FILELIST}
     # ls -l /mnt/cache/.flocate/filelist.${DISK}
done

Link to comment

FWIW, you are better off dating your scripts with YYYY-MM-DD or some derivative like that.

This way when you sort the file list, they are sorted correctly.

 

date "+%Y-%m-%d"

2013-06-23

 

Since YYYY-MM-DD is an ISO standard, I usually use that or without the dashes so I can just check the length an know I have a date.

20130623

i.e. 8 digits following very specific limits.

Link to comment

Okay.. using both of your suggestions I've managed to cobble something together that seems to have the desired effect!  I can't say I'm entirely sure how/why it works though  :P

 

Any constructive criticism welcome!!

 

This is what I've got:

# Script to Create MD5 Hashes of Data Files on Array Disks (Monthly)
#!/bin/bash 
dt=$(date +"%Y-%m-%d")
find /mnt/ -type d -maxdepth 1 -name disk* -print | while read DIR
cd /mnt/
do   DISK="${DIR##*/}"
     md5deep -re ${DISK} > /mnt/cache/Backups/MD5_${DISK}_${dt}.txt
done

 

I have a couple of questions about how/why this works, if anyone is happy to answer  :)

 

If I type ' find /mnt/ -type d -maxdepth 1 -name disk* -print ' at the console, it returns the disks, along with '/mnt/' prefix.  I can't see where this is removed in the script, since the file names I end up with don't include '/mnt/'.

 

I'm also not 100% sure of the effect of 'while read DIR'. From my googling, I'm guessing this is reordering the disks in alpha/numerical order?

 

Final question! Is how/why do the disk numbers end up in the ${DISK} variable?  :)

 

 

EDIT: Updated code to include WeeboTech's suggestion re date format - Thank you!!  ;)

Link to comment

As I am not yet fluent in the fine art of scripting, I found an open source windows explorer shell extension called HashCheck that accomplishes the same thing.

 

http://code.kliu.org/hashcheck/

 

You can generate MD5 hash files for entire directories that can be edited with your favorite text editor.  I use as a check on both my unRAID array and backup disks.  Gives me peace of mind that everything is working properly.  It will also generate and display as a property page all the different hashes for a given file.  Very easy to use.  I use it from a Win7 machine.

Link to comment

 

Here you are piping the list into the variable DIR and entering a DO WHILE loop.

find /mnt/ -type d -maxdepth 1 -name disk* -print | while read DIR

 

So for each value in DIR, you set the variable DISK then do you md5deep on it. Then it hits the done and goes back to the DIR array and reads the next value until no more values.

cd /mnt/

do  DISK="${DIR##*/}"

    md5deep -re ${DISK} > /mnt/cache/Backups/MD5_${DISK}_${dt}.txt

done

 

Personally, I would put the cd /mnt/ line just above the find /mnt/ line.

no need to keep ccd /mnt/ each iteration

 

Link to comment

 

Here you are piping the list into the variable DIR and entering a DO WHILE loop.

find /mnt/ -type d -maxdepth 1 -name disk* -print | while read DIR

 

So for each value in DIR, you set the variable DISK then do you md5deep on it. Then it hits the done and goes back to the DIR array and reads the next value until no more values.

cd /mnt/

do  DISK="${DIR##*/}"

    md5deep -re ${DISK} > /mnt/cache/Backups/MD5_${DISK}_${dt}.txt

done

 

Personally, I would put the cd /mnt/ line just above the find /mnt/ line.

no need to keep ccd /mnt/ each iteration

I hadn't twigged that the DIR in caps was indicating a variable - thanks, that makes sense now!

 

As for the cd /mnt/ line, I did try putting that above the find /mnt/ line, but for some reason when I did that the script produced an error.. not really sure why! However by moving it down after the /mnt/ find line it seemed to work fine.  The error I got was

find: paths must precede expression: disk2
Usage: find [-H] [-L] [-P] [-Olevel] [-D help|tree|search|stat|rates|opt|exec] [path...] [expression]

Link to comment

DISK="${DIR##*/}"

Takes apart the /mnt/disk#

root@unRAID:~# DIR=/mnt/disk1

root@unRAID:~# echo ${DIR##*/}
disk1[/code

 

Thank you! That also makes perfect sense now, hopefully one of these days once I've learnt a little more, I might actually be able to make a script from scratch  ;)

Link to comment

Here is a script that I put together from one I found a while back, the threads Joe L posted and portions of yours jack0w:

# Script to Create MD5 Hashes of Data Files on Array Disks (Monthly)
#!/bin/bash 
dt=$(date +"%Y_%m_%d")
mkdir -p /mnt/cache/Backup/N40L

find /mnt/disk* -type f -exec md5sum {} \;>> /mnt/cache/Backup/N40L/MD5_${dt}.txt

 

Obviously the directories above that put the text file on the cache drive would have to change but it appears to do what I needed anyway.  I think it will be several hours before it gets to my next disk to see if that works or not but it is working on the first disk so far just like I want anyway.

Link to comment

All operations in a while loop must occur on or after the do

in my case I had

do DISK=

 

Just move the DISK= to the line below, indent accordingly.

add the cd after the DISK= or before it.

 

while read DIR

do

<your lines go here>

done

 

Okay think I've got this sussed...

 

I started with:

 

# Script to Create MD5 Hashes of Data Files on Array Disks (Monthly)
#!/bin/bash 
dt=$(date +"%Y-%m-%d")
find /mnt/ -type d -maxdepth 1 -name disk* -print | while read DIR
cd /mnt/
do   DISK="${DIR##*/}"
     md5deep -re ${DISK} > /mnt/cache/Backups/MD5_${DISK}_${dt}.txt
done

 

As per your suggestion I've now changed this to:

# Script to Create MD5 Hashes of Data Files on Array Disks (Monthly)
#!/bin/bash 
dt=$(date +"%Y-%m-%d")
find /mnt/ -type d -maxdepth 1 -name disk* -print | while read DIR
do   DISK="${DIR##*/}"
     cd /mnt/
     md5deep -re ${DISK} > /mnt/cache/Backups/MD5_${DISK}_${dt}.txt
done

 

Is what I've got now correct?

 

graywolf seemed to suggest that it was incorrect or bad script etiquette (I'm not sure which) to cd /mnt/ for each iteration.

Link to comment

Here is a script that I put together from one I found a while back, the threads Joe L posted and portions of yours jack0w:

 

I'm glad what I started cobbling together was of some use to somebody else!!  ;)

Was a big help.  Now to get it to execute periodically once it completes.  Don't think that will be a problem but got to wait until it gets done.  I'm thinking I will zip up the file as well to take less space so will have to add that also have to add it to my other unRAID servers.
Link to comment

Here is a script that I put together from one I found a while back, the threads Joe L posted and portions of yours jack0w:

 

I'm glad what I started cobbling together was of some use to somebody else!!  ;)

Was a big help.  Now to get it to execute periodically once it completes.  Don't think that will be a problem but got to wait until it gets done.  I'm thinking I will zip up the file as well to take less space so will have to add that also have to add it to my other unRAID servers.

 

If it helps, I did try creating MD5 hashes on a single disk a few days ago, just to get an idea of file size and time taken to complete per disk.  On an almost full 3TB disk, I ended up with a 411.2 kB file containing the hashes, and it took around 7 hours to complete.

 

Have you given much thought to the interval you will run it at?  I am thinking of doing it monthly, although *I wonder* if since it is reading from the whole disk, it may well contribute to the 'ageing' of the disk? As a result I'm not sure if running it monthly is wise! I did wonder about running it bi-monthly? That's the only reason I haven't immediately added a line to my go file to get crontab to run it regularly.

 

What interval do you run your version of this script at WeeboTech?

Link to comment

If it helps, I did try creating MD5 hashes on a single disk a few days ago, just to get an idea of file size and time taken to complete per disk.  On an almost full 3TB disk, I ended up with a 411.2 kB file containing the hashes, and it took around 7 hours to complete.

 

Have you given much thought to the interval you will run it at?  I am thinking of doing it monthly, although *I wonder* if since it is reading from the whole disk, it may well contribute to the 'ageing' of the disk? As a result I'm not sure if running it monthly is wise! I did wonder about running it bi-monthly? That's the only reason I haven't immediately added a line to my go file to get crontab to run it regularly.

 

What interval do you run your version of this script at WeeboTech?

On my Media unRAID servers my current MD5 checksums take 7749KB and 1318KB respectively with my old script which I had to edit for each disk change.  On my N40L which contains downloads and backups I will know sometime tomorrow when it completes.  It should be larger than the media servers because it has many more smaller files.  Before I deleted the old MD5Sums the largest drive (by file count) was approximately 33300KB.  I have already added a copy command to my go file to copy my version of the script to cron.monthly so that it will run monthly.  Not worried about the drives especially since I'm using WD Reds for 2 of my 3 unRAID boxes.  They are designed to be spinning 24x7 at least and I'm sure reading will not add much more ware.  The only thing I'm worried about is that this will run at the same time as the month parity check and slow both down allot.

 

Edit: First disk on N40L just completed and has 295833 files in 786.59G.  Rest of the disks (all 2TB WD Reds) are about the same usage or have less free space but the files are much larger.  I probably only have a couple of thousand files on the other 4 disks of the array but each disk has between 400G and 950G free each.

Link to comment

On my Media unRAID servers my current MD5 checksums take 7749KB and 1318KB respectively with my old script which I had to edit for each disk change.  On my N40L which contains downloads and backups I will know sometime tomorrow when it completes.  It should be larger than the media servers because it has many more smaller files.  Before I deleted the old MD5Sums the largest drive (by file count) was approximately 33300KB.  I have already added a copy command to my go file to copy my version of the script to cron.monthly so that it will run monthly.  Not worried about the drives especially since I'm using WD Reds for 2 of my 3 unRAID boxes.  They are designed to be spinning 24x7 at least and I'm sure reading will not add much more ware.  The only thing I'm worried about is that this will run at the same time as the month parity check and slow both down allot.

 

Ah I see! Well I guess since my array is full of a smaller quantity of larger files, that accounts for why our checksum file sizes are significantly different. I can now understand why you want them zipped!  ;)

 

I'm also using all WD Red drives (bar one), and your argument sounds logical to me. I think I will follow suit and run it monthly.  Thanks for the additional info.

 

Also big thanks to everyone who pitched in to help me get to the point of having a working script! Hopefully it may prove useful, even if only as a starting point, for others in future  :)

Link to comment

I'm resurrecting some old snippets here to give you some ideas since I cannot locate the full script.

I had a daily job that would traverse one disk for each day via cron.

 

It would find today's day of the month.

Look to see if that disk mount point existed.

If so, continue and make a filelist.

 

Here are the snippets.

 

# DD=`date "+%e"|sed -e 's# ##g'`

# [ ! -e /mnt/disk${DD} ] && exit 3

# find /mnt/disk${DD} -type f > /mnt/cache/.flocate/filelist.disk${DD}

 

My script was a bit more extravagant though.

 

The later creation made the md5sums as did the other scriptlet shown in the thread.

 

I then read the md5sum file and checked each file for it's respective MTIME.

The latest MTIME was used as a reference to set the TIME of the md5sum.

 

On subsequent iterations, I used find /mnt/disk/${DD} -newer (seed md5sum file). >> filelist.disk${DD}.newer

 

From here I would use the newer file as input to make new md5sums of only the updated files.

Then merge them into the original md5sum file.

Then set the timestamp of the md5sum file to the timestamp of the newest file.

 

So rather then doing all your md5sums daily or monthly.

I did an MD5SUM for each /mnt/disk${DD} based on the day of the month.

I changed my parity check to run on the 27th day so that there were no collisions.

 

I started parity check on the 27th in case it ran over 24 hours.

 

This allowed disk 1-24 to be processed on those respective days

cache to be any day between or reruns of any other day to occur over night.

 

Before I lost everything, I had this neat scheduler.

It read a named pipe and would spawn up to CPU # of md5sum parallel processes so the job got done quicker.

Link to comment

I plan on putting my monthly MD5 files through BeyondCompare to see if any changes have occurred that were not caused by any file operations I know about (bit rot).  Then delete the oldest.  As long as I do this regularly every couple of months I should see and maybe be able to correct any problems - re-record or re-download the files.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.