MD5 Scripting help!


TheDragon

Recommended Posts

I would probably change it to do one disk a day.

 

 

Using the day of month as the disk number.

 

 

Schedule your monthly parity check for the 27th day of the month and you should not have any conflicts.

 

Thanks for the suggestions Weebo, finally got round to finishing this off!

 

This is the script I'm now running, as per your suggestion, with the limited testing I've done so far it seems to work as expected!! Thought I'd post my finished version in case it's helpful to anyone else  :)

 

#!/bin/bash
logger "MD5 Hash Script Starting..."
dt=$(date +"%Y-%m-%d")
DD=`date "+%e"|sed -e 's# ##g'`
logger "Creating MD5 Hash of disk${DD}..."
[ ! -e /mnt/disk${DD} ] && logger "Disk${DD} not found, MD5 Hash Script Stopping..." && exit 3
cd /mnt/
     /usr/local/bin/md5deep -re /mnt/disk${DD} > /mnt/cache/Backups/MD5_disk${DD}_${dt}.txt && logger "Creation of MD5 Hash of disk${DD} Completed Succesfully" 

Link to comment
  • Replies 63
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

Should also log the file date, and md5 create date too, otherwise this script is a winner.  Took 6 1/2 hours to process a 2 tb drive with 1/4 million files on my Xeon 1220 writing to a protected disk on the array.

 

Now I just need to find the code to add it to an sqlite db.  When it adds to the db it needs to check if the file is already in there if there is no changes, just update a last seen date and a counter showing how many times its been compared.  Database should also have a first seen date.  No changes means MD5, path and date are all the same.  Otherwise add a new record.  Would also be nice to have a deleted flag if a file is no longer in the file system.

 

This will be awesome for analysis of a working file system....

 

What do enterprise data centers use for auditing this file management process?

Link to comment

I have something I will be releasing soon. It's a bunch of loadable/shared object functions for bash.

 

They will allow you to access sqlite from the bash program itself. It provides functionality of sqlite as an extension of bash.

I'll have a gdbm module too, This is good for key/value data sets, i.e. key is file/path, value is anything you want.

 

Plus a tool for doing strftime functions from a number or as a file reference.

Think date but allowing you to use the unix EPOCH time which allows you to do math on it. Not such a big deal, but when combined with stat which can return the mtime as EPOCH time, now you can check a file's age.

 

I had just finished the GDBM update the other week. While it worked before, now it can load the gdbm into an associative array. This is what I plan for the lsql module, do a select, get the results in an associative array.

 

I had also started coding an md5 function for bash. While we have md5sum as an external. When you do MD5=`md5sum` millions of times, the extra pipe and fork adds up. By providing an embedded md5sum function inside of bash as a loadable, we can alleviate the required pipe and fork, thus saving a few cycles.

 

Should be ready in a few weeks.

 

Link to comment

couldnt you just use hashdeep / md5deep to do this? - https://github.com/jessek/hashdeep

 

Sure can.

I think part a part of this that comes into play is only calculating new md5's for files that are actually new and/or changed.

 

However if you just want to re-calculate md5's for every file on the disk, md5deep/hashdeep works great.

 

I tried to get the author to have md5deep store the mtime as a unix epoch/gmtime. I.E. the same time on the filesystem, but he resisted and said we should be using tripwire instead.  I disagree. So eventually I'll store it in sqlite along with other filesystem stat structure information. This way during a corruption it could help identify what files(or parts) are moved into lost+found.

Link to comment

couldnt you just use hashdeep / md5deep to do this? - https://github.com/jessek/hashdeep

Sure can. I think part a part of this that comes into play is only calculating new md5's for files that are actually new and/or changed.

However if you just want to re-calculate md5's for every file on the disk, md5deep/hashdeep works great.

 

So we will be able to calculate MD5 only for files that have changed.  That should make this process much more efficient in an ongoing basis.  I wonder how much more efficient???

 

However, sometimes you will want the system to recalculate all MD5 again to verify there is no bit-rot.  Although I suppose a parity check does the same thing??

 

Link to comment

couldnt you just use hashdeep / md5deep to do this? - https://github.com/jessek/hashdeep

Sure can. I think part a part of this that comes into play is only calculating new md5's for files that are actually new and/or changed.

However if you just want to re-calculate md5's for every file on the disk, md5deep/hashdeep works great.

 

So we will be able to calculate MD5 only for files that have changed.  That should make this process much more efficient in an ongoing basis.  I wonder how much more efficient???

 

However, sometimes you will want the system to recalculate all MD5 again to verify there is no bit-rot.  Although I suppose a parity check does the same thing??

 

My goal was to do an initial inventory and md5 of all files on a per disk basis.

Then do the same once a month, per disk.

 

disk# to be processed on the day of the month that matches the disk.

new files will be added to the inventory with md5 calculated and added to a report

changed files will be updated with new md5 and added to a report.

missing files will be added to a report.

 

There should always be a way to check all md5's based on some expression.

There should be a way to extract the sqlite data to an actual md5sum file based on some expression so you can archive or check with regular tools.

 

Discussed recently is another date field to be updated whenever the md5 is processed from the raw file again.

This way you can age out the md5 and recalculate it on some configured age or just do all of them to check for changed files, missing files, bit rot, etc, etc.

 

My personal goal was

day 1-26 process each respective data disk.

day 27 parity check.

 

It'll probably break for people who end up with more then 26 data disks, but that was my goal so that it could be run once a day in a reasonable time frame.

 

It was also because I wanted to do a badblocks read test and look for troublesome sectors for each respective disk on the day of the month, then possibly an automated SMART test.

 

How much more efficient you ask?

Well since the size of our disks are quite large these days, it can save allot of recalculation if it's not needed.

I know I had millions and millions of files on my array. It would take a long time just to traverse down the directory tree to get a file list.

 

Hence one of the other reasons to inventory your files.

An added goal was to provide a locate tool so that you can search your whole array for files via sqlite and find out exactly where they are. Just like locate.  My initial test of this had a neat plugin with an interface so you could locate a file It would do an ls -l on the file and present a directory view that would allow you to download or operate on the file.

 

For me, this was extremely useful, when you have an mp3 collection that spans 12TB, you want to know where files are when building a disk for a DJ party.

 

I had it going in an alpha version, then the plugin architecture changed and all my plugins broke.

I decided to wait until 5.0 final before moving forward again.

 

I am leaning towards SQLite because there's a very basic browser plugin for firefox that provides gui access to your database.

 

Plus mediamonkey uses SQlite and I would need to match against the two looking for dupes.

 

A couple of simple sql selects and you can extract an md5sum file for specific matches and or use that file to back things up a special way.

 

While I have esx running on the HP microservers and a slackware dev system there. It's a bit slow to develop on.

What's slowing me down is I need to build a new ESX server with enough oomph that doesn't frustrate me to compile.  I've been waiting and hoping to score a Lian Li-PCA17B to build a new unRAID server. I've not had much success with it.  I may go for the latest limetech build for a shell and drop in my other developer server components and be done with it.

 

on the plus side, I do have an lsql bash loadable working.

I just needed to add better array support so it's better to use with the shell.

 

Currently a select provides an array of each field in successive indexes.

I find that cumbersome to use at the current time. 

It's fine for one row, but when multiple rows are stored in the array, it gets messy (from what I remember at least).

Link to comment
  • 1 month later...

# Script to Create MD5 Hashes of Data Files on Array Disks (Monthly)
#!/bin/bash 
dt=$(date +"%Y-%m-%d")
#physcial cores
CORES="$(cat /proc/cpuinfo | egrep "core id|physical id" | tr -d "\n" | sed s/physical/\\nphysical/g | grep -v ^$ | sort | uniq | wc -l)"
#logical threads
#CORES="$(nproc)"
ARRAY="$(find /mnt/ -type d -maxdepth 1 -name disk* -print)" 
for DIR in $ARRAY
do	
DISK="${DIR##*/}"
mkdir -p /hash
md5deep -r ${DIR} > /hash/MD5_${DISK}_${dt}.txt &
while (( $(jobs | wc -l) >= $CORES ))
do
	echo $(jobs | wc -l)
	sleep 0.1
	jobs > /dev/null
done
done
md5deep -r /hash > /hash/${dt}_MD5.txt
cp /hash/* /boot/hash

 

I tried my hand at improving the code. Is now multithreaded! Writes hash to ramDisk then copies to flash since I don't have a cache drive. Also does hash on the hash for extra protection.

 

Hopefully WeeboTech comes out with his solution since 5.0 is out now.

Link to comment
  • 5 months later...
  • 2 months later...

# Script to Create MD5 Hashes of Data Files on Array Disks (Monthly)
#!/bin/bash 
dt=$(date +"%Y-%m-%d")
#physcial cores
CORES="$(cat /proc/cpuinfo | egrep "core id|physical id" | tr -d "\n" | sed s/physical/\\nphysical/g | grep -v ^$ | sort | uniq | wc -l)"
#logical threads
#CORES="$(nproc)"
ARRAY="$(find /mnt/ -type d -maxdepth 1 -name disk* -print)" 
for DIR in $ARRAY
do	
DISK="${DIR##*/}"
mkdir -p /hash
md5deep -r ${DIR} > /hash/MD5_${DISK}_${dt}.txt &
while (( $(jobs | wc -l) >= $CORES ))
do
	echo $(jobs | wc -l)
	sleep 0.1
	jobs > /dev/null
done
done
md5deep -r /hash > /hash/${dt}_MD5.txt
cp /hash/* /boot/hash

 

I tried my hand at improving the code. Is now multithreaded! Writes hash to ramDisk then copies to flash since I don't have a cache drive. Also does hash on the hash for extra protection.

 

Hopefully WeeboTech comes out with his solution since 5.0 is out now.

I'm a complete noob when it comes to scripting and cron. I'm hoping someone who is good at it will take pity on a beginner and tell me how to set this up. I understand how the cron expressions work but I don't know enough about scripting to know what parts I need to change or where/how to install this in my system.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.