bitrot - a utility for generating sha256 keys for integrity checks (version 1.0)

jbartlett · September 16, 2014

What's the benefit to having an exported hash file as well as the hash in the metadata?

Also how often are you running your script, do you have it scheduled?

The exported list is used in the event of a catastrophic error on which you have to low level scan the sectors and rebuild the file/directory tree in "lost+found". It'll generate a hash on the files in the lost+found directory and move them back to their original location.

Though, the last time I had to do this, I ended up with files in the PB sizes, I may add an exclusion for that.

jumperalex · September 17, 2014

I would like to request / suggest the inclusion of BLAKE2 as an optional hash function. You can get the details here https://blake2.net/

The main reason is that it is capable of multi-threading a single hash generation and is indeed as fast as md5 while not having collisions. Couple that with running at least as many processes as there are drives (to maximize throughput from the drives as choke points) and you really will have a blazing fast process. While not critical for generating (though nice) it can be very appealing during verifications; like, after a parity fail or [cough] something kernel related [cough] ;-)

jumperalex · September 17, 2014

TL;DR - Any hash generating script / plugin created for unRAID should be looking to parallel-ize the operation across HDDs as much as possible and then consider which hash algorithm to use for anyone with enough drives to max the CPU threads/cores. Because blake2 is indeed faster than md5 when fed data fast enough.

===========================================

Well I went a little crazy and starter running some tests. blake2 is about 2.5x faster than md5 when disk bandwidth is not a limiting factor (.481s vs 1.237s for the same 1GB file stored in /tmp in ramfs). That is the average over five iterations as are all timed tests.

However, when I ran a single-file test from /mnt/disk1 with a cache flush in between each test they took about the same time to run. htop clearly showed blake2 spawned multiple processes vs. md5sum's single but it didn't help. Neither came close to tapping out a single core no less all eight of them.

Next I used a script to run three simultaneous background process, one for each of my three disks, first for blake2 and then for md5 with a cache flush in between. Again, blake2 spawned a shetload of processes vs. md5's three processes, but each finished in nearly the same time (13-16s); Though if i had to award a gold medal md5 won by a nose. I can only guess that it took some extra time to setup the extra processes blake2 spawns since it is clearly faster when feed lots of data in a single-file head to head drag race. Maybe it was also initial i/o contention since all my drives are sitting on the m/b sata ports. This is the code I used

#!/bin/bash
echo Flushing Cache
sync; echo 3 > /proc/sys/vm/drop_caches
echo Running md5 timed test on disks 1-3
time md5sum /mnt/disk1/Torrents/Hell.on.Wheels.S04E03.720p.HDTV.x264-IMMERSE.mkv &
time md5sum /mnt/disk2/Torrents/Hell.on.Wheels.S04E04.720p.HDTV.x264-KILLERS.mkv &
time md5sum /mnt/disk3/Torrents/Hell.on.Wheels.S04E05.720p.HDTV.X264-DIMENSION.mkv &

#!/bin/bash
echo Flushing Cache
sync; echo 3 > /proc/sys/vm/drop_caches
echo Running Blake2sp timed test on disks 1-3
time b2sum-amd64-linux -a blake2sp /mnt/disk1/Torrents/Hell.on.Wheels.S04E03.720p.HDTV.x264-IMMERSE.mkv &
time b2sum-amd64-linux -a blake2sp /mnt/disk2/Torrents/Hell.on.Wheels.S04E04.720p.HDTV.x264-KILLERS.mkv &
time b2sum-amd64-linux -a blake2sp /mnt/disk3/Torrents/Hell.on.Wheels.S04E05.720p.HDTV.X264-DIMENSION.mkv &

and sample output

root@Tower:/boot/scripts# blaketest.sh
Flushing Cache

Running Blake2sp timed test on disks 1-3

root@Tower:/boot/scripts# dc4868be21d0c89c7daac0b7c43357216eb18fe3d31f936c55efa99bf49a58ca /mnt/disk2/Torrents/Hell.on.Wheels.S04E04.720p.HDTV.x264-KILLERS.mkv

real 0m10.283s

user 0m7.640s

sys 0m2.040s

980515c91493729e966f533a2e717f01761da7ad60f0ab292ba10e0376f239ad /mnt/disk3/Torrents Hell.on.Wheels.S04E05.720p.HDTV.X264-DIMENSION.mkv

real 0m13.214s

user 0m7.510s

sys 0m1.960s

cd463de201638f30b87205243918d890f5803b2817402542fc6c3f73aca4ed9f /mnt/disk1/Torrents/Hell.on.Wheels.S04E03.720p.HDTV.x264-IMMERSE.mkv

real 0m13.513s

user 0m7.870s

sys 0m2.310s

The time of the last file to process is what I used as the total time to process all files. It might not be "perfect" since that file is also the largest (barely) but it is good enough to show me that I can process three nearly identically sized files in as little as 13-16 second.

What happens when I try to process each file in sequence? Does it take the sum of the time for each file (38s for md5)? No it does not. It takes nearly 4x as long!!! 55.9s vs 14.3s for md5. I don't know why it takes longer than the sum of each file run in parallel, but I know why it took longer than running three disk streams at the same time. Now imagine how that will scale with 10 discs, or even 20 disks!!! Frankly I expect it to scale fairly linearly until either the cpu is maxed or the SATA path(s) is(are) saturated.

So what does this all mean? IDFK

No seriously I don't know if blake2 will be worth it regardless of the # of disks run in parallel unless you're running an SSD array!!! But I strongly suspect that at least folks running Atoms or similarly low-end cpus might benefit from blake2 even with only 5 or 6 drives, heck maybe fewer than that [shrug] and again I suspect it might help on a moderate cpu once enough drives are feeding it data.

Unfortunately I can't test with more disks since I only have three in my array. I also can't easily test with a weaker CPU to see if I can't find the inflection point. I'm just not in the mood to play with my syslinux.cfg to limit cores on my production system . I could add a fourth drive, my cache, to the mix but at this point I'm tired of messing with it and probably have interview-prep "stuff" to do anyway. This might all have been an attempt at procrastination but I'm not admitting anything :-X .

Anyone with a lot of drives and/or a weaker CPU wanna try? You can easily grab the blake2 "fat binaries" from here https://blake2.net/#dl. It is a single file you can copy to your flash and then over to /tmp

FYI I used '-a blake2sp' because it produces a 64 hex-dec hash and enables multi-proc. In testing from ram the 's' and 'b' versions were about the same speed regardless of 'p' multiprocessing option. 'b' creates a 128 hex-dec hash and since we aren't hashing passwords here, so I saw no point in creating hash files twice the size.

jbartlett · September 18, 2014

Here's the times I got running a 4.69 GB file from RAMFS vs disk. Times are in seconds.

md5sum: 9 / 29

sha1deep: 17 / 29

sha256deep: 25 / 28

tigerdeep: 14 / 29

whirlpooldeep: 57/58

When I was validating my hashes after upgrading to beta9, I had 8 telnet sessions opened (the max) and had a validate running on 8 drives, one per session. I couldn't perceive any slow downs but then again, I wasn't looking for one.

jumperalex · September 18, 2014

did you happen to watch htop to see what the cpu was doing during that 8-way validation?

jumperalex · September 18, 2014

I couldn't perceive any slow downs but then again, I wasn't looking for one.

Right, so then in theory you could have gotten even more total hash-throughput with more than 8 running processes during that validation. I mean we know there has to be a point where the cpu or sata i/o will become the choke point and it will depend on the cpu and sata port set-up. With little enough memory even that could become the bottleneck at some point. But one thing is for sure, we want to avoid running each file and each drive in sequence if it can be avoided.

What CPU were you running?

jbartlett · September 18, 2014

did you happen to watch htop to see what the cpu was doing during that 8-way validation?

CPU: Intel i7-4771 @ 3.50GHz

Ran 7 verifies on 7 drives on large media files, 1 window with htop. CPU1 stayed around 50-60%, the rest at 80-99%. sha256deep had two threads open on each file. Aborting each task had a noticeable effect on the CPU utilization. With only one session running, a single CPU hovered at 35-40%.

I stopped cache_dirs for this test.

jumperalex · September 18, 2014

Interesting. Thanks. Clearly a bit stronger cpu than mine. Sounds like another drive or two might have pushed you over the limit so to speak.

So, being the author of this wonderful tool and all, and admitting I'm not using it yet, is it able to run per-disk background processes? Or at least on your to-do list after everything else shakes out?

WeeboTech · September 18, 2014

One of the things I've been working on when dealing with this is to avoid flushing the cache buffers.

In doing my own tests on a whole drive (With clearly less CPU) access to the drive becomes hampered.

Through the use of the fadvise call and the POSIX_FADV_DONTNEED option, You can drop the cache on the file just read, which helps avoid pushing more data out of the cache buffers. Probably more useful for a large number of small files. If you happen to have a file that is larger then ram all bets are off. I did try to formulate my own embedded md5sum in the SQLDB version. In that version I would do an fadvise on the data already read to help it get dropped from the cache. The only issue I had was the source implementation I borrowed from wasn't as fast as the md5sum command itself. So I may borrow the gnu implementation or avoid embedding it and just pipe out to another configurable program. I'm leaning towards the latter. The blake implementation doesn't do much faster then md5sum on my weaker CPU. Truth be told, my implementation doesn't lend it self well to allot of parallel hash processing, however you can run a scan per disk in parallel with the SQLDB access being the bottleneck. (albeit a small one). Anyway use of FADV_DONTNEED shows interesting results. Might be useful in the mover to drop the cache on a file just moved.


root@unRAID:/mnt/disk1/filedb# free -l 
             total       used       free     shared    buffers     cached
Mem:       4116784    1573240    2543544          0     103680    1103220
Low:        869096     461380     407716
High:      3247688    1111860    2135828
-/+ buffers/cache:     366340    3750444
Swap:            0          0          0


root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise --read --dontneed --progress 3 testfile.dat 
701972480 bytes (669M) read, 4 s, 167M/s from `testfile.dat'
1413455872 bytes (1.3G) read, 8 s, 168M/s from `testfile.dat'
2099748864 bytes (2.0G) read, 12 s, 167M/s from `testfile.dat'
2791284736 bytes (2.6G) read, 16 s, 166M/s from `testfile.dat'
3477708800 bytes (3.2G) read, 20 s, 166M/s from `testfile.dat'
4117962752 bytes (3.8G) read, 24 s, 164M/s from `testfile.dat'
4710400000 bytes (4.4G) read, 27 s, 166M/s from `testfile.dat'


root@unRAID:/mnt/disk1/filedb# free -l 
             total       used       free     shared    buffers     cached
Mem:       4116784    1568348    2548436          0      97252    1108960
Low:        869096     450908     418188
High:      3247688    1117440    2130248
-/+ buffers/cache:     362136    3754648
Swap:            0          0          0


root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise --read --cache --progress 3 testfile.dat    
536846336 bytes (512M) read, 4 s, 128M/s from `testfile.dat'
1253548032 bytes (1.2G) read, 8 s, 149M/s from `testfile.dat'
1957142528 bytes (1.8G) read, 12 s, 155M/s from `testfile.dat'
2630852608 bytes (2.5G) read, 16 s, 157M/s from `testfile.dat'
3334971392 bytes (3.1G) read, 20 s, 159M/s from `testfile.dat'
4016234496 bytes (3.7G) read, 24 s, 160M/s from `testfile.dat'
4607942656 bytes (4.3G) read, 28 s, 157M/s from `testfile.dat'
4710400000 bytes (4.4G) read, 28 s, 160M/s from `testfile.dat'


root@unRAID:/mnt/disk1/filedb# free -l 
             total       used       free     shared    buffers     cached
Mem:       4116784    3550028     566756          0      97256    3094376
Low:        869096     557808     311288
High:      3247688    2992220     255468
-/+ buffers/cache:     358396    3758388
Swap:            0          0          0


root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise --dontneed testfile.dat                     
root@unRAID:/mnt/disk1/filedb# free -l 
             total       used       free     shared    buffers     cached
Mem:       4116784    1552312    2564472          0      97256    1096352
Low:        869096     446900     422196
High:      3247688    1105412    2142276
-/+ buffers/cache:     358704    3758080
Swap:            0          0          0

jumperalex · September 18, 2014

Makes sense. After testing I didn't figure Blake would help as much as a drag would show until the only bottle neck was pure cpu. And chances are other things like SATA i/o, and now sqldb writes would probably come first. But I still think there might be use cases that could help like a wide SATA path and sql writes to an ssd with a lot of drives. Cause even single threaded blaze is faster than md5

WeeboTech · September 18, 2014

Cause even single threaded blaze is faster than md5

Not on my Hp Micro server. MD5 processing is as fast as the SATA can process.

Considering the other fadvise raw test reads, maxes at 160MB/s in 28s.

In my case pre-processing with md5deep allows two processes in parallel for hashing while reading the hard drive.

So I may pre-process with md5deep, then use the associated hash file as a seed to import into the SQLdb and/or xattr on the filesystem.

My other thought is to insert all the stat information into the sqldb as fast as possible, then have other helper co-processors applications do sql selects on the NULL hash values and calculate/insert them in parallel. That would allow multiple parallel select and hash processes according to as many CPU's as you have.

I have not full worked out my implementation yet as there are two camps, those who just want it to work and those who want it to work really fast.

In my case I need a locate database, so storing hash values there is logical. With over a million files on 3 drives, it can be difficult to remember where I put something.

Anyway, I'm deviating the thread a bit here. core point of my post is, speed on blake depends on the hardware.

So if the bitrot program had a configurable hash helper program, that would allow people to choose what to use per hardware.

root@unRAID:/mnt/disk1/filedb# grep bogo /proc/cpuinfo 
bogomips        : 2994.98
bogomips        : 2994.98
root@unRAID:/mnt/disk1/filedb# grep Mhz  /proc/cpuinfo 
root@unRAID:/mnt/disk1/filedb# grep -i Mhz  /proc/cpuinfo 
cpu MHz         : 1500.000
power management: ts ttp tm stc 100mhzsteps hwpstate
cpu MHz         : 1500.000
power management: ts ttp tm stc 100mhzsteps hwpstate


root@unRAID:/mnt/disk1/filedb# time /boot/bin/fadvise --read --verbose -u testfile.dat      
   0. testfile.dat
4710400000 bytes (4.4G) read, 27 s, 166M/s from `testfile.dat'

real    0m27.951s
user    0m0.210s
sys     0m8.770s

FOR PERSPECTIVE

root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat 
root@unRAID:/mnt/disk1/filedb# time dd if=testfile.dat of=/dev/null bs=8192 
575000+0 records in
575000+0 records out
4710400000 bytes (4.7 GB) copied, 27.8765 s, 169 MB/s


real    0m27.883s
user    0m0.140s
sys     0m8.690s




HASH TESTS

root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat 
root@unRAID:/mnt/disk1/filedb# time md5sum testfile.dat          
1ff225cfbdbd9474b51a78a2aad416cc  testfile.dat

real    0m27.749s
user    0m18.020s
sys     0m7.900s

root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat 
root@unRAID:/mnt/disk1/filedb# time md5deep testfile.dat         
1ff225cfbdbd9474b51a78a2aad416cc  /mnt/disk1/filedb/testfile.dat

real    0m30.636s
user    0m24.090s
sys     0m9.270s

root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat 
root@unRAID:/mnt/disk1/filedb# time b2sum testfile.dat 
a867b501da9b199eda858afed353fef1d229894dd331f09c91c3406c3a578b1db17dedbadcca0a5b054b2cf7e07637bd7ff169c457a8a45cb01a22638f99c3fd testfile.dat

real    0m45.076s
user    0m36.850s
sys     0m7.070s

root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat 
root@unRAID:/mnt/disk1/filedb# time ./b2sum -a blake2sp testfile.dat  
97dd9bae3875a85e9516cb9e2b2a990d503e7be6b7d6d6d2b1a28135597d1232 testfile.dat

real    0m57.266s
user    1m3.040s
sys     0m9.180s

root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat 
root@unRAID:/mnt/disk1/filedb# time ./b2sum -a blake2bp testfile.dat 
43624a75e181a169630799b5f32a564ab65277e293528cdac71e748f57e7260f7e3dac93deabce048cbfcb84ad2beab84bff087b505d0707bb863803864428b1 testfile.dat

real    0m46.749s
user    0m42.250s
sys     0m9.380s

root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat     
root@unRAID:/mnt/disk1/filedb# time ./b2sum -a blake2s  testfile.dat 
36155b44c937ddb9237838df4e29557cae272207546642c57d2f1243b377ac8c testfile.dat

real    1m0.413s
user    0m52.450s
sys     0m7.270s

root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat     
root@unRAID:/mnt/disk1/filedb# time ./b2sum -a blake2b  testfile.dat 
a867b501da9b199eda858afed353fef1d229894dd331f09c91c3406c3a578b1db17dedbadcca0a5b054b2cf7e07637bd7ff169c457a8a45cb01a22638f99c3fd testfile.dat

real    0m44.798s
user    0m37.080s
sys     0m6.850s

root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat     
root@unRAID:/mnt/disk1/filedb# time sha256sum testfile.dat                     
b5aac2d99f5a47e6ff721b26d6bd50b2f3293455a285e04fe9501edfa48d6c3f  testfile.dat

real    1m14.804s
user    1m6.920s
sys     0m7.250s

jumperalex · September 18, 2014

Well right, if the drive isn't feeding it data. I'm just saying with enough data blaze does run faster or from the other side, Blaze will consume less CPU per hashing process. So if you are feeding 10 drives worth of data, you are less likely to run into cpu limits.

jbartlett · September 18, 2014

With my inventory script saving the hashes to a SQL DB, the main issue I had was that sqlite would introduce long delays into the process, considerably so if running on a spinner. Running the insert/update query after every file was not feasible so I would append any SQL statements to a file and run them in a batch every minute - seemed to keep the delays down to around 10 seconds or so.

Storing the hashes in the extended attributes takes zero time in comparison.

jbartlett · September 18, 2014

Question: What do you think about storing the file name location in the extended attributes as well?

I think I nixed the idea at first in that there is a limited amount of space in the attributes and I didn't want to hog it - but it seems like there's very very few scripts that store info in the attribute space.

Assuming attributes aren't lost when they show up in "lost+found" (something I haven't and hope I never have to test), it would make putting the file back into it's original location faster as the hash value wouldn't need to be computed first and then compared to an external file. If the attributes are lost, then the hash value compared to the exported file list would recover it.

If added, I would need to add an option to refresh just the path in the attributes and it would automatically be refreshed during adding/verifying.

RobJ · September 18, 2014

Question: What do you think about storing the file name location in the extended attributes as well?

I think I nixed the idea at first in that there is a limited amount of space in the attributes and I didn't want to hog it - but it seems like there's very very few scripts that store info in the attribute space.

Assuming attributes aren't lost when they show up in "lost+found" (something I haven't and hope I never have to test), it would make putting the file back into it's original location faster as the hash value wouldn't need to be computed first and then compared to an external file. If the attributes are lost, then the hash value compared to the exported file list would recover it.

If added, I would need to add an option to refresh just the path in the attributes and it would automatically be refreshed during adding/verifying.

It's a very interesting, good idea. But how often would it be useful? I think file system failures causing "lost+found" files are on the order of 1 in 2000 users (ballpark)? Plus, no one is going to install something like this untested, so you WILL have to test it, thoroughly, either yourself or some other willing suckeruser!

WeeboTech · September 18, 2014

Well right, if the drive isn't feeding it data. I'm just saying with enough data blaze does run faster or from the other side, Blaze will consume less CPU per hashing process. So if you are feeding 10 drives worth of data, you are less likely to run into cpu limits.

If you have a powerful CPU. Look at my log, I can read and supply data at 160MB/s.

Thats pretty good a very full filesystem accessing the inner cylinders.

The raw read from fadvise, md5sum and md5deep are very close.

Any of the blake functionality takes longer on my CPU and it's pegged 100%.

For the smaller servers, they may not be starved for input data, but for CPU.

WeeboTech · September 18, 2014

With my inventory script saving the hashes to a SQL DB, the main issue I had was that sqlite would introduce long delays into the process, considerably so if running on a spinner. Running the insert/update query after every file was not feasible so I would append any SQL statements to a file and run them in a batch every minute - seemed to keep the delays down to around 10 seconds or so.

Storing the hashes in the extended attributes takes zero time in comparison.

Using SQLite via the shell has a penalty in the manner of the original script.

The connecting to the database, then the select and insert/update on an array drive adds penalty.

In my case with the locate database, I store it on the ramdrive. I'll add something to rsync it to spinning storage later.

At that point I can add over a million files of stat() data to it in an hour or 2. (Without hashing)

There's really allot that can be done with a bash co-process, but I'm not sure it's worth it.

I wrote my utility in C calling SQLite directly. This avoids the fork()/exec() overhead and database connect/disconnect.

Plus I can have multiple drives ftw()ing and inserting/hashing.

I'll probably devise a separate tool to do the select for NULL or expired hashes and hash/rehash them.

Those can run in parallel without issue since it's not allot of SQLite transactions.

With my needs, I sometimes need to sweep the whole file system searching for something so having it all confined in a SQLite table is important. I.E. like locate and tripwire combined.

In any case, I think storing the hash and time in the xattr is brilliant.

My only fear is loosing them if there are filesystem issues or it gets moved.

I'll probably make a utility to import/export a .hash file to/from the attributes.

While I'm deviating a bit off topic, it's good for ideas.

jbartlett · September 19, 2014

It's a very interesting, good idea. But how often would it be useful? I think file system failures causing "lost+found" files are on the order of 1 in 2000 users (ballpark)? Plus, no one is going to install something like this untested, so you WILL have to test it, thoroughly, either yourself or some other willing suckeruser!

I have UNRAID running under VirtualBox - putting a bunch of files under /mnt/disk1, building the hash, and then deleting them would set things up for a reiser rebuild tree check....

I simulated a "lost+found" to test the recovery aspect though with no keys preserved by moving a bunch of files to a single directory and running a recover against that directory.

jbartlett · September 19, 2014

I'll probably make a utility to import/export a .hash file to/from the attributes.

Already in my script.

WeeboTech · September 19, 2014

I'll probably make a utility to import/export a .hash file to/from the attributes.

Already in my script.

While the bitrot shell is really well written.

I'm probably going to do it all in .c when I can. While it's easier to do it in bash and/or perl.

I'm working with millions of files and I have to be very careful of the naming.

I also need to import/export them to my SQLite table.

Plus I want the export file to work as input to the raw command if need be.

The testing tool segment I'm working on now makes a .hash file that can be used by the windows corz checksum.

It also makes a set of par2 files so you can validate and repair some damage or missing files (dependent on how bad the damage is)

This goes one step further then detecting it, by allowing you to repair it.

From what I've seen, the export file looks like

echo $filestoadd > "/tmp/filestoadd.$rand.txt"

echo "$currfile|$ShaKey|$ScanDate" >> "$exportfile"

My suggestion would be to exec to a fd for the export file then print to the fd in a slightly different format.

I.E. Standard sha256sum file, use comments for the supplementary data.

In the quick preliminary shell I create a file as

root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# more folder.hash
# file: folder.hash
# user.dir="/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)"
# user.hash.dir="a59998b6369450c9269203279720b427"
# user.hash.time="1411076621"
60343da3cd9725c218ba522df3024503  18 Michael E - Late Night Dreams.mp3
9060cd879b9870aae4df72e9ae568307  15 Eddie Silverton - White Sand.mp3
cd4d1b45f9b6dfcd0441ca13e91b560d  04 Mazelo Nostra - Smooth Night.mp3
011875f40fea633e0b8a1f94c583ab81  25 Pianochocolate - Long Long Letter.mp3

This way the folder.hash file can be used to validate the current files.

I'm purposely using relative filenames here.

The corz checksum makes a file as

# made with checksum.. point-and-click hashing for windows (64-bit edition).
# from corz.org.. http://corz.org/windows/software/checksum/
#
#md5#folder.par2#[email protected]:14
3a23f04cd50f96a1dcc2b10f3406376d *folder.par2
#md5#filelist.lastrun#[email protected]:15
f6cd171fd3955f1e9a25010451509652 *filelist.lastrun
#md5#disk3.filelist.2014-37.txt#[email protected]:03
62064d427826035e7360cbf1a409aa61 *disk3.filelist.2014-37.txt

In my version,

I tried to make the comments look like the exported getfattr -d format.

I'll build a parser in .c to allow use of this file to import into SQLite or do other things with it.

I have the parser for the hash line, now need the variable line.

Since I write to folder.hash in every folder (and folder.par2) the user.hash.dir is a hash of the path so the folder.hash's can be collected to a safer location off the filesystem.

Anyway, with the par2create/verify/repair executed within each directory, I can damage a file and repair it.

I can delete a file and have it recreated depending on how many blocks are available.

here's an example of the directory. Cool thing with the .hash naming, you can go onto windows,

right click on the file, click check checksum and get quick valdation from the windows workstation.

root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# ls -l
total 280837
-rw-rw-rw- 1 rcotrone 522 15798176 2014-09-18 11:25 01\ The\ Diventa\ Project\ -\ Serenity\ (Lazy\ Hammock\ Mix).mp3
-rw-rw-rw- 1 rcotrone 522  7648975 2014-09-18 11:25 02\ Stepo\ Del\ Sol\ -\ Fly\ Far\ Away.mp3
-rw-rw-rw- 1 rcotrone 522  9873573 2014-09-18 11:25 03\ Stargazer\ -\ White\ Caps\ (Sylt\ Mix).mp3
-rw-rw-rw- 1 rcotrone 522  9378281 2014-09-18 11:25 04\ Mazelo\ Nostra\ -\ Smooth\ Night.mp3
-rw-rw-rw- 1 rcotrone 522  8689679 2014-09-18 11:25 05\ Billy\ Esteban\ -\ Dream.mp3
-rw-rw-rw- 1 rcotrone 522 16075024 2014-09-18 11:25 06\ C.A.V.O.K\ -\ Night\ Flight.mp3
-rw-rw-rw- 1 rcotrone 522  9431569 2014-09-18 11:25 07\ Nasser\ Shibani\ -\ Time\ Chase.mp3
-rw-rw-rw- 1 rcotrone 522 12205755 2014-09-18 11:25 08\ Dave\ Ross\ -\ Solana.mp3
-rw-rw-rw- 1 rcotrone 522  9747172 2014-09-18 11:25 09\ Gabor\ Deutsch\ -\ Rearrange\ (Feat.\ Harcsa\ Veronika).mp3
-rw-rw-rw- 1 rcotrone 522 11318708 2014-09-18 11:25 10\ Ryan\ KP\ -\ Everythings\ Gonna\ Be\ Alright\ (Feat.\ Melody).mp3
-rw-rw-rw- 1 rcotrone 522 10283170 2014-09-18 11:25 11\ Florzinho\ -\ Primavera\ (Dub\ Mix).mp3
-rw-rw-rw- 1 rcotrone 522 13498323 2014-09-18 11:25 12\ Lazy\ Hammock\ -\ One\ of\ Those\ Days.mp3
-rw-rw-rw- 1 rcotrone 522 10098189 2014-09-18 11:25 13\ Myah\ -\ Falling.mp3
-rw-rw-rw- 1 rcotrone 522 10972802 2014-09-18 11:25 14\ Peter\ Pearson\ -\ I\ Need\ to\ Chill.mp3
-rw-rw-rw- 1 rcotrone 522 10983245 2014-09-18 11:25 15\ Eddie\ Silverton\ -\ White\ Sand.mp3
-rw-rw-rw- 1 rcotrone 522  6619741 2014-09-18 11:25 16\ Ingo\ Herrmann\ -\ Filtron.mp3
-rw-rw-rw- 1 rcotrone 522 14187937 2014-09-18 11:25 17\ DJ\ MNX\ -\ Cosmic\ Dreamer.mp3
-rw-rw-rw- 1 rcotrone 522  9187068 2014-09-18 11:25 18\ Michael\ E\ -\ Late\ Night\ Dreams.mp3
-rw-rw-rw- 1 rcotrone 522  5278124 2014-09-18 11:25 19\ Francois\ Maugame\ -\ Like\ a\ Summer\ Breeze.mp3
-rw-rw-rw- 1 rcotrone 522 12749119 2014-09-18 11:25 20\ Collioure\ -\ Perfect\ Resort.mp3
-rw-rw-rw- 1 rcotrone 522  7050247 2014-09-18 11:25 21\ Leon\ Ard\ -\ Caribbean\ Dreams.mp3
-rw-rw-rw- 1 rcotrone 522 10242401 2014-09-18 11:25 22\ Syusi\ -\ Bright\ Moments.mp3
-rw-rw-rw- 1 rcotrone 522 10783678 2014-09-18 11:25 23\ Thomas\ Lemmer\ -\ Above\ the\ Clouds.mp3
-rw-rw-rw- 1 rcotrone 522  8890313 2014-09-18 11:25 24\ Frame\ by\ Frame\ -\ Borderland.mp3
-rw-rw-rw- 1 rcotrone 522  9443076 2014-09-18 11:25 25\ Pianochocolate\ -\ Long\ Long\ Letter.mp3
-rw-rw-rw- 1 root     522     2049 2014-09-18 17:43 folder.hash
-rw-rw-rw- 1 root     522   132623 2014-09-18 17:03 folder.jpg
-rw-rw-rw- 1 root     522    46760 2014-09-18 17:43 folder.par2
-rw-rw-rw- 1 root     522 26635780 2014-09-18 17:43 folder.vol000+200.par2

Food for thought. Here's a log of what is possible by creating the set of par2 files per folder.

It might be something to consider for the bitrot program. I.E. being able to repair a small file directly.

Or storing the par2 file somewhere else.

root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# par2verify folder.par2 
par2cmdline version 0.4, Copyright (C) 2003 Peter Brian Clements.

par2cmdline comes with ABSOLUTELY NO WARRANTY.

This is free software, and you are welcome to redistribute it and/or modify
it under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version. See COPYING for details.

Loading "folder.par2".
Loaded 54 new packets
Loading "folder.vol000+200.par2".
Loaded 200 new packets including 200 recovery blocks

There are 26 recoverable files and 0 other files.
The block size used was 131244 bytes.
There are a total of 2000 data blocks.
The total size of the data files is 260566968 bytes.

Verifying source files:

Target: "01 The Diventa Project - Serenity (Lazy Hammock Mix).mp3" - found.
Target: "02 Stepo Del Sol - Fly Far Away.mp3" - found.
Target: "03 Stargazer - White Caps (Sylt Mix).mp3" - found.
Target: "04 Mazelo Nostra - Smooth Night.mp3" - found.
Target: "05 Billy Esteban - Dream.mp3" - found.
Target: "06 C.A.V.O.K - Night Flight.mp3" - found.
Target: "07 Nasser Shibani - Time Chase.mp3" - found.
Target: "08 Dave Ross - Solana.mp3" - found.
Target: "09 Gabor Deutsch - Rearrange (Feat. Harcsa Veronika).mp3" - found.
Target: "10 Ryan KP - Everythings Gonna Be Alright (Feat. Melody).mp3" - found.
Target: "11 Florzinho - Primavera (Dub Mix).mp3" - found.
Target: "12 Lazy Hammock - One of Those Days.mp3" - found.
Target: "13 Myah - Falling.mp3" - found.
Target: "14 Peter Pearson - I Need to Chill.mp3" - found.
Target: "15 Eddie Silverton - White Sand.mp3" - found.
Target: "16 Ingo Herrmann - Filtron.mp3" - found.
Target: "17 DJ MNX - Cosmic Dreamer.mp3" - found.
Target: "18 Michael E - Late Night Dreams.mp3" - found.
Target: "19 Francois Maugame - Like a Summer Breeze.mp3" - found.
Target: "20 Collioure - Perfect Resort.mp3" - found.
Target: "21 Leon Ard - Caribbean Dreams.mp3" - found.
Target: "22 Syusi - Bright Moments.mp3" - found.
Target: "23 Thomas Lemmer - Above the Clouds.mp3" - found.
Target: "24 Frame by Frame - Borderland.mp3" - found.
Target: "25 Pianochocolate - Long Long Letter.mp3" - found.
Target: "folder.jpg" - found.

All files are correct, repair is not required.

root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# rm -vi folder.jpg
rm: remove regular file `folder.jpg'? y
removed `folder.jpg'

root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# par2verify folder.par2 
par2cmdline version 0.4, Copyright (C) 2003 Peter Brian Clements.

par2cmdline comes with ABSOLUTELY NO WARRANTY.

This is free software, and you are welcome to redistribute it and/or modify
it under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version. See COPYING for details.

Loading "folder.par2".
Loaded 54 new packets
Loading "folder.vol000+200.par2".
Loaded 200 new packets including 200 recovery blocks

There are 26 recoverable files and 0 other files.
The block size used was 131244 bytes.
There are a total of 2000 data blocks.
The total size of the data files is 260566968 bytes.

Verifying source files:


Target: "01 The Diventa Project - Serenity (Lazy Hammock Mix).mp3" - found.
Target: "02 Stepo Del Sol - Fly Far Away.mp3" - found.
Target: "03 Stargazer - White Caps (Sylt Mix).mp3" - found.
Target: "04 Mazelo Nostra - Smooth Night.mp3" - found.
Target: "05 Billy Esteban - Dream.mp3" - found.
Target: "06 C.A.V.O.K - Night Flight.mp3" - found.
Target: "07 Nasser Shibani - Time Chase.mp3" - found.
Target: "08 Dave Ross - Solana.mp3" - found.
Target: "09 Gabor Deutsch - Rearrange (Feat. Harcsa Veronika).mp3" - found.
Target: "10 Ryan KP - Everythings Gonna Be Alright (Feat. Melody).mp3" - found.
Target: "11 Florzinho - Primavera (Dub Mix).mp3" - found.
Target: "12 Lazy Hammock - One of Those Days.mp3" - found.
Target: "13 Myah - Falling.mp3" - found.
Target: "14 Peter Pearson - I Need to Chill.mp3" - found.
Target: "15 Eddie Silverton - White Sand.mp3" - found.
Target: "16 Ingo Herrmann - Filtron.mp3" - found.
Target: "17 DJ MNX - Cosmic Dreamer.mp3" - found.
Target: "18 Michael E - Late Night Dreams.mp3" - found.
Target: "19 Francois Maugame - Like a Summer Breeze.mp3" - found.
Target: "20 Collioure - Perfect Resort.mp3" - found.
Target: "21 Leon Ard - Caribbean Dreams.mp3" - found.
Target: "22 Syusi - Bright Moments.mp3" - found.
Target: "23 Thomas Lemmer - Above the Clouds.mp3" - found.
Target: "24 Frame by Frame - Borderland.mp3" - found.
Target: "25 Pianochocolate - Long Long Letter.mp3" - found.
Target: "folder.jpg" - missing.

Scanning extra files:

Repair is required.
1 file(s) are missing.
25 file(s) are ok.
You have 1998 out of 2000 data blocks available.
You have 200 recovery blocks available.
Repair is possible.
You have an excess of 198 recovery blocks.
2 recovery blocks will be used to repair.

root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# par2repair folder.par2  par2cmdline version 0.4, Copyright (C) 2003 Peter Brian Clements.

par2cmdline comes with ABSOLUTELY NO WARRANTY.

This is free software, and you are welcome to redistribute it and/or modify
it under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your
option) any later version. See COPYING for details.

Loading "folder.par2".
Loaded 54 new packets
Loading "folder.vol000+200.par2".
Loaded 200 new packets including 200 recovery blocks

There are 26 recoverable files and 0 other files.
The block size used was 131244 bytes.
There are a total of 2000 data blocks.
The total size of the data files is 260566968 bytes.

Verifying source files:

Target: "01 The Diventa Project - Serenity (Lazy Hammock Mix).mp3" - found.
Target: "02 Stepo Del Sol - Fly Far Away.mp3" - found.
Target: "03 Stargazer - White Caps (Sylt Mix).mp3" - found.
Target: "04 Mazelo Nostra - Smooth Night.mp3" - found.
Target: "05 Billy Esteban - Dream.mp3" - found.
Target: "06 C.A.V.O.K - Night Flight.mp3" - found.
Target: "07 Nasser Shibani - Time Chase.mp3" - found.
Target: "08 Dave Ross - Solana.mp3" - found.
Target: "09 Gabor Deutsch - Rearrange (Feat. Harcsa Veronika).mp3" - found.
Target: "10 Ryan KP - Everythings Gonna Be Alright (Feat. Melody).mp3" - found.
Target: "11 Florzinho - Primavera (Dub Mix).mp3" - found.
Target: "12 Lazy Hammock - One of Those Days.mp3" - found.
Target: "13 Myah - Falling.mp3" - found.
Target: "14 Peter Pearson - I Need to Chill.mp3" - found.
Target: "15 Eddie Silverton - White Sand.mp3" - found.
Target: "16 Ingo Herrmann - Filtron.mp3" - found.
Target: "17 DJ MNX - Cosmic Dreamer.mp3" - found.
Target: "18 Michael E - Late Night Dreams.mp3" - found.
Target: "19 Francois Maugame - Like a Summer Breeze.mp3" - found.
Target: "20 Collioure - Perfect Resort.mp3" - found.
Target: "21 Leon Ard - Caribbean Dreams.mp3" - found.
Target: "22 Syusi - Bright Moments.mp3" - found.
Target: "23 Thomas Lemmer - Above the Clouds.mp3" - found.
Target: "24 Frame by Frame - Borderland.mp3" - found.
Target: "25 Pianochocolate - Long Long Letter.mp3" - found.
Target: "folder.jpg" - missing.

Scanning extra files:


Repair is required.
1 file(s) are missing.
25 file(s) are ok.
You have 1998 out of 2000 data blocks available.
You have 200 recovery blocks available.
Repair is possible.
You have an excess of 198 recovery blocks.
2 recovery blocks will be used to repair.

Computing Reed Solomon matrix.
Constructing: done.
Solving: done.

Wrote 132623 bytes to disk

Verifying repaired files:


Target: "folder.jpg" - found.

Repair complete.
root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# ls -l folder.jpg
-rw-rw-rw- 1 root 522 132623 2014-09-19 02:20 folder.jpg

root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# md5sum -c folder.hash | grep folder.jpg
folder.jpg: OK

BRiT · September 19, 2014

Something you will want to validate/verify is that the par2 files themselves have checksums and integrity checks built in. I haven't looked at that format enough to know how it deals with that. I would think you wouldn't want to be using a possibly corrupt par2 file or parts of a file to do a repair if you ever need to.

itimpi · September 19, 2014

Something you will want to validate/verify is that the par2 files themselves have checksums and integrity checks built in. I haven't looked at that format enough to know how it deals with that. I would think you wouldn't want to be using a possibly corrupt par2 file or parts of a file to do a repair if you ever need to.

Par2 files do have integrity checks built in. They carry out the same checks on themselves that they carry out on the files that they are protecting when you try and use them. That is why you can often use even a damaged par2 file for repair as it can work out which repair blocks contained within it at are intact and can be used for repair purposes.

WeeboTech · September 19, 2014

Question: What do you think about storing the file name location in the extended attributes as well?

I think I nixed the idea at first in that there is a limited amount of space in the attributes and I didn't want to hog it - but it seems like there's very very few scripts that store info in the attribute space.

I thought this relevant

For ext2/3/4 and btrfs, each extended attribute is limited to a filesystem block (e.g. 4 KiB), and in practice in ext2/3/4 all of them must fit together on a single block (including names and values). ReiserFS allow attributes of arbitrary size. In XFS the names can be up to 256 bytes in length, terminated by the first 0 byte, and the values can be up to 64KB of arbitrary binary data.

jbartlett · September 19, 2014

Question: What do you think about storing the file name location in the extended attributes as well?

I think I nixed the idea at first in that there is a limited amount of space in the attributes and I didn't want to hog it - but it seems like there's very very few scripts that store info in the attribute space.

I thought this relevant

For ext2/3/4 and btrfs, each extended attribute is limited to a filesystem block (e.g. 4 KiB), and in practice in ext2/3/4 all of them must fit together on a single block (including names and values). ReiserFS allow attributes of arbitrary size. In XFS the names can be up to 256 bytes in length, terminated by the first 0 byte, and the values can be up to 64KB of arbitrary binary data.

Sweet - a non issue then.

eroz · September 20, 2014

Thanks for the script. It works great. I have a question can you run the script with more than one mask?

i.e. bitrot.sh -a -p /mnt/user/Movies -m *.mkv -m *.m4v

bitrot - a utility for generating sha256 keys for integrity checks (version 1.0)

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation