jbartlett Posted September 16, 2014 Author Share Posted September 16, 2014 What's the benefit to having an exported hash file as well as the hash in the metadata? Also how often are you running your script, do you have it scheduled? The exported list is used in the event of a catastrophic error on which you have to low level scan the sectors and rebuild the file/directory tree in "lost+found". It'll generate a hash on the files in the lost+found directory and move them back to their original location. Though, the last time I had to do this, I ended up with files in the PB sizes, I may add an exclusion for that. Quote Link to comment
jumperalex Posted September 17, 2014 Share Posted September 17, 2014 I would like to request / suggest the inclusion of BLAKE2 as an optional hash function. You can get the details here https://blake2.net/ The main reason is that it is capable of multi-threading a single hash generation and is indeed as fast as md5 while not having collisions. Couple that with running at least as many processes as there are drives (to maximize throughput from the drives as choke points) and you really will have a blazing fast process. While not critical for generating (though nice) it can be very appealing during verifications; like, after a parity fail or [cough] something kernel related [cough] ;-) Quote Link to comment
jumperalex Posted September 17, 2014 Share Posted September 17, 2014 TL;DR - Any hash generating script / plugin created for unRAID should be looking to parallel-ize the operation across HDDs as much as possible and then consider which hash algorithm to use for anyone with enough drives to max the CPU threads/cores. Because blake2 is indeed faster than md5 when fed data fast enough. =========================================== Well I went a little crazy and starter running some tests. blake2 is about 2.5x faster than md5 when disk bandwidth is not a limiting factor (.481s vs 1.237s for the same 1GB file stored in /tmp in ramfs). That is the average over five iterations as are all timed tests. However, when I ran a single-file test from /mnt/disk1 with a cache flush in between each test they took about the same time to run. htop clearly showed blake2 spawned multiple processes vs. md5sum's single but it didn't help. Neither came close to tapping out a single core no less all eight of them. Next I used a script to run three simultaneous background process, one for each of my three disks, first for blake2 and then for md5 with a cache flush in between. Again, blake2 spawned a shetload of processes vs. md5's three processes, but each finished in nearly the same time (13-16s); Though if i had to award a gold medal md5 won by a nose. I can only guess that it took some extra time to setup the extra processes blake2 spawns since it is clearly faster when feed lots of data in a single-file head to head drag race. Maybe it was also initial i/o contention since all my drives are sitting on the m/b sata ports. This is the code I used #!/bin/bash echo Flushing Cache sync; echo 3 > /proc/sys/vm/drop_caches echo Running md5 timed test on disks 1-3 time md5sum /mnt/disk1/Torrents/Hell.on.Wheels.S04E03.720p.HDTV.x264-IMMERSE.mkv & time md5sum /mnt/disk2/Torrents/Hell.on.Wheels.S04E04.720p.HDTV.x264-KILLERS.mkv & time md5sum /mnt/disk3/Torrents/Hell.on.Wheels.S04E05.720p.HDTV.X264-DIMENSION.mkv & #!/bin/bash echo Flushing Cache sync; echo 3 > /proc/sys/vm/drop_caches echo Running Blake2sp timed test on disks 1-3 time b2sum-amd64-linux -a blake2sp /mnt/disk1/Torrents/Hell.on.Wheels.S04E03.720p.HDTV.x264-IMMERSE.mkv & time b2sum-amd64-linux -a blake2sp /mnt/disk2/Torrents/Hell.on.Wheels.S04E04.720p.HDTV.x264-KILLERS.mkv & time b2sum-amd64-linux -a blake2sp /mnt/disk3/Torrents/Hell.on.Wheels.S04E05.720p.HDTV.X264-DIMENSION.mkv & and sample output root@Tower:/boot/scripts# blaketest.sh Flushing Cache Running Blake2sp timed test on disks 1-3 root@Tower:/boot/scripts# dc4868be21d0c89c7daac0b7c43357216eb18fe3d31f936c55efa99bf49a58ca /mnt/disk2/Torrents/Hell.on.Wheels.S04E04.720p.HDTV.x264-KILLERS.mkv real 0m10.283s user 0m7.640s sys 0m2.040s 980515c91493729e966f533a2e717f01761da7ad60f0ab292ba10e0376f239ad /mnt/disk3/Torrents Hell.on.Wheels.S04E05.720p.HDTV.X264-DIMENSION.mkv real 0m13.214s user 0m7.510s sys 0m1.960s cd463de201638f30b87205243918d890f5803b2817402542fc6c3f73aca4ed9f /mnt/disk1/Torrents/Hell.on.Wheels.S04E03.720p.HDTV.x264-IMMERSE.mkv real 0m13.513s user 0m7.870s sys 0m2.310s The time of the last file to process is what I used as the total time to process all files. It might not be "perfect" since that file is also the largest (barely) but it is good enough to show me that I can process three nearly identically sized files in as little as 13-16 second. What happens when I try to process each file in sequence? Does it take the sum of the time for each file (38s for md5)? No it does not. It takes nearly 4x as long!!! 55.9s vs 14.3s for md5. I don't know why it takes longer than the sum of each file run in parallel, but I know why it took longer than running three disk streams at the same time. Now imagine how that will scale with 10 discs, or even 20 disks!!! Frankly I expect it to scale fairly linearly until either the cpu is maxed or the SATA path(s) is(are) saturated. So what does this all mean? IDFK No seriously I don't know if blake2 will be worth it regardless of the # of disks run in parallel unless you're running an SSD array!!! But I strongly suspect that at least folks running Atoms or similarly low-end cpus might benefit from blake2 even with only 5 or 6 drives, heck maybe fewer than that [shrug] and again I suspect it might help on a moderate cpu once enough drives are feeding it data. Unfortunately I can't test with more disks since I only have three in my array. I also can't easily test with a weaker CPU to see if I can't find the inflection point. I'm just not in the mood to play with my syslinux.cfg to limit cores on my production system . I could add a fourth drive, my cache, to the mix but at this point I'm tired of messing with it and probably have interview-prep "stuff" to do anyway. This might all have been an attempt at procrastination but I'm not admitting anything . Anyone with a lot of drives and/or a weaker CPU wanna try? You can easily grab the blake2 "fat binaries" from here https://blake2.net/#dl. It is a single file you can copy to your flash and then over to /tmp FYI I used '-a blake2sp' because it produces a 64 hex-dec hash and enables multi-proc. In testing from ram the 's' and 'b' versions were about the same speed regardless of 'p' multiprocessing option. 'b' creates a 128 hex-dec hash and since we aren't hashing passwords here, so I saw no point in creating hash files twice the size. Quote Link to comment
jbartlett Posted September 18, 2014 Author Share Posted September 18, 2014 Here's the times I got running a 4.69 GB file from RAMFS vs disk. Times are in seconds. md5sum: 9 / 29 sha1deep: 17 / 29 sha256deep: 25 / 28 tigerdeep: 14 / 29 whirlpooldeep: 57/58 When I was validating my hashes after upgrading to beta9, I had 8 telnet sessions opened (the max) and had a validate running on 8 drives, one per session. I couldn't perceive any slow downs but then again, I wasn't looking for one. Quote Link to comment
jumperalex Posted September 18, 2014 Share Posted September 18, 2014 did you happen to watch htop to see what the cpu was doing during that 8-way validation? Quote Link to comment
jumperalex Posted September 18, 2014 Share Posted September 18, 2014 I couldn't perceive any slow downs but then again, I wasn't looking for one. Right, so then in theory you could have gotten even more total hash-throughput with more than 8 running processes during that validation. I mean we know there has to be a point where the cpu or sata i/o will become the choke point and it will depend on the cpu and sata port set-up. With little enough memory even that could become the bottleneck at some point. But one thing is for sure, we want to avoid running each file and each drive in sequence if it can be avoided. What CPU were you running? Quote Link to comment
jbartlett Posted September 18, 2014 Author Share Posted September 18, 2014 did you happen to watch htop to see what the cpu was doing during that 8-way validation? CPU: Intel i7-4771 @ 3.50GHz Ran 7 verifies on 7 drives on large media files, 1 window with htop. CPU1 stayed around 50-60%, the rest at 80-99%. sha256deep had two threads open on each file. Aborting each task had a noticeable effect on the CPU utilization. With only one session running, a single CPU hovered at 35-40%. I stopped cache_dirs for this test. Quote Link to comment
jumperalex Posted September 18, 2014 Share Posted September 18, 2014 Interesting. Thanks. Clearly a bit stronger cpu than mine. Sounds like another drive or two might have pushed you over the limit so to speak. So, being the author of this wonderful tool and all, and admitting I'm not using it yet, is it able to run per-disk background processes? Or at least on your to-do list after everything else shakes out? Quote Link to comment
WeeboTech Posted September 18, 2014 Share Posted September 18, 2014 One of the things I've been working on when dealing with this is to avoid flushing the cache buffers. In doing my own tests on a whole drive (With clearly less CPU) access to the drive becomes hampered. Through the use of the fadvise call and the POSIX_FADV_DONTNEED option, You can drop the cache on the file just read, which helps avoid pushing more data out of the cache buffers. Probably more useful for a large number of small files. If you happen to have a file that is larger then ram all bets are off. I did try to formulate my own embedded md5sum in the SQLDB version. In that version I would do an fadvise on the data already read to help it get dropped from the cache. The only issue I had was the source implementation I borrowed from wasn't as fast as the md5sum command itself. So I may borrow the gnu implementation or avoid embedding it and just pipe out to another configurable program. I'm leaning towards the latter. The blake implementation doesn't do much faster then md5sum on my weaker CPU. Truth be told, my implementation doesn't lend it self well to allot of parallel hash processing, however you can run a scan per disk in parallel with the SQLDB access being the bottleneck. (albeit a small one). Anyway use of FADV_DONTNEED shows interesting results. Might be useful in the mover to drop the cache on a file just moved. root@unRAID:/mnt/disk1/filedb# free -l total used free shared buffers cached Mem: 4116784 1573240 2543544 0 103680 1103220 Low: 869096 461380 407716 High: 3247688 1111860 2135828 -/+ buffers/cache: 366340 3750444 Swap: 0 0 0 root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise --read --dontneed --progress 3 testfile.dat 701972480 bytes (669M) read, 4 s, 167M/s from `testfile.dat' 1413455872 bytes (1.3G) read, 8 s, 168M/s from `testfile.dat' 2099748864 bytes (2.0G) read, 12 s, 167M/s from `testfile.dat' 2791284736 bytes (2.6G) read, 16 s, 166M/s from `testfile.dat' 3477708800 bytes (3.2G) read, 20 s, 166M/s from `testfile.dat' 4117962752 bytes (3.8G) read, 24 s, 164M/s from `testfile.dat' 4710400000 bytes (4.4G) read, 27 s, 166M/s from `testfile.dat' root@unRAID:/mnt/disk1/filedb# free -l total used free shared buffers cached Mem: 4116784 1568348 2548436 0 97252 1108960 Low: 869096 450908 418188 High: 3247688 1117440 2130248 -/+ buffers/cache: 362136 3754648 Swap: 0 0 0 root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise --read --cache --progress 3 testfile.dat 536846336 bytes (512M) read, 4 s, 128M/s from `testfile.dat' 1253548032 bytes (1.2G) read, 8 s, 149M/s from `testfile.dat' 1957142528 bytes (1.8G) read, 12 s, 155M/s from `testfile.dat' 2630852608 bytes (2.5G) read, 16 s, 157M/s from `testfile.dat' 3334971392 bytes (3.1G) read, 20 s, 159M/s from `testfile.dat' 4016234496 bytes (3.7G) read, 24 s, 160M/s from `testfile.dat' 4607942656 bytes (4.3G) read, 28 s, 157M/s from `testfile.dat' 4710400000 bytes (4.4G) read, 28 s, 160M/s from `testfile.dat' root@unRAID:/mnt/disk1/filedb# free -l total used free shared buffers cached Mem: 4116784 3550028 566756 0 97256 3094376 Low: 869096 557808 311288 High: 3247688 2992220 255468 -/+ buffers/cache: 358396 3758388 Swap: 0 0 0 root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise --dontneed testfile.dat root@unRAID:/mnt/disk1/filedb# free -l total used free shared buffers cached Mem: 4116784 1552312 2564472 0 97256 1096352 Low: 869096 446900 422196 High: 3247688 1105412 2142276 -/+ buffers/cache: 358704 3758080 Swap: 0 0 0 Quote Link to comment
jumperalex Posted September 18, 2014 Share Posted September 18, 2014 Makes sense. After testing I didn't figure Blake would help as much as a drag would show until the only bottle neck was pure cpu. And chances are other things like SATA i/o, and now sqldb writes would probably come first. But I still think there might be use cases that could help like a wide SATA path and sql writes to an ssd with a lot of drives. Cause even single threaded blaze is faster than md5 Quote Link to comment
WeeboTech Posted September 18, 2014 Share Posted September 18, 2014 Cause even single threaded blaze is faster than md5 Not on my Hp Micro server. MD5 processing is as fast as the SATA can process. Considering the other fadvise raw test reads, maxes at 160MB/s in 28s. In my case pre-processing with md5deep allows two processes in parallel for hashing while reading the hard drive. So I may pre-process with md5deep, then use the associated hash file as a seed to import into the SQLdb and/or xattr on the filesystem. My other thought is to insert all the stat information into the sqldb as fast as possible, then have other helper co-processors applications do sql selects on the NULL hash values and calculate/insert them in parallel. That would allow multiple parallel select and hash processes according to as many CPU's as you have. I have not full worked out my implementation yet as there are two camps, those who just want it to work and those who want it to work really fast. In my case I need a locate database, so storing hash values there is logical. With over a million files on 3 drives, it can be difficult to remember where I put something. Anyway, I'm deviating the thread a bit here. core point of my post is, speed on blake depends on the hardware. So if the bitrot program had a configurable hash helper program, that would allow people to choose what to use per hardware. root@unRAID:/mnt/disk1/filedb# grep bogo /proc/cpuinfo bogomips : 2994.98 bogomips : 2994.98 root@unRAID:/mnt/disk1/filedb# grep Mhz /proc/cpuinfo root@unRAID:/mnt/disk1/filedb# grep -i Mhz /proc/cpuinfo cpu MHz : 1500.000 power management: ts ttp tm stc 100mhzsteps hwpstate cpu MHz : 1500.000 power management: ts ttp tm stc 100mhzsteps hwpstate root@unRAID:/mnt/disk1/filedb# time /boot/bin/fadvise --read --verbose -u testfile.dat 0. testfile.dat 4710400000 bytes (4.4G) read, 27 s, 166M/s from `testfile.dat' real 0m27.951s user 0m0.210s sys 0m8.770s FOR PERSPECTIVE root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat root@unRAID:/mnt/disk1/filedb# time dd if=testfile.dat of=/dev/null bs=8192 575000+0 records in 575000+0 records out 4710400000 bytes (4.7 GB) copied, 27.8765 s, 169 MB/s real 0m27.883s user 0m0.140s sys 0m8.690s HASH TESTS root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat root@unRAID:/mnt/disk1/filedb# time md5sum testfile.dat 1ff225cfbdbd9474b51a78a2aad416cc testfile.dat real 0m27.749s user 0m18.020s sys 0m7.900s root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat root@unRAID:/mnt/disk1/filedb# time md5deep testfile.dat 1ff225cfbdbd9474b51a78a2aad416cc /mnt/disk1/filedb/testfile.dat real 0m30.636s user 0m24.090s sys 0m9.270s root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat root@unRAID:/mnt/disk1/filedb# time b2sum testfile.dat a867b501da9b199eda858afed353fef1d229894dd331f09c91c3406c3a578b1db17dedbadcca0a5b054b2cf7e07637bd7ff169c457a8a45cb01a22638f99c3fd testfile.dat real 0m45.076s user 0m36.850s sys 0m7.070s root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat root@unRAID:/mnt/disk1/filedb# time ./b2sum -a blake2sp testfile.dat 97dd9bae3875a85e9516cb9e2b2a990d503e7be6b7d6d6d2b1a28135597d1232 testfile.dat real 0m57.266s user 1m3.040s sys 0m9.180s root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat root@unRAID:/mnt/disk1/filedb# time ./b2sum -a blake2bp testfile.dat 43624a75e181a169630799b5f32a564ab65277e293528cdac71e748f57e7260f7e3dac93deabce048cbfcb84ad2beab84bff087b505d0707bb863803864428b1 testfile.dat real 0m46.749s user 0m42.250s sys 0m9.380s root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat root@unRAID:/mnt/disk1/filedb# time ./b2sum -a blake2s testfile.dat 36155b44c937ddb9237838df4e29557cae272207546642c57d2f1243b377ac8c testfile.dat real 1m0.413s user 0m52.450s sys 0m7.270s root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat root@unRAID:/mnt/disk1/filedb# time ./b2sum -a blake2b testfile.dat a867b501da9b199eda858afed353fef1d229894dd331f09c91c3406c3a578b1db17dedbadcca0a5b054b2cf7e07637bd7ff169c457a8a45cb01a22638f99c3fd testfile.dat real 0m44.798s user 0m37.080s sys 0m6.850s root@unRAID:/mnt/disk1/filedb# /boot/bin/fadvise -u testfile.dat root@unRAID:/mnt/disk1/filedb# time sha256sum testfile.dat b5aac2d99f5a47e6ff721b26d6bd50b2f3293455a285e04fe9501edfa48d6c3f testfile.dat real 1m14.804s user 1m6.920s sys 0m7.250s Quote Link to comment
jumperalex Posted September 18, 2014 Share Posted September 18, 2014 Well right, if the drive isn't feeding it data. I'm just saying with enough data blaze does run faster or from the other side, Blaze will consume less CPU per hashing process. So if you are feeding 10 drives worth of data, you are less likely to run into cpu limits. Quote Link to comment
jbartlett Posted September 18, 2014 Author Share Posted September 18, 2014 With my inventory script saving the hashes to a SQL DB, the main issue I had was that sqlite would introduce long delays into the process, considerably so if running on a spinner. Running the insert/update query after every file was not feasible so I would append any SQL statements to a file and run them in a batch every minute - seemed to keep the delays down to around 10 seconds or so. Storing the hashes in the extended attributes takes zero time in comparison. Quote Link to comment
jbartlett Posted September 18, 2014 Author Share Posted September 18, 2014 Question: What do you think about storing the file name location in the extended attributes as well? I think I nixed the idea at first in that there is a limited amount of space in the attributes and I didn't want to hog it - but it seems like there's very very few scripts that store info in the attribute space. Assuming attributes aren't lost when they show up in "lost+found" (something I haven't and hope I never have to test), it would make putting the file back into it's original location faster as the hash value wouldn't need to be computed first and then compared to an external file. If the attributes are lost, then the hash value compared to the exported file list would recover it. If added, I would need to add an option to refresh just the path in the attributes and it would automatically be refreshed during adding/verifying. Quote Link to comment
RobJ Posted September 18, 2014 Share Posted September 18, 2014 Question: What do you think about storing the file name location in the extended attributes as well? I think I nixed the idea at first in that there is a limited amount of space in the attributes and I didn't want to hog it - but it seems like there's very very few scripts that store info in the attribute space. Assuming attributes aren't lost when they show up in "lost+found" (something I haven't and hope I never have to test), it would make putting the file back into it's original location faster as the hash value wouldn't need to be computed first and then compared to an external file. If the attributes are lost, then the hash value compared to the exported file list would recover it. If added, I would need to add an option to refresh just the path in the attributes and it would automatically be refreshed during adding/verifying. It's a very interesting, good idea. But how often would it be useful? I think file system failures causing "lost+found" files are on the order of 1 in 2000 users (ballpark)? Plus, no one is going to install something like this untested, so you WILL have to test it, thoroughly, either yourself or some other willing suckeruser! Quote Link to comment
WeeboTech Posted September 18, 2014 Share Posted September 18, 2014 Well right, if the drive isn't feeding it data. I'm just saying with enough data blaze does run faster or from the other side, Blaze will consume less CPU per hashing process. So if you are feeding 10 drives worth of data, you are less likely to run into cpu limits. If you have a powerful CPU. Look at my log, I can read and supply data at 160MB/s. Thats pretty good a very full filesystem accessing the inner cylinders. The raw read from fadvise, md5sum and md5deep are very close. Any of the blake functionality takes longer on my CPU and it's pegged 100%. For the smaller servers, they may not be starved for input data, but for CPU. Quote Link to comment
WeeboTech Posted September 18, 2014 Share Posted September 18, 2014 With my inventory script saving the hashes to a SQL DB, the main issue I had was that sqlite would introduce long delays into the process, considerably so if running on a spinner. Running the insert/update query after every file was not feasible so I would append any SQL statements to a file and run them in a batch every minute - seemed to keep the delays down to around 10 seconds or so. Storing the hashes in the extended attributes takes zero time in comparison. Using SQLite via the shell has a penalty in the manner of the original script. The connecting to the database, then the select and insert/update on an array drive adds penalty. In my case with the locate database, I store it on the ramdrive. I'll add something to rsync it to spinning storage later. At that point I can add over a million files of stat() data to it in an hour or 2. (Without hashing) There's really allot that can be done with a bash co-process, but I'm not sure it's worth it. I wrote my utility in C calling SQLite directly. This avoids the fork()/exec() overhead and database connect/disconnect. Plus I can have multiple drives ftw()ing and inserting/hashing. I'll probably devise a separate tool to do the select for NULL or expired hashes and hash/rehash them. Those can run in parallel without issue since it's not allot of SQLite transactions. With my needs, I sometimes need to sweep the whole file system searching for something so having it all confined in a SQLite table is important. I.E. like locate and tripwire combined. In any case, I think storing the hash and time in the xattr is brilliant. My only fear is loosing them if there are filesystem issues or it gets moved. I'll probably make a utility to import/export a .hash file to/from the attributes. While I'm deviating a bit off topic, it's good for ideas. Quote Link to comment
jbartlett Posted September 19, 2014 Author Share Posted September 19, 2014 It's a very interesting, good idea. But how often would it be useful? I think file system failures causing "lost+found" files are on the order of 1 in 2000 users (ballpark)? Plus, no one is going to install something like this untested, so you WILL have to test it, thoroughly, either yourself or some other willing suckeruser! I have UNRAID running under VirtualBox - putting a bunch of files under /mnt/disk1, building the hash, and then deleting them would set things up for a reiser rebuild tree check.... I simulated a "lost+found" to test the recovery aspect though with no keys preserved by moving a bunch of files to a single directory and running a recover against that directory. Quote Link to comment
jbartlett Posted September 19, 2014 Author Share Posted September 19, 2014 I'll probably make a utility to import/export a .hash file to/from the attributes. Already in my script. Quote Link to comment
WeeboTech Posted September 19, 2014 Share Posted September 19, 2014 I'll probably make a utility to import/export a .hash file to/from the attributes. Already in my script. While the bitrot shell is really well written. I'm probably going to do it all in .c when I can. While it's easier to do it in bash and/or perl. I'm working with millions of files and I have to be very careful of the naming. I also need to import/export them to my SQLite table. Plus I want the export file to work as input to the raw command if need be. The testing tool segment I'm working on now makes a .hash file that can be used by the windows corz checksum. It also makes a set of par2 files so you can validate and repair some damage or missing files (dependent on how bad the damage is) This goes one step further then detecting it, by allowing you to repair it. From what I've seen, the export file looks like echo $filestoadd > "/tmp/filestoadd.$rand.txt" echo "$currfile|$ShaKey|$ScanDate" >> "$exportfile" My suggestion would be to exec to a fd for the export file then print to the fd in a slightly different format. I.E. Standard sha256sum file, use comments for the supplementary data. In the quick preliminary shell I create a file as root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# more folder.hash # file: folder.hash # user.dir="/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)" # user.hash.dir="a59998b6369450c9269203279720b427" # user.hash.time="1411076621" 60343da3cd9725c218ba522df3024503 18 Michael E - Late Night Dreams.mp3 9060cd879b9870aae4df72e9ae568307 15 Eddie Silverton - White Sand.mp3 cd4d1b45f9b6dfcd0441ca13e91b560d 04 Mazelo Nostra - Smooth Night.mp3 011875f40fea633e0b8a1f94c583ab81 25 Pianochocolate - Long Long Letter.mp3 This way the folder.hash file can be used to validate the current files. I'm purposely using relative filenames here. The corz checksum makes a file as # made with checksum.. point-and-click hashing for windows (64-bit edition). # from corz.org.. http://corz.org/windows/software/checksum/ # #md5#folder.par2#[email protected]:14 3a23f04cd50f96a1dcc2b10f3406376d *folder.par2 #md5#filelist.lastrun#[email protected]:15 f6cd171fd3955f1e9a25010451509652 *filelist.lastrun #md5#disk3.filelist.2014-37.txt#[email protected]:03 62064d427826035e7360cbf1a409aa61 *disk3.filelist.2014-37.txt In my version, I tried to make the comments look like the exported getfattr -d format. I'll build a parser in .c to allow use of this file to import into SQLite or do other things with it. I have the parser for the hash line, now need the variable line. Since I write to folder.hash in every folder (and folder.par2) the user.hash.dir is a hash of the path so the folder.hash's can be collected to a safer location off the filesystem. Anyway, with the par2create/verify/repair executed within each directory, I can damage a file and repair it. I can delete a file and have it recreated depending on how many blocks are available. here's an example of the directory. Cool thing with the .hash naming, you can go onto windows, right click on the file, click check checksum and get quick valdation from the windows workstation. root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# ls -l total 280837 -rw-rw-rw- 1 rcotrone 522 15798176 2014-09-18 11:25 01\ The\ Diventa\ Project\ -\ Serenity\ (Lazy\ Hammock\ Mix).mp3 -rw-rw-rw- 1 rcotrone 522 7648975 2014-09-18 11:25 02\ Stepo\ Del\ Sol\ -\ Fly\ Far\ Away.mp3 -rw-rw-rw- 1 rcotrone 522 9873573 2014-09-18 11:25 03\ Stargazer\ -\ White\ Caps\ (Sylt\ Mix).mp3 -rw-rw-rw- 1 rcotrone 522 9378281 2014-09-18 11:25 04\ Mazelo\ Nostra\ -\ Smooth\ Night.mp3 -rw-rw-rw- 1 rcotrone 522 8689679 2014-09-18 11:25 05\ Billy\ Esteban\ -\ Dream.mp3 -rw-rw-rw- 1 rcotrone 522 16075024 2014-09-18 11:25 06\ C.A.V.O.K\ -\ Night\ Flight.mp3 -rw-rw-rw- 1 rcotrone 522 9431569 2014-09-18 11:25 07\ Nasser\ Shibani\ -\ Time\ Chase.mp3 -rw-rw-rw- 1 rcotrone 522 12205755 2014-09-18 11:25 08\ Dave\ Ross\ -\ Solana.mp3 -rw-rw-rw- 1 rcotrone 522 9747172 2014-09-18 11:25 09\ Gabor\ Deutsch\ -\ Rearrange\ (Feat.\ Harcsa\ Veronika).mp3 -rw-rw-rw- 1 rcotrone 522 11318708 2014-09-18 11:25 10\ Ryan\ KP\ -\ Everythings\ Gonna\ Be\ Alright\ (Feat.\ Melody).mp3 -rw-rw-rw- 1 rcotrone 522 10283170 2014-09-18 11:25 11\ Florzinho\ -\ Primavera\ (Dub\ Mix).mp3 -rw-rw-rw- 1 rcotrone 522 13498323 2014-09-18 11:25 12\ Lazy\ Hammock\ -\ One\ of\ Those\ Days.mp3 -rw-rw-rw- 1 rcotrone 522 10098189 2014-09-18 11:25 13\ Myah\ -\ Falling.mp3 -rw-rw-rw- 1 rcotrone 522 10972802 2014-09-18 11:25 14\ Peter\ Pearson\ -\ I\ Need\ to\ Chill.mp3 -rw-rw-rw- 1 rcotrone 522 10983245 2014-09-18 11:25 15\ Eddie\ Silverton\ -\ White\ Sand.mp3 -rw-rw-rw- 1 rcotrone 522 6619741 2014-09-18 11:25 16\ Ingo\ Herrmann\ -\ Filtron.mp3 -rw-rw-rw- 1 rcotrone 522 14187937 2014-09-18 11:25 17\ DJ\ MNX\ -\ Cosmic\ Dreamer.mp3 -rw-rw-rw- 1 rcotrone 522 9187068 2014-09-18 11:25 18\ Michael\ E\ -\ Late\ Night\ Dreams.mp3 -rw-rw-rw- 1 rcotrone 522 5278124 2014-09-18 11:25 19\ Francois\ Maugame\ -\ Like\ a\ Summer\ Breeze.mp3 -rw-rw-rw- 1 rcotrone 522 12749119 2014-09-18 11:25 20\ Collioure\ -\ Perfect\ Resort.mp3 -rw-rw-rw- 1 rcotrone 522 7050247 2014-09-18 11:25 21\ Leon\ Ard\ -\ Caribbean\ Dreams.mp3 -rw-rw-rw- 1 rcotrone 522 10242401 2014-09-18 11:25 22\ Syusi\ -\ Bright\ Moments.mp3 -rw-rw-rw- 1 rcotrone 522 10783678 2014-09-18 11:25 23\ Thomas\ Lemmer\ -\ Above\ the\ Clouds.mp3 -rw-rw-rw- 1 rcotrone 522 8890313 2014-09-18 11:25 24\ Frame\ by\ Frame\ -\ Borderland.mp3 -rw-rw-rw- 1 rcotrone 522 9443076 2014-09-18 11:25 25\ Pianochocolate\ -\ Long\ Long\ Letter.mp3 -rw-rw-rw- 1 root 522 2049 2014-09-18 17:43 folder.hash -rw-rw-rw- 1 root 522 132623 2014-09-18 17:03 folder.jpg -rw-rw-rw- 1 root 522 46760 2014-09-18 17:43 folder.par2 -rw-rw-rw- 1 root 522 26635780 2014-09-18 17:43 folder.vol000+200.par2 Food for thought. Here's a log of what is possible by creating the set of par2 files per folder. It might be something to consider for the bitrot program. I.E. being able to repair a small file directly. Or storing the par2 file somewhere else. root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# par2verify folder.par2 par2cmdline version 0.4, Copyright (C) 2003 Peter Brian Clements. par2cmdline comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. See COPYING for details. Loading "folder.par2". Loaded 54 new packets Loading "folder.vol000+200.par2". Loaded 200 new packets including 200 recovery blocks There are 26 recoverable files and 0 other files. The block size used was 131244 bytes. There are a total of 2000 data blocks. The total size of the data files is 260566968 bytes. Verifying source files: Target: "01 The Diventa Project - Serenity (Lazy Hammock Mix).mp3" - found. Target: "02 Stepo Del Sol - Fly Far Away.mp3" - found. Target: "03 Stargazer - White Caps (Sylt Mix).mp3" - found. Target: "04 Mazelo Nostra - Smooth Night.mp3" - found. Target: "05 Billy Esteban - Dream.mp3" - found. Target: "06 C.A.V.O.K - Night Flight.mp3" - found. Target: "07 Nasser Shibani - Time Chase.mp3" - found. Target: "08 Dave Ross - Solana.mp3" - found. Target: "09 Gabor Deutsch - Rearrange (Feat. Harcsa Veronika).mp3" - found. Target: "10 Ryan KP - Everythings Gonna Be Alright (Feat. Melody).mp3" - found. Target: "11 Florzinho - Primavera (Dub Mix).mp3" - found. Target: "12 Lazy Hammock - One of Those Days.mp3" - found. Target: "13 Myah - Falling.mp3" - found. Target: "14 Peter Pearson - I Need to Chill.mp3" - found. Target: "15 Eddie Silverton - White Sand.mp3" - found. Target: "16 Ingo Herrmann - Filtron.mp3" - found. Target: "17 DJ MNX - Cosmic Dreamer.mp3" - found. Target: "18 Michael E - Late Night Dreams.mp3" - found. Target: "19 Francois Maugame - Like a Summer Breeze.mp3" - found. Target: "20 Collioure - Perfect Resort.mp3" - found. Target: "21 Leon Ard - Caribbean Dreams.mp3" - found. Target: "22 Syusi - Bright Moments.mp3" - found. Target: "23 Thomas Lemmer - Above the Clouds.mp3" - found. Target: "24 Frame by Frame - Borderland.mp3" - found. Target: "25 Pianochocolate - Long Long Letter.mp3" - found. Target: "folder.jpg" - found. All files are correct, repair is not required. root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# rm -vi folder.jpg rm: remove regular file `folder.jpg'? y removed `folder.jpg' root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# par2verify folder.par2 par2cmdline version 0.4, Copyright (C) 2003 Peter Brian Clements. par2cmdline comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. See COPYING for details. Loading "folder.par2". Loaded 54 new packets Loading "folder.vol000+200.par2". Loaded 200 new packets including 200 recovery blocks There are 26 recoverable files and 0 other files. The block size used was 131244 bytes. There are a total of 2000 data blocks. The total size of the data files is 260566968 bytes. Verifying source files: Target: "01 The Diventa Project - Serenity (Lazy Hammock Mix).mp3" - found. Target: "02 Stepo Del Sol - Fly Far Away.mp3" - found. Target: "03 Stargazer - White Caps (Sylt Mix).mp3" - found. Target: "04 Mazelo Nostra - Smooth Night.mp3" - found. Target: "05 Billy Esteban - Dream.mp3" - found. Target: "06 C.A.V.O.K - Night Flight.mp3" - found. Target: "07 Nasser Shibani - Time Chase.mp3" - found. Target: "08 Dave Ross - Solana.mp3" - found. Target: "09 Gabor Deutsch - Rearrange (Feat. Harcsa Veronika).mp3" - found. Target: "10 Ryan KP - Everythings Gonna Be Alright (Feat. Melody).mp3" - found. Target: "11 Florzinho - Primavera (Dub Mix).mp3" - found. Target: "12 Lazy Hammock - One of Those Days.mp3" - found. Target: "13 Myah - Falling.mp3" - found. Target: "14 Peter Pearson - I Need to Chill.mp3" - found. Target: "15 Eddie Silverton - White Sand.mp3" - found. Target: "16 Ingo Herrmann - Filtron.mp3" - found. Target: "17 DJ MNX - Cosmic Dreamer.mp3" - found. Target: "18 Michael E - Late Night Dreams.mp3" - found. Target: "19 Francois Maugame - Like a Summer Breeze.mp3" - found. Target: "20 Collioure - Perfect Resort.mp3" - found. Target: "21 Leon Ard - Caribbean Dreams.mp3" - found. Target: "22 Syusi - Bright Moments.mp3" - found. Target: "23 Thomas Lemmer - Above the Clouds.mp3" - found. Target: "24 Frame by Frame - Borderland.mp3" - found. Target: "25 Pianochocolate - Long Long Letter.mp3" - found. Target: "folder.jpg" - missing. Scanning extra files: Repair is required. 1 file(s) are missing. 25 file(s) are ok. You have 1998 out of 2000 data blocks available. You have 200 recovery blocks available. Repair is possible. You have an excess of 198 recovery blocks. 2 recovery blocks will be used to repair. root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# par2repair folder.par2 par2cmdline version 0.4, Copyright (C) 2003 Peter Brian Clements. par2cmdline comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. See COPYING for details. Loading "folder.par2". Loaded 54 new packets Loading "folder.vol000+200.par2". Loaded 200 new packets including 200 recovery blocks There are 26 recoverable files and 0 other files. The block size used was 131244 bytes. There are a total of 2000 data blocks. The total size of the data files is 260566968 bytes. Verifying source files: Target: "01 The Diventa Project - Serenity (Lazy Hammock Mix).mp3" - found. Target: "02 Stepo Del Sol - Fly Far Away.mp3" - found. Target: "03 Stargazer - White Caps (Sylt Mix).mp3" - found. Target: "04 Mazelo Nostra - Smooth Night.mp3" - found. Target: "05 Billy Esteban - Dream.mp3" - found. Target: "06 C.A.V.O.K - Night Flight.mp3" - found. Target: "07 Nasser Shibani - Time Chase.mp3" - found. Target: "08 Dave Ross - Solana.mp3" - found. Target: "09 Gabor Deutsch - Rearrange (Feat. Harcsa Veronika).mp3" - found. Target: "10 Ryan KP - Everythings Gonna Be Alright (Feat. Melody).mp3" - found. Target: "11 Florzinho - Primavera (Dub Mix).mp3" - found. Target: "12 Lazy Hammock - One of Those Days.mp3" - found. Target: "13 Myah - Falling.mp3" - found. Target: "14 Peter Pearson - I Need to Chill.mp3" - found. Target: "15 Eddie Silverton - White Sand.mp3" - found. Target: "16 Ingo Herrmann - Filtron.mp3" - found. Target: "17 DJ MNX - Cosmic Dreamer.mp3" - found. Target: "18 Michael E - Late Night Dreams.mp3" - found. Target: "19 Francois Maugame - Like a Summer Breeze.mp3" - found. Target: "20 Collioure - Perfect Resort.mp3" - found. Target: "21 Leon Ard - Caribbean Dreams.mp3" - found. Target: "22 Syusi - Bright Moments.mp3" - found. Target: "23 Thomas Lemmer - Above the Clouds.mp3" - found. Target: "24 Frame by Frame - Borderland.mp3" - found. Target: "25 Pianochocolate - Long Long Letter.mp3" - found. Target: "folder.jpg" - missing. Scanning extra files: Repair is required. 1 file(s) are missing. 25 file(s) are ok. You have 1998 out of 2000 data blocks available. You have 200 recovery blocks available. Repair is possible. You have an excess of 198 recovery blocks. 2 recovery blocks will be used to repair. Computing Reed Solomon matrix. Constructing: done. Solving: done. Wrote 132623 bytes to disk Verifying repaired files: Target: "folder.jpg" - found. Repair complete. root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# ls -l folder.jpg -rw-rw-rw- 1 root 522 132623 2014-09-19 02:20 folder.jpg root@unRAID:/mnt/disk3/Music/music.mp3/Chill/Various Artists/Diventa Island Guide (Midnight Chill Beach)# md5sum -c folder.hash | grep folder.jpg folder.jpg: OK Quote Link to comment
BRiT Posted September 19, 2014 Share Posted September 19, 2014 Something you will want to validate/verify is that the par2 files themselves have checksums and integrity checks built in. I haven't looked at that format enough to know how it deals with that. I would think you wouldn't want to be using a possibly corrupt par2 file or parts of a file to do a repair if you ever need to. Quote Link to comment
itimpi Posted September 19, 2014 Share Posted September 19, 2014 Something you will want to validate/verify is that the par2 files themselves have checksums and integrity checks built in. I haven't looked at that format enough to know how it deals with that. I would think you wouldn't want to be using a possibly corrupt par2 file or parts of a file to do a repair if you ever need to. Par2 files do have integrity checks built in. They carry out the same checks on themselves that they carry out on the files that they are protecting when you try and use them. That is why you can often use even a damaged par2 file for repair as it can work out which repair blocks contained within it at are intact and can be used for repair purposes. Quote Link to comment
WeeboTech Posted September 19, 2014 Share Posted September 19, 2014 Question: What do you think about storing the file name location in the extended attributes as well? I think I nixed the idea at first in that there is a limited amount of space in the attributes and I didn't want to hog it - but it seems like there's very very few scripts that store info in the attribute space. I thought this relevant For ext2/3/4 and btrfs, each extended attribute is limited to a filesystem block (e.g. 4 KiB), and in practice in ext2/3/4 all of them must fit together on a single block (including names and values). ReiserFS allow attributes of arbitrary size. In XFS the names can be up to 256 bytes in length, terminated by the first 0 byte, and the values can be up to 64KB of arbitrary binary data. Quote Link to comment
jbartlett Posted September 19, 2014 Author Share Posted September 19, 2014 Question: What do you think about storing the file name location in the extended attributes as well? I think I nixed the idea at first in that there is a limited amount of space in the attributes and I didn't want to hog it - but it seems like there's very very few scripts that store info in the attribute space. I thought this relevant For ext2/3/4 and btrfs, each extended attribute is limited to a filesystem block (e.g. 4 KiB), and in practice in ext2/3/4 all of them must fit together on a single block (including names and values). ReiserFS allow attributes of arbitrary size. In XFS the names can be up to 256 bytes in length, terminated by the first 0 byte, and the values can be up to 64KB of arbitrary binary data. Sweet - a non issue then. Quote Link to comment
eroz Posted September 20, 2014 Share Posted September 20, 2014 Thanks for the script. It works great. I have a question can you run the script with more than one mask? i.e. bitrot.sh -a -p /mnt/user/Movies -m *.mkv -m *.m4v Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.