BRiT Posted September 20, 2014 Share Posted September 20, 2014 Has anyone given thought to running snapraid (snapshot) on top of unraid? Wouldn't that give the same sort of protection as par2? Quote Link to comment
WeeboTech Posted September 20, 2014 Author Share Posted September 20, 2014 Has anyone given thought to running snapraid (snapshot) on top of unraid? Wouldn't that give the same sort of protection as par2? I personally have not, allot of my files change pretty often. Then there are the others that change less often, but often enough that would require a daily sync operation. However, I think the snapraid as an additional check is a great idea. In my particular case, I store data in large trees and tend to move them to archive areas. So having the folder by folder validation is important enough for me to build it. I have no less then half a million mp3's in a few trees on a few mini unRAID servers. Quote Link to comment
tr0910 Posted September 20, 2014 Share Posted September 20, 2014 Re number of files per directory. Most of mine would be less than a thousand. But who can guarantee there isn't a stupid folder somewhere with 10000 tiny files. It would be interesting to run some statistics on our unRaid servers to see what those numbers really are. Quote Link to comment
WeeboTech Posted September 20, 2014 Author Share Posted September 20, 2014 Regarding the snapraid, It does compile under slackware. So this has possibilities. Thanks for the idea Brit. At this point without a slackbuild (unless someone finds one) making it available/useful isn't all that easy. I have to read into this more to see how the parity functionality works, I don't have a spare SATA disk slot to test this out. It might be worthwhile for alpha/beta/rc testers to help overcome/monitor gotcha's of a bad file system driver. root@slacky:/mnt/disk1/cache/home/rcotrone/slacky/src/snapraid/snapraid-6.3# ./snapraid -h snapraid v6.3 by Andrea Mazzoleni, http://snapraid.sourceforge.net Usage: snapraid sync|status|scrub|diff|dup|pool|check|fix [options] Commands: sync Syncronize the state of the array of disks pool Create or update the virtual view of the array of disks diff Show the changes that needs to be syncronized dup Find duplicate files scrub Scrub the array of disks status Print the status of the array check Check the array of disks fix Fix the array of disks Options: -c, --conf FILE Configuration file -f, --filter PATTERN Process only files matching the pattern -d, --filter-dist NAME Process only files in the specified disk -m, --filter-missing Process only missing/deleted files -e, --filter-error Process only files with errors -p, --percentage PERC Process only a part of the array -o, --older-than DAYS Process only the older part of the array -i, --import DIR Import deleted files -l, --log FILE Log file. Default none -a, --audit-only Check only file data and not parity -h, --pre-hash Pre hash all the new data -Z, --force-zero Force synching of files that get zero size -E, --force-empty Force synching of disks that get empty -U, --force-uuid Force commands on disks with uuid changed -D, --force-device Force commands on disks with same device id -N, --force-nocopy Force commands disabling the copy detection -F, --force-full Force commands requiring a full sync -s, --start BLKSTART Start from the specified block number -t, --count BLKCOUNT Count of block to process -v, --verbose Verbose -H, --help Help -V, --version Version I'm still moving forward with my per folder .hash and .par2 variants. Quote Link to comment
neilt0 Posted September 20, 2014 Share Posted September 20, 2014 I trust you are using the multicore par2? Quote Link to comment
BRiT Posted September 20, 2014 Share Posted September 20, 2014 Regarding the snapraid, It does compile under slackware. So this has possibilities. Thanks for the idea Brit. At this point without a slackbuild (unless someone finds one) making it available/useful isn't all that easy. Docker. Snapraid inside docker with /mnt/disk# or /mnt/user mapped through. Though one would need ability to feed through commands to execute inside the docker container from unraid unless snapraid has a webui too. Quote Link to comment
WeeboTech Posted September 20, 2014 Author Share Posted September 20, 2014 I trust you are using the multicore par2? Not at the current time, I'll be compiling and exploring that in the future. Quote Link to comment
WeeboTech Posted September 20, 2014 Author Share Posted September 20, 2014 Regarding the snapraid, It does compile under slackware. So this has possibilities. Thanks for the idea Brit. At this point without a slackbuild (unless someone finds one) making it available/useful isn't all that easy. Docker. Snapraid inside docker with /mnt/disk# or /mnt/user mapped through. Though one would need ability to feed through commands to execute inside the docker container from unraid unless snapraid has a webui too. There are somethings that lend it self well to docker, I'm not sure this is a candidate. Frankly, for my low level apps, I doubt I'll use docker. I'm not totally sold on the technology for my administrative applications. Perhaps I don't know enough about docker. I get paid to do VM's and administrative programming, I don't gain anything by Docker yet. If it came down to a whole packaged solution with webmin, I would consider docker solution. For now I don't have the spare hardware or time to play with Docker. Quote Link to comment
neilt0 Posted September 20, 2014 Share Posted September 20, 2014 Par2 creation is limited by CPU, so you'll see a big speedup with multicore. Quote Link to comment
pengrus Posted September 22, 2014 Share Posted September 22, 2014 ... Thoughts? Any one have an example command line to find maximum file count in directories? Name suggestions? Is it worthwhile having a local folder.hash and folder.par2 file in each directory? ... By chance, I was looking for this just the other day! This command has worked for me: find DIR_NAME -type f -print | wc -l Following this thread, awesome stuff Weebo! Quote Link to comment
WeeboTech Posted September 23, 2014 Author Share Posted September 23, 2014 I'm making good headway so far. I have squpdatedb and sqlocate working. I found a new md5 set of routines that are a little faster. The cool thing about this module is it's ssl compatible. With a switch I can compile internally, albeit a lil slower. With another switch I can compile it to require the SSL MD5 routines and it's as fast as the current gnu coreutils md5sum. An improvement with the embedded function is that I can dump the cache on the file that was just read for the md5. This means that it aid in preventing the push of directory entries from the buffer cache. I haven't tested it fully. But for smaller files which do not overrun all of memory, it does work, and it works well. There's still more to do, such as allowing the calling of any external hashing application that people want to use. As long as it returns a standard hash line of hexdigts(space)(space)fullpath of file It can be imported. What I've been curious about is how people see this kind of hash database to be used? I can export it via locate into a format that can be piped into the hashing application. This is pretty routine. You can also locate files with built in grep. What I'm curious about is what time fields people might need. Currently I have the max time of (change time vs modification time) update time (when the record was updated in the database). I was thinking of adding a hash verified time, This way you can automatically schedule a verification of a file every 7, 15, 30, 60 days by some calculation. The purpose of the database in itself is to avoid rehashing the files over and over if they haven't changed. But to detect bitrot or some kind of corruption, you need to re-verify them without overwriting them(unless it is intentional). Thoughts? Quote Link to comment
dmacias Posted September 24, 2014 Share Posted September 24, 2014 In your first post you mention sqlite being included in php. While unRAID does include sqlite3, "php -m" reveals no PDO module is available to access sqlite3 from php. Just curious cause I have a plg that I needed to access sqlite3 db and had to use bash shell to access. Quote Link to comment
WeeboTech Posted September 24, 2014 Author Share Posted September 24, 2014 In your first post you mention sqlite being included in php. While unRAID does include sqlite3, "php -m" reveals no PDO module is available to access sqlite3 from php. Just curious cause I have a plg that I needed to access sqlite3 db and had to use bash shell to access. It's slated for unRAID 6. Quote Link to comment
Alex R. Berg Posted October 27, 2014 Share Posted October 27, 2014 In case you are still interested in these comments, here goes. What I'm curious about is what time fields people might need. Currently I have the max time of (change time vs modification time) update time (when the record was updated in the database). I was thinking of adding a hash verified time, This way you can automatically schedule a verification of a file every 7, 15, 30, 60 days by some calculation. The purpose of the database in itself is to avoid rehashing the files over and over if they haven't changed. But to detect bitrot or some kind of corruption, you need to re-verify them without overwriting them(unless it is intentional). Thoughts? I would be interested in file-modification-time' at the time of last hash-scan and date and time of last hash-scan. A scan should probably keep both the old and the new checksum of the file if file has changed but files modification timestamp has not. If files mod-timestamp has changed I just want the new timestamp. It sounds interesting with .par2 or .md5 files pr directory though I'm not sure I would use it, I'm generally against files being littered all over in my folders, as I always browse with hidden files visible (in total commander, windows). I might very well enable .par2 files for some specific directories. Best Alex Quote Link to comment
WeeboTech Posted October 27, 2014 Author Share Posted October 27, 2014 In case you are still interested in these comments, here goes. What I'm curious about is what time fields people might need. Currently I have the max time of (change time vs modification time) update time (when the record was updated in the database). I was thinking of adding a hash verified time, This way you can automatically schedule a verification of a file every 7, 15, 30, 60 days by some calculation. The purpose of the database in itself is to avoid rehashing the files over and over if they haven't changed. But to detect bitrot or some kind of corruption, you need to re-verify them without overwriting them(unless it is intentional). Thoughts? I would be interested in file-modification-time' at the time of last hash-scan and date and time of last hash-scan. A scan should probably keep both the old and the new checksum of the file if file has changed but files modification timestamp has not. If files mod-timestamp has changed I just want the new timestamp. Best Alex I've considered the time fields. mtime is already stored. update time is stored i.e. anytime the record is updated this time is updated. I was going to use update time for the update/hash verification time. I may split it out to a separate hash time hashtime will be a new field of when the hash is changed or verified as matching. I'm not sure of the use on saving the old hash. FWIW, the locate command is capable of exporting the data as a list of files. it can print like an /bin/ls -1 /bin/ls -l /bin/stat and finally output like the output of md5sum hash filename This lets you export the hashes of some matching file expression at the current point in time. It sounds interesting with .par2 or .md5 files pr directory though I'm not sure I would use it, I'm generally against files being littered all over in my folders, as I always browse with hidden files visible (in total commander, windows). I might very well enable .par2 files for some specific directories. I thought the same way at first with files littered all over. However when I thought about it, in my mp3 directories I have mp3s a playlist, a folder.jpg. If I have a folder.hash I can use the corz checksum utility in a flash If I have a folder.par2 I can use one of the windows par2 gui's to validate the files. While storing all the hashes either in text(md5sum file), gdbm(key/value) or sqlite is do-able. There's a limit as to how many files you can use with the par2 command. Since corz limits hashes to a folder, it makes sense to do this with the par2 for that added ability to recover. Years ago I tried doing par2 on a huge set of files and it caused all sorts of OOM issues with unRAID. It's not feasible in my array which over a million files. Quote Link to comment
WeeboTech Posted October 27, 2014 Author Share Posted October 27, 2014 FWIW, I have 3 versions of hash storage I've been working with. 1. The SQLite updatedb/locate command. 2. a suite of tools to do this with .gdbm files. .gdbm files are very fast for key/value pairing and it's allot smaller then the sqlite variant. I have been testing this for my own version of the cache_dirs program. The goal was a c based caching initiator, that would also catalog the files into the .gdbm file. When files change, they are updated in the .gdbm and written to stdout which can then be used a seed to update the md5sums immediately or at some scheduled time. Then you can re-import the smaller subset of md5sums back into the .gdbm. 3. a suite of tools manage hash values into extended attributes of a file. it's very similiar to the bitrot shell, only I do it in C for speed. The export creates a file that can be used by md5sum the import reads the md5sum file into the extended atrributes. The hashfattr command works like find, md5sum and setfattr all in one. The delay on it's release has been in using external hash'ers. I've been playing with functionality to allow use of an external hasher like this --hash-exec '/bin/b2sum -a {}' so every file found calls this program, which is piped back into the hash tool and then stored. It's all very similar and easy to do in bash. In my case, the overhead of spawning all these extra helper processes on a million files really makes the process longer. Plus it's an exercise for me in programming. I really like the extended attributes method. If you use rsync -X to move files to another server, it preserves the attributes. Thus you can jump onto another server and verify the hash. However if the file is re-created (as if copied/moved by windows, the extended attribute is lost. Thus the reason to cache this data elsewhere as an exportable md5sums file, .gdbm or sqlite table. I'm making headway, however I'm now in the process of updating some hardware because the HP microserver N40L just takes way too long to test these programs out on large datasets. It takes days on end just to md5deep 1 drive of over 300,000 files. Hence you can see why it's important for me to create functionality to 1. seed the data 2. only do updates on files that actually change with mtime/size. 3. verify files on demand based on some manageable set of rules. Just traversing a drive with 300,000 files takes 30 minutes if the data is not cached. Hence why I started working on my own cache_dirs which tracks what files change by mtime/size. and finally... I'm going to experiment with the newer seagate hybrid drives to see if they have any impact on caching the directory structures in the MLC/SLC cache. I figure at the very least if I continually drop the cache, rescan the drive a few times, it should, in theory, cache the directory LBA's thus lessening the time it takes to walk a directory try. So while it's been quiet here, I've been hard at work. Quote Link to comment
Alex R. Berg Posted October 27, 2014 Share Posted October 27, 2014 I'm not sure of the use on saving the old hash. I'm only interested in the old hash when I fear bitrot. That is when the timestamp of the file is unaltered. I'll probably have up to thousands of file changes on a monthly basis, and would likely never be interested in the old hashes, except when fearing bitrot. If I fear bitrot, I would restore from crashplan or out-of-home backup and check against old hash, I would think This lets you export the hashes of some matching file expression at the current point in time. Awesome I thought the same way at first with files littered all over. However when I thought about it, in my mp3 directories I have mp3s a playlist, a folder.jpg. If I have a folder.hash I can use the corz checksum utility in a flash If I have a folder.par2 I can use one of the windows par2 gui's to validate the files. [/font] While storing all the hashes either in text(md5sum file), gdbm(key/value) or sqlite is do-able. There's a limit as to how many files you can use with the par2 command. Since corz limits hashes to a folder, it makes sense to do this with the par2 for that added ability to recover. Years ago I tried doing par2 on a huge set of files and it caused all sorts of OOM issues with unRAID. It's not feasible in my array which over a million files. Sounds like you've thought about the negative consequences of one big par. I think its a great possibility to have it add .par2 file recursively to some folders. But I don't want it to do it to all folders. For instance I probably don't want my git-workspace infiltrated with .par2 files. I could make a git-ignore, but would prefer not to. I suppose you are still planning an sql-database underneath? Best Alex PS: I'm looking forward to it Quote Link to comment
WeeboTech Posted October 27, 2014 Author Share Posted October 27, 2014 The folder.hash / folder.par2 generator would be a separate program. I.E. I have a separate utility that walks down the tree. If the directory is newer then the folder.par2 or folder.hash, both of them are re-created. It will not be attached to the other suite of tools. It's a separate tool. I suppose we could talk about how to set up the ignore function. I.E. When I walk a tree. if there was a .prognameIgnore file then I would skip that directory. I'm open to suggestions on filename. In thinking of this tool, if it's 0 bytes ignore the whole directory. i.e. assume .* If it's > 0, read that as a list of expressions to ignore. i.e. *.o, *~ ,etc, etc. In any case, it's a separate suite of tools, One program to walk a tree. comparing mtimes on folder.hash folder.par2. Another program to read the files in the directory and make the respective files. This can probably be done with find/xargs and a shell. It will work, but all of the external processes add up to allot of overhead. Then you have the issues with filenames and quoting. So doing it in C, I can create a 'safer' array and fork/exec the programs without concern of being interpreted by the shell. So while all of this can be done in bash with tools. I'm paranoid about filenames and quoting enough to do it all in C. Quote Link to comment
WeeboTech Posted October 27, 2014 Author Share Posted October 27, 2014 I'm not sure of the use on saving the old hash. I'm only interested in the old hash when I fear bitrot. That is when the timestamp of the file is unaltered. I'll probably have up to thousands of file changes on a monthly basis, and would likely never be interested in the old hashes, except when fearing bitrot. If I fear bitrot, I would restore from crashplan or out-of-home backup and check against old hash, I would think What I was planning is check mtime/size if they have not changed, no hash is calculated unless you do a --verify on purpose. Even then the hash is not updated if it differs. The only time the hash is updated is mtime/size has changed or a forced --verify-update is used or possibly a --force switch. If you had to force an update, there is a --delete command to the squpdatedb which will delete the record allowing the next invocation to reinsert the data. Quote Link to comment
Alex R. Berg Posted October 28, 2014 Share Posted October 28, 2014 Sounds good that hash is not overwritten if when mtime and size have not changed. I guess that's fine, that I need to delete the hash. I can also just touch the file in most cases. Best Alex Quote Link to comment
WeeboTech Posted November 19, 2014 Author Share Posted November 19, 2014 While I know it's been quiet, I've been continuing to work on hashing database tools. What came to mind recently is the ability to label specific directory tree scans with labels. We do have user shares which could show up in the paths and would be searchable, I'm thinking more about foreign disks. Disks that are mounted temporarily. In my case, I have a box of older 1tb hard drives I'm trying to consolidate. I found myself doing an md5deep -r down the tree to a file named according to the drives model and serial. This gave me the hint if I added a label field which is any text string you define, then the squpdate/locate database functionality now becomes a wider catalog. Using the drives serial number, or any other identifying string, you could now search for these files in one locate front end. This would make the database larger, but gives you newer functionality. I suppose if everything has an md5 or other hash, then duplicates can be found across a whole archive. Currently the squpdate/locate tools allow you to specify a database, so it doesn't all have to be in one database. I figured I would throw it out there to see what boomerangs back to the idea of a string label as an additional filter. Quote Link to comment
Alex R. Berg Posted November 20, 2014 Share Posted November 20, 2014 It sounds good, to be able to locate files on external non attached drives, though I would not use it as my Tower is in the 'attic'. I'm not certain what you mean but I think you are suggesting to create a manual configured label (chosen name) for an external harddrive which your software then maps to its actualy 'serial'/identifier number right? Its a bit difficult to keep up and to come up with ideas when I don't have the software to work with. Ideas come when playing with stuff and experiencing it. Its more difficulty for brains to come up with brilliant new ideas, when I have to imagine everything, because of the lack of new information - information what could come from actually playing with the software and seeing how it works. I know the software is not yet done, but if you could send out something that could be played with it would help generating ideas. Ideas you don't necessarily have to implement, but which might generate new good ideas. I wouldn't mind having to rescan everything once in a while on new updates if the database changes. I do a full monthly rescan anyway with my current vary basic md5sum-to-files strategy. I think this is not really meant as the nag 'give me that beta already!' , but as an excuse for not being very helpfull to you Best Alex Quote Link to comment
WeeboTech Posted November 20, 2014 Author Share Posted November 20, 2014 my current vary basic md5sum-to-files strategy. What do you do now? Do you just do an md5deep/hashdeep to text files? Quote Link to comment
Alex R. Berg Posted November 20, 2014 Share Posted November 20, 2014 No, md5deep didn't work for me it crashed. I don't quite remember if i tried hashdeep, but my memory tells me its the same just different name (?). So I use md5sum with find command, and I run a command pr disk on each cpu-core. find /mnt/${DISK} -type f -exec md5sum {} \; > $MD5DIR/MD5_${DATESTAMP}_${DISK}.md5 I got most of the program from this forum, I think, and I'm not sure who to credit. Or I made it myself from bits and pieces, I'm not really sure. I have attached the file. Best Alex md5_array.zip Quote Link to comment
NAS Posted November 20, 2014 Share Posted November 20, 2014 ... I found myself doing an md5deep -r down the tree to a file named according to the drives model and serial. This gave me the hint if I added a label field which is any text string you define, then the squpdate/locate database functionality now becomes a wider catalog. Using the drives serial number, or any other identifying string, you could now search for these files in one locate front end. ... I am new to this thread but this comment caught my eye as its not that far from what I do for cataloging. I put a sticker with a number on the side of every disk I own and in the root of that drive create a file called number.key. This file can contain some info on the disk but thats not relevant here. Then I have a perl script that essentially does a ls -R to a text file named after the disk id and stored elsewhere. The reason this is perl is that i can teach it to understand XBMC movie.nfo and tvshow.nfo and place the TVDB id or the IMDB in the catalog as well. This is just handy but it shows how similar an md5 catalog and a general file collection catalog are. Currently I prefer txt files that a database for the catalog as it allows me to use the power of the shell. Food for though about where this could perhaps evolve to. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.