February 21, 201115 yr I'm looking for something that can scan selective shares on my array for duplicates based on file checksum. Anyone have any experience with anything that would handle that? Most of my array is indexed and orderly, but I do have a couple of areas that are more challenging, and the dupe checker [that I'm most comfortable with] isn't over-the-network friendly. Thanks!
February 21, 201115 yr I'm looking for something that can scan selective shares on my array for duplicates based on file checksum. Anyone have any experience with anything that would handle that? Most of my array is indexed and orderly, but I do have a couple of areas that are more challenging, and the dupe checker [that I'm most comfortable with] isn't over-the-network friendly. Thanks! I would do it entirely on the server, since it a highly IO intensive task. This series of commands will do it. The final output is a text file containing the duplicate files (based on checksum, not name) First, to make the system less likely to crash since you will be using a lot of RAM, make it less likely to horde directory entries in memory. You would also want to stop cache_dirs if you have it running. sysctl vm.vfs_cache_pressure=200 Next, find and list with their sizes files that are not empty. (make sure you change "your_share" in the next line to the correct name) find "/mnt/user/your_share" ! -empty -type f -links 1 -printf "%s " -exec ls -dQ {} \; >/mnt/disk1/dupes_tmp1 Next sort the list and keep only those with another of the same size. (if a file has a unique size, it is unique, no need to look at a checksum) sort -n /mnt/disk1/dupes_tmp1 | awk '{ printf "%015d %s\n", $1, $0}' | cut -d" " -f1,3- | uniq -D -w 15 | cut -d" " -f2- >/mnt/disk1/dupes_tmp2 Then, for all that are not unique in their size, get the md5sum of the first 4Meg of the file. If the file is unique in the first 4Meg of its contents, it is unique. No need to check the remainder as it is not a duplicate. sed "s/'/'\\\''/g" < /mnt/disk1/dupes_tmp2 | xargs -n 1 -I FiLeNaMeX sh -c "dd if='FiLeNaMeX' count=1 ibs=4M 2>/dev/null | md5sum -| tr -d '\n'; echo 'FiLeNaMeX'" >/mnt/disk1/dupes_tmp3 Now, sort the remaining file names. Those we'll look at with a full md5 checksum. sort /mnt/disk1/dupes_tmp3 | uniq -w32 --all-repeated| cut -c36- | sed -e 's/"/\\\"/g' -e 's/\(.*\)/"\1"/' >/mnt/disk1/dupes_tmp4 Now get the full md5 checksum for the files left as candidates cat /mnt/disk1/dupes_tmp4 | xargs md5sum >/mnt/disk1/dupes_tmp5 Last, sort the md5 checksums getting rid of the unique files, pairing up the remaining. They are in in the file /mnt/disk1/dupes_out.txt sort /mnt/disk1/dupes_tmp5 | uniq -w32 -d --all-repeated=separate | cut -c35- >/mnt/disk1/dupes_out.txt Each of the lines in bold above should be typed as a single line with no carriage returns. (It will wrap on the monitor if it gets to the end of a line, just keep typing.) the line that calculates full md5 checksums for your remaining candidate files will take a long time, so do not expect an answer in seconds unless you only have a few files. The path to be searched in the second command must be changed to the name of your user-share. This is not perfect, but it will get the job done in most cases. It might choke on some special characters in file names, but hopefully, those will not be the dupes. Joe L.
February 22, 201115 yr I've been using "Duplicate Files Finder" at http://doubles.sourceforge.net/ . It's algorithm is: First, all files are sorted by their size, because files can be only equal, if they have the same size (logically). Then the files are compared with each other, and thus the equal files are determined. If two files are not equal from a given point on, reading is interrupted; no more has to be read for determining that these files are not equal. Because of this the results are determined much faster than in programs which use hashing algorithms, for which all files have to be read completely. Additional caching of the contents of the files additionally improves performance. (from http://doubles.sourceforge.net/#Algorithm ) I was able to scan 3TB worth of data to identify over 40GB of duplicate files is just a couple of hours, comparing shares over a gigabit network and local hard drives.
February 22, 201115 yr I've been using "Duplicate Files Finder" at http://doubles.sourceforge.net/ . It's algorithm is: First, all files are sorted by their size, because files can be only equal, if they have the same size (logically). Then the files are compared with each other, and thus the equal files are determined. If two files are not equal from a given point on, reading is interrupted; no more has to be read for determining that these files are not equal. Because of this the results are determined much faster than in programs which use hashing algorithms, for which all files have to be read completely. Additional caching of the contents of the files additionally improves performance. (from http://doubles.sourceforge.net/#Algorithm ) I was able to scan 3TB worth of data to identify over 40GB of duplicate files is just a couple of hours, comparing shares over a gigabit network and local hard drives. Same basic algorithm I use in the series of linux commands in my prior post. It might prove easier though for some to use a windows based front end, even though it might be slower since it is over the LAN. Joe L.
February 22, 201115 yr Author Thank you both for your reply. Joe, I am more than impressed with the effort that's obvious in your response. I'm a DOS veteran, so I [am] comforted by seeing C:\ in times of trouble, but I have to admit, when I saw "mnt/disk1/dupes_tmp3 | uniq -w32 --all-repeated| cut -c36- | sed -e 's/"/\\\"/g' -e 's/\(.*\)/"\1"/' >/mnt" I had to laugh! The chances of my correctly typing that, without inserting a space or making an outright mistake....and doing it correctly for each folder I'm interested in parsing........slim. So, I salute you, Joe- you're hard-core, even in the eyes of a DOS user. And I really do thank you for the time you took in preparing your response. I'm going to try the Windows program first.
February 22, 201115 yr Thank you both for your reply. Joe, I am more than impressed with the effort that's obvious in your response. I'm a DOS veteran, so I [am] comforted by seeing C:\ in times of trouble, but I have to admit, when I saw "mnt/disk1/dupes_tmp3 | uniq -w32 --all-repeated| cut -c36- | sed -e 's/"/\\\"/g' -e 's/\(.*\)/"\1"/' >/mnt" I had to laugh! The chances of my correctly typing that, without inserting a space or making an outright mistake....and doing it correctly for each folder I'm interested in parsing........slim. So, I salute you, Joe- you're hard-core, even in the eyes of a DOS user. And I really do thank you for the time you took in preparing your response. I'm going to try the Windows program first. Even I would not type it in every time. And trust me, I did not get the syntax correct the very first time either. It is attached in its basic form, zipped so you do not have to type it. It will do its magic on all your disk shares. Joe L. dupes.zip
February 24, 201115 yr Joe, I just tried your attached script, and it fails on my unRAID 4.7 system. It appears to die with the error: xargs: xargs.c:445: main: Assertion `bc_ctl.arg_max <= (131072-2048)' failed. My assumption is that this script is the same as the one you posted in the v4.5 Support topic. I had posted (http://lime-technology.com/forum/index.php?topic=7018.msg97824#msg97824) there with the error as well. A Google search turns up results that seem to indicate a problem with glibc. Any thoughts? Thanks, Dave
Archived
This topic is now archived and is closed to further replies.