Duplicate file search utility

July 16, 201015 yr

Hello All;

Any chance this utility is installed on unraid? (flint: http://www.pixelbeat.org/fslint/) or something similar, which can be used to find duplicate files (regardless of names, compare content)

I tend every so often to dump files from different pcs ont unraid, and end up with the same content multiple times - pictures, music, docs etc. which I've downloaded and used independently on the notebooks in the house. I am doing a dup file search from time to time through one of the attached notebooks, but since such process is taking for ever, it is completely blocking that notebook for the time I'm allocating it to the file search on multiple shares.

Thanks for your time.

HG

July 16, 201015 yr

Hello All;

Any chance this utility is installed on unraid? (flint: http://www.pixelbeat.org/fslint/) or something similar, which can be used to find duplicate files (regardless of names, compare content)

I tend every so often to dump files from different pcs ont unraid, and end up with the same content multiple times - pictures, music, docs etc. which I've downloaded and used independently on the notebooks in the house. I am doing a dup file search from time to time through one of the attached notebooks, but since such process is taking for ever, it is completely blocking that notebook for the time I'm allocating it to the file search on multiple shares.

Thanks for your time.

HG

It looks like it is possible as there is a slackware install here. Just download the .txz file and run installpgk and you should be good to go.

NOTE: I am away from my machine so have not tested this. When i get home tonight and have some time i will install it and see what happens (will probably make an unmenu package for it also).

July 16, 201015 yr

There is something like this already in unmenu. To quote the text beside the button:

Press the "Dupe Files" button to list the duplicate files that exist in multiple parallel /mnt/disk* folders, but where only one can be shown in the user share. These are logged by the User-share process to /var/log/syslog, but the entries do not give enough information to easily deal with the files.

To locate duplicates, this process must scan for files on all disks with similar directory paths to those in the syslog.

This process may take a while, please be patient and wait for the browser to refresh. If disks are spun down, they may spin up as directories are scanned if their directory entries are not in memory.

I think this should do what you want.

bux

July 16, 201015 yr

There is something like this already in unmenu. To quote the text beside the button:

Press the "Dupe Files" button to list the duplicate files that exist in multiple parallel /mnt/disk* folders, but where only one can be shown in the user share. These are logged by the User-share process to /var/log/syslog, but the entries do not give enough information to easily deal with the files.

To locate duplicates, this process must scan for files on all disks with similar directory paths to those in the syslog.

This process may take a while, please be patient and wait for the browser to refresh. If disks are spun down, they may spin up as directories are scanned if their directory entries are not in memory.

I think this should do what you want.

bux

I don't think so.. The function in unMENU will find duplicate named files in parallel directories on disks where the duplicate was identified and complained about in the syslog. It does not do anything if they are not in parallel directories in user shares on different disks.

I have a script that will do what the user is looking for... I'll post it in my next post as soon as I add a line or two to it to speed it up.

Joe L.

July 16, 201015 yr

Author

Thanks very much for your prompt reaction to my note.

I think this is a tool to benefit anyone who is using the unraid as a backup server.

HG

July 16, 201015 yr

I've used this set of commands in the past.

They use intermediate temporary files on disk1 since on systems with a large number of files it would run out of memory otherwise.

It is best to disable the "cache_dirs" program if you are using it while this is running, since it would otherwise be competing for the disk buffer cache.

This method of finding duplicate files will work even if you have the same file stored under different names.

It will take a long time to run if you have lots of files, but it will be faster than doing it over the LAN. (It could take hours to run if you have huge numbers of files, since it must read all of the potential duplicates in their entirety)

It may also have a difficult time if some of your file names contain unicode (extended) characters, but it should work.

It uses a series of temporary intermediate files on /mnt/disk1

The final output will be in /mnt/disk1/dupes_out.txt

1. It first gets a list of all the files on your disks and eliminates any file whose length is not the same as another file.

(A unique length indicates it must be a unique file)

2. Then, it generates the md5 checksum for the first 4Meg of every file where their length is not unique.

3. Then, it generates the md5 checksum on the entire file where the md5 checksum on the first 4Meg is not unique.

(If the md5 on the first 4 Meg is unique, it is a unique file, no need to read the remainder)

4. Lastly, it lists the files, grouping them where they have identical md5 checksums, (indicating any identical contents), even though they may be in different folders, or even have different names.

The resulting list will look like this (each group is set of identical files, regardless of their names):

/mnt/disk4/Mp3/KC and the Sunshine Band/KC_The_Sunshine_Band_-_Shake_Your_Booty.mp3

/mnt/disk4/Mp3/KC and the Sunshine Band/Shake Your Booty-Earth Wind and Fire.mp3

/mnt/disk3/data/mg35/mg-kernel/mg-kernel/include/linux/i2c-id.h

/mnt/disk3/data/mg35/mg35tools/firmware/uClinux-2.4/include/linux/i2c-id.h

/mnt/disk1/Pictures/2009 - July 4th/IMG_1374.JPG

/mnt/disk1/Pictures/PictureFrame/DCIM/2009 - July 4th/IMG_1374.JPG

/mnt/disk3/data/mg35/mg-kernel/mg-kernel/fs/nls/nls_cp865.c

/mnt/disk3/data/mg35/mg35tools/firmware/uClinux-2.4/fs/nls/nls_cp865.c

/mnt/disk1/Movies/SD_VIDEO/PRG002/MOV0AB.avi

/mnt/disk1/Pictures/Movies/20080315_135748.MPG

/mnt/disk1/Pictures/PRG002/MOV0AB.MOD

/mnt/disk1/Movies/SD_VIDEO/PRG001/MOV024.avi

/mnt/disk3/data/USADanceShowcase-Sept2006/MOV024.MOD

/mnt/disk3/data/shared/JVC-VIDEO-CAMERA/SD_VIDEO/PRG001/MOV024.MOD

Edit: This script has difficulty with files containing embedded quotes. A correct script is attached in the post a little further on in this thread.

Here: http://lime-technology.com/forum/index.php?topic=7018.msg68073#msg68073

set -v
sysctl vm.vfs_cache_pressure=200
find /mnt/disk* ! -empty -type f -links 1 -printf "%s " -exec ls -dQ {} \; >/mnt/disk1/dupes_tmp1
sort -n /mnt/disk1/dupes_tmp1 | awk '{ printf "%015d %s\n", $1, $0}' | cut -d" " -f1,3- |  uniq -D -w 15 | cut -d" " -f2- >/mnt/disk1/dupes_tmp2
sed "s/'/\\\'/g" < /mnt/disk1/dupes_tmp2 | xargs -n 1 -I FiLeNaMeX sh -c "dd if='FiLeNaMeX' count=1 ibs=4M 2>/dev/null | md5sum -| tr -d '\n'; echo 'FiLeNaMeX'" >/mnt/disk1/dupes_tmp3
sort /mnt/disk1/dupes_tmp3 | uniq -w32 --all-repeated| cut -c36- | sed -e 's/"/\\\"/' -e 's/\(.*\)/"\1"/' >/mnt/disk1/dupes_tmp4
cat /mnt/disk1/dupes_tmp4 | xargs md5sum >/mnt/disk1/dupes_tmp5
sort /mnt/disk1/dupes_tmp5 | uniq -w32 -d --all-repeated=separate | cut -c35- >/mnt/disk1/dupes_out.txt

Edit: Improved performance by fixing how unique file sizes are determined in above script.

Joe L.

July 17, 201015 yr

Author

Thanks much, Joe L. I like the approach. I'll copy it into a script in the root directory and give it a try. All seems to be perfect, and I'll just leave it run on the console until it finishes!

hg

July 17, 201015 yr

I had not properly dealt with file names with apostrophes (single quotes)

Note to self: "You cannot escape the special meaning of single-quote with a backslash if the string is already surrounded by single-quotes."

In any case, here is the corrected script. ( I strongly suggest you download and un-zip the attachment, since the single and double-quotes in it might be difficult to accurately re-type from viewing on the forum post)

This method of finding duplicate files will work even if you have the same file stored under different names, or in folders that are not in parallel directories on physical disks.

It will take a long time to run if you have lots of files. (It could take hours to run if you have huge numbers of potentially duplicate files, since it must read all of the potential duplicates in their entirety)

It may also have a difficult time if some of your file names contain unicode (extended) characters, but it should work in most cases.

It uses a series of temporary intermediate files on /mnt/disk1 (/mnt/disk1/dupes_tmp1 /mnt/disk1/dupes_tmp2 /mnt/disk1/dupes_tmp3 /mnt/disk1/dupes_tmp4 and /mnt/disk1/dupes_tmp5 )

The final output will be in /mnt/disk1/dupes_out.txt You can delete any and all of the temporary files when it is done.

This series of commands works like this:

1. It first gets a list of all the files on your disks, and their lengths, and eliminates any file whose length is not the same as another file.

(A unique length indicates it must be a unique file)

2. Then, it generates the md5 checksum for the first 4Meg of every file where their length is not unique.

(A unique MD5 checksum in the first 4 Meg would indicate a unique file, no need to verify by reading the remainder)

3. Then, it generates the md5 checksum on the entire file where the md5 checksum on the first 4Meg is not unique.

(If the md5 on the first 4 Meg was not unique we need to read the remainder)

4. Lastly, it lists the files, grouping them where they have identical md5 checksums, (and identical contents), even though they may be in different folders, or even have different names.

The resulting list will look like this (each group is set of identical files, regardless of their names):

/mnt/disk4/Mp3/KC and the Sunshine Band/KC_The_Sunshine_Band_-_Shake_Your_Booty.mp3

/mnt/disk4/Mp3/KC and the Sunshine Band/Shake Your Booty-Earth Wind and Fire.mp3

/mnt/disk3/data/mg35/mg-kernel/mg-kernel/include/linux/i2c-id.h

/mnt/disk3/data/mg35/mg35tools/firmware/uClinux-2.4/include/linux/i2c-id.h

/mnt/disk1/Pictures/2009 - July 4th/IMG_1374.JPG

/mnt/disk1/Pictures/PictureFrame/DCIM/2009 - July 4th/IMG_1374.JPG

/mnt/disk3/data/mg35/mg-kernel/mg-kernel/fs/nls/nls_cp865.c

/mnt/disk3/data/mg35/mg35tools/firmware/uClinux-2.4/fs/nls/nls_cp865.c

/mnt/disk1/Movies/SD_VIDEO/PRG002/MOV0AB.avi

/mnt/disk1/Pictures/Movies/20080315_135748.MPG

/mnt/disk1/Pictures/PRG002/MOV0AB.MOD

/mnt/disk1/Movies/SD_VIDEO/PRG001/MOV024.avi

/mnt/disk3/data/USADanceShowcase-Sept2006/MOV024.MOD

/mnt/disk3/data/shared/JVC-VIDEO-CAMERA/SD_VIDEO/PRG001/MOV024.MOD

Once you get the list it will be up to you to decide which files in each set can be deleted. (Make sure you leave at least one of them, otherwise for certain, you will not have another copy elsewhere.)

The script also sets the vm.vfs_cache_pressure to 200 to encourage release of the cache as this program scans the directories and files. Otherwise it is possible to run out of memory as it is all used for the cache. If you are running cache_dirs on your server you might want to disable it temporarily, as it will only slow the process down. (It too will be scanning all your disks)

set -v
sysctl vm.vfs_cache_pressure=200
find /mnt/disk* ! -empty -type f -links 1 -printf "%s " -exec ls -dQ {} \; >/mnt/disk1/dupes_tmp1
sort -n /mnt/disk1/dupes_tmp1 | awk '{ printf "%015d %s\n", $1, $0}' | cut -d" " -f1,3- |  uniq -D -w 15 | cut -d" " -f2- >/mnt/disk1/dupes_tmp2
sed  "s/'/'\\\''/g" < /mnt/disk1/dupes_tmp2 | xargs -n 1 -I FiLeNaMeX sh -c "dd if='FiLeNaMeX' count=1 ibs=4M 2>/dev/null | md5sum -| tr -d '\n'; echo 'FiLeNaMeX'" >/mnt/disk1/dupes_tmp3
sort /mnt/disk1/dupes_tmp3 | uniq -w32 --all-repeated| cut -c36- | sed -e 's/"/\\\"/g' -e 's/\(.*\)/"\1"/' >/mnt/disk1/dupes_tmp4
cat /mnt/disk1/dupes_tmp4 | xargs md5sum >/mnt/disk1/dupes_tmp5
sort /mnt/disk1/dupes_tmp5 | uniq -w32 -d --all-repeated=separate | cut -c35- >/mnt/disk1/dupes_out.txt

To help, I've attache a zipped copy to this post.

It takes 70 minutes to run on my newer C2SEE based, dual core unRAID server where I have just under 100,000 files to compare for duplicates.

On my older Intel D865GLCLK based unRAID server it took 120 minutes to scan the same exact files to compare for duplicates.

(My two servers have exact mirrors of the same data, the older server is PCI based with mostly older IDE drives, and the newer with 7200RPM SATA drives and onboard SATA controller)

Joe L.

dupes.zip

July 17, 201015 yr

Thanks Joe, now seems to be working perfectly.

July 17, 201015 yr

Thanks Joe, now seems to be working perfectly.

Did it find your duplicates? How long did it take?

Joe L.

July 17, 201015 yr

Awesome work as always Joe. Is this utility added to the official unofficial user addon packages?

July 17, 201015 yr

Awesome work as always Joe. Is this utility added to the official unofficial user addon packages?

No.... not yet...

but you never can tell...

Joe L.

July 17, 201015 yr

Thanks Joe, now seems to be working perfectly.

Did it find your duplicates? How long did it take?

Hmm About an hour... Thats about 11tb of media, probably 50k files.

July 17, 201015 yr

Author

I ran the script here and found I had not properly dealt with file names with apostrophes (single quotes)

Note to self: "You cannot escape the special meaning of single-quote with a backslash if the string is already surrounded by single-quotes."

....

Joe L.

Hello again and thanks much for the time taken to share this script -

A comment on the speed - my media server runs on an one core celeron. The only record I can set is for slowness...

Nevertheless - I've let it run at its pace, and it is interesting that it started complaining of files not found. The absolute path is quite deep, and the names quite long. The ones I caught on the screen were not too important (a Favorites folder and long urls saved there) and I couldn't figure out if there were other (more significant files) also not found. I went into the directory structure through midnight commander, and the last part of the directory listed in the error, was not on the disk. So I'm not entirely sure what is going on, and where that path came from.

Any thoughts?

In the same context, of funny/long names, is it correct to assume that (once I filter my duplicates and decide what to delete), cat final_list | xargs rm will get rid of them? I'm tempted to put an -i there, but I have thousands of duplicates and I'll grow old answering yes to each line...

Thanks for your insight!

July 18, 201015 yr

I ran the script here and found I had not properly dealt with file names with apostrophes (single quotes)

Note to self: "You cannot escape the special meaning of single-quote with a backslash if the string is already surrounded by single-quotes."

....

Joe L.

Hello again and thanks much for the time taken to share this script -

A comment on the speed - my media server runs on an one core celeron. The only record I can set is for slowness...

That is the same as my older server.

It too uses a one core 2.6 Ghz celeron.

Nevertheless - I've let it run at its pace, and it is interesting that it started complaining of files not found. The absolute path is quite deep, and the names quite long. The ones I caught on the screen were not too important (a Favorites folder and long urls saved there) and I couldn't figure out if there were other (more significant files) also not found. I went into the directory structure through midnight commander, and the last part of the directory listed in the error, was not on the disk. So I'm not entirely sure what is going on, and where that path came from.

Any thoughts?

No, but you have all the intermediate temporary files in /mnt/disk1/dupe* so you might be able to figure our where they are coming from.

In the same context, of funny/long names, is it correct to assume that (once I filter my duplicates and decide what to delete), cat final_list | xargs rm will get rid of them? I'm tempted to put an -i there, but I have thousands of duplicates and I'll grow old answering yes to each line...

Thanks for your insight!

I too have several on one of my two servers where files with "extended" characters in their names are expanded in the in initial "find" to the octal equivalent, but not recognized later by md5sum.

Yes, when you finally get a list of the files you wish to delete you can invoke the "xargs rm" command as you described. Just be sure to leave one copy of each file so you don't delete them in their entirety.

October 25, 201015 yr

Hello All;

Any chance this utility is installed on unraid? (flint: http://www.pixelbeat.org/fslint/) or something similar, which can be used to find duplicate files (regardless of names, compare content)

I tend every so often to dump files from different pcs ont unraid, and end up with the same content multiple times - pictures, music, docs etc. which I've downloaded and used independently on the notebooks in the house. I am doing a dup file search from time to time through one of the attached notebooks, but since such process is taking for ever, it is completely blocking that notebook for the time I'm allocating it to the file search on multiple shares.

Thanks for your time.

HG

It looks like it is possible as there is a slackware install here. Just download the .txz file and run installpgk and you should be good to go.

NOTE: I am away from my machine so have not tested this. When i get home tonight and have some time i will install it and see what happens (will probably make an unmenu package for it also).

Hi , I want to try out the FSLint package. I did the installpkg succesfully, but dont know how to start the GUI up. I'm a bit of a noob, especially at linux cmd line stuff. ANy help is appreciated

Incidentally, I did try Joe's script, but i got 98,000 dupes in the txt file, which is a big list to tackle. I'd rather not mass deleting them all form teh cmd line, because I'd like to try to organize things a bit, and pare down duplicate directories, without just deleting half the files out of each copy of the same directory.

October 25, 201015 yr

FSlint has a GUI but it is not web based, so you will not be able to use it. You can use the command line with FSlint but that is it.

October 25, 201015 yr

OK. I guess that means I cant do it over a telnet session either(thats how I installed it). Thanks for the response!

Now I'm off to restart the tediously slow Windows dup deleter over the LAN again... scanning 9TB over LAN is sloooooooowwwww

October 25, 201015 yr

OK. I guess that means I cant do it over a telnet session either(thats how I installed it). Thanks for the response!

You can use FSlint on the unraid server you just have to use the command line version that was installed with the GUI version.

We (the community) have tricked at least one application (Crashplan) into working over ssh, but that is another thing entirely and I am not sure if it will work for FSlint. I am really busy with other things right now, but if i get a chance I will take a look and see if something like what was done with crashplan can be done with FSlint.

The only other thing you can do is contact the developer of FSlint and ask for a WebGui or some type of redirection like what we do with Crashplan.

October 26, 201015 yr

On my Windows Machine I have been using AllDup for a while and it seems to have many options for finding duplicates ...

January 18, 201115 yr

I tried this script tonight, but it fails after a short time:

root@Tower:~# /boot/custom/dupes.sh
sysctl vm.vfs_cache_pressure=200
vm.vfs_cache_pressure = 200
find /mnt/disk* ! -empty -type f -links 1 -printf "%s " -exec ls -dQ {} \; >/mnt/disk1/dupes_tmp1
sort -n /mnt/disk1/dupes_tmp1 | awk '{ printf "%015d %s\n", $1, $0}' | cut -d" " -f1,3- |  uniq -D -w 15 | cut -d" " -f2- >/mnt/disk1/dupes_tmp2
sed  "s/'/'\\\''/g" < /mnt/disk1/dupes_tmp2 | xargs -n 1 -I FiLeNaMeX sh -c "dd if='FiLeNaMeX' count=1 ibs=4M 2>/dev/null | md5sum -| tr -d '\n'; echo 'FiLeNaMeX'" >/mnt/disk1/dupes_tmp3
xargs: xargs.c:445: main: Assertion `bc_ctl.arg_max <= (131072-2048)' failed.
/boot/custom/dupes.sh: line 5:  4435 Broken pipe             sed "s/'/'\\\''/g" </mnt/disk1/dupes_tmp2
      4436 Aborted                 | xargs -n 1 -I FiLeNaMeX sh -c "dd if='FiLeNaMeX' count=1 ibs=4M 2>/dev/null | md5sum -| tr -d '\n'; echo 'FiLeNaMeX'" >/mnt/disk1/dupes_tmp3
sort /mnt/disk1/dupes_tmp3 | uniq -w32 --all-repeated| cut -c36- | sed -e 's/"/\\\"/g' -e 's/\(.*\)/"\1"/' >/mnt/disk1/dupes_tmp4
cat /mnt/disk1/dupes_tmp4 | xargs md5sum >/mnt/disk1/dupes_tmp5
xargs: xargs.c:445: main: Assertion `bc_ctl.arg_max <= (131072-2048)' failed.
/boot/custom/dupes.sh: line 7:  4441 Done                    cat /mnt/disk1/dupes_tmp4
      4442 Aborted                 | xargs md5sum >/mnt/disk1/dupes_tmp5
sort /mnt/disk1/dupes_tmp5 | uniq -w32 -d --all-repeated=separate | cut -c35- >/mnt/disk1/dupes_out.txt

As far as I can tell, the issue involves the line:

xargs: xargs.c:445: main: Assertion `bc_ctl.arg_max <= (131072-2048)' failed.

A Google search returns results that appear to point to glibc. I've installed the ""C" compiler & development tools" package from unMENU with no change in the results.

Does anyone have any thoughts on this? I'd love to put this script to use.

Duplicate file search utility

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)