Duplicate mp3 files


Recommended Posts

any ideas?

yes.

 

The attached script will scan your entire disk for duplicate files, regardless of their names or paths.  (only EXACT duplicates are reported... If a different size, or content, files are NOT considered duplicates)

The output file it creates will have content that looks like this:

/mnt/disk1/Pictures/Misc-Pictures/100OLYMP13/PB291591.JPG

/mnt/disk1/Pictures/Misc-Pictures/100OLYMP5/PB291591.JPG

/mnt/disk1/Pictures/Misc-Pictures2/100OLYMP13/PB291591.JPG

/mnt/disk1/Pictures/Misc-Pictures2/100OLYMP5/PB291591.JPG

 

/mnt/disk1/Pictures/Misc-Pictures/100OLYMP11/P2281705.JPG

/mnt/disk1/Pictures/Misc-Pictures/100OLYMP12/P2281705.JPG

/mnt/disk1/Pictures/Misc-Pictures2/100OLYMP11/P2281705.JPG

/mnt/disk1/Pictures/Misc-Pictures2/100OLYMP12/P2281705.JPG

/mnt/disk3/data/shared/100OLYMP/P2281705.JPG

 

/mnt/disk3/data/packages5-10/cpio-unmenu-package.conf

/mnt/disk6/boot/packages/cpio-unmenu-package.conf

/mnt/disk6/boot/packages5-10/cpio-unmenu-package.conf

/mnt/disk6/boot/unmenu/cpio-unmenu-package.conf

 

/mnt/disk3/Pictures/102SANDS/SANY2050.JPG

/mnt/disk6/Pictures/Misc-Pictures/102SANDS/SANY2050.JPG

 

/mnt/disk3/data/packages5-10/libX11-1.1.5-i486-1.tgz

/mnt/disk3/data/packagesSept2009/libX11-1.1.5-i486-1.tgz

/mnt/disk6/boot/packages/libX11-1.1.5-i486-1.tgz

/mnt/disk6/boot/packages5-10/libX11-1.1.5-i486-1.tgz

 

/mnt/disk4/Mp3/Stations/Country/181 FM Classic Hits Home of The 60 s and 70 s.url

/mnt/disk4/Mp3/Stations/Top 50/181 FM Classic Hits Home of The 60 s and 70 s.url

 

/mnt/disk1/Pictures/2007-FamilyReunion - Jackie's Pictures/DSCF0092.JPG

/mnt/disk1/Pictures/PictureFrame/DCIM/2007-FamilyReunion - Jackie's Pictures/DSCF0092.JPG

 

/mnt/disk1/Pictures/2009-Classic/4/VIDEO_TS/VTS_39_0.BUP

/mnt/disk1/Pictures/2009-Classic/4/VIDEO_TS/VTS_39_0.IFO

 

/mnt/disk1/Movies/SD_VIDEO/MOV0AF.avi.nfo

/mnt/disk4/Pictures/SD_VIDEO/MOV0AF.avi.nfo

echo "FINISHED: DUPES ARE IN FILE: /mnt/disk1/dupes_out.txt"

FINISHED: DUPES ARE IN FILE: /mnt/disk1/dupes_out.txt

 

The script of commands in the attached find_dupes.sh script is:

set -v
sysctl vm.vfs_cache_pressure=200
find /mnt/disk* ! -empty -maxdepth 5 -type f -links 1 -printf "%s " -exec ls -dQ {} \; | tee /mnt/disk1/dupes_tmp1
sort -n /mnt/disk1/dupes_tmp1 | awk '{ printf "%015d %s\n", $1, $0}' | cut -d" " -f1,3- |  uniq -D -w 15 | cut -d" " -f2- | tee /mnt/disk1/dupes_tmp2
sed  "s/'/'\\\''/g" < /mnt/disk1/dupes_tmp2 | xargs -n 1 -I FiLeNaMeX sh -c "dd if='FiLeNaMeX' count=1 ibs=4M 2>/dev/null | md5sum -| tr -d '\n'; echo 'FiLeNaMeX'" | tee /mnt/disk1/dupes_tmp3
sort /mnt/disk1/dupes_tmp3 | uniq -w32 --all-repeated| cut -c36- | sed -e 's/"/\\\"/g' -e 's/\(.*\)/"\1"/' | tee /mnt/disk1/dupes_tmp4
cat /mnt/disk1/dupes_tmp4 | xargs md5sum | tee /mnt/disk1/dupes_tmp5
sort /mnt/disk1/dupes_tmp5 | uniq -w32 -d --all-repeated=separate | cut -c35- | tee /mnt/disk1/dupes_out.txt
echo "FINISHED: DUPES ARE IN FILE: /mnt/disk1/dupes_out.txt"

 

download and unzip in your flash drive.  then run

find_dupes.sh

It will create temp files on disk1.  If this is not OK, change the script accordingly.

There are a number of intermediate temp files it creates when running.  You can delete them all once you get the final result. 

The final set of files that are dupes is in /mnt/disk1/dupes_out.txt

 

As you can see in the above sample output from my newer server, the files can be in different directories, and even have different names, but if they have the same md5 checksum, they are considered the same file. 

 

The actual process runs in several steps.

The first line sets verbose mode so you can see the script running.

The second sets a kernel parameter to make it easier for other processes to grab memory if they need it. 

The third "find" command line finds all the files and lists then preceeded by their size in bytes. 

The fourth finds those that do not have a unique size.  (If they are unique in size, they cannot be a duplicate of another files)

The fifth finds those that are not unique MD5 checksums in their first 4Meg of content.  (If uniqueness occurs in the first 4Meg, the file is unique, no need to check the balance of the file, regardless of content.)

The sixth computes the MD5 checksum for those files not unique in their first 4 Meg for their entire contents.

The seventh line sorts those and groups them in a way that is readable and deletes those that have a unique MD5 checksum.  The output is put in /mnt/disk1/dupes_out.txt

 

While the processing is occurring, the output is also sent to the terminal being used.  It is interesting to watch.

 

It will take quite a few hours if you have a large amount of files to scan.  Oh yes, I limited the scan to 8 directories deep.  (I had some windows backups that were far deeper and did not want to bother with them in the results)

 

It is up to you to delete all but one of the duplicates...  (the process does nothing to delete files.  It will only show you where they are.  Whatever you do, if it shows you have two copies of a file, do NOT delete both unless you want NO copies of the file to remain)

 

It is expected you'll use the opportunity to organize the files as you desire, deleting all but ONE of the desired files. 

 

Joe L.

find_dupes.zip

Link to comment

Is there a way so he only search in "\Alldata\music" ?

Cause under \Alldata\Movies, i have a lot of duplicate files, but those may not be deleted..

 

And this way the list is very long lol

if its not possible , no problem then, i already like this script !

 

my output looks much more hectic also, below is a screenshot, everything is together

58432185.jpg

 

 

Link to comment

Is there a way so he only search in "\Alldata\music" ?

Cause under \Alldata\Movies, i have a lot of duplicate files, but those may not be deleted..

 

And this way the list is very long lol

if its not possible , no problem then, i already like this script !

 

You can filter the final output file using the "grep" command

 

grep "\/Alldata\/music" /mnt/disk1/dupes_out.txt > /mnt/disk1/dupes_out_filtered.txt

Link to comment

Is there a way so he only search in "\Alldata\music" ?

Cause under \Alldata\Movies, i have a lot of duplicate files, but those may not be deleted..

 

And this way the list is very long lol

if its not possible , no problem then, i already like this script !

 

You can filter the final output file using the "grep" command

 

grep "\/Alldata\/music" /mnt/disk1/dupes_out.txt > /mnt/disk1/dupes_out_filtered.txt

If you do that, you'll lose the spaces the script puts between the different files.

 

Instead, just modify the very first find command like this:

from

find /mnt/disk* ! -empty -maxdepth 5 -type f -links 1 -printf "%s " -exec ls -dQ {} \; | tee /mnt/disk1/dupes_tmp1

to

find /mnt/disk*/AllData/music ! -empty -maxdepth 5 -type f -links 1 -printf "%s " -exec ls -dQ {} \; | tee /mnt/disk1/dupes_tmp1

 

If your "music" sub-directory actually is "Music" use a capitalized "Music" in the "find" command instead of "music", otherwise, the script will not match the directory name and nothing will print.  (but it will run really fast, since no files will be found)  ;)

 

Joe L.

Link to comment

i don't get it why mp3's are skipped...

I have tons of doubles (i checked) but they arent shown by the script

And the directory is correct

you can check if they have the same Md5 checksum.  If not, they are not duplicates, even if you think they are.

 

To see the md5 checksum, type:

md5sum /mnt/disk*/AllData/music/path/to/mp3_file.mp3

Link to comment

i don't get it why mp3's are skipped...

I have tons of doubles (i checked) but they arent shown by the script

And the directory is correct

you can check if they have the same Md5 checksum.  If not, they are not duplicates, even if you think they are.

 

To see the md5 checksum, type:

md5sum /mnt/disk*/AllData/music/path/to/mp3_file.mp3

 

when i type that in, he says no such file or directory

 

Also when i run the script, i definately see all mp3s that are duplicate, but afterwards when i open up the textfile, all of those arent in it

Link to comment

its not working for me :(

it could be that some duplicate files are not exact the same

 

for ex

 

file:

Acdc - Thunderstruck        length is 4:53

Acdc - Thunderstruck        length is 4:52

 

Those files arent really exact the same but it is the same song...

although they are in different map too

 

so how do i filther those

to me the script doesnt bring me those things

Link to comment

its not working for me :(

it could be that some duplicate files are not exact the same

 

for ex

 

file:

Acdc - Thunderstruck        length is 4:53

Acdc - Thunderstruck        length is 4:52

 

Those files arent really exact the same but it is the same song...

although they are in different map too

 

so how do i filther those

to me the script doesnt bring me those things

That is a different request.  Obviously, if different time duration, they are different songs and I can guarantee that the file-sizes will be different.  (and the checksums)

 

 

 

 

Link to comment

its not working for me :(

it could be that some duplicate files are not exact the same

 

for ex

 

file:

Acdc - Thunderstruck        length is 4:53

Acdc - Thunderstruck        length is 4:52

 

Those files arent really exact the same but it is the same song...

although they are in different map too

 

so how do i filther those

to me the script doesnt bring me those things

That is a different request.  Obviously, if different time duration, they are different songs and I can guarantee that the file-sizes will be different.  (and the checksums)

 

yes sometimes file sizes are different, but same file name,

sometimes the size is the same, but the name iisnt 100% the same , for ex Acdc and Ac Dc ..

etc

 

so how can i sort those out?

they are different but not, you get the picture

Link to comment

yes sometimes file sizes are different, but same file name,

sometimes the size is the same, but the name iisnt 100% the same , for ex Acdc and Ac Dc ..

etc

 

so how can i sort those out?

they are different but not, you get the picture

There is no automated system that I can think of that will listen to the songs and make a choice for you. At some point you are going to have to manually delete the versions that you do not care for, as it's a matter of differing content, and differing opinions on which content version is right to keep. Perhaps you should cue up the duplicates in your favorite listening program and keep notes? As long as you can't hear a difference, delete the larger file size, or if you think more data is better, delete the smaller size. If you don't care to take the time to listen to them, then just make an arbitrary decision, because the files must not mean that much to you.

Link to comment

yes sometimes file sizes are different, but same file name,

sometimes the size is the same, but the name iisnt 100% the same , for ex Acdc and Ac Dc ..

etc

 

so how can i sort those out?

they are different but not, you get the picture

There is no automated system that I can think of that will listen to the songs and make a choice for you. At some point you are going to have to manually delete the versions that you do not care for, as it's a matter of differing content, and differing opinions on which content version is right to keep. Perhaps you should cue up the duplicates in your favorite listening program and keep notes? As long as you can't hear a difference, delete the larger file size, or if you think more data is better, delete the smaller size. If you don't care to take the time to listen to them, then just make an arbitrary decision, because the files must not mean that much to you.

 

It doesnt have to listen to my songs to know they are the same, cause the file name contains the same title

So theres no way to show it up?

How come this scripts shows it then (when its running, i see those double songs pass by) but doesnt export it to the txt?

Any idea Joe?

Link to comment
It doesnt have to listen to my songs to know they are the same, cause the file name contains the same title

I can rename a file anything I want. Just because the name matches, doesn't mean anything.  Duplicate files have the same binary contents, which Joe's script finds just fine. If the files have different contents, they are different, and you will have to judge for yourself which one to keep. How you make the decision is up to you.

Link to comment

i don't get it why mp3's are skipped...

I have tons of doubles (i checked) but they arent shown by the script

And the directory is correct

you can check if they have the same Md5 checksum.  If not, they are not duplicates, even if you think they are.

 

To see the md5 checksum, type:

md5sum /mnt/disk*/AllData/music/path/to/mp3_file.mp3

 

when i type that in, he says no such file or directory

You must put the correct path to YOUR files, not the text I gave.  Replace /path/to/mp3_file.mp3 with the path and name of YOUR mp3 you think should be found as a dupe.

 

Also when i run the script, i definately see all mp3s that are duplicate, but afterwards when i open up the textfile, all of those arent in it

You are seeing the very first pass of the process.  It lists EVERY file preceded by their size in bytes, regardless of its contents.
Link to comment
  • 4 months later...

Funny, just the other day I was googling for a dupe finder! Tons of them out there and I'll be scanning my stuff too. Personally I'm looking for exact dupes of the data so will be using something that scans and compares hashes. It will take forever but I know there will be some to find, especially in my pictures which is a mess. Different versions of the same song I'll keep especially from different albums.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.