June 9, 201214 yr My music is stored like this: \Music\Albums \Music\Singles \Music\Top 100 Some mp3s are double or triple... How can i easily find (in 6000 mp3s) duplicate files and delete them on unraid?
June 9, 201214 yr any ideas? yes. The attached script will scan your entire disk for duplicate files, regardless of their names or paths. (only EXACT duplicates are reported... If a different size, or content, files are NOT considered duplicates) The output file it creates will have content that looks like this: /mnt/disk1/Pictures/Misc-Pictures/100OLYMP13/PB291591.JPG /mnt/disk1/Pictures/Misc-Pictures/100OLYMP5/PB291591.JPG /mnt/disk1/Pictures/Misc-Pictures2/100OLYMP13/PB291591.JPG /mnt/disk1/Pictures/Misc-Pictures2/100OLYMP5/PB291591.JPG /mnt/disk1/Pictures/Misc-Pictures/100OLYMP11/P2281705.JPG /mnt/disk1/Pictures/Misc-Pictures/100OLYMP12/P2281705.JPG /mnt/disk1/Pictures/Misc-Pictures2/100OLYMP11/P2281705.JPG /mnt/disk1/Pictures/Misc-Pictures2/100OLYMP12/P2281705.JPG /mnt/disk3/data/shared/100OLYMP/P2281705.JPG /mnt/disk3/data/packages5-10/cpio-unmenu-package.conf /mnt/disk6/boot/packages/cpio-unmenu-package.conf /mnt/disk6/boot/packages5-10/cpio-unmenu-package.conf /mnt/disk6/boot/unmenu/cpio-unmenu-package.conf /mnt/disk3/Pictures/102SANDS/SANY2050.JPG /mnt/disk6/Pictures/Misc-Pictures/102SANDS/SANY2050.JPG /mnt/disk3/data/packages5-10/libX11-1.1.5-i486-1.tgz /mnt/disk3/data/packagesSept2009/libX11-1.1.5-i486-1.tgz /mnt/disk6/boot/packages/libX11-1.1.5-i486-1.tgz /mnt/disk6/boot/packages5-10/libX11-1.1.5-i486-1.tgz /mnt/disk4/Mp3/Stations/Country/181 FM Classic Hits Home of The 60 s and 70 s.url /mnt/disk4/Mp3/Stations/Top 50/181 FM Classic Hits Home of The 60 s and 70 s.url /mnt/disk1/Pictures/2007-FamilyReunion - Jackie's Pictures/DSCF0092.JPG /mnt/disk1/Pictures/PictureFrame/DCIM/2007-FamilyReunion - Jackie's Pictures/DSCF0092.JPG /mnt/disk1/Pictures/2009-Classic/4/VIDEO_TS/VTS_39_0.BUP /mnt/disk1/Pictures/2009-Classic/4/VIDEO_TS/VTS_39_0.IFO /mnt/disk1/Movies/SD_VIDEO/MOV0AF.avi.nfo /mnt/disk4/Pictures/SD_VIDEO/MOV0AF.avi.nfo echo "FINISHED: DUPES ARE IN FILE: /mnt/disk1/dupes_out.txt" FINISHED: DUPES ARE IN FILE: /mnt/disk1/dupes_out.txt The script of commands in the attached find_dupes.sh script is: set -v sysctl vm.vfs_cache_pressure=200 find /mnt/disk* ! -empty -maxdepth 5 -type f -links 1 -printf "%s " -exec ls -dQ {} \; | tee /mnt/disk1/dupes_tmp1 sort -n /mnt/disk1/dupes_tmp1 | awk '{ printf "%015d %s\n", $1, $0}' | cut -d" " -f1,3- | uniq -D -w 15 | cut -d" " -f2- | tee /mnt/disk1/dupes_tmp2 sed "s/'/'\\\''/g" < /mnt/disk1/dupes_tmp2 | xargs -n 1 -I FiLeNaMeX sh -c "dd if='FiLeNaMeX' count=1 ibs=4M 2>/dev/null | md5sum -| tr -d '\n'; echo 'FiLeNaMeX'" | tee /mnt/disk1/dupes_tmp3 sort /mnt/disk1/dupes_tmp3 | uniq -w32 --all-repeated| cut -c36- | sed -e 's/"/\\\"/g' -e 's/\(.*\)/"\1"/' | tee /mnt/disk1/dupes_tmp4 cat /mnt/disk1/dupes_tmp4 | xargs md5sum | tee /mnt/disk1/dupes_tmp5 sort /mnt/disk1/dupes_tmp5 | uniq -w32 -d --all-repeated=separate | cut -c35- | tee /mnt/disk1/dupes_out.txt echo "FINISHED: DUPES ARE IN FILE: /mnt/disk1/dupes_out.txt" download and unzip in your flash drive. then run find_dupes.sh It will create temp files on disk1. If this is not OK, change the script accordingly. There are a number of intermediate temp files it creates when running. You can delete them all once you get the final result. The final set of files that are dupes is in /mnt/disk1/dupes_out.txt As you can see in the above sample output from my newer server, the files can be in different directories, and even have different names, but if they have the same md5 checksum, they are considered the same file. The actual process runs in several steps. The first line sets verbose mode so you can see the script running. The second sets a kernel parameter to make it easier for other processes to grab memory if they need it. The third "find" command line finds all the files and lists then preceeded by their size in bytes. The fourth finds those that do not have a unique size. (If they are unique in size, they cannot be a duplicate of another files) The fifth finds those that are not unique MD5 checksums in their first 4Meg of content. (If uniqueness occurs in the first 4Meg, the file is unique, no need to check the balance of the file, regardless of content.) The sixth computes the MD5 checksum for those files not unique in their first 4 Meg for their entire contents. The seventh line sorts those and groups them in a way that is readable and deletes those that have a unique MD5 checksum. The output is put in /mnt/disk1/dupes_out.txt While the processing is occurring, the output is also sent to the terminal being used. It is interesting to watch. It will take quite a few hours if you have a large amount of files to scan. Oh yes, I limited the scan to 8 directories deep. (I had some windows backups that were far deeper and did not want to bother with them in the results) It is up to you to delete all but one of the duplicates... (the process does nothing to delete files. It will only show you where they are. Whatever you do, if it shows you have two copies of a file, do NOT delete both unless you want NO copies of the file to remain) It is expected you'll use the opportunity to organize the files as you desire, deleting all but ONE of the desired files. Joe L. find_dupes.zip
June 9, 201214 yr Author Wow, this is so Great! Very handy! THx à lot Joe! Will try asap tomorrow! THx again.
June 10, 201214 yr Author Is there a way so he only search in "\Alldata\music" ? Cause under \Alldata\Movies, i have a lot of duplicate files, but those may not be deleted.. And this way the list is very long lol if its not possible , no problem then, i already like this script ! my output looks much more hectic also, below is a screenshot, everything is together
June 10, 201214 yr Is there a way so he only search in "\Alldata\music" ? Cause under \Alldata\Movies, i have a lot of duplicate files, but those may not be deleted.. And this way the list is very long lol if its not possible , no problem then, i already like this script ! You can filter the final output file using the "grep" command grep "\/Alldata\/music" /mnt/disk1/dupes_out.txt > /mnt/disk1/dupes_out_filtered.txt
June 10, 201214 yr Is there a way so he only search in "\Alldata\music" ? Cause under \Alldata\Movies, i have a lot of duplicate files, but those may not be deleted.. And this way the list is very long lol if its not possible , no problem then, i already like this script ! You can filter the final output file using the "grep" command grep "\/Alldata\/music" /mnt/disk1/dupes_out.txt > /mnt/disk1/dupes_out_filtered.txt If you do that, you'll lose the spaces the script puts between the different files. Instead, just modify the very first find command like this: from find /mnt/disk* ! -empty -maxdepth 5 -type f -links 1 -printf "%s " -exec ls -dQ {} \; | tee /mnt/disk1/dupes_tmp1 to find /mnt/disk*/AllData/music ! -empty -maxdepth 5 -type f -links 1 -printf "%s " -exec ls -dQ {} \; | tee /mnt/disk1/dupes_tmp1 If your "music" sub-directory actually is "Music" use a capitalized "Music" in the "find" command instead of "music", otherwise, the script will not match the directory name and nothing will print. (but it will run really fast, since no files will be found) Joe L.
June 10, 201214 yr Author mmm strange, it works for everything but not mp3 somehow hes skipping those files here....
June 10, 201214 yr Author i don't get it why mp3's are skipped... I have tons of doubles (i checked) but they arent shown by the script And the directory is correct
June 10, 201214 yr i don't get it why mp3's are skipped... I have tons of doubles (i checked) but they arent shown by the script And the directory is correct you can check if they have the same Md5 checksum. If not, they are not duplicates, even if you think they are. To see the md5 checksum, type: md5sum /mnt/disk*/AllData/music/path/to/mp3_file.mp3
June 10, 201214 yr Author i don't get it why mp3's are skipped... I have tons of doubles (i checked) but they arent shown by the script And the directory is correct you can check if they have the same Md5 checksum. If not, they are not duplicates, even if you think they are. To see the md5 checksum, type: md5sum /mnt/disk*/AllData/music/path/to/mp3_file.mp3 when i type that in, he says no such file or directory Also when i run the script, i definately see all mp3s that are duplicate, but afterwards when i open up the textfile, all of those arent in it
June 10, 201214 yr Author its not working for me it could be that some duplicate files are not exact the same for ex file: Acdc - Thunderstruck length is 4:53 Acdc - Thunderstruck length is 4:52 Those files arent really exact the same but it is the same song... although they are in different map too so how do i filther those to me the script doesnt bring me those things
June 10, 201214 yr its not working for me it could be that some duplicate files are not exact the same for ex file: Acdc - Thunderstruck length is 4:53 Acdc - Thunderstruck length is 4:52 Those files arent really exact the same but it is the same song... although they are in different map too so how do i filther those to me the script doesnt bring me those things That is a different request. Obviously, if different time duration, they are different songs and I can guarantee that the file-sizes will be different. (and the checksums)
June 10, 201214 yr Author its not working for me it could be that some duplicate files are not exact the same for ex file: Acdc - Thunderstruck length is 4:53 Acdc - Thunderstruck length is 4:52 Those files arent really exact the same but it is the same song... although they are in different map too so how do i filther those to me the script doesnt bring me those things That is a different request. Obviously, if different time duration, they are different songs and I can guarantee that the file-sizes will be different. (and the checksums) yes sometimes file sizes are different, but same file name, sometimes the size is the same, but the name iisnt 100% the same , for ex Acdc and Ac Dc .. etc so how can i sort those out? they are different but not, you get the picture
June 10, 201214 yr yes sometimes file sizes are different, but same file name, sometimes the size is the same, but the name iisnt 100% the same , for ex Acdc and Ac Dc .. etc so how can i sort those out? they are different but not, you get the picture There is no automated system that I can think of that will listen to the songs and make a choice for you. At some point you are going to have to manually delete the versions that you do not care for, as it's a matter of differing content, and differing opinions on which content version is right to keep. Perhaps you should cue up the duplicates in your favorite listening program and keep notes? As long as you can't hear a difference, delete the larger file size, or if you think more data is better, delete the smaller size. If you don't care to take the time to listen to them, then just make an arbitrary decision, because the files must not mean that much to you.
June 11, 201214 yr Author yes sometimes file sizes are different, but same file name, sometimes the size is the same, but the name iisnt 100% the same , for ex Acdc and Ac Dc .. etc so how can i sort those out? they are different but not, you get the picture There is no automated system that I can think of that will listen to the songs and make a choice for you. At some point you are going to have to manually delete the versions that you do not care for, as it's a matter of differing content, and differing opinions on which content version is right to keep. Perhaps you should cue up the duplicates in your favorite listening program and keep notes? As long as you can't hear a difference, delete the larger file size, or if you think more data is better, delete the smaller size. If you don't care to take the time to listen to them, then just make an arbitrary decision, because the files must not mean that much to you. It doesnt have to listen to my songs to know they are the same, cause the file name contains the same title So theres no way to show it up? How come this scripts shows it then (when its running, i see those double songs pass by) but doesnt export it to the txt? Any idea Joe?
June 11, 201214 yr It doesnt have to listen to my songs to know they are the same, cause the file name contains the same title I can rename a file anything I want. Just because the name matches, doesn't mean anything. Duplicate files have the same binary contents, which Joe's script finds just fine. If the files have different contents, they are different, and you will have to judge for yourself which one to keep. How you make the decision is up to you.
June 11, 201214 yr i don't get it why mp3's are skipped... I have tons of doubles (i checked) but they arent shown by the script And the directory is correct you can check if they have the same Md5 checksum. If not, they are not duplicates, even if you think they are. To see the md5 checksum, type: md5sum /mnt/disk*/AllData/music/path/to/mp3_file.mp3 when i type that in, he says no such file or directory You must put the correct path to YOUR files, not the text I gave. Replace /path/to/mp3_file.mp3 with the path and name of YOUR mp3 you think should be found as a dupe. Also when i run the script, i definately see all mp3s that are duplicate, but afterwards when i open up the textfile, all of those arent in it You are seeing the very first pass of the process. It lists EVERY file preceded by their size in bytes, regardless of its contents.
June 12, 201214 yr Author i see... Anway i've found a way to delete the duplicate mp3 files using mp3tag I could sort the files by name, and that way i could easily find doubles Thx anyway
November 4, 201213 yr My music is stored like this: \Music\Albums \Music\Singles \Music\Top 100 Some mp3s are double or triple... How can i easily find (in 6000 mp3s) duplicate files and delete them on unraid? I always prefer Duplicate Files Deleter to find out duplicate files and delete as well. It's comparatively hassle free & user friendly utility than i used before.
November 4, 201213 yr Funny, just the other day I was googling for a dupe finder! Tons of them out there and I'll be scanning my stuff too. Personally I'm looking for exact dupes of the data so will be using something that scans and compares hashes. It will take forever but I know there will be some to find, especially in my pictures which is a mess. Different versions of the same song I'll keep especially from different albums.
Archived
This topic is now archived and is closed to further replies.