Jump to content
itimpi

unRAIDFindDuplicates.sh

88 posts in this topic Last Reply

Recommended Posts

It is possible in unRAID to end up with files with the same name on more than one array disk.  In particular this could happen when moving files around the system between disk shares (typically at the command line level) if you make a copy and forget to delete the source.

 

This can be an issue on an unRAID system as if you are using unRAID user shares (as I expect most unRAID users would be) as unRAID only shows the first occurrence in such a case so it may not be obvious that you have duplicate files on the system.  As well s potentially wasting space this can lead to unexpected behaviour such as deleting a file on a user share and finding it appears to still be there (because unRAID is now showing you the other copy that was previously hidden).

 

Please note that we are talking about files with the same name that are present in more than one location and are thus wasting space.  This utility does not try and detect files with different names that have the same content.  If you want to try and detect such files then Joe L (of pre-clear and cache-dirs fame) has developed a script that will do this as described in this post).

 

It is possible to see that such duplicate filenames exist by browsing via the GUI, but this has to be done on a case-by-case basis and there is no easy way to get a consolidated list of all duplicates.  To get around this I created the attached script for my own use that is reasonably comprehensive and that others may find useful. The script runs very quickly (as it is working purely off directory information) so it is not much of chore to run it at regular intervals as part of your system housekeeping.  LimeTech have on the Roadmap an item to include duplicate checking utility as a standard feature at some point.  I thought that this script might be a useful stopgap (NOTE:  I am more than happy if Limetech want to include this script (or a derivative) in any standard unRAID release).

 

I modelled this on the style of script that is used in cache-dirs.  I hope that Joe L. does not mind that I borrowed some of the coding techniques that he used.

 

The following shows the usage information built into the script.  Hopefully it is enough to get anyone interested started successfully.  I would recommend that you try the -v (verbose) option at least initially.

 

  Usage: ./unRAIDFindDuplicates.sh [-v] ]-q] [-b ] [-c] -d exclude_disk] [-o filename] [-i dirname] [-e dirname] [-f/F] [z|Z]

-b      =   If duplicate names found, do a binary compare of the files as well
             If omitted, then only a compare on file name is done
             NOTE.  Using this option slows things down A LOT as it needs to
                    read every byte of files whose names match to compare them
-c      =   Ignore case when checking names.
             This can help with the fact that linux is case sensitive on filenames,
             whereas Windows, and by implication Samba, is case independent. This
             can lead to unexpected filename collisions.
-d exclude_disk  (may be repeated as many times as desired)
             The default behavior is to include all disks in the checks.
             Use this to exclude a specific disk from being included in the checks
-D path     Treat the given path as if it was an array disk (may be repeated as many times as necessary).
             Can be useful to test if files on an extra disk already exist in the array.
-e exclude_dir  (may be repeated as many times as desired)
             Use this to exclude a particular share/directory from being included in the checks
-f          List any empty folders (directories) that are duplicates of non-empty folder on
             another disk. These can be left behind when you remove duplicate files, but notg
             their containing folder.  However empty folders are also possible in normal
             operation so finding these is not necessarily an issue.
-F          List any empty folders even if they are not duplicated on another drive.  This
             may be perfectly valid but at least this helps you decide if this is so.
-i include_dir  (may be repeated as many times as desired)
             Use this to include a particular share/directory to be included in the checks
             If omitted, then all top level folders on each disk (except for those
             specifically excluded via the -e option(s)) will be included in the checks
-o filename Specify the output filename to which a copy of the results should be sent.
             If omtted then the results are sent to the file duplicates.txt on the root
             of the flash drive e.g. /boot/duplicates.txt from linux
-q      =   Quiet mode.
             No console output while running.  Need to see results file for output.
-v      =   verbose mode.  Additional Details are produced as progress proceeds
-V      =   print program version
-x      =   Only report file mismatches on time/size (default) or content (if -b also used)
             Does not simply report the fact that there is a duplicate if they appear identical.
-X path =   Check the array against the given disk and report if files on the array are
             either missing or appear to be different size.  Use the -b option as well if you
             want the file contents checked as well.  Useful for checking whether you have files
             on a backup disk that are not also on the main array.  It is assumed that the path
             specified conatins files in the same folder structure as is present on the array.
  z          Report zero length files that are also duplicates.   These are not necessarily
             wrong, but could be a remnant of some previous issue or failed copy
  Z          Report xero length files even when they are not duplicates.

EXAMPLES:

To check all shares on all disks except disk 9
  ./unRAIDFindDuplicates.sh -d 9

To check just the TV share
  ./unRAIDFindDuplicates.sh -i TV

To check all shares except the Archives share
  ./unRAIDFindDuplicates.sh  -e Archives

TIP:  This program runs much faster if all drives are spun up first

 

# CHANGE HISTORY
# ~~~~~~~~~~~~~~
# Version 1.0  09 Sep 2014  First Version	
# Version 1.1  10 Sep 2014  Got the -b option working to identify files where
#                           the names are the same but the contents differ.
#                           Added -q option to suppress all console output while running
#                           Added warning if file sizes differ
# Version 1.2  13 Sep 2014  Added the -D option to check extra disk
# Version 1.3  01 Oct 2014  Added -f and -F options to list empty (duplicated) directories.
#                           Added -z and -Z options to list zero length (duplicated) files.
#                           Fix:  Allow for shares that have spaces in their names.
# Version 1.4   07 Oct 2014 Fix:  Use process ID in /tmp filenames to allow multiple copies of
#                           the script to be run in parallel without interfering with each other.
# Version 1.5   07 Mar 2016 Fix: Incorrect reporting of file size mismatches when sparse
#                           files involved.

 

If you find any issues with the script or have suggestions for improvement please let me know and I will see if I can incorporate the feedback.

 

So far I have only tested this on unRAID v6, but I cannot think of any reason it would not also work on v5.  If any one successfully uses it (or not) on v5 then feedback on the results would be welcomed.

unRAIDFindDuplicates_v1.5.zip

Share this post


Link to post

 

Great ! 

 

My REISERFS - XFS conversion cycle has allmost ended (98% of the latest disk), after that I will start your tool to find my dupes..

Share this post


Link to post

Let me know what results you get.

 

It helped me get rid of duplicates that had crept onto my own system.

Share this post


Link to post

I just started it for my Movies share:

 

./unRAIDFindDuplicates.sh -i Movies -v

 

It throws one error directly after starting:

 

./unRAIDFindDuplicates.sh: line 153: [!: command not found

 

Tools appears to run after that though and also find results:

 

First thing I notice that it appears to be REALLY fast... IT scanned my moves folder (allmost 7TB) in a minute. It found only a few dupes which I have now deleted.

 

I also notice it finds duplicates between the array disks and the cache drive, since files only briefly reside on the cache drive I figure it might be better to exclude the cache drive ?

 

Now scanning series (10.1 TB). Appears to also run quick (also a lot more dupes ;-)

 

I stopped the scan and restarted rerouting the results to a file to analyse later, really a lot of dupes here..

 

Share this post


Link to post

Would it be possible to only list the duplicate entry instead of both ? With that it would be extremely easy to an editor to add "RM" in front of that and simply delete the duplicate entries..

Share this post


Link to post

The tool absolutely works but with a big list of files it is quite undoable to work with it (eg: I am to lazy).

 

So I am taking a different route. I seem to have two disks (both 2TB) that contain the majority of the duplicates.

 

I also have an empty 4TB.

 

I will now move the first 2TB to the 4TB, then I will move the 2nd 2TB to the 4TB telling it to "replace when file size is different", that way I should have an unduped combination of the two drives on the 4TB. Whatever remains on the second 2TB are dupes and can be deleted..

Share this post


Link to post

I just started it for my Movies share:

 

./unRAIDFindDuplicates.sh -i Movies -v

 

It throws one error directly after starting:

 

./unRAIDFindDuplicates.sh: line 153: [!: command not found

 

Thanks for pointing that out.  Appears to be a missing space.  However it just means that a -d option is not validated properly so has minimal side-effect.

 

Tools appears to run after that though and also find results:

 

First thing I notice that it appears to be REALLY fast... IT scanned my moves folder (allmost 7TB) in a minute. It found only a few dupes which I have now deleted.

I must admit I was also surprised at first by how fast it ran if you have not used the -b option (which can REALLY slows things down) as it is working purely of directory information.  If you have cache_dirs running this is probably mostly already cached in memory.  I do not really expect the -b option would be used very often.  I added it to help me check whether file corruption appeared to be happening in light of the bug in v6 beta 7/8.

 

I also notice it finds duplicates between the array disks and the cache drive, since files only briefly reside on the cache drive I figure it might be better to exclude the cache drive ?

I thought about providing an option to exclude the cache drive.  My initial thinking was that if mover is not running then it probably makes little difference.  However since you mentioned it I will probably make including the cache drive optional.

 

Now scanning series (10.1 TB). Appears to also run quick (also a lot more dupes ;-)p

Good - it is doing its job.

 

[quote[i stopped the scan and restarted rerouting the results to a file to analyse later, really a lot of dupes here..

It should by default write a 'duplicates.txt' file on the flash drive.  You can provide your own filename instead using the -o option.

 

As one of the first to use this in anger (besides myself) please continue to provide feedback.  I am particularly interested in knowing whether the amount of detail output appears about right (both with and without the -v option).  Tweaking that should be easy enough. 

Share this post


Link to post

The tool absolutely works but with a big list of files it is quite undoable to work with it (eg: I am to lazy).

 

So I am taking a different route. I seem to have two disks (both 2TB) that contain the majority of the duplicates.

 

I also have an empty 4TB.

 

I will now move the first 2TB to the 4TB, then I will move the 2nd 2TB to the 4TB telling it to "replace when file size is different", that way I should have an unduped combination of the two drives on the 4TB. Whatever remains on the second 2TB are dupes and can be deleted..

Fair enough.  At least once you have the output you can look at the scale of the problem and decide on an appropriate course of action.

 

I had thought to add an option to automatically delete duplicates, but certainly did not want it in an initial iteration of the tool.  Deleting data is dangerous as if something went wrong it would lead to data loss.  If I do decide to add it, I will enforce the binary check on such duplicates before doing a delete, and also make the user confirm it at tool start up.  However I am only going to add such an option if users think it would be useful enough and safe enough to do so.  At the moment I am considering that having the list of files is a good start.

Share this post


Link to post

I also notice it finds duplicates between the array disks and the cache drive, since files only briefly reside on the cache drive I figure it might be better to exclude the cache drive ?

 

This shouldn't be possable under the normal flow. When you replace a file that already exists, it overwrites the version in the array and not writing to the cache drive. Only new files get put on the cache drive.

 

I'd verify which one is correct and take the appropriate action and remove the other. I know that if you have dupes in the same path on an array drive, UNRAID presents only the first one found working in numerical ascending order from drive 1 to drive x. Not sure if the cache drive gets plugged in last or first.

Share this post


Link to post

A script I wrote a while back to find duplicate files is described in and attached to this post:

http://lime-technology.com/forum/index.php?topic=7018.msg68073;topicseen#msg68073

It is a series of commands that create and use intermediate files on /mnt/disk1  You can run one line at a time and examine the intermediate files if you wish.

 

It differs from the script in this post in that it does NOT rely on the file names to be identical.  It will find duplicate files even if you've renamed them.

 

It narrows down the files it needs to examine by eliminating files that cannot be duplicates by using this logic:

1. It first gets a list of all the files on your disks, and their lengths, and eliminates any file whose length is not the same as another file. 

    (A unique length indicates it must be a unique file)

2. Then, it generates the md5 checksum for the first 4Meg of every file where their length is not unique.

    (A unique MD5 checksum in the first 4 Meg would indicate a unique file, no need to verify by reading the remainder)

3. Then, it generates the md5 checksum on the entire file where the md5 checksum on the first 4Meg is not unique.

    (If the md5 on the first 4 Meg was not unique we need to read the remainder)

4. Lastly, it lists the files, grouping them where they have identical md5 checksums, (and identical contents), even though they may be in different folders, or even have different names.

The resulting list will look like this (each group is set of identical files, regardless of their names):

/mnt/disk4/Mp3/KC and the Sunshine Band/KC_The_Sunshine_Band_-_Shake_Your_Booty.mp3

/mnt/disk4/Mp3/KC and the Sunshine Band/Shake Your Booty-Earth Wind and Fire.mp3

 

/mnt/disk3/data/mg35/mg-kernel/mg-kernel/include/linux/i2c-id.h

/mnt/disk3/data/mg35/mg35tools/firmware/uClinux-2.4/include/linux/i2c-id.h

 

/mnt/disk1/Pictures/2009 - July 4th/IMG_1374.JPG

/mnt/disk1/Pictures/PictureFrame/DCIM/2009 - July 4th/IMG_1374.JPG

 

/mnt/disk3/data/mg35/mg-kernel/mg-kernel/fs/nls/nls_cp865.c

/mnt/disk3/data/mg35/mg35tools/firmware/uClinux-2.4/fs/nls/nls_cp865.c

 

/mnt/disk1/Movies/SD_VIDEO/PRG002/MOV0AB.avi

/mnt/disk1/Pictures/Movies/20080315_135748.MPG

/mnt/disk1/Pictures/PRG002/MOV0AB.MOD

 

/mnt/disk1/Movies/SD_VIDEO/PRG001/MOV024.avi

/mnt/disk3/data/USADanceShowcase-Sept2006/MOV024.MOD

/mnt/disk3/data/shared/JVC-VIDEO-CAMERA/SD_VIDEO/PRG001/MOV024.MOD

 

Once you get the list it will be up to you to decide which files in each set can be deleted.  (Make sure you leave at least one of them, otherwise for certain, you will not have another copy elsewhere.)

 

Have fun.

 

Joe L.

Share this post


Link to post

Joe:  I have added a link to the post describing your utility to my opening post in this thread. 

 

I think your utility satisfies a slightly different need as it may be perfectly legitimate to have files on the system that are identical in content but have different names.  The case of identical content and identical names but on different drives is far more likely to be unintended.  My utility runs much faster than yours so they should both be of use depending on the exact problem you are trying to solve.

Share this post


Link to post

Joe:  I have added a link to the post describing your utility to my opening post in this thread. 

 

I think your utility satisfies a slightly different need as it may be perfectly legitimate to have files on the system that are identical in content but have different names.  The case of identical content and identical names but on different drives is far more likely to be unintended.  My utility runs much faster than yours so they should both be of use depending on the exact problem you are trying to solve.

They both have their place.

 

If, for example, you've made a copy of an mp3 file, and edited it in some way (added tags, or changed length, or whatever) and put the copy in a different directory, but with the exact same name, it will be found by your script, but not mine, since its checksum and size will be different.

 

My script will find exactly identical files on different disks or in different directories regardless if they are named the same name or different names.

 

users of your script should NOT just delete  duplicates until they verify which of the files is the one they wish to keep.  They should open each in turn to verify AND check their length.    More than once I've copied a movie file from one disk to another only to run out of space and have the target file truncated and end up much shorter than the original. 

 

It would be horrible if I deleted the original file and kept a similarly named, but corrupted file that was in a different directory found by your script.

 

Joe L.

Share this post


Link to post

They both have their place.

That is what I thought.

 

If, for example, you've made a copy of an mp3 file, and edited it in some way (added tags, or changed length, or whatever) and put the copy in a different directory, but with the exact same name, it will be found by your script, but not mine, since its checksum and size will be different.

That is one of the reasons my script will warn if file sizes are different.  If you use the -b option it will also check if contents are identical when file sizes match and warn if contents are different.

 

users of your script should NOT just delete  duplicates until they verify which of the files is the one they wish to keep.

It would be horrible if they deleted the original file and kept a similarly named, but corrupted file that was in a different directory.

I agree.  That is one reason the script currently just gives information about the copies for the user to make a manual selection, and adds a warning if it notices differences between copies.  At the moment the disks where the duplicates are located are only given if the -v option is used - perhaps this detail should always be listed?

 

If I ever add an auto-delete option (which I am very adverse to doing) I would only ever delete a copy if they proved to be identical at the binary level.  However because of the risk of large scale data loss if something went wrong (e.g. if one reason there were duplicates was because you were moving files off a failing disk and it mattered which copy off apparently identical files was deleted) any sort of auto-delete would always be dangerous.

 

Joe L.

Share this post


Link to post

I would not create an auto-delete option..

 

However if, with the -v switch off, the list only containts the duplicate and not the original (alas: only one version), then it would be very easy for every USER to put an RM before that and create a little script.. That would however then be the users own thing .. Very understandable if you do not want to put this in the tool..

Share this post


Link to post

thank you for this!  it was very helpful to me in finding and merging/deleting my duplicates after a disk problem.

 

It took me quite a while to open the different disks to move the files around, but after 3 or 4 runs of this utility, it says I've got them all now.  nice!

 

now I'm running Joe's script to see what I have that ma have different names.  Fun stuff :)

Share this post


Link to post

This is great script, can this be modified to report duplicated folder names as well?

I deliberately did not do that as it is quite normal to have duplicate folders if your share settings allow for this.

 

Having said that such a change is definitely a possibility.  I want to clearly understand the Use Case before looking at how something might be coded.  Some questions that occur to me:

  • Are you interested in duplicates even if they are allowed by the share settings?
  • Are you interested primarily in folders with no files.  I had assumed that when tidying up duplicate files you would also do any associated folders.    However I can see that empty folders could easily be missed.
  • Should this be done a separate pass so that folders are reported in their own section of any report?
  • Is there any other specific scenario you are thinking of?

Share this post


Link to post

What I like to do is clean up my system, so a movie are on the same disk, now I have some duplicated foldernames left on other disks, that I like to remove :-)

 

//Peter

Share this post


Link to post

What I like to do is clean up my system, so a movie are on the same disk, now I have some duplicated foldernames left on other disks, that I like to remove :-)

 

//Peter

Yes - but what are the criteria for identifying these?    Are they perhaps ones with no contents?   

 

On a normal unRAID system where users shares span multiple disks there are lots of duplicate folders.  Listing all duplicates is unlikely to be of much use because if no criteria are applied there will be so many the problem ones are likely to be hard to spot.

Share this post


Link to post

What I see so are the duplicates folder empty.

 

//Peter

At the moment I have added an option to my test version of the script that shows any empty folders (regardless of whether they are empty) as that was easy and quick.  Interestingly enough it showed some unexpected empty folders on my own system.    I now need to add some additional logic to check if they are duplicates of other non-empty directories which is mostly likely to be the case where one wants to tidy up.  On my system some of them were duplicates of non-empty folders and others were not (e.g I have quite a few empty AUDIO_TS folders from DVD rips and this is valid).

 

I am wondering if there is any value to having an option to report empty directories even when they are not duplicates - what do you think?

 

I think I will have an option with this functionality that I am prepared ti upload by sometime tomorrow.

Share this post


Link to post

Good job, doesn't heart to delete empty folder that are unique , its good to know what can be removed 

Share this post


Link to post

I noticed lots of empty folders while going thru this process myself.

 

I used this to remove the empty directories; and it seemed to work fine...

 

-type d -exec rmdir {} + 2>/dev/null

 

I also tried this, which also seemed to work fine...

 

find . -type d -empty -delete

Share this post


Link to post

Getting 2 errors. Anyway to fix this?

 

./unRAIDFindDuplicates.sh: line 153: [!: command not found

 

and

 

ls: cannot access /mnt/disk*//mnt/user/.........

 

Please see attached.

 

Duppie

errors.txt

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.