Duplicate files on unRAID


Recommended Posts

Hey All

 

Surely by now there should be a EASY solution to this as it is caused by unRAID itself and not a user fault.

 

So does LimeTech have a solution to their problem?

or is there at least a easy way of finding dupes and getting rid of them.

I hope I am not opening a can of worms with this post BUT there should be some kind of solution from the company by now as this problem is as old as the product itself....

 

Thanks for any help

[and yes I have done searches and found solutions but there should be an official one from LimeTech by now ...]

  • Like 1
Link to comment
7 minutes ago, abs0lut.zer0 said:

Surely by now there should be a EASY solution to this as it is caused by unRAID itself and not a user fault.

What do you mean caused by unRAID? I use unRAID for over 10 years and never had any dupe issues, nor do I remember reading on the forums about anything other than user error regarding dupes.

Link to comment

I do not see how unRAID itself would create duplicate files in normal use.    In my experience this is always caused by the user manually copying files between drives.   Do you have a scenario where unRAID itself manages to do this?

 

As has been mentioned there are various user-supplied solutions to this problem.    How easy they are to use depends on whether you are looking for files with identical paths but on different disks, or whether you are looking for files where the paths differ but the contents are identical.

Link to comment

I ONLY use /mnt/user to copy files to my unRAID box and the various shares I have under that mount point, so I was saying that it is not users creating the duplicates on purpose.

I am not trying to start a blame game here I am just looking for an easy solution.

 

The fix common problem plugin brought these duplicates to my attention and I am seeing a lot of them.

 

 

 

Edited by abs0lut.zer0
Link to comment
4 minutes ago, abs0lut.zer0 said:

If that is true please explain so we can avoid the duplication on our side...

Like I said never had a single duplicate file, and I have over 300TB of data on unRAID servers, to know what you're doing wrong we'd need more details on how and where you end up with duplicates.

Link to comment
1 minute ago, johnnie.black said:

Like I said never had a single duplicate file, and I have over 300TB of data on unRAID servers, to know what you're doing wrong we'd need more details on how and where you end up with duplicates.

As I have explained it's not an isolated case so I do not know how to replicate the problem unless I manually copy onto disk share instead of unRAID share so I am just asking the community if any one has a easier solution than excel vba or the scripts mentioned in the other posts.

 

So I will just monitor this post if any one else has some tips for this known problem.

Link to comment
16 minutes ago, abs0lut.zer0 said:

 

 

If that is true please explain so we can avoid the duplication on our side...

The only way you can get duplicate files is if you work at the disk level (/mnt/diskX or /mnt/cache).    Working at that level bypasses the User Share Level and so means you can create duplicates.    If you work purely at the User Share Level you will not get duplicate files (defining duplicates in this case as files with the same path on different drives).    If duplicate files do end up being created it is almost universally caused by users moving files between drives and making some sort of error during the process.   As an example this can easily be caused by copying files from one disk to another leaving the originals behind rather than moving them.

Link to comment
1 minute ago, abs0lut.zer0 said:

Thanks a lot for that explanation itimpi, I get it now so I will just use the fix common problems output and then delete the duplicates from the file system level and then monitor if it happens again.

 

Thanks for all the help ....

 

Appreciated.

If you have a lot of duplicates then you may find it easier to get the list of duplicate files using the UnRAIDFindDuplicates script I created some years ago when I managed to create a lot of duplicates when re-organising my disks.   I know you said you preferred not to use a script but if there are a lot of files this can be a fast way to get a list of them.

  • Like 1
Link to comment
2 minutes ago, itimpi said:

If you have a lot of duplicates then you may find it easier to get the list of duplicate files using the UnRAIDFindDuplicates script I created some years ago when I managed to create a lot of duplicates when re-organising my disks.   I know you said you preferred not to use a script but if there are a lot of files this can be a fast way to get a list of them.

will do 

thank you

Link to comment

Keep in mind the semantics here can be important. You are stating that the files are duplicates, and if that is the case, then the solution is to delete one of the copies.

 

However... when a copy has been made to a different disk with the same path, those files are binary duplicates, but ONLY if no changes are made to the files.

 

Consider this scenario

 

/mnt/disk1/folder/document.txt

/mnt/disk2/folder/document.txt

 

If you access the file /mnt/user/folder/document.txt, it only modifies the document on the first disk, so the 2 documents are NO LONGER DUPLICATES. They will be reported as duplicate file names, but the contents are different. If you then "clean up duplicates" and delete the file on disk1, then your changes will be lost, and the file that shows up in /mnt/user/folder/document.txt will be the original, not the modified copy.

 

Bottom line, you should probably use a clean up method that employs full binary comparison to determine duplication, not just file name collisions.

 

If you continue to have problems with duplicate files being created, you need to make SURE that none of your file creation or management tools reference /mnt/diskX or /mnt/cache, only /mnt/user paths when creating or moving your data.

  • Like 1
  • Upvote 1
Link to comment

First step is a tool that finds duplicates on content - so some tool that computes a hash for every file and then sorts the hashes and reports duplicates. If they have same name/path but different disk then it should be safe to remove. If they have different name then you need to consider if any specific program (such as Plex or other media player) may have remembered the path+name.

 

As a second step you can use a tool that searches for same path/name. And because of the first pass removing files with identical binary content you now know that you need to look at the actual content and change date of the files to decide which to keep.

Link to comment

The script I linked to has an option to check if files that have duplicate names also have duplicate contents.     It slows down generating the duplicate list somewhat as for those files that match on file names it then first checks if their sizes are the same (which is fast) and if they are then it does a binary compare as well to validate if their contents are equal.    If the copies are different this is then indicated.

 

Finding duplicates for files which have different file file names but identical contents is possible using other tools but will take much longer time unless you already have file hashes generated.   However I got the impression that in this case we were looking at the case of the same filename occurring on multiple disks (probably as a side-effect of a small misstep when re-organising data on the drives).

Link to comment
  • 4 months later...

I had the exact same issue, and think I have the solution.  Use the UnBalance Plugin and "Gather" all your versions of the file(s) to a different drive.  This assumes that there is at least one drive in your array that an individual file is not copied to.  I used a filtering setup in excel and pasted the output of the extended test in fix common problems to determine which files were on which drives.

 

If you have 3 copies of a 1.5Gb file, Gather will say that it will move 4.5Gb of data, but in reality the second two copies will be skipped but the old versions removed.

 

Hope that helps

Link to comment
7 minutes ago, TheTinMan said:

If you have 3 copies of a 1.5Gb file, Gather will say that it will move 4.5Gb of data, but in reality the second two copies will be skipped but the old versions removed.

 

Does unBalance actually bother to decide which to keep or does it just keep whichever it encounters first (or last)?

  • Like 1
Link to comment

Hard to tell in my instance as they are exact duplicates, but I'd assume as it's running 

 

rsync -avPR -X

 

in the background, looking at that rsync nomenclature should tell you how it'll behave.  It seems to copy from each disk in numerical order, which makes sense as it's the one visible to the user share I believe.

 

Your mileage may vary...

  • Like 1
Link to comment
  • 3 years later...

I know this reply is 3.5 years later, but I was searching for a solution to this problem.

CA Fix Common Problems found hundreds of duplicate files on two of my disks.

I never work at the disk level - only the share level, so I do not know how this happened.

I do a lot with Musicbrainz Picard - and a lot of the dups were flac files, so maybe that's an issue?

In any case, I needed a way to find and delete these duplicates.

I found the app Czkawka - and this allowed me to do a binary compare and then delete the duplicates. 

The program suffers from stack overflow errors if you try to compare too many files, but once I figured out the sweet spot, it's been easy to search and find these thousands of dups. 

I'll work on finding out the cause, but I thought I would post this workaround to fixing the problem without having to manually go through everything.

 

  • Thanks 1
  • Confused 1
Link to comment
  • 2 months later...
On 1/3/2022 at 5:04 PM, volcs0 said:

I know this reply is 3.5 years later, but I was searching for a solution to this problem.

CA Fix Common Problems found hundreds of duplicate files on two of my disks.

I never work at the disk level - only the share level, so I do not know how this happened.

I do a lot with Musicbrainz Picard - and a lot of the dups were flac files, so maybe that's an issue?

In any case, I needed a way to find and delete these duplicates.

I found the app Czkawka - and this allowed me to do a binary compare and then delete the duplicates. 

The program suffers from stack overflow errors if you try to compare too many files, but once I figured out the sweet spot, it's been easy to search and find these thousands of dups. 

I'll work on finding out the cause, but I thought I would post this workaround to fixing the problem without having to manually go through everything.

 

so as I said before this is multi user problem, can there be some sort of looking into how to troubleshoot this please.

i have seen how because only few are affected that it's not a "thing" but this is still a problem

surely this should be looked into?

 

Link to comment

Duplicates are pretty hard to create under normal circumstances.  You have to work with disk shares (or reference a disk share directly in say an app template - /mnt/cache/appdata) ) 

 

EG: Easiest way I can think of for one to be created is to have appdata share as use cache: yes, and apps set to /mnt/cache/appdata.  Mover comes running along, and dutifully moves the file(s) that aren't actually in use to the array, but then the app goes and decides to recreate the file on the cache drive.  Instant dupe caused by a configuration mistake (referencing /mnt/cache directly and having the share set to use cache yes)  

 

Always referencing things by /mnt/user will never allow a dupe to be created (but any pre-existing will still remain). 

Link to comment
11 hours ago, Squid said:

Duplicates are pretty hard to create under normal circumstances.  You have to work with disk shares (or reference a disk share directly in say an app template - /mnt/cache/appdata) ) 

 

EG: Easiest way I can think of for one to be created is to have appdata share as use cache: yes, and apps set to /mnt/cache/appdata.  Mover comes running along, and dutifully moves the file(s) that aren't actually in use to the array, but then the app goes and decides to recreate the file on the cache drive.  Instant dupe caused by a configuration mistake (referencing /mnt/cache directly and having the share set to use cache yes)  

 

Always referencing things by /mnt/user will never allow a dupe to be created (but any pre-existing will still remain). 

hey squid

thanks for the answer I have opened a thread here:

 

 

in the bug stable thread to continue this.

 

is it possible to lock and redirect this one please ? [my intention was not to duplicate the thread but to escalate to the bug thread]

 

Thank you

Edited by abs0lut.zer0
wrong url
Link to comment
43 minutes ago, abs0lut.zer0 said:

is it possible to lock and redirect this one please ? [my intention was not to duplicate the thread but to escalate to the bug thread]

Not sure why you think this is a bug. As mentioned, it is easy to make this happen by working directly with the disks, and no clear good way to prevent it in that case.

 

If a user creates a filepath on one disk, and the same filepath on another disk, of course these are duplicates from the point of user shares. Should the OS prevent the user from doing that? Should it spin up all disks (with the delays that causes) to make sure it doesn't happen, each and every time someone creates a file?

 

Or can you demonstrate that it happens when not working directly with the disks?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.