Utility to report Duplicates


Helmonder

Recommended Posts

Duplicate file logging no longer available in V6.

 

See thread:

 

http://lime-technology.com/forum/index.php?topic=35037.msg325868#msg325868

 

Tested in b8

 

That's a feature: it prevents the system log from blowing up when there are numerous duplicates.  Duplicates are shown for each file (if they exist) when viewed via the 'folder' icon on the Shares page (that is, using 'indexer' feature).

 

Granted this is not a good way to find all the duplicates.  What's needed is a utility to scan an entire share and produce a report.

Link to comment

Duplicate file logging no longer available in V6.

 

See thread:

 

http://lime-technology.com/forum/index.php?topic=35037.msg325868#msg325868

 

Tested in b8

 

That's a feature: it prevents the system log from blowing up when there are numerous duplicates.  Duplicates are shown for each file (if they exist) when viewed via the 'folder' icon on the Shares page (that is, using 'indexer' feature).

 

Granted this is not a good way to find all the duplicates.  What's needed is a utility to scan an entire share and produce a report.

 

Agreed... the option in the syslog was never really handy also.. A little link to something on the settings page (or possibly something that can be scheduled to run) would be a lot better..

 

Any command line guru's out there who can think up something ?

Link to comment

Can someone make sense of this:

 

http://duff.dreda.org/

 

DUFF is supposed to find duplicates and should run on slackware.. No idea how to get it working from the unraid console though.. I

Unfortunately we are not really talking about duplicates in the normal sense, we're actually talking about a naming collision that happens because of the way multiple disks can participate in a user share. /mnt/disk1/foldera/filea.txt will collide with /mnt/disk2/foldera/filea.txt, and the combined view of /mnt/user/foldera/filea.txt will be the one on disk 1, and the file on disk 2 is basically invisible to the user share. The actual contents of the file are irrelevant, they may or may not be duplicates. So, a traditional duplicate file finder is not very helpful. It can get really messy if you don't keep track of which applications are writing directly to the disks, and which are using the user shares.

 

On the plus side, you can easily hide files if you wish by purposely naming them identically and keeping them on different disks. Anyone using the user share will only see the first copy.

Link to comment

Can someone make sense of this:

 

http://duff.dreda.org/

 

DUFF is supposed to find duplicates and should run on slackware.. No idea how to get it working from the unraid console though.. I

Unfortunately we are not really talking about duplicates in the normal sense, we're actually talking about a naming collision that happens because of the way multiple disks can participate in a user share. /mnt/disk1/foldera/filea.txt will collide with /mnt/disk2/foldera/filea.txt, and the combined view of /mnt/user/foldera/filea.txt will be the one on disk 1, and the file on disk 2 is basically invisible to the user share. The actual contents of the file are irrelevant, they may or may not be duplicates. So, a traditional duplicate file finder is not very helpful. It can get really messy if you don't keep track of which applications are writing directly to the disks, and which are using the user shares.

 

On the plus side, you can easily hide files if you wish by purposely naming them identically and keeping them on different disks. Anyone using the user share will only see the first copy.

 

Mwaa.. that may be true but I think the majority of users are using user shares and the  majority of users will sometimes do large disk to disk transfers using MC and or commandline copy... While doing that you run a risk of getting the same folder on more drives.. This is what unraid calls a duplicate.

 

A tool should be run on the /mnt/disk* shares (therefor excluding the /mnt/user) and then showing duplicate files, duplicate should be duplicate with the full pathname... (so /mnt/disk1/movies/movie.mkv and /mnt/disk2/movies/movie.mkv would be a duplicate while /mnt/disk1/movies/movie.mkv and /mnt/disk2/movies/alternate/movie.mkv would be no duplicate.

 

If the term "duplicate" is misleading you could also call it orphaned files or something like that..

 

Either way they are important since they are hogging space and you will never know that..

 

The described would be ideal way, but a simple, standard, duplicate scanner would also help a lot.. You would have to manually weed out some of the non-duplicates but that would not be to much of a hassle..

Link to comment

While it will not find all the duplicates, I posted an Excel based utility yesterday that helps.

 

http://lime-technology.com/forum/index.php?topic=33689.msg326408#msg326408

 

 

Updated my utility to V3 and am sharing it if anyone needs it. I added a rudimentary capability of detecting duplicate files.  It only looks within the share being checked and it must have the same directory structure. For example, suppose you had the same file on both disks as show below.  These two files would be flagged as duplicates.  No other checking is performed, i.e. I don’t look at the size or do any comparison other than the directory structure/filename.

 

These two files would be flagged as duplicates.

\disk3\Movies\Two Weeks Notice (2002)\Two Weeks Notice.mkv

\disk4\Movies\Two Weeks Notice (2002)\Two Weeks Notice.mkv.

 

The following would not be flagged as they have a different file structure.

\disk3\Movies\2 Weeks Notice (2002)\Two Weeks Notice.mkv

\disk4\Movies\Two Weeks Notice (2002)\Two Weeks Notice.mkv.

 

The duplicate listings are displayed on the tab “Share_File_Listing”.  You can filter on the error column to display only the duplicate

Link to comment

Maybe something like this:

 

<pre>
<?
function relativePath($file){
  preg_match("%/disk\d+/([^:]*)%", $file, $matches);
  return $matches[1];
}

function searchDuplicates($file){
  global $disks;
  $out = array();
  $rel = relativePath($file);
  foreach ($disks as $disk) {
    $abs = "$disk/$rel";
    if(is_file($abs) && $abs != $file) {
      $out[] = $abs;
    }
  }
  return $out;
}

function listDir($dir) {
  global $duplicates;
  $dir = rtrim($dir, '\\/');
  if (! is_dir($dir)) return NULL;
  $Files = array_diff(scandir($dir), array('..', '.'));
  if (! $Files) return $result;
  natcasesort($Files);
  foreach ($Files as $f) {
    $dirname = "$dir/$f";
    if (is_dir( $dirname )) {
      listDir($dirname);
    } else {
      $dups = searchDuplicates($dirname);
      if (count($dups)  && array_search($dirname, $duplicates) === FALSE ) {
        echo "\n\nDuplicate:\n   " . shell_exec("ls -lah \"$dirname\"");
        $duplicates[] = $dirname;
        foreach ($dups as $dup) {
          if ( array_search($dup, $duplicates)  === FALSE) echo "   " . shell_exec("ls -lah \"$dup\"") ;
          $duplicates[] = $dup;
        }
      }

    }
  }
}

#Global var for duplicates:
$duplicates = array();

preg_match_all("%disk\d+%", implode(" ", scandir("/mnt")), $matches);
$disks = array();
foreach ($matches[0] as $disk) {
  $disks[] = "/mnt/$disk";
}
natsort($disks);

foreach ($disks as $disk) {
  echo "Scanning $disk:";
  listDir( $disk );
  echo "\n\n";
}

?>
</pre>

 

It seems to work. It's a port I just did from a python script I did a long time ago.

 

Link to comment

Ehm.. I see this got moved to UNSCHEDULED .. Did I miss something in the release notes of the previous version pertaining the drop of this feature ?

 

I honoustly think it a bit weird that something that was feature now becomes an unscheduled new feature... And its not something trivial also.. At least I do not think so..

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.