Project: Duplicate file handling tool


NAS

Recommended Posts

Ok, copied from the other thread...

 

I made it into two lines for readability.  You can put it all on one line

grep "duplicate object" /var/log/syslog | cut -d" " -f8- | 
    sed -e "s/^\/[^\/]*\/[^\/]*\/\(.*\)/ls -l \/*\/*\/'\1'/" | sort -u | sh -

 

This should list the duplicate files as found by user-shares in parallel folders in the /mnt/disk?? shares.

 

If your syslog is HUGE, perhaps you need to just take the tail end of the syslog like this:

tail -10000 /var/log/syslog | grep "duplicate object" | cut -d" " -f8- | 
    sed -e "s/^\/[^\/]*\/[^\/]*\/\(.*\)/ls -l \/*\/*\/'\1'/" | sort -u | sh -

 

The trick to regular expressions is all in knowing where to put the backslashes.

 

Joe L.

Link to comment
  • 3 months later...
  • 4 months later...

This topic originally began here, in the "Spin down timers - are they in HDD firmware or stored in slackware" thread.

 

The easiest way to identify duplicates is to install the UnMENU addon, and use its Dupe files plugin.

 

If you don't install UnMENU, the syslog gives you *some* information, enough to figure out where they are.  You can manually locate the duplicates by finding a particular file listed in the syslog as a "duplicate object", making a note of its drive and path, then searching the syslog for additional copies on other drives.  That will provide you with a list of all but the first, which you can assume has the same path, but is on one of the drives that are LOWER than the lowest drive you have found listed in the syslog.  An example:

 

  /mnt/disk2/Movies/Action/Terminator.mpg  (first one is never a duplicate, will not be in syslog)

  /mnt/disk3/Movies/Action/Terminator.mpg  (found in syslog as "duplicate object")

  /mnt/disk6/Movies/Action/Terminator.mpg  (found in syslog as "duplicate object")

 

The syslog will indicate that Terminator is duplicated twice, with copies on Disk 3 and Disk 6, and you can conclude that there is a third copy, and that it is on either Disk 1 or Disk 2, with the same path as the others.

Link to comment
  • 3 months later...

How does the duplicate handling work?  Just checking file names?  or some kind of hash check?

It just checks the names.  If they are in parallel folders on different disks, with the same name, only the lower numbered disk file is accessible in the user-share.  The others are logged to the syslog, but the log entry does not tell you where the first one was located, only the subsequent ones... The script above in this thread finds the file with the similar name in the parallel path on each of the disks.

 

The files can be completely different, or identical... It is up to you to figure out what to do with them.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.