Utility to show duplicate directories/files & create rsync cmd for drive cleanup


Recommended Posts

In the past I wanted a utility that would show me how my data was distributed across the various disks on my Unraid server.  As I was never able to find one, I decided to build my own.  I decided to use Excel VBA as I can usually hack together code to achieve what I need to do.  This code could be more elegant.  Feel free to modify it to suite your needs. .  I am sharing this as others may find it useful.

 

To use this workbook, you must enable macros in excel.  All of the shares you want to work with, including disk shares, on the Unraid Server must be “Exported” via SMB and Public.  You may need to spin up all of your drives to keep it from timing out.

 

What this workbook does is take a listing of all the directories on a top level share on your Unraid server that you enter. It then builds a list of all of the top level directories.  The next step is to cycle through all of the disk drives and build a list of the top level directories of the selected share on the appropriate tab.

 

Starting in column “C”, a formula is written that checks if the directory to the left is in disk tabs.  If it is, it will put a “Y” in the cell under the appropriate disk.  There is no check other than that the directory exists. It can be empty.

 

How to use this tool:

On the “Input Data” tab, enter the Unraid top level share name starting in cell “B22”. If you need more cells, just insert a row(s) in the middle of the existing rows.  This way the range will dynamically expandEdit 2/1/2106.  Don't do this.

Enter the IP address off your Unraid server in cell “B14”.

 

Pick the share you want to work with in the drop down list in cell “B15”

 

Turn Screen updating on/ off in cell “B16”.  Setting it off (Yes) speeds up the macro.

 

Click the “Reload Data” button to run the macro.

 

Hopefully you find this workbook useful.

 

NOTE: Warning: I have noticed that the first time the excel script to build the Rsync sheet/commands has an error.  When you select a destination drive, the directory it picks to move is one row off.  You should run the report a second time.  It seems to straighten itself out after that. .  After you have ran the report a second time, save a copy of indexer and use that version going forward.  I have not been able find out why it is doing this

EDIT NOTE.  I updated the utility to version 4 below and changed the Subject of this thread to reflect the new functionality.

http://lime-technology.com/forum/index.php?topic=33689.msg328265#msg328265

 

EDIT NOTE 10-31-2014.  I updated the utility to version 4.07 below.

http://lime-technology.com/forum/index.php?topic=33689.msg328265#msg328265

 

Indexer v 04.07.02 - https://drive.google.com/folderview?id=0B8FJEotSXLrxQllJdTBrVmxNd2c&usp=sharing  - UPDATED: 3-1-2015 Adds the “undist.sh” script to the zip.

 

Indexer v 04.07-02-06 - https://drive.google.com/file/d/0B8FJEotSXLrxcFptMmVCVTZYN1k/view?usp=sharing.  Added formulas to dynamically display the disk with the largest utilization of each directory.  See this post.  http://lime-technology.com/forum/index.php?topic=33689.msg415201#msg415201

 

Link to comment

Attached is a revised version of the workbook.

 

I added the capability to retrieve the size of the directory on each disk for the share.  You can select the unit of measure you want to show.  You can show the capacity in GigaBytes, Gigabits, MegaBytes, Megabits, KiloBytes, Kilobits or Bytes.  You can change the value to any unit of measure you want.

 

You can also get a listing on a per disk basis for all of the files in the share on a per disk basis, or for the entire share.

 

The data provided should give enough information to assist in performing disk cleanup/directory consolidation.  You can see the capacity used on a per disk basis to assist deciding were to move the files.

 

If you use the workbook, if you would not mind posting a message in this thread indicating you used it, I would appreciate it.  I am just curious if anyone else finds it useful.

 

Link to comment
  • 1 month later...

Woohoo, just what I was looking for.  I noticed a stray directory on 2 drives and was curious to see if anything else had gotten split up. Turns out I had the split level wrong for one of my shares.

 

The only improvement I can see would be to add a configuration item for the number of drives to scan.  I only have 12 data drives, so there is a small delay (~12 seconds)  while it attempts to scan drives 13-24.

 

 

Thanks!

...Donovan

Link to comment

The only improvement I can see would be to add a configuration item for the number of drives to scan.  I only have 12 data drives, so there is a small delay (~12 seconds)  while it attempts to scan drives 13-24.

 

I actually thought of this when I was putting it together.  But if you do not assign the drives in order, then how do you know which drives to scan.  For example, to spread the drives out in my cages, I added every other drive staring with drive 2, ie 2, 4, 6 etc.  Then I filled in the odd drives as I grew.  I did this for heat reasons. 

 

In the end I just decided to make it as large as Unraid can grow.  I decided to live with the extra time hit.

Link to comment

http://lime-technology.com/forum/index.php?topic=2689.msg27208#msg27208

 

Again i point out this is scrappy. However heres a better version:

 

#!/usr/bin/bash

#Get a list of drives in use
VALUES=($(mount | awk '/disk[0-9]/ {print $3}'))

#Get the intersting bit of the foldeer this tool was called from
FOLDER=`pwd | sed 's/\/mnt\/user\(\/.*\)/\1/'`
echo "Folder: ${FOLDER}";

#Loop through the values of the drive that are installed
for ((i=0; i<${#VALUES[@]}; i++))
do

if [ -d "${VALUES[$i]}$FOLDER" ]
then
    #Get the size on disk of the folder but on one of the disks
    fld=($(du -sh "${VALUES[$i]}$FOLDER"));
    #Get the space free on this same disk
    dsk=($(df -h "${VALUES[$i]}" | sed -n '2p'|awk '{print $4}'));
    echo "Disk:${VALUES[$i]} Using:${fld} Free:${dsk}";

fi
done

 

 

Example of usage:

 

root@TOWER [~] 10:22 AM>pwd

/root

root@TOWER [~] 10:22 AM>cd /mnt/user/Movies/

root@TOWER [Movies] 10:22 AM>/boot/scripts/undist.sh

Folder: /Movies

Disk:/mnt/disk7 Using:169G Free:115G

Disk:/mnt/disk8 Using:121G Free:149G

Disk:/mnt/disk3 Using:431G Free:25G

Disk:/mnt/disk15 Using:250G Free:41G

 

 

So i navigated to the fodler i want to see the split on. Ran the script and it told me the Movies folder lives on 4 drives, how much of it is on each drive and how much space each drive has.

Link to comment
  • 1 month later...

Updated my utility to V3 and am sharing it if anyone needs it. I added a rudimentary capability of detecting duplicate files.  It only looks within the share being checked and it must have the same directory structure. For example, suppose you had the same file on both disks as show below.  These two files would be flagged as duplicates.  No other checking is performed, i.e. I don’t look at the size or do any comparison other than the directory structure/filename.

 

These two files would be flagged as duplicates.

\disk3\Movies\Two Weeks Notice (2002)\Two Weeks Notice.mkv

\disk4\Movies\Two Weeks Notice (2002)\Two Weeks Notice.mkv.

 

The following would not be flagged as they have a different file structure. 

\disk3\Movies\2 Weeks Notice (2002)\Two Weeks Notice.mkv

\disk4\Movies\Two Weeks Notice (2002)\Two Weeks Notice.mkv.

 

The duplicate listings are displayed on the tab “Share_File_Listing”.  You can filter on the error column to display only the duplicate

 

I also cleaned up some other minor stuff.  Status messages are displayed in the lower left hand side of the window in the status bar so you can follow the progress.

 

As always, if you use this utility, please let me know.

 

Link to comment

Attached is version 4 of the Indexer script.

 

The files were too large, they exceeded the max size per post of 192KB, to post so I placed them on my Google drive.

 

NOTE: Warning: I have noticed that the first time the excel script to build the Rsync sheet/commands has an error.  When you select a destination drive, the directory it picks to move is one row off.  You should run the report a second time.  It seems to straighten itself out after that. .  After you have ran the report a second time, save a copy of indexer and use that version going forward.  I have not been able find out why it is doing this

Indexer v 04.07 -- UPDATED: 10-31-2014 Removed link 6/19/15.

 

Indexer v 04.07.02 - https://drive.google.com/folderview?id=0B8FJEotSXLrxQllJdTBrVmxNd2c&usp=sharing  - UPDATED: 3-1-2015 Adds the “undist.sh” script to the zip.

 

Example 1 Screenshot - https://drive.google.com/file/d/0B8FJEotSXLrxb2ZaNDBiMzctU2M/edit?usp=sharing

Example 2 Screenshot - https://drive.google.com/file/d/0B8FJEotSXLrxU3RtVkFzeGkxb00/edit?usp=sharing

 

I have made the final changes to it that was my original goal.  It can now assist you in cleaning up your dives to consolidate the spread out directories to a single disk/directory.  Basically, you select the destination drive you want to move the data to.  An rsync command will be built utilizing the appropriate Source/Destination drives and directory.

 

It can find duplicate directories/files to allow you to clean them up.  You should clean up the duplicates before consolidating the directories.

 

The attached zip file contains the latest Indexer excel workbook plus a script file.  I have used it with Excel 2007 and 2013.  The script is from NAS in a post earlier in this thread.  This work book assumes it is at the following location “/boot/config/my_scripts/undist.sh”.  You can change the tab “Input Data” cell B56 contents to reflect a different path.

 

Setup:

Unzip the file and copy the “undist.sh” script to the appropriate location, “/boot/config/my_scripts/undist.sh,” on your server.  You must have macros enabled in excel to use this workbook.  When you first open the Indexer workbook, it should open to the “Input Data” tab.  Set the values in the blue/grey cells appropriately for your server.  Save a copy of the work book with your setting.  Example 1 shows a screen shot of this area. Pick the appropriate selections in the green cells for the data you want to retrieve and press the run report button at the top run the macros.  I believe the instructions are pretty self-explanatory.

 

Note: The shares, including disk shares, on the Unraid Server must be exported via SMB and Public.  You may need to spin all of your drives up to avoid a timeout.  I recommend you use the "cache_dirs" script if you do not already do so.  It makes a major difference on the time required to retrieve the data.

 

If you open the attached Example 2 in a separate window, I think it may be easier to follow along.  I believe the best work flow is the following.  You should be able proceed at a pace that you are waiting on the rsync command to complete before you move the next directory.  Feel free to do it any way you want.

 

Step 0

Open two (2) telnet windows to your server and the Indexer workbook and place them similar to how I have them in my example with the two telnet windows on top and the indexer below.  If you have not already done so, run the report with “Build Rsync tab” field set to yes.  You can run all of the reports if you want, but it takes longer.

 

Alter you have run the report open up the “Rsync Command Build” tab.

 

Step 1

Click the button “cp “dsk chk”” button.  It will copy the command to run the undist.sh script with the watch utility.  Paste this command into the left most telnet session and run it.  This will run the script once every 60 seconds as it is setup.  You can change it if you want.  This way you can see your drive capacity as you move and/or consolidate files.  To stop the script running, issue a Ctrl-C command in the telnet window.

 

Step 2

Select cell A9 to change the cell value to either Rsync Copy or Rsync Dry Run.  The default every time you run a report is Dry Run.  This option lest you test the command to ensure it is valid without actual moving any data.  The “Rsync Copy” setting will build a real command that will move your data.  In column “F” select the dropdown to filter the list.  Uncheck “1”.  This will not display any directories where they currently only occupy a single drive. 

 

Step 3

Scroll you window up so the directory you want to work with is on the top most row.  This helps ensure you don’t copy the wrong rsync command.

 

Step 4

Select the disk you want to use as your destination for all of the data moves in column G.  This has the effect of building the appropriate rsync command under the disk where the data resides.  In the example2, as you can see SOA is has disk1 set as its destination drive.  Disks 3, 4 and 5 now show a green rsync command.  In column S under disk5 you see a “0”.  This does not mean it is empty, merely the contents are smaller than the unit of measure, which in this case is GigaBytes. 

 

Note: I do not move anything off of the cache disk.  I leave that task for the “mover” script to handle.  The command to delete the empty directories does not include the cache drive.

 

Step 5

Right click on the cell containing “rsync” under the disk whose contents you want to move to the destination drive you selected in column G.  Copy the contents and paste them into the telnet window above.  This will run the rsync command to move the data based on the options you have selected.  As you can see, assuming I wanted to move the data from disk4 to disk1, this is the command that would be generated.  For details on the options, refer to the man pages.  You can change the options on the “Input Data” tab, cells B52 and 53.  Before you move any data, I would suggest you try the dry run option to make sure there are no issues.

Note: 10-31-2014 - Version 4.07 stacks all of the Rsync commands for the target in a single cell. Copy and paste this command into the telnet window to paste them in at the same time.

 

Example Commands:

rsync -avzh  --dry-run  --remove-source-files /mnt/disk4/TV_Shows/"Sons of Anarchy"  /mnt/disk1/TV_Shows

rsync -avzh  --remove-source-files /mnt/disk4/TV_Shows/"Sons of Anarchy"  /mnt/disk1/TV_Shows

 

Step 6

Got to Step 3 and repeat for for each directory you want to move.

 

Step 7

Now we need to delete the empty directories left over from the rsync moves.  Click the button “cp “dlt emt dir”” and it copies a list of commands to the clipboard.  Paste them into the telnet window and hit your enter key.  It will then delete all of the empty directories.

While rsync will move, verify and delete the source files, it will not delete the empty directories.  To do that I am using the find command with the directory, empty and delete flags set.  I am also cheating a little.  As you can stack multiple commands in a way that does not care if the prior command was successful or not, this is the approach I took.  The command for all 24 drives is created with them stacked.  When you paste the command into the telnet window and execute the commands, linux will cycle through each of the drives and delete all of the empty directories in the appropriate share you were working with. 

 

Example command for disk1-->  find /mnt/disk1/TV_Shows/ -type d -empty -delete ;

 

Hopefully you will find the utility useful until such time as an official tool or a similar functionality is available directly on the unraid server.  Please let me know if you have any issues with it.  As always, it’s nice to hear if others find it useful.

 

 

 

Link to comment

Thanks for sharing your work!

 

It can find duplicate directories/files to allow you to clean them up.  You should clean up the duplicates before consolidating the directories.

 

As I have been experimenting with finding duplicates, I know this is not that easy.

In my case I rarely have true 1:1 dupes.

It happens far more often that I have slight differences in the names and you won't find them that way.

e.g. I merged the movie collection of a friend and he had a different naming scheme or different resolution - that's really challenging!  ???

I played with scripts that calculate the Levenshtein-Distance and phonetic comparison.

Find enclosed the VBA code of Levenshtein, Soundex2, Kölner Phonetic (kp).

There are many more phonetic algorhitms - each suits for a different language.

 

If you want to improve this part of your script, this might be useful.

However, the user will finally have to decide about dupe or not.

Phonetik.zip

Link to comment

Duplicate files are an entire problem within themselves.  That is why my check is a simple direct check, ie driveX/path/file vs driveY/path/file.  I was really looking for cases where in the past, some may have used the copy command to copy the files do another drive and did not delete the source files.

 

I have already found one change I need to make.  I want to color code the rsync cells based on whether you are in dry run vs copy mode. 

Link to comment
  • 1 month later...

Added version 04.07 above.

http://lime-technology.com/forum/index.php?topic=33689.msg328265#msg328265

 

Changes

Color coding on Rsync tab based if if you are in Dry Run vs Move mode.

Added Column E where the Rsync commands are now stacked so you copy and paste all of the commands  to move the files to the target directory at once.

Removed the "Compression" option from the Rsync command.  This just added extra overhead not needed in this application.

 

I hope you find the tool useful.

Link to comment
  • 5 months later...

I'm using version 04.07-02 with Excel 2010, macros enabled. I have entered my ip address, and the name of the shares I want to check (I left some of them blank). When I run report I get this error:

Error 1004 (PasteSpecial method of Range class failed) in procedure populate_rsync_command_build_sheet of Module rsync

 

Any thoughts?

 

Thanks, Richard

Link to comment

@RichardU,

 

I do not have excel 2010, but I have tested with 2013 and 2007 and it works with both.  If you do not have a valid share selected, you will get that error. Make sure you have entered valid shares in Rows 25-31.  The using the drop down in B15, pick the share to import.  It should not be the default "Clear the data"  That just clears my data before I post it.  It is not a real directory.

 

The grey cells are where you enter your data and the green cells are pick list of your data.

 

 

 

 

 

Link to comment
  • 1 month later...

OK, Maybe doing something goofy, but my Input Data Settings options look just like your did in your example screenshot in the zip file (other than my IP Address and Share names in B25 and below).

 

I "Ran" the report, it ran for several minutes and in the "Share_File_Listing" tab, I see all of my folders in my Movie Share (which is what I selected in the Pull down in Input_Data Cell B15.

 

But in my RSYnc Command Build, I see only the First Level Subfolders under my Movie User Share (HD, DVD, Converted, Music, BluRay, etc), but not any of the individual movie folders that I see in the Share_File_Listing tab...

 

So I guess my question is how do I get it to go 2 folders deep in the RSync folder so it will allow me to move \\TOWER\Movies\HD\MovieXYX instead of just \\TOWER\Movies\HD?

 

Link to comment

On the tab Input Data, in cells B25 to B32, change the directory structure to add the extra share level.  I just have "Movies", ie a top level only structure.  Try changing it to "Movies/HD".  Make sure you get the slash correct.  I just gave it a quick check and it appears to build the command correctly. 

 

You will need to insert rows between B25 to B32.  As named ranges are used, insert the rows in the middle to expand the range. Do not inseart the rows at the first or last row position.  You would then have the following directories to pick in "B15"

 

Movies/HD

Movies/DVD

Movies/Converted

Movies/Music

Movies/BluRay

 

A quick test and it appears to work, but I have not used it to move a directory, but the command appears correct.  Make sure you check the command that is built for accuracy.  I would defiantly try a dry run first.

 

By editing B52 and B53, you can change the rsync command options if you want.

 

If the directory is spread across multiple dries, using the stacked command in column e will move all of the files to the destination drive.  Make sure you have enough free space on the destination drive.

 

Please let me know if this works for you.

 

NOTE:  By changing the sub directories, the file paths built on the "Share_File_Listing" and "Diskn" tabs will be incorrect due to the slash used.  It is due to the way the path is built up.  If you want correct paths, then reverse the slash in the directory path.  But the rsync command will be incorrect.  I only built it to go one layer deep as that is my structure.  You could go in and modify the vba code if you want it to work correctly in both cases.  Or just do a search and replace on the "Share_File_Listing" and "Diskn" tabs, or add both directory paths in the "Input Data", depending on the what data you want, pick the correct directory

 

Also, make sure you are using Indexer v 04.07.02 - https://drive.google.com/folderview?id=0B8FJEotSXLrxQllJdTBrVmxNd2c&usp=sharing

Link to comment

@SuperW2

 

Give the Indexer v 04.07-02-02-01c.xlsm version a try.  I added additional rows on the input tab for more shares.  You should not add rows unless you do not care about the formatting of the sheet.  I apply a master format to the tab to make sure there are no issues.  This causes issues if you insert/delete rows.

 

You can now add a single subdirectory to the top level share and it will present the data correctly, as in:

Movies\HD

Movies\DVD

Movies\Converted

Movies\Music

Movies\BluRay

 

This is a quick fix.  I need to add a fix that can support multiple levels.  if you want to go deeper, you need to enter the share as an example "Movies/DVD/test1".  This will build the rsync command correctly

 

 

https://drive.google.com/folderview?id=0B8FJEotSXLrxQllJdTBrVmxNd2c&usp=sharing

Link to comment

Ok, give this version a try, Indexer v 04.07-02-02-01d.xlsm.zip.  It should support multiple share levels.

 

Also, there is an error where the the first time you run it after running a share that does not exist, where the rsync command is one row of for the directory that you want to move.  The fix is to run it a second time.  This fixes the output.  I have never found whyy it does that.  The share "clear the data" is how I clear my data and does not really exist on my system.

 

So the way to use it is to enter your server IP and your shares.  Select one of your shares and run the tool twice with all options set to yes and it should be fine after that.

 

https://drive.google.com/folderview?id=0B8FJEotSXLrxQllJdTBrVmxNd2c&usp=sharing

 

Let me know how it works for you.

 

WARNING:  Don't use the "cp "dlt emt dir" button.  The command is incorrect if you use sub directories.  I will have to fix that.

Link to comment

Thanks so much... this new version seemed to do the trick... I'm using the 01e.zip and it's working great with the one share level down \Movies\HD, etc, which is really as far as I need to go for my use case.

 

Whatever you did seemed to speed up the report build process substantially too... the first few times I ran it, it was taking 30-40 minutes (and in fact got an out of memory error on my Excel 2013 and moved to 64Bit Version just in case).  That same share report build is just a few minutes now (6ish), although on 64Bit Excel.

 

One more quick question that I just don't quite get yet is when duplicates are detected, as indicated by "Duplicates found" in Cell B2 of the "Share_File_Listing" worksheet, how exactly do I locate the duplicates that it found (there are over 100K Rows for just the one share on that tab).

 

Thanks a bunch!

SW2

 

 

Link to comment

By setting these options to either yes or no, you can run different reports.  By default, I set them to all yes.  This is what can take so long.  If all you want is the rsync report, set the others to "no" and set the rsync to yes.  These are all individual reports.  Some depend on others for data.  If another report also is needed, the script will auto set the additional reports on if they are set for no.  If you don't use cache-dirs, it can take a long time to run.  It makes a huge difference it retrieving the data.  Nothing I did should have sped it up i think.  I did pre-compile the vba code before I save it which could have made a little bit of difference, but not much I would thing.  Most likely it was going to the 64 bit version.  Both version of excel I have are the 64 bit version so I do not know how it runs under the 32 bit version.

 

Get directory size also:  ---->

Check for duplicate files:  ---->

Build Rsync tab :  ---->

Get a listing of the files by share & disk:  ---->

Get a listing of the  directories for the share & disk:  ---->

 

Leave the "Turn screen updating on or off while running macro..    on = No / off = Yes:  ---->" as yes.  Otherwise, it updates the screen as it runs and it will really slow the system down.

 

To see the duplicates, if they exist, enable the duplicates report.  After it runs, go to tab "Share_File_Listing"and look in column B.  I would apply a filter to the column then you can see only the rows set to yes.  This is a rudimentary duplicate check, specifically  my check is a simple direct check, ie driveX/path/file vs driveY/path/file.  I was really looking for cases where in the past, some may have used the copy command to copy the files do another drive and did not delete the source files.  There are other utilities that may be a better solution to finding duplicates. 

 

 

edit: Now that I think about it, by going a layer lower, you have significantly reduced the amount of data you have to retrieve.  I am sure that is why it ran faster.

Link to comment
  • 3 weeks later...

This is really cool. I love the matrix view to understand which folders are on which drives. I am trying to clean up my server and arrange things a bit better now that I have a few larger drives than I had a few years ago.

 

I would love to see this type of functionality built into unRAID. I think this thread should be moved to where it will get higher visibility.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.