[DOCKER CONTAINER] DUC - Disk Usage Charts (and duplicate file finding!)


Recommended Posts

Over the holidays, I built my first unRAID server and setup the CrashPlan docker container. I was interested in learning more about docker AND wanted to interactively browse my array for organization and to reduce duplicates.

 

I stumbled upon DUC which creates awesome interactive Disk Usage Charts like this:

 

example.png

 

I've created a DUC (and apache) container based on the suggested phusion baseimage, and created a template repository and template so that you can easily install it in unRAID.

 

I'm looking for others to test my container and provide feedback. I'm still working to reduce its size and to add more features (I've forked DUC, and will do a pull request for my changes once I'm complete)

 

PLEASE CONSIDER THIS EXPERIMENTAL AND A WORK IN PROGRESS.

 

My template repo:              https://github.com/digitalman2112/docker-templates

(Update: the above URL didn't work for someone, but the following did work: https://github.com/digitalman2112/docker-templates/tree/master/digitalman2112)

 

My duc-docker build repo:  https://github.com/digitalman2112/duc-docker

My DUC fork repo:              https://github.com/digitalman2112/duc

My Docker Hub page:          https://registry.hub.docker.com/u/digitalman2112/duc/

 

When installing into unRAID, map your array (or a portion of) to /data (this is the default if you use the docker tab and the template) - It will default to READ ONLY, as there is no need for the container to have write access to the data for indexing.

 

You also need to map port 80 from the docker container to a host port. The template defaults this to 2112. (updated port based on feedback)

 

Once started, you can access the DUC web page with interactive charts by using the Web UI menu option for the running container. This will work if you leave the port option set to 2112, or you can remap it as you see fit and visit http://<container_ip>:80/cgi-bin/duc.cgi

 

You will need to start an index operation by visiting the web interface and clicking reindex (be patient, it doesn't update the page while doing the reindex...yet). You can add additional indexes or try my new duplicate file utility in duc by starting bash in the running container with docker exec. If you do this, look at the duc command line options, and be sure to specify the database location like this: -d /duc/duc.db This is no longer required as I changed the HOME dir for the container to /duc and it will automatically place the .duc.db there now  :)

 

Given that the container has access to your data, strongly suggest keeping it internal to your network.

 

I'll be adding the duplicate file functions to the CGI in the coming weeks to avoid the need for command line use, if you are really interested, ping me and I'll get you started with it. It is a matching based file attibutes (name, extension, size) NOT a hash / crc match, so it is very fast, but you still need to validate the files are actually duplicated. I also have a mode that finds duplicate FOLDERS - which is extremely handy for finding mass sets of duplicate photos.

 

Any & all feedback welcomed as I'm new to unRAID, new to docker, and new to duc - but willing to learn :)

 

Ian

 

(Edits due to build updates)

Link to comment

Any particular reason you decided on defaulting to port 8080? That port is already used by a very popular addon with a long history here. Search for unMenu thread for more details.

 

Total ignorance on my part. I'll change it to something else and update the template later tonight. Thanks for the tip!

Link to comment

You shouldn't have to rebuild it, we can change the host port when we set it up, or you can just change your XML.

 

Yeah, I'm a little slow to remember all the moving parts. Much easier to just change the xml.

 

Only small thing... most of us have updated to phusion *.15

 

Ok, THAT will be a rebuild. :)

Link to comment

Updates based on feedback:

 

1) Updated to phusion *.15

2) Default port mapping changed to 2112

3) Removed the auto-indexing on startup. I'd added that before I added the abiity to trigger an index from the website, and you may not want a big index job at startup.

4) Updated template xml description with some getting-started info.

5) Redirected from webroot to the cgi-script to save a little typing & help new users find it easier

6) Combined some commands in the Dockerfile to reduce # of layers (and removed some old commented-out commands)

 

 

Also: I moved the icons to imgur as I saw in another template - but still not getting working icons for some reason....Are others seeing the icons?

 

 

Link to comment

If anyone is interested in the duplicate file finding functions I've been adding (still a work in progress, but tests out ok on my data) - here's a screenshot to give you an idea of what it does:

 

76q7koQ.png

 

duc has a number of command-line utils, in this case I'm calling duc dup (which I added) with:

  • --database to specify the location of the duc index database (in my container it is at /duc/.duc.db) (NOTE: As of 1/9/2015, you no longer need to specify the database - and the location shown in the image is now incorrect.)
  • --megabytes to specify the minimum file or folder size to use for comparisons. I like to work on the biggest items first, and reduce the noise. This also makes it run extremely fast.
  • -f for folderscan (only compare folders, not files.)
  • the path to scan. In this case /data. Note, that this has to be an indexed path. In my container, we map external data to /data. You can specify a subpath of that index, or if you've used duc index at the command line to index another path, you can specify it. The key is to remember that the duc dup command works on the EXISTING DUC INDEX, it does not read the disk directly.

 

Note that this scan took .16 seconds  ;D The same scan with --megabytes=1 returns 539 matches, and takes .44 secs.

 

All candidate matches are returned as a row (I left a few examples in the screenshot), and then a summary table is listed below. If you enable other match types, you will see the match type on the left - and a summary of matches by those types at the end of the scan.

Match types are:

  • Name + Size: always enabled
  • Extension + Size: enable with -e (not valid with folderscan -f)
  • Size: enable with -s
  • Name: enable with -n

 

You can also enable case insensitivity with -i.

 

At some point, I'll add this functionality to the web interface for DUC so that you don't need to docker exec into the container for dup - but until that time, it is available on the command line.

 

Until then - here's how I run it.

 

[*]SSH into your unRAID box

[*]Start a bash shell in the DUC container using: docker exec -it DUC /bin/bash (This assumes you've used the template and the container is called DUC - you can see the name on the unRAID docker tab

[*]Use the command line instructions above to run dup scans - don't forget that database option or you will likely get a "Database corrupt and not usable" message.

[*]If you remove duplicates, don't forget you'll need to rerun the index command (on the command line, or in the web interface) to get updated results. To reindex /data from the command line in the container bash shell, you'd use the command duc index /data

[*]Exit the bash shell from the docker when you are done and then close the ssh session to your unRAID server

 

NOTE: This duplicate file finding does NOT use a CRC / Hash to compare files. It is returning duplicate CANDIDATES, and you need to personally validate that they are duplicates!

 

Please alert me if something doesn't work, or if these instructions need to be improved :)

 

If you want help getting started, just ping me.

 

 

Link to comment

Updated container (and xml template) published.

 

New parameters that can be set when installing the container:

  • -m or --maxlevels  Max # of levels shown in chart (web) - defaults to 5
  • -p or --pixels  Size of the chart (web) in pixels - defaults to 1000px
  • -l or --list  Include a directory listing with the chart (web) on / off - defaults to on
  • -i or --index  Ability to trigger an index operation from the web page can be turned off - defaults to on

 

These are specified TOGETHER as one environment variable named DUC_CGI_OPTIONS

 

width=700http://i.imgur.com/lUB0ibR.png[/img]

 

 

Other changes:

  • Web pages always show the full path now (before if you drilled in, the url bar showed the previous path, plus some numbers)
  • The indexing page now sends back some info and asks you to be patient - at least you know it is working
  • Fixed a pointer error that was showing up in the apache logs
     

 

Link to comment

I mapped /data to /mnt/user to get all my user shares, is that the basic way?

 

It loads, but just sits at /data with 0 0 0  listed (after hitting reindex).

 

Otherwise, maybe i'm not waiting long enough, whats the "time to index" something like 6TB ?  Like 10 mins, or 2 hours or what?

 

Looking forward to looking it over, and really want to hit up the duplicates feature, see if my "dumping ground" folder has gone crazy or not.

Link to comment

I mapped /data to /mnt/user to get all my user shares, is that the basic way?

 

It loads, but just sits at /data with 0 0 0  listed (after hitting reindex).

 

Otherwise, maybe i'm not waiting long enough, whats the "time to index" something like 6TB ?  Like 10 mins, or 2 hours or what?

 

Looking forward to looking it over, and really want to hit up the duplicates feature, see if my "dumping ground" folder has gone crazy or not.

It will take a while, I never timed mine but with 14TB I'm pretty sure it was only a few minutes.
Link to comment

This operation can take minutes on large paths, please be patient.

 

The indexing should continue even if this window is closed, however there will be no notification.

 

Indexed 21005 files and 1752 directories, (265.8GB total) in 01 minutes, and 31.77 seconds.

 

 

Hmm, I guess I need more cpu power/faster hdds? :(

Link to comment

started my docker 120min and referenced /mnt/user and still cannot see anything, but I have about 37tb of data, so I assume this is expected? Is there a log I can check if it's still running? AFter 6 hours it still is not displaying anything. Now it has been over 9 hrs and still nothing is displayed. So something must have gone wrong somewhere.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.