Disk Contents Tracker?

June 24, 201016 yr

What I mean is there anything that keeps a track of what files and folder are on each disk?

Im thinking what happens if disk 1 dies and if you want to know what was on that

disk? What do you do?

Quote

June 24, 201016 yr

If only disk1 dies (or any one disk), it is easy, you type:

find /mnt/disk1 -ls

The entire drive and its contents are virtualized by parity in combination with the other drives. You can still list, read, and even write the files on the failed drive. You don't have to guess, because you can still get to the files.

If two drives fail at the same time, then your best bet is to do beforehand:

find /mnt/disk1 -ls >/boot/disk1_contents.txt

find /mnt/disk2 -ls >/boot/disk2_contents.txt

find /mnt/disk3 -ls >/boot/disk3_contents.txt

find /mnt/disk4 -ls >/boot/disk4_contents.txt

find /mnt/disk5 -ls >/boot/disk5_contents.txt

find /mnt/disk6 -ls >/boot/disk6_contents.txt

etc...

That way, on your flash drive will be files listing the contents of your data drives.

Joe L.

Quote

June 24, 201016 yr

Author

Thanks for the reply...

Going to turn it into a script which runs everynight!

Quote

June 24, 201016 yr

Thanks for the reply...

Going to turn it into a script which runs everynight!

The danger of that is that you accidentally one day delete all your files, and not realize it, and that night the file-list is automatically re-generated as "empty," overwriting the list of contents that used to be there. So now you deleted the files, and overwrote the list of what you used to have.

Joe L.

Quote

June 24, 201016 yr

Joe L, I think he meant it mainly for failed disk scenario.

Quote

June 24, 201016 yr

Author

Joe L, I think he meant it mainly for failed disk scenario.

Correct... lol

Love Joe's enthusiasm and knowledge!

Quote

June 24, 201016 yr

I've been working on something like this for a little while.

It's sort of a modified slocate/mlocate & updatedb tool.

I frequently use locate to find files on disks, but now I wanted to track when the files changed and/or the md5sum of these files.

It think it's a good way to double check missing files and/or corruption.

I'm playing with sqilite to store the data.

When I'm done I'll publish it.

The only problem I've been having is unRAID running out of memory during the find down the whole disk. I had to set vm.cache_pressure higher then the default or processes would randomly get killed.

Quote

June 24, 201016 yr

I had an idea once along those lines. For when you have some disk problem, and you recover a bunch of files and directories into lost+found. You could use slocate or similar database, to put the recovered stuff back where it belonged.

I was working on a big case, and the drives had been wiped clean as a whistle. But an old hard drive was found at the perp's location, that had been removed from the system a few months before because it crashed. I ran fsck and fixed all the corruption on the image, but had a lot of stuff in lost+found. Because of the nature of the case, I needed to know where the lost files belonged. A utility that could use directory contents for lost/recovered directories would have been useful.

Quote

June 24, 201016 yr

A utility that could use directory contents for lost/recovered directories would have been useful.

I had not even thought of it that way. As I think of it. using the md5sum would have been a way to track what the original name was and even rename it.

I'll update the priority on it. Originally it was just a way to tell if you had corrupt files after an accidental format, now I see the value in it.

At one point I used Joe's methodology of linking the file to the md5sum value (while also creating an md5sum file at the same time). The next step I started was doing a par2 on all of the md5linked names. The directory just became too large. The par2 kept getting killed with an OOM error. I'll probably revisit that later.

There's more value here then I thought.

With sqlite after the load DB load of filenames, the filename scan was fast enough to be comparable to the standard locate.

Only thing was my sqlite db ended up being 120M.

Quote

June 25, 201016 yr

I haven't done much coding in Linux or BSD for a long time, but I used to use the Today command a lot when I created directories since well every day is a new day. I don't know all the locations in Slackware and maybe somebody else can make this technically correct, but I used to use a script that would create a folder based on Today's Date and then copy everything in that folder. Then I would tar it up and keep it archived. However this one doesn't it just adds today's date to the front of each file so you can find it easily. Maybe when I'm up and running I'll tinker with it myself, but hardware isn't all here yet.

Here's something really basic and of course I ran it from Cron so I wouldn't have to worry about it. I'll probably do something of the same, but like I said I don't know all the directory paths in Slackware, because I used to use BSD and more or less just pieced this today from what Joe L typed.

#!/bin/sh

Today="`date +%Y-%m-%d`"

find /mnt/disk1  -ls >/boot/diskcontents/$Today_disk1_contents.txt
find /mnt/disk2  -ls >/boot/diskcontents/$Today_disk2_contents.txt
find /mnt/disk3  -ls >/boot/diskcontents/$Today_disk3_contents.txt
find /mnt/disk4  -ls >/boot/diskcontents/$Today_disk4_contents.txt
find /mnt/disk5  -ls >/boot/diskcontents/$Today_disk5_contents.txt
find /mnt/disk6  -ls >/boot/diskcontents/$Today_disk6_contents.txt

Then I used something like this that would go through a folder and delete anything that was older than 30 days. Of course all the paths would have to be fixed, but its what I used to clean out old archived things that I thought where to old.

#!/usr/local/bin/perl -w
$reports = "/boot/diskcontents/";
@file_list= (`/usr/bin/find $reports -mtime +30 -print`);
if ($file_list[0]) {
   foreach $file (@file_list) {
      print "Removed $file";
      `rm -fr $file`;
   };
}else{
   print "No tripwire report files were removed.\n";
};
######################################################################
opendir(REPORTDIR,$reports) || die "Unable to opendir $reports: $!";
@numfiles= grep { /[^\.]\w+/ } readdir(REPORTDIR);
$numfiles=@numfiles;
close (REPORTDIR);
print "Their are $numfiles reports left in the report archive.\n";
######################################################################
undef @file_list;
undef $file;
undef $reports;
undef *numfiles;

You just have to make sure that you are pointing it at the right folder or it would wreck havic on your files.

Quote

June 25, 201016 yr

maybe something like this could be put in the dir_cache program.

That runs all the time and could use the date as a break field to decide when to change the find and save the contents.

I'm still going to move forward with my updatedb/locate replacement using sqlite.

I see the value in it even more after bubba's comments.

I just need to add some search methods for locating by md5.

FWIW, I located md5 subroutines that I can embed right into the updatedb/locate commands to make it easier/more transparent.

Quote

June 25, 201016 yr

I have often though about something like this myself.

Currently i occasionally ls > text file and whilst functional is not exactly slick.

I like the idea of using sqlite as you could then expand it to do cool things like "tell xbmc directly about all new tv shows since last update timestamp" or "list all last layer folders that are on more that one drive aka recover from inefficient high water splits" The potential uses are huge.

Count me in

Edit: perhaps mysql might be a more flexible solution. bad things often happen in sqlite land when two programs try to use the same database. also were very close to proper XBMC sql support which I suspect alot of unRAID users will immediately like to use and I can see scenarios where other programs start using an XBMC dbase as a library as well. food for thought

Quote

June 25, 201016 yr

At the current time my mind set is.

1. I want lightweight easy install, minimal dependency.

2. I need to learn sqlite for some flat file databases at work anyway.

I'm considering some form of middle layer so the DB could be switched out if someone decides to recompile it. In the meantime my goal is.

1. duplicate the functionality of updatedb/locate

2. store mtime/size/md5 to check for changes, duplicates, integrity

3. Allow export in a few mechanisms so that other programs can do things.

I.E. md5sum file, (md5sum can then use this file to check the files directly).

A directory of md5 files symlinked to to a source directory so par2cmdline can be run against the directory. This would allow recovery of corrupt files. md5 can detect the corruption, but this might help fix it. I do not know yet as I've been having difficulty with this step.

As far as SQLite and concurrency issues, I would like to hear more about this.

I've done enmass inserts with two processes. SQLite does not block if the DB is busy, so you have to cycle and retry. But after recoding a retry loop and adding in 400,000 files with duplicate processes, I was able to extract the same amount of records.

I'm considering some form of external lockfile to help with the concurrency issue.

This way I lock a flag file during writes so only one write occurs and the other write processes block.

Reads would not go through this, but I have not tested it yet.

i really do not expect this db to be that active for it to be of great concern.

In the meantime I would love to hear more about concurrency issues in sqlite.

Quote

June 25, 201016 yr

Yeah a search database would be awesome. Are thinking a web interface or would you run it from console?

Yeah from what I can tell XBMC runs very well. I've only had it give me issues once or twice because the database file some how got locked up in permissions and I couldn't write to it.

Quote

June 25, 201016 yr

Yeah a search database would be awesome. Are thinking a web interface or would you run it from console?

I've wanted this for some time now. Since I'm quite seasoned at the command line the priority of it was not high enough, plus I've been waiting for unRAID 5.0 with some kind of plugin interface.

I have incorporated basic code for md5sum and a basic ls -l so the new locate command and probably a check command will have this stuff included.

I don't see a browser interface for some time yet. it could be an unmenu plugin or something for unRAID 5.0 in either case, a browser interface could be designed to use the standard slocate or mlocate commands so there's no reason it could not happen before hand.

The initial design of this tool is to get a quick index of all files you choose, allow lookups and assist in verifying integrity.

FWIW, I'm sure there will be a browser interface eventually.

The way I use slocate is to search for a movie, then spin up the disk.

I just have too many disks to be spinning them all up for the usershare search.

I hardly ever use the usershares these days. I have too many files for the dircache to have a positive effect. so the net effect is spinning up all these disks to find a movie.

The way I see a browser interface is a simple google like search screen.

Then the output is the list of files with hyperlinks. (with optional checkbox for ls -l).

When you select the hyperlink it will access the file for ls -l, download, head, tail

This should then spin up the associate disk after the file is accessed.

Perhaps a full read of the file to cache it totally into ram.

Quote

June 26, 201016 yr

Interesting tidbit I discovered today.

mySQL can be embedded and linked directly into the application without requiring an external server.

http://dev.mysql.com/tech-resources/articles/embedding-mysql-server.html

One of the reasons I was choosing SQLite is simplicity of installation and to allow people access to the raw data should they need it.

The locate program I'm working on would have SQLite statically linked, but if someone wanted to, they could install sqlite and use the command line sqlite program to access the raw data.

With this new tidbit, I'll play around with it to see how easy it would be to switch from local non server repository to a remote one. In any case I thought this was an interesting point to note if people were following the thread.

Probably more then some want to know about. Sometimes the underlying thought process can be interesting to follow.

Quote

June 26, 201016 yr

The only reason I mentioned browser is I see unRAID going farther and farther into the use of the browser, which easy translates into a simpler using device and of course translates into a more attractive device to future users.

Users like myself that have used Linux/BSD on and off for the last 10 or so years would understand it and use it as long as of course they can remember all the switches since sometimes querying a sql server could get long depending on access permissions and complexity of the search.

Of course in the old days I would of just used the "locate" or "find" command and wait and pray something came up. Depending on the build of course you would often get the command not found or get told that the database has not yet been built.

I think its really cool what your attempting and think in any form it would be a good addition.

Quote

June 26, 201016 yr

Users like myself that have used Linux/BSD on and off for the last 10 or so years would understand it and use it as long as of course they can remember all the switches since sometimes querying a sql server could get long depending on access permissions and complexity of the search.

I don't expect anyone to go this far.

It will be a simple command

locate filename

locate --md5sum (some md5hash value to resolve to filename)

updatedb filename or

updatedb -R /mnt/disk1

updatedb -R --md5 /mnt/disk1

check --md5 /mnt/disk1/filename (or some other command. I am open to suggestions).

(verify? validate? checksum?)

export --md5sum=/mnt/cache/.md5sums/disk1.md5sum (some filename regex like /mnt/disk1/) this will create an md5sum file that can be used by md5sum to verify.

export --md5links=/mnt/cache/.disk1 (some filename regex). (this will create the md5hash to actual filename symlinks).

So a browser interface would just call the locate command.

Quote

June 27, 201016 yr

yeah I'm a big fan of

Find transformers

Would it have to be an exact search or could you use wildcards?

find transformers*

Quote

June 27, 201016 yr

Just for clarity whole fields/data will the proposed database hold?

Quote

June 27, 201016 yr

yeah I'm a big fan of

Find transformers

Would it have to be an exact search or could you use wildcards?

find transformers*

Without a full regular expression it is a partial match.

I have the regular expression library compiled in so it can get intricate if desired.

locate transformers - would match anything with transformers in the name

locate /transforners.iso$ would anchor it to the only being at the end.

locate -i filemask is case insensitive match

locate --md5sum 29234lsdlasdkj123 matches on md5sum of data field.

This would then come back with the full filename and would be useful for files in the lost&found directory.

locate -ls filenames - find files and do an ls -l on them (something I've longed for a long time).

I could possibly add something like find and do an -exec somecommand {} \; but I'm getting ahead of myself.

My use is really to find a file or movie, and know exactly where it is, trigger the spin up and be off. (or to find dups).

Quote

June 27, 201016 yr

Just for clarity whole fields/data will the proposed database hold?

So far.

Full path

File type (Dir/Regular File)

md5name (basename of file as an md5hash) - good for finding dups but this field may not even be needed if I write an export tool.

md5sum (of data of file) another way of finding dups

mtime (modification time of tile)

size (size of file).

I'm considering adding other stat information

(uid,gid,mode) - Sort of use for a tripwire like functionality.

I'm considering adding a single field for a lock semaphore (this is for my own use in corporate applications).

In my organization I have many programs that monitor for incoming data files.

When a file is found, a lock is set based on name.mtime and a semaphore file flag.

So with this data warehouse, I can lock a file in the application by storing mtime,size and lock flag (meaning I have processed this file and do not process again). The lock is considered stale if mtime & size changes and the flag would be nulled out. This is my own use to prevent multiple sweeps of a file.

CREATE TABLE IF NOT EXISTS files

(path TEXT PRIMARY KEY, type TEXT, md5name TEXT, md5sum TEXT, mtime INTEGER, size INTEGER);"

What I need right now is a nice clean way of sweeping a directory in a non memory resource intensive way.

I'm bouncing between a pipe to find, or the filefind64 call.

In the meantime I'm reading The Definitive guide to SQLite.

I'm slow with these type of things, so it will be a few weeks before anything materializes perhaps over my July vacation which is when I usually get very inspired. it's also a precursor to a job scheduler I plan to write which will replace a file monitor daemon and a directory monitor daemon at work.

Quote

June 27, 201016 yr

Int is not long enough to store file sizes. Need unsigned long ints at the very least. I always use unsigned double-longs to be future-proof.

Definitely keep stat info, if you intend to use the data to replace files that get recovered to lost+found, as sometimes the perms get clobbered.

BTW, *fsck does have a rhyme and reason to is file naming scheme, that relates to the location on disk where the recovered file/dir was found.... if you can find the algorythm, that could be helpful too.

Don't forget security.... such a database, once built, could be a massive security hole by identifying exact locations and names of files, and letting someone see that information (and quickly search through it) when they could not see or search in the native directories because they lack permission.

Quote

June 27, 201016 yr

This is a very specific usage case but worth knowing about:

http://forums.thetvdb.com/viewtopic.php?f=5&t=1368

fast hash of videos files to allow lookup online (specifically TV only this now)

Quote

June 27, 201016 yr

Int is not long enough to store file sizes. Need unsigned long ints at the very least. I always use unsigned double-longs to be future-proof.

I'm aware of this but have not seen a place to specify that yet in sqlite.

Definitely keep stat info, if you intend to use the data to replace files that get recovered to lost+found, as sometimes the perms get clobbered.

Have not thought about that, Thanks I'll keep it in mind.

I had not thought of the tripwire like functionality until I remembered another fixperms tool we had on a webhosting service. it would check all these permissions and let us know if something was altered and set it back.

BTW, *fsck does have a rhyme and reason to is file naming scheme, that relates to the location on disk where the recovered file/dir was found.... if you can find the algorythm, that could be helpful too.

I'm not ready to absorb that yet.

Don't forget security.... such a database, once built, could be a massive security hole by identifying exact locations and names of files, and letting someone see that information (and quickly search through it) when they could not see or search in the native directories because they lack permission.

I saw this in the mlocate code and will copy it's functionality.

Quote

Disk Contents Tracker?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)