Native Cache_Dirs Support


jonp

Recommended Posts

In getting back to a native cache_dirs, I was thinking for large arrays with many files it might be useful to enable these based on disk level/user share level where each item has it's own max depth.

 

In my particular structure, I have millions of files, but I would probably only want to cache a couple disks, down to about 4 levels.  This way the main distribution of media type / media genre / artist or various artists / collection name was cached and not every single individual file. (this is just an example, I'm sure everyone's layout varies).

 

Point is we may need to set a maxdepth based on disk/usershare for custom layouts.

Link to comment

The question really should be why does a parity check or drive rebuild clear out the caches that causes cache_dirs to kick in to real disk reads again?

 

Do they have to?

 

Shouldn't happen.  All parity/rebuild operations use private buffers allocated by unraid driver and do not consume any buffer cache.

Link to comment

The question really should be why does a parity check or drive rebuild clear out the caches that causes cache_dirs to kick in to real disk reads again?

 

Do they have to?

 

Shouldn't happen.  All parity/rebuild operations use private buffers allocated by unraid driver and do not consume any buffer cache.

 

Could there be some bounce buffers being utilized in certain motherboards?

 

Other then that the only other explanation is all the IO going on with parity/rebuild is causing the dentry and filesystem data to be aged out.

Link to comment

Was writing a reply about how this definitely happens etc etc but then I realized I only assume it happens. Yes it seems to happen every time and certainly tings like preclear absolutely ruin cache_dirs performance but that again isnt exactly the same thing.

 

Alot of this comes down to us deciding/working out how to test this thing properly. All tests so far are pretty ad hoc with so many variables it makes conclusions confusing at best.

 

Ideally we would want to be able to see the cache fill up vs. the inode count on disk vs. the settings chosen and from there start to play with pressures and loads.

Link to comment

Other then that the only other explanation is all the IO going on with parity/rebuild is causing the dentry and filesystem data to be aged out.

If I had to put money somewhere this would be it.

 

I haven't looked at Joe's script for a long time, maybe he disables scanning if a parity operation is in-progress.

Link to comment

I'm fairly sure that's not the case.    But it's actually not that bad an impact on a parity check once the cached directories are completed.

 

As I noted earlier, if you reboot a system with Cache_Dirs enabled and then immediately start a parity check, it will be VERY notably slower than if you do the same thing with Cache_Dirs disabled.

 

But after about 30 minutes (on my system -- this time will depend on how many disks and files you have)  the parity check will start running at normal speed and will continue to do so unless you use the system for other things and, particularly, if you access a bunch of directories that weren't in the cache ... as these will cause Cache_Dirs to have to do some rescanning.

 

Link to comment

I think the simplest "solution" is a simple On/Off switch for Cache-Dirs in the Web GUI  :)

 

I suspect most of us would leave it on all the time except when we wanted to do a parity check or drive rebuild.

 

I can't imagine it would be that hard to automatically turn off cache-dir during those two situations, both otherwise provide the switch.

Link to comment

Continued from

http://lime-technology.com/forum/index.php?topic=34425.msg322248#msg322248

http://lime-technology.com/forum/index.php?topic=34425.msg322275#msg322275

http://lime-technology.com/forum/index.php?topic=34425.msg322293#msg322293

 

While I know this is slightly off topic to cache_dirs itself, I wanted to present some other data for feasibility review.

 

I.E. Caching the data via another warehousing mechanism.

This time with SQLite.

Since it's part of the newer unRAID releases, it opens up other doors for tracking our filenames/md5 sums.

 

You can see it takes enormous amounts of time to traverse 3 disks with over a million files.

This is using ftw64() which does a stat for every file.

Then I check if the stat block matches what's in sqlite and either update it or insert it.

 

root@unRAID:# /boot/local/bin/sqlite3 /tmp/ftwstatcache.sqlite3 '.schema locate' 
CREATE TABLE locate (name TEXT PRIMARY KEY, updatetime INTEGER, inode INTEGER, mtime INTEGER, size INTEGER, jobstime INTEGER, jobetime INTEGER, rc INTEGER, md5 TEXT, stat BLOB);

NEW DB
./ftwstatcache_sqlite -D /tmp/ftwstatcache.sqlite3 /mnt/disk1 /mnt/disk2 /mnt/disk3 
1154802 files processed in 4515 seconds, selects 0, deletes 0, duplicates 0, inserts 0, updates 1154802, errors 0 

PRE-EXISTING DB
time ./ftwstatcache_sqlite -D /tmp/ftwstatcache.sqlite3 /mnt/disk1 /mnt/disk2 /mnt/disk3 
1154815 files processed in 4477 seconds, selects 1154781, deletes 0, duplicates 1154769, inserts 0, updates 46, errors 0 

real    74m36.815s
user    2m46.970s
sys     1m47.360s

 

We can see the time is the same for the initial insert and then the select/match/updates.

 

The size of the cached data.

root@unRAID:/mnt/disk1/home/rcotrone/src.slacky/ftwcache# ls -l /tmp/ftwstatcache.sqlite3 
-rw-r--r-- 1 root root 504266752 2014-08-24 22:52 /tmp/ftwstatcache.sqlite3
root@unRAID:/mnt/disk1/home/rcotrone/src.slacky/ftwcache# ls -l --si /tmp/ftwstatcache.sqlite3 
-rw-r--r-- 1 root root 505M 2014-08-24 22:52 /tmp/ftwstatcache.sqlite3

 

This is larger then the gdbm stat block cache because I include other fields.

I.E. a place for md5sum, tracking of when the md5sum starts/ends and selectable fields for mtime,size,inode.

That's more for command line scripting. While I store the stat[] block, it would not be easily query-able via SQL.

I may redo this test with just the name and binary stat[] struct just as a size test.

However keeping in mind that SQLLite converts binary data to a X'' character string equivalent inside the DB, I'm not sure how much smaller it can be.

 

Querying this data is pretty fast via sqlite also.

time /boot/local/bin/sqlite3 /tmp/ftwstatcache.sqlite3 "select name from locate" > /tmp/filelist.txt

real    0m2.377s
user    0m1.720s
sys     0m0.650s

root@unRAID:/mnt/disk1/cache/home/rcotrone/hercules/src/flocate# wc -l /tmp/filelist.txt
1154836 /tmp/filelist.txt

root@unRAID: time /boot/local/bin/sqlite3 /tmp/ftwstatcache.sqlite3 "select name from locate" | grep 'Night Vision' | wc -l  
174

real    0m11.387s
user    0m14.270s
sys     0m0.970s

 

Might be faster when I finish the new locate command.

It might also be slower if the SQLLite was on a physical disk rather then in root/tmpfs.

There is a big performance penalty when using physical disks.

 

This is all precursor to a new updatedb/locate/md5sum database I've been working on.

Since we have SQLLite as part of PHP, we can now have a webgui to locate our files. (when I get to it).

 

This is about all I'm going to post about this part of the subject.

It's here as an exercise to represent the potential of cpu/time/space required to cache stat data for arrays with a large number of files.

 

While it's feasible to keep these data/structures on disk.

It really would be better having a slight modification to the kernel caching of dentry information.  What that is, or  how to change it is beyond me at the moment, I only know of a dentry boot time option to increase the hash table. I'm not sure how to preserve the dentry data in the kernel longer.

 

 

 

 

EDIT: 2014-08-25

This is the same command re-run with a new scheme of just the filename and the stat block using sqllite.

 

root@unRAID: # /boot/local/bin/sqlite3 /tmp/ftwstatcache.sqlite3 ".schema locate"
CREATE TABLE locate (name TEXT PRIMARY KEY, stat BLOB);


root@unRAID:/mnt/disk1/home/rcotrone/src.slacky/ftwcache# time ./ftwstatcache_sqlite -D /tmp/ftwstatcache.sqlite3 /mnt/disk3 /mnt/disk2 /mnt/disk1 
1155019 files processed in 2296 seconds, selects 0, deletes 0, duplicates 0, inserts 0, updates 1155019, errors 0 

real    38m16.123s
user    3m1.920s
sys     2m45.310s

While the time seems to be faster to load an initial database, it's probably skewed due to my heavy usage and scanning of the filesystems many times today.

 

and as you'll see the file size is not much smaller. I guess that's the price you pay for SQL acccess.

root@unRAID:# ls -l /tmp/ftwstatcache.sqlite3 
-rw-r--r-- 1 root root 466719744 2014-08-25 20:21 /tmp/ftwstatcache.sqlite3

root@unRAID:# ls -l --si /tmp/ftwstatcache.sqlite3 
-rw-r--r-- 1 root root 467M 2014-08-25 20:21 /tmp/ftwstatcache.sqlite3

and some other benchmarks

root@unRAID:/mnt/disk1/cache/home/rcotrone/hercules/src/flocate# time /boot/local/bin/sqlite3 /tmp/ftwstatcache.sqlite3 "select name from locate" | wc -l
1155019

real    0m2.310s
user    0m2.190s
sys     0m1.110s

root@unRAID:/mnt/disk1/cache/home/rcotrone/hercules/src/flocate# time /boot/local/bin/sqlite3 /tmp/ftwstatcache.sqlite3 "select name from locate" | grep '^/mnt/disk3' | wc -l 
301326

real    7m33.861s
user    7m37.080s
sys     0m1.130s

root@unRAID:/mnt/disk1/home/rcotrone/src.slacky/ftwcache# time /boot/local/bin/sqlite3 /tmp/ftwstatcache.sqlite3 "select name from locate" | grep 'Night Vision' | wc -l   
175

real    0m11.798s
user    0m14.450s
sys     0m0.790s

Link to comment
  • 2 weeks later...
  • 2 weeks later...

Native Cache_Dirs support would really be ideal.  With my media players navigating thorough the GUI is horrible without Cache_Dirs and I would not use unRAID without it.

 

Great idea.

 

craigr

 

Could not agree more.

 

One enhancement to cache_dirs that is needed is the ability to set the depth per folder i.e. scan folder X 2 deep and folder Y 3 deep and folder Z completely.

Link to comment

I'd also have a setting that can force the array to spin up if a certain directory gets viewed.  This would be the best of both worlds as I have seen media players timeout waiting for the disks to spin up.  Having a disk with a root folder of say /movies/ spin up when a media player starts vuiewing the directory would be well handy.

 

this could be further tuned by having cache_dirsd react differently with different users.  i.e. from my PC I wouldn't need the array spinning up because I could just be looking for what I have.  but if a user that I have assigned to the media player(s) starts searching, it's most likely going to want to read content off the disk.

Link to comment

I'd also have a setting that can force the array to spin up if a certain directory gets viewed.  This would be the best of both worlds as I have seen media players timeout waiting for the disks to spin up.  Having a disk with a root folder of say /movies/ spin up when a media player starts vuiewing the directory would be well handy.

 

this could be further tuned by having cache_dirsd react differently with different users.  i.e. from my PC I wouldn't need the array spinning up because I could just be looking for what I have.  but if a user that I have assigned to the media player(s) starts searching, it's most likely going to want to read content off the disk.

 

THIS. My popcorn hour usually times out on array spin up and requires a reboot to rediscover the smb shares.

Link to comment

I'd also have a setting that can force the array to spin up if a certain directory gets viewed.  This would be the best of both worlds as I have seen media players timeout waiting for the disks to spin up.  Having a disk with a root folder of say /movies/ spin up when a media player starts vuiewing the directory would be well handy.

 

this could be further tuned by having cache_dirsd react differently with different users.  i.e. from my PC I wouldn't need the array spinning up because I could just be looking for what I have.  but if a user that I have assigned to the media player(s) starts searching, it's most likely going to want to read content off the disk.

 

THIS. My popcorn hour usually times out on array spin up and requires a reboot to rediscover the smb shares.

 

Joe L had a script to spin up the array automatically when an ip address answers a ping.

 

What seems to be asked in these later posts is not 'directory caching', but access activity monitoring.

A feature request in it's own.

Link to comment

Maybe I missed something in all of this, and maybe my understanding of the concepts involved isn't good enough.  But why wouldn't it be possible to have the cache dirs info in active memory dumped to disk on command or on a scheduled basis, and then just have the capability of reading that file into active memory on boot?  I would think you would want this disk to not be an array member, but I don't know? 

Link to comment

Maybe I missed something in all of this, and maybe my understanding of the concepts involved isn't good enough.  But why wouldn't it be possible to have the cache dirs info in active memory dumped to disk on command or on a scheduled basis, and then just have the capability of reading that file into active memory on boot?  I would think you would want this disk to not be an array member, but I don't know?

 

it would be better to increase the dentry/inode hash tables and make them stick around longer or permanently rather then work on the backup/restore part. 

 

You have to consider these hash tables are kernel structures, so every time the kernel changes, you have to modify the application  or module responsible for backup/restore.

 

I believe the core issue is that dentry and inode structures age out. If there is a way of preserving them in ram, that would be the first prudent step.  After that backup and restore could have some potential value, but then you have the programmer expense of maintaining it.

 

 

In my tests of storing filename to stat structure with gdbm, I've found that a .gdbm file of over 300,000 files can grow to 90MB.

Traversing it can be very fast, growing it is a bit slow, but accessing a specific record is lightning fast.

 

So the potential to store the data on a fast subsystem exists, but the complexity in doing so could be overwhelming to make it worth while.

 

By adjusting the kernel to prefer keeping inode/dentry structures in ram, most of the battle is dealt with.

 

That's what cache_dirs tries to do. Keep on accessing these structures so they do not get aged out.

 

When you think about it, it doesn't make too much sense to cache these structures on disk, when the information already originally came from the disk. It makes more sense to keep them around in ram longer, without being flushed out accidentally by reading a very large file very fast.

Link to comment
  • 2 weeks later...

yes please, without it accessing a single file wakes up the whole array with it on it will wake up only the drive the file is on and just browsing the shares is very smooth and fast. Even though cache_dirs is a weird fix it does work and at least make 6 have it until or if you guys can figure out a proper way to do it.

Link to comment

Maybe I missed something in all of this, and maybe my understanding of the concepts involved isn't good enough.  But why wouldn't it be possible to have the cache dirs info in active memory dumped to disk on command or on a scheduled basis, and then just have the capability of reading that file into active memory on boot?  I would think you would want this disk to not be an array member, but I don't know?

 

 

So the potential to store the data on a fast subsystem exists, but the complexity in doing so could be overwhelming to make it worth while.

 

 

Worth while?  I don't know.  I'm usually very happy with my unraid, but sometimes it's so slow to traverse complex directory structures that could span a lot of disks, that the application I'm trying to use with the file I'm looking for will time out.  I've had applications crashes because of this, and etc...  Not to mention just being generally pissed off wondering when it will do whatever it is that it's doing to a particular folder in order to open it.    It makes me wonder how many people have had this problem and then ran away from unraid to something else, without bothering to take the time to figure out what's really going on.  I don't know, looking at it from that perspective this would seem important and maybe a worth while pursuit. 

  • Upvote 1
Link to comment

My comment about worthwhile,  is in designing a mechanism to cache the directory structure to disk, and/or reload it.

Writing an application or kernel level mod to do that could prove futile.

 

Expanding/Modifying the dentry/inode cache mechanism might be worthwhile, but then you are always modifying a moving target. I.E. The kernel.

 

Issuing a find down a tree you plan to explore, would be easier. (what cache_dirs does).

A page on emhttp to issue the find, thus caching the whole directory structure would be easier to program and continually supportable in the future.

Writing a emhttp configurable interface to a daemon that does the file tree walk isn't too difficult either. i.e. a Native directory caching application.

I was planning to do something like this for myself. Perhaps a lil flashlight at the disk link on the front page.

Clicking that would trigger a background walk down the tree.

Or perhaps in the user share.

 

Also keep in mind, if you jump to the shares page and click on compute.

That has the effect of walking the tree once, reading the stat blocks for each file and thus caching it for future review.

 

The other option I've been exploring is a duplicate of the updatedb/locate functionality of linux, where all files are stored in a catalog. When searching for files, you use the locate command to find the files you are interested in.  With a browser interface we can issue a command to read files from that list into ram, thus causing the appropriate disks to spin up along with the files being cached.  One further option will be to store md5 hash's into the catalog.

 

This comes at a price, to use the ram filesystem for database storage in sqlite with 1 million files it's little over 500MB.

Using GDBM files this drops to about 300MB.

SQLite providing the ability to extract your data in multiple ways. Command line, application, PHP.

GDBM being more compact, possibly faster for single key access.

 

It's also feasible, at a performance penalty, to do the caching within the usershare filesystem. I.E. caching directories in a .gdbm, sqlite or some memory hash structure.

At the cost of performance for initial load, searching when looking for files and then memory to warehouse the data.

 

So the answer may be two fold.  A native find/file tree walk utility with configurable emhttp interface.

Educating the user to use the compute function before doing any advanced work down a user share, thus caching that directory set on demand.

 

I know when I plan to do allot of adds to a disk, I will issue an on demand find /mnt/disk3 >/dev/null & to cache all the directory information.  In my particular environment with so many small files. I cannot cache all the disks, so I have to work on demand.  Thus the need for my locate catalog was born.

Link to comment
  • 5 years later...

Fast forward 6 years. Any progress? Lots of nerdy crap in here. But no way to hook into the low level file system to know what's going on? Of course at first the cache needs to know what's there and read everything. Then afterward hook into the filesystem to always know what happens afterwards.

 

I've been struggling with the add-on for years of never working right for me. Having low-level kernel support would be such a benefit. I'm surprised throughout all the years of Linux that it doesn't have some sort of file caching option already. I'm bumping this thread for future development.

  • Upvote 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.