cache_dirs - an attempt to keep directory entries in RAM to prevent disk spin-up

Alex R. Berg · September 26, 2018

Oh right, two threads...

I'm not sure anybody is maintaining the cache-dirs script, but I am the last one who got their hands dirty in the script. Honestly I'm really sick and tired of the bash script, bash is a terrible language for a script this long, and I made the script a lot longer by adding a lot of adaptive logic to it. But anyway it would be nice if the script works, but it does seem to work quite well for me.

The dynamix plugin contains the standalone cache-dirs script in an updated version, and the plugin itself is just a wrapper to run it and configure via GUI. The script in this thread has not been updated for years.

I think it would be best to have the cache-dirs conversation in a thread separate from all the other dynamix plugins, or the conversation will be muddled. I think this place is a good place, since Joe seems to not maintain the original script. Now I'm not promising that I will maintain it, but I might listen along.

If the problem with the plugin is that it does not start, then the dynamix thread is probably the best place to address the problem.

The cache-dirs script (also in the plugin version) can be run manually. The plugin installs the script to /usr/local/bin/cache_dirs and service files to /etc/rc.d/rc.cachedirs

Fireball3: It sounds like the issue that it doesn't start could be a plugin issue, I'll add a comment in that thread

Best Alex

Fireball3 · September 26, 2018

10 minutes ago, Alex R. Berg said:

I think it would be best to have the cache-dirs conversation in a thread separate from all the other dynamix plugins, or the conversation will be muddled.

I agree. I would even suggest to have an extra thread for each dynamix plugin...

12 minutes ago, Alex R. Berg said:

If the problem with the plugin is that it does not start, then the dynamix thread is probably the best place to address the problem.

Starting is the one issue.

The second is the locking up and using 100% CPU on one core after some time (when manually started).

I reverted to v2.1.1 which doesn't have that issues for now.

Alex R. Berg · September 26, 2018

Ah, good to know the 2.1.1 doesn't give you trouble.

I'll need the cache-dirs log in /var/log/cache_dirs.log. You need to enable logging first in plugin (or touch the file) to get it written.

Also it will be help-ful to see what is spinning, if you want to dig into the issue:

ps -e x -o ppid -o pid -o pgid -o tty -o vsz -o rss -o etime -o cputime -o rgroup -o ni -o fname -o args | grep "cache_dirs\|find\|wc"

pstree -p | grep -2 cache_dirs

Eisi2005 · September 26, 2018

@Alex

sorry for late answer, thanks for your help.

Fireball3 · September 26, 2018

4 hours ago, Alex R. Berg said:

Ah, good to know the 2.1.1 doesn't give you trouble.

I'll need the cache-dirs log in /var/log/cache_dirs.log. You need to enable logging first in plugin (or touch the file) to get it written.

OK, I'll install the new version, enable logging and give it a reboot.

The plugin won't start presumably, then I will start it manually and after a couple of minutes I will dump the logs.

4 hours ago, Alex R. Berg said:

Also it will be help-ful to see what is spinning, if you want to dig into the issue:

ps -e x -o ppid -o pid -o pgid -o tty -o vsz -o rss -o etime -o cputime -o rgroup -o ni -o fname -o args | grep "cache_dirs\|find\|wc"

pstree -p | grep -2 cache_dirs

Not sure about this one, it's intended to find the process that's keeping the CPU load pinned?

If so, I have to wait for the "event" to happen, then I'll dump that info.

Alex R. Berg · September 26, 2018

@Fireball3

yeah, ps gives you processes, and the grep filters the ones i'm interested in.

pstree gives processes and subprocesses but does not give much info into what is going on.

You'll have to wait till the 100% cpu problem happens. Its not unnormal for high CPU, when all files are cached and during scan. But it shouldn't scan that often, and I doubt it should hit 100% cpu.

Best Alex

Fireball3 · September 27, 2018

Here you go.

Attached both logs.

1. right after reboot, plugin doesn't start automatically

2 enabled logging, manually started the plugin

Now waiting for the CPU load event to happen.

tuerke-diagnostics-20180927-1012_fresh_start_after_plugin_install.zip

tuerke-diagnostics-20180927-1021_logging_enabled_plugin_manually_started.zip

Alex R. Berg · September 27, 2018

@Fireball3

Things look great.

the cache_dirs log in

20180927-1021\logs\cache_dirs.log

indicate its running in each scan in 0 seconds, and waiting 10 secs in between running.

Did you try the ps + pstree ?

Regarding your syslog crontab problem I saw in your syslog: your crontab looks fine. Maybe you have a bad script in one of the dirs /etc/cron.*

Regarding the startup problem, it does seem like the cache dirs starts and system boot at is should. It starts within seconods after mounting the disks. Why do you think its not running on system boot?

Sep 27 10:01:53 Tuerke kernel: REISERFS (device sdi1): Using r5 hash to sort names
Sep 27 10:01:53 Tuerke emhttp: shcmd (80): set -o pipefail ; mount -t reiserfs -o remount,user_xattr,acl,noatime,nodiratime /dev/sdi1 /mnt/cache |& logger
Sep 27 10:01:53 Tuerke emhttp: shcmd (81): sync
Sep 27 10:01:56 Tuerke emhttp: shcmd (82): mkdir /mnt/user0
Sep 27 10:01:56 Tuerke emhttp: shcmd (83): /usr/local/sbin/shfs /mnt/user0 -disks 4094 -o noatime,big_writes,allow_other |& logger
Sep 27 10:01:56 Tuerke emhttp: shcmd (84): mkdir /mnt/user
Sep 27 10:01:56 Tuerke emhttp: shcmd (85): /usr/local/sbin/shfs /mnt/user -disks 4095 1024000000 -o noatime,big_writes,allow_other -o remember=0 |& logger
Sep 27 10:01:56 Tuerke emhttp: shcmd (86): cat - > /boot/config/plugins/dynamix/mover.cron <<< "# Generated mover schedule:#012 /usr/local/sbin/mover |& logger#012"
Sep 27 10:01:56 Tuerke emhttp: shcmd (87): /usr/local/sbin/update_cron &> /dev/null
Sep 27 10:01:56 Tuerke s3_sleep: ----------------------------------------------
Sep 27 10:01:56 Tuerke s3_sleep: command-args=-C 2 -h 00 -h 17 -h 18 -h 19 -h 20 -h 21 -h 22 -h 23 -a -m 15 -n -e eth0 -N 12500 -i 192.168.178.150 -i 192.168.178.153 -l -L -D 0
Sep 27 10:01:56 Tuerke s3_sleep: action mode=shutdown
Sep 27 10:01:56 Tuerke s3_sleep: check disks status=yes
Sep 27 10:01:56 Tuerke s3_sleep: check network activity=yes
Sep 27 10:01:56 Tuerke s3_sleep: check active devices=192.168.178.150 192.168.178.153
Sep 27 10:01:56 Tuerke s3_sleep: check local login=yes
Sep 27 10:01:56 Tuerke s3_sleep: check remote login=yes
Sep 27 10:01:56 Tuerke s3_sleep: version=3.0.6
Sep 27 10:01:56 Tuerke s3_sleep: ----------------------------------------------
Sep 27 10:01:56 Tuerke s3_sleep: included disks=sdb sdc sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdn
Sep 27 10:01:56 Tuerke s3_sleep: excluded disks=sda
Sep 27 10:01:56 Tuerke s3_sleep: ----------------------------------------------
Sep 27 10:01:56 Tuerke s3_sleep: s3_sleep process ID 4470 started, To terminate it, type: s3_sleep -q
Sep 27 10:01:56 Tuerke emhttp: Starting services...
Sep 27 10:01:56 Tuerke emhttp: nothing to sync
Sep 27 10:01:56 Tuerke sudo: root : TTY=unknown ; PWD=/ ; USER=nobody ; COMMAND=/bin/bash -c /usr/local/emhttp/plugins/unbalance/unbalance -port 6237

Best Alex

Fireball3 · September 27, 2018

1 hour ago, Alex R. Berg said:

the cache_dirs log in

20180927-1021\logs\cache_dirs.log

indicate its running in each scan in 0 seconds, and waiting 10 secs in between running.

That was after I started it manually.

1 hour ago, Alex R. Berg said:

Did you try the ps + pstree ?

See attached.

1 hour ago, Alex R. Berg said:

Regarding your syslog crontab problem I saw in your syslog: your crontab looks fine. Maybe you have a bad script in one of the dirs /etc/cron.*

in /etc/cron.d is the attached file "root"

1 hour ago, Alex R. Berg said:

Regarding the startup problem, it does seem like the cache dirs starts and system boot at is should. It starts within seconods after mounting the disks. Why do you think its not running on system boot?

Because it says "Status: stopped".

There is nothing cached. Once the drives spin down I have to wait for the drives to spin up when browsing the cached smb shares.

Is there a way to check how many entries are actually cached?

I'm checking the memory chart within the stats. When using 2.1.1 almost all of my 8GB RAM is "cached".

When starting with 2.2.0j there is nothing cached. The only difference is the plugin version. This might not be a good approach though.

Not sure about the log snippet you posted. I don't see any cache_dirs entry.

Also attached a diagnostics after 10 hours uptime.

No abnormal CPU load registered yet.

root

ps.log

pstree.log

tuerke-diagnostics-20180927-2014.zip

Alex R. Berg · September 27, 2018

Here's some relevant info for you to learn to debug the problem

pstree

      |-cache_dirs(10139)---sleep(30575)

This says which subprocesses cache_dirs currently has. Right now its sleeping. Cache-dirs does all its work in the 'find' and the 'wc' file-counting sub-process, though the wc is only executid when the script is first started to count files at each depth.

The ps-command gives info into how subcommands are executed, but there were no sub-commands.

Why do you think its running 100% cpu?

crontab

I didn't know about the existence of the file : /etc/cron.d, good for me to know. The last line is incorrect, hence the complaints in your syslogs. Something on your system must be messing it up. You could debug what it is if you want, by removing other plugins and reenabling one by one or doing binary search enabling half at a time. This is how my file looks:

# Generated cron settings for plugin autoupdates
0 0 * * * /usr/local/emhttp/plugins/ca.update.applications/scripts/updateApplications.php >/dev/null 2>&1
# Generated ssd trim schedule:
0 0 * * * /sbin/fstrim -a -v | logger &> /dev/null

# Generated docker monitoring schedule:
10 0 1 * * /usr/local/emhttp/plugins/dynamix.docker.manager/scripts/dockerupdate.php check &> /dev/null

# Generated mover schedule:
40 3 * * * /usr/local/sbin/mover &> /dev/null

# Generated plugins version check schedule:
10 0 1 * * /usr/local/emhttp/plugins/dynamix.plugin.manager/scripts/plugincheck &> /dev/null

# Generated array status check schedule:
20 0 * * 1 /usr/local/emhttp/plugins/dynamix/scripts/statuscheck &> /dev/null

Quote

Because it says "Status: stopped".

There is nothing cached. Once the drives spin down I have to wait for the drives to spin up when browsing the cached smb shares.

Is there a way to check how many entries are actually cached?

I'm checking the memory chart within the stats. When using 2.1.1 almost all of my 8GB RAM is "cached".

The status could be lying. Running the ps or pstree command tells you the truth. Its a good indication though that nothing is cached, and drives are spun down, and spinning up when you access drives. Linux caches lots of stuff in memory, it might be different stuff is cached. Best indication is disks spinning up when you access dirs that should be cached. Note only dirs are cached, not file content, but I'm sure you know that.

The cache-dirs log gives you info about what its doing, and how many files you have cached at each level.

2018.09.27 10:26:06 Executed find in (0s) 00.10s, wavg=00.10s Idle____________  depth 9999 slept 10s Disks idle before/after 1538036766s/1538036766s suc/fail cnt=69/70/0 mode=4 scan_tmo=150s maxCur=9999 maxWeek=9999 isMaxDepthComputed=1 CPU= 4%, filecount[9999]=16460

This line from a previous diagnostics cache-dirs log indicated

filecount[1]=0
filecount[2]=1213
filecount[3]=7832
filecount[4]=11794
filecount[5]=14207
filecount[6]=15873
filecount[7]=16460
filecount[9999]=16460

This means 1213 files at depth 2 (/mnt/user/share/X) and 15873 files at depth 6 (/mnt/user/share/X/Y/Z/A/B). I'm probably off by one or two on the counts, but this gives you an idea. If the next scan says 'Disks idle before/after 9999s/9999s' then disks are idle or spun down, and scanning didn't bring up the disks.

You could reduce Cache Pressure to 1 while debugging, but you might run out of memory crashing the system, especially if you reduce to 0.

In the last log at 20:14 cache-dirs was also started at startup. Disks where spun down later. But there's no cache_log. Maybe you forgot it? I just played with the diagnostics myself, I can see its not automatically included. It would be cool if the plugin could tell unRaid to include the log-files in the diagnostics, but I don't know if that is possible.

Best Alex

Fireball3 · September 28, 2018

Here is a set of new logs.

This morning one CPU core was locked @100%.

I will comment on your post later.

diagnostics_28.09.2018.zip

tuerke-diagnostics-20180928-0734.zip

Edited September 28, 2018 by Fireball3

Alex R. Berg · September 28, 2018

Interesting with the 'defunct' processes. I've seen that myself once a long time ago. 'defunct' processes are apparently dead processes, kept around because of some crash or something because their parent exits, ie. cache_dirs. However those processes should not be running, so cannot generate CPU-load. I need to see what is causing the cpu-load. Can you include the following, and keep doing it until you get 100%:

top -b -n1 |head -20 > top-output.txt; cat top-output.txt

I have attached two files containing all the diagnostics I need, and also zips the zip file. Run cache_dirs_diagnostics_generate_zip.sh and keep running it until you see the 100% cpu process in the list of top-processes.

2.2.0j added process management where each disk is scanned in its own process. Apparently there's a problem with that. Did you stop the array at any time before the defunct processes appear? (probably not)

How many disks do you have? (not that it should matter)

Best Alex

cache_dirs_diagnostics.sh

cache_dirs_diagnostics_generate_zip.sh

Alex R. Berg · September 28, 2018

I've made some changes to cache_dirs, and attached the file. Download it to somewhere, and then do

sudo chmod 755 ./cache_dirs; ./cache_dirs -q; ./cache_dirs -i Video -e Audio -e Backup -e Fotos -e Spiele -l on -T

The last -T paramater disabled multithreaded disk-scan, and makes it more like the last 2.1.0 cache-dirs. See if the problem appears with and without -T

to get diagnostics you can now run (on the new attached cache_dirs)

./cache_dirs -L

When you reboot the machine, or start cache_dirs from the plugin, the installed cache_dirs plugin will be running, not the attached script.

cache_dirs

Alex R. Berg · September 28, 2018

Ah, please use this cache_dirs instead for your tests.

I noticed the defunct process happened at 3:42 which means right after moving started. But I still cannot reproduce the issue.

cache_dirs

Alex R. Berg · September 28, 2018

You could also try adding to the cache_dirs command

-U 100000

to set memory limit higher. Maybe the system is killing the process, or suspending it because of lack of memory. Just a guess.

./cache_dirs -i Video -e Audio -e Backup -e Fotos -e Spiele -l on -U 100000

Fireball3 · September 28, 2018

OK, thanks for your effort!

Here is the plan:

1. check if the GUI is lying. Execute the ps and pstree right after reboot to have evidence.

2. kill all running instances of cache_dirs

./cache_dirs -q

3. run the one you posted above

-without -T (wait until it locks up again, dump diagnostics)

./cache_dirs -i Video -e Audio -e Backup -e Fotos -e Spiele -l on

-with -T (see if it locks up again, dump diagnostics)

./cache_dirs -i Video -e Audio -e Backup -e Fotos -e Spiele -l on -T

-without -T but with memory limit -U100000 if it still locks up

./cache_dirs -i Video -e Audio -e Backup -e Fotos -e Spiele -l on -U 100000

Quote

I noticed the defunct process happened at 3:42 which means right after moving started. But I still cannot reproduce the issue.

I have to check the log, but my mover has no timed job. I fire up the move by hand if necessary.

Edit:

I picked the wrong diagnostic.zip in the hurry. The file is from yesterday evening 20:14...

Edit2:

I was up late yesterday.

By that time, the S3sleep plugin should have shutdown the server but it wasn't.

That was suspicious, but the CPU wasn't locked up by that time. I will post the correct diagnostics dumped this morning when I'm home later. Maybe there is some clue within. The S3sleep is logging if a condition is preventing shutdown.

Edited September 28, 2018 by Fireball3

Alex R. Berg · September 28, 2018

yep, exactly, try that and we'll see the result. I'm very interested in seeing a snapshot af the 100% cpu spike, and the cachelog at the same time. But the diagnostics should produce that, but your job is to fire it when its at 100% cpu.

./cache_dirs -L

Also I'm very interested in seeing it at a time where there are defunct processes again, and also whether the 100% cpu problem always coincide with the defunct processes.

I found a better way to do the ps-command, I've updated cache_dirs attached, so its log-diagnostics is better.

If you run the attached

find_defunct.sh

then it'll wait and do the scan when it finds defunct processes (if it works)

cache_dirs

find_defunct.sh

Fireball3 · September 28, 2018

Here I am.

I updated the diagnostics in the post of this morning.

Can't see anything of use.

Attached to this post the first cache_dirs_diagnostics together with a screen shot of the cache_dirs GUI.

I'm not sure if the log generated with -L is showing the running plugin or the instance that is pulling the log data!?

Therefore I attached the ps and pstree output.

cache_dirs -q says it it not running.

This means, the GUI is not lying.

I fired up the modified version as planned.

./cache_dirs -i Video -e Audio -e Backup -e Fotos -e Spiele -l on

The second file show the logs right after the script start.

The GUI is also showing the plugin status "running".

The defunct.sh is also running.

I think I managed to remove the mover & logger spamming the log.

The mover settings had logging enabled.

# Generated mover schedule:
0 0 1 * * /usr/local/sbin/mover |& logger

Once it is set to disabled the file \flash\config\plugins\dynamix\mover.cron

contains

# Generated mover schedule:
0 0 1 * * /usr/local/sbin/mover &> /dev/null

It seems there is no switch to fully disable the mover.

cache_dirs_diagnostics.zip

cache_dirs_diagnostics_arberg_mod_28.09.2018_2252.zip

Alex R. Berg · September 29, 2018

Yeah, looks good so far. You ran the cache_dirs -L diagnostics immediately after starting cache_dirs it seems. That's to early to get much info, because cache_dirs is just started. But anyway, you can report back when you have more details.

Best Alex

Fireball3 · September 30, 2018

8 hours ago, Alex R. Berg said:

You ran the cache_dirs -L diagnostics immediately after starting cache_dirs it seems.

No, the dynamix-plugin didn't start at start-up.

The cache_dir -L was the only instance running.

Your modified script invoked with

./cache_dirs -i Video -e Audio -e Backup -e Fotos -e Spiele -l on

didn't block the shutdown.

I will confirm with other runs.

How to instiall the modded version instead the onf of the dynamix suite?

And finally, what's wrong with the autostart of the plugin?

Alex R. Berg · September 30, 2018

It is very surprising to me if the 2.2.0a version I sent does not give the same issues when executed without the -T switch. The only change is how I report idle time of disks, I cap the reporting at 9999s when to avoid the crazy numbers you got when all disks where spun down. I wonder it its somehow related to the spun-down situation.

I've create a test-plugin for you and added logging during startup.

Copy txz to \\tower\flash\config\plugins\dynamix.cache.dirs

and replace the plugins\dynamix.cache.dirs.plg

(and backup the originals so you can roll back).

The updated plugin has the multithreaded param in the plugin-page http://tower/Settings/FolderCachingSettings

I've also attached a history of cache-dirs scripts, so you can test them individually for the defunct tests in case you don't have easy access to the cache_dirs via plugins.

Best Alex

dynamix.cache.dirs.txz

dynamix.cache.dirs.plg

cache_dirs-scripts.zip

Edited September 30, 2018 by Alex R. Berg

Fireball3 · September 30, 2018

OK, copied the files and rebooted the server.

Checked the plugin status on the GUI and...tadaaaa...running.

Attached a cache_dirs_diagnostics.

find_defunct.sh is now also running, just in case.

cache_dirs_diagnostics_30.09.2018.zip

Alex R. Berg · October 2, 2018

Great to hear its running now. I guess that means it somehow died with the old version.

I just noticed your are only caching your share 'Videos'. The excludes have no effect, because you use include.

This is the list of directories maching your include/exclude criteria, as seen in syslog. This is probably what you want, I just wanted to let you know, now that I saw it.

Sep 30 23:26:00 Tuerke cache_dirs: ---------- Caching Directories ---------------
Sep 30 23:26:00 Tuerke cache_dirs: Video
Sep 30 23:26:00 Tuerke cache_dirs: ----------------------------------------------

Best Alex

Fireball3 · October 2, 2018

1 hour ago, Alex R. Berg said:

The excludes have no effect, because you use include.

OK, I can adjust that.

Is the new version available via the plugin update?

A friend of mine also had the plugin updated and then it wouldn't start.

I could tell him to update as the bug seems fixed.

Edit:

I suspect some content is not cached.

Is there a way to check what drives are and what not?
When I'm accessing the share - just browsing, it will pause to spin up some discs.

I checked the GUI and found only 3 disks spinning.

The share definitely spans mor than those 3 disks.

Edited October 2, 2018 by Fireball3

Alex R. Berg · October 2, 2018

No its not available yet, I prefer to wait a few weeks to see if it continues to work for you. Maybe I'll commit it one of these days, and post send it along to dynamix to create a new plugin. If you just send him the plg and the txz he can do the same as you, but he can also wait.

Best Alex

cache_dirs - an attempt to keep directory entries in RAM to prevent disk spin-up

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

Alex R. Berg

Alex R. Berg

Alex R. Berg

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation