Automated Diagnostics File Gathering Script

danioj · March 22, 2016

WARNING - PLEASE CONSIDER THIS A WORK IN PROGRESS. I HAVE TESTED IT ON MY SERVER AND THERE WAS NO ISSUE BUT THAT DOESNT MEAN IN ANY WAY IT IS FREE FROM BUGS/ISSUS SO PLEASE USE AT YOUR OWN RISK UNTIL IT HAS BEEN TESTED FURTHER - WARNING

Through trying to support the forums today a users situation inspired me to try and do something about it.

He was faced with an unRAID lockup with NO access to Telnet, GUI, Shares, SSH or otherwise. My first thought in trying to help him was "Please post a diagnostics file .." BUT of course he was not able. A Hard-Reset was on the cards and some potentially valuable information was going to be lost.

As well as this particular user, we have all been there. I have been told it. I have told it. Ensure you grab a diagnostics file BEFORE resetting, if possible. However, as this particular users' experience reminded me, sometimes this is not possible. How much easier would it be if we could deal with that in some way?

Queue danioj's brain. So, I have spent the last few hours playing with and testing a script to (in conjunction with dynamix cron jobs) "regularly" and automatically grab diagnostics files at intervals (overwriting any previously automatically collected diagnostics file) meaning that there is ALWAYS a regular diagnostics file available for posting in a support request.

Now, I am a coder in a past life 20 years ago but I am FAR from one now. I am sure there are issues with this code but PLEASE remember that I know that this is FAR from perfection - but the making of it has come from a good place. So be gentle!

If anyone more competent in bash scripting is out there and wants to tear it a new one and offer suggestions for improvement then go ahead. There might not even be a need for this and it is dealt with in another way OR has been done before. I don't know. Either way, all feedback is a gift and is welcome. You won't hurt my feelings

Just trying to help ...

There are two files (code below):

- automated-diagnostics.sh is to be placed in /boot/

- automated-diagnostics.cron is to be placed in /boot/config/plugins/dynamix/

Installation

- Place the files as indicated above

- Re-Apply Settings in the Scheduler section of Settings to load the cron file*

- chmod +x /boot/automated-diagnostics.sh

* I have noticed that the dynamix cron files are not loaded just because you create a file BUT you can force this. Just go into the Settings>Scheduler AND make a change (so the Apply Button gets un-greyed out) and change it back (Apply Button remains un-greyed out) and Hit Apply (essentially not changing ANYTHING) then the cron file gets loaded.

automated-diagnostics.sh

#!/bin/bash


# automated-diagnostics.sh

# Change Log

# 20160322 - changed automated-diagnostics.sh to have absolute path to diagnostics utility in unRAID. Errors in the Cron interactive shell otherwise.


# This simple script is intended to grab the diagnostics zip file and rename it with a unique flag so it can be overwritten easy.
# the intention here is there as always a "reccent" diagnostic zip file available on the flash drive.

# the intention is that this script is run from a cron job "regularly" (e.g. I have set it to run every hour) to deal with the
# sceanrio where you have a lockup can can't grab the diagnostics file to facilitate a support request.


diagnostics_dir="/boot/logs/"
automation_flag="ag-"


# lets make a note in the log indicating what we are doing

echo "automated grab of unRAID diagnostics zip file has commenced"

# first things first, execute the most recent disgnostics file

/usr/local/sbin/diagnostics

echo "deleting previously automatically grabbed diagnostics file(s)"

# delete any other diagnostic files that were automatically generated

find $diagnostics_dir -name $automation_flag*diagnostics*.zip -maxdepth 1 -delete

# get the name of the most rececnt diagnostics file (the one we just created) in the logs directory

diagnostics_file=$(ls -t $diagnostics_dir*diagnostics* | head -1)
diagnostics_file=$(basename $diagnostics_file)

echo "renaming new diagnostics file to differentiate it from other diagnostics files"

# add the automation prefix to the diagnostics file

mv $diagnostics_dir$diagnostics_file $diagnostics_dir"$automation_flag"$diagnostics_file

echo "diagnostics file renamed and is located at $diagnostics_dir$automation_flag$diagnostics_file"

echo "automated grab of unRAID diagnostics zip file has completed"

# end of script

automated-diagnostics.cron

# run the automated-diagnostics.sh file on the hour every hour
0 * * * * /boot/automated-diagnostics.sh | logger &> /dev/null

I have attached the files as well. Please remember that you will need to remove the .txt from the file extension.

Change Log

20160322 - changed automated-diagnostics.sh to have absolute path to diagnostics utility in unRAID. Errors in the Cron interactive shell otherwise.

automated-diagnostics.cron.txt

automated-diagnostics.sh.txt

wgstarks · March 22, 2016

Not sure how much work would be involved, all my coding experience was on punch cards (what a pita) many many years ago, but if this was setup as a plugin wouldn't it make updating easier? Just an idea.

I'll try to give this a test run tonight.

danioj · March 22, 2016

Not sure how much work would be involved, all my coding experience was on punch cards (what a pita) many many years ago, but if this was setup as a plugin wouldn't it make updating easier? Just an idea.

I'll try to give this a test run tonight.

Maybe, but I have NO idea how to do that! It's a simple simple script though. I shall look into it tomorrow and see if I can pull one together.

If anyone reading this would like to send me some help to start (e.g. a simple Plugin to work from to get a general idea) via PM id appreciate it.

Thanks.

danioj · March 22, 2016

While I investigate how I might turn this into a Plugin => I have been running it for 3 hours (following the update for the diagnostics path - as indicated in OP) and it seems to be working just fine.

It is currently 12:17am (AEST - my time) and lets just say I had an unrecoverable crash 10 mins ago => I would have rebooted and as expected a diagnostics file that was automatically taken at 11:50pm (AEST - my time) is available in /boot/logs/ for me to post in a support thread.

Fireball3 · March 22, 2016

With respect to increased wear of the thumb drive,

I would propose to make this an option for the "clean powerdown" script.

Keep the diagnostics.zip rolling in the RAM or generate it on demand (clean powerdown invoked).

This won't cover the events when the server completely freezes, but in that case it still can be set up

as it is shown here - or simply turned on/off in the GUI of the "powerdown" plugin.

bungee91 · March 22, 2016

My question would be, is there a way to invoke this prior to doing a hard shutdown? I understand the powerdown script collects information when used, but this use case is for when we cannot do that. Could we monitor the power button input, for say a double press, which would invoke this?.. Just thinking out loud for a way to trigger it when the system is up, but any normal way to control it has stopped working.

BRiT · March 22, 2016

My question would be, is there a way to invoke this prior to doing a hard shutdown? I understand the powerdown script collects information when used, but this use case is for when we cannot do that. Could we monitor the power button input, for say a double press, which would invoke this?.. Just thinking out loud for a way to trigger it when the system is up, but any normal way to control it has stopped working.

Nope, when you're hard locked up you're hard locked up.

Squid · March 22, 2016

With respect to increased wear of the thumb drive,

I would propose to make this an option for the "clean powerdown" script.

Keep the diagnostics.zip rolling in the RAM or generate it on demand (clean powerdown invoked).

This won't cover the events when the server completely freezes, but in that case it still can be set up

as it is shown here - or simply turned on/off in the GUI of the "powerdown" plugin.

Personally, I wouldn't make it an option, because users would tend to leave it enabled all the time, killing their flash drives. Keep it completely a separate script / plugin

danioj · March 22, 2016

Can I just take the time to KILL what seems to be a suggestion that (given what I am suggesting - which of course CAN be changed):

24 x 100KB writes to the Flash drive per day is going to "Kill the Flash Drive". It is NOT. IMHO it is not even going to have a "Material" impact on the drives life expectancy over not using this script.

That being said, if you simply write data to a USB flash drive and put it away in a safe place for 10 years, it will most likely work again and all the data will still be there. BUT if you continue to use it over and over again, it will definitely wear out eventually. The keyword here is "eventually".

The life expectancy of a USB Flash Drive can be measured by the number of write or erase cycles. Typical USB flash drives can withstand between 10,000 to 100,000 write/erase cycles, depending on the memory technology used (MLC/SLC).

Given the amount of action a typical unRAID USB drive gets from exisiting plugin's logs, adding the writes to the above for the sake of storing an ever "current" diagnostics file is MORE than enough of a trade off (even if there is one worth discussing) IMHO.

I have however considered an alternative, because I KNOW this is going to get people talking and that is this: Using the script with a spare USB drive AND Unassigned Devices.

My rationale is this. USB Flash Devices are cheap. Almost EVERYONE has a spare USB 2.0 port. I can have a USB drive auto mounted when the system starts using Unassigned Devices. That USB drive is for holding diagnostics files ONLY. A $2 penalty for the USB drive for having "current" diagnostics. HOWEVER, I don't think it is really needed.

Anyway ... open the discussion.

danioj · March 22, 2016

With respect to increased wear of the thumb drive,

I would propose to make this an option for the "clean powerdown" script.

Keep the diagnostics.zip rolling in the RAM or generate it on demand (clean powerdown invoked).

This won't cover the events when the server completely freezes, but in that case it still can be set up

as it is shown here - or simply turned on/off in the GUI of the "powerdown" plugin.

Personally, I wouldn't make it an option, because users would tend to leave it enabled all the time, killing their flash drives. Keep it completely a separate script / plugin

I have posted my feelings on the "Wear on the Flash drive" issue.

As for where this belongs, I don't think it belongs in the powerdown plugin. This is for managing the times (and they are becoming more and more frequent with the additional "Non-NAS" features of unRAID) where an unRAID system crashes, resets etc resulting in a HARD reset AND no diagnostics were manually taken by the user prior to this happening.

bungee91 · March 22, 2016

Nope, when you're hard locked up you're hard locked up.

(thinking about fiber currently) ::)

Ok, maybe I thought about this incorrectly then.

The point here is to (hopefully) catch some good diagnostic information prior to a lockup, and then be able to use that after the hard reset is performed (as it would have been lost normally), right?

So were hoping whatever is going on is logged prior to it getting to the point of failure.

For some reason I thought cron was still operating some way, and that even though we are locked up from accessing/controlling UnRAID, that if we waited, the cron job would happen and capture all the relevant information. I guess that's not the case, so disregard.

danioj · March 23, 2016

Well, I have grabbed one of dlandon's Plugin's and examined the files. It "seems" quite straight forward. I am going to have a go at turning this script into one!

JonathanM · March 23, 2016

Using the script with a spare USB drive AND Unassigned Devices.

My rationale is this. USB Flash Devices are cheap. Almost EVERYONE has a spare USB 2.0 port.

But, since unraid licenses connected devices instead of array devices, the real cost of that extra device is alot higher for anyone with less than a pro license.

Hardly a reason not to develop this option, but another thing to keep in mind.

In regards to the flash wear argument, perhaps allow changing the write intervals, something like defaulting to once a day on reboot, and manually switching to shorter intervals if desired. That way if someone installs it without fully understanding the impact, it wouldn't be automatically writing each hour unless they activate hourly logging, but would still have a daily diagnostics to fall back on.

danioj · March 23, 2016

Using the script with a spare USB drive AND Unassigned Devices.

My rationale is this. USB Flash Devices are cheap. Almost EVERYONE has a spare USB 2.0 port.

But, since unraid licenses connected devices instead of array devices, the real cost of that extra device is alot higher for anyone with less than a pro license.

Hardly a reason not to develop this option, but another thing to keep in mind.

In regards to the flash wear argument, perhaps allow changing the write intervals, something like defaulting to once a day on reboot, and manually switching to shorter intervals if desired. That way if someone installs it without fully understanding the impact, it wouldn't be automatically writing each hour unless they activate hourly logging, but would still have a daily diagnostics to fall back on.

A post with excellent points and similarly excellent advice. I take on board everything written. Thank you.

MyKroFt · March 23, 2016

If ppls don't want to write it to their flash drive, since your going to attempt to make it a plugin - how about sending it to a ftp server?

Myk

danioj · March 23, 2016

If ppls don't want to write it to their flash drive, since your going to attempt to make it a plugin - how about sending it to a ftp server?

Myk

Interesting though and not a bad "additional" feature BUT it is not a solution to the issue of concern of "wear" on the Flash Drives.

This is because at the moment the location of the .zip file that the diagnostics script produces is "hard coded". This means that come what may when it is run, a diagnostics.zip file is getting written to the host unRAID Flash drive. IF I then sent it to an FTP server (while it might serve as a nice option) it doesn't change the fact that the "damage is done".

I "could" ask LT/ Tom if he would add a destination parameter to the script for me to facilitate saving it to an alternate destination (which I could expose as an option to the user) to deal with those who want to pre-select where it goes. They might go for that as it would be a simple change. That would also allow me to hold the zip file in Memory while I sent it to an FTP Server (to facilitate what you are asking).

What I DONT want to do is get into modifying that script OR managing a version of my own. I think the reasons of not wanting to do that are obvious.

I have been working on a Forum Schema today BUT I did think on the drive home about the viability of just storing it on the Array / Cache Drive. There are some pro's and con's. Obviously to "access it" in the normal way then the Array would have to be active. On most occasions this would not be an issue as most people would be able to access the Array (at least initially) after a crash / reboot BUT then there would be those who would take an issue with this script "spinning up" their disks. Then there is the Cache drive (seems it would be less of an issue about spinning this up, not least to say because most people use an SSD as their Cache anyway) which seems a better option BUT there are those who don't have one.

I still think the best option is the Flash drive. Thats where LT choose for it to go by default, why change it.

Either way, I am going to have a fair crack at this Plugin at the weekend. Just thinking out loud for now.

itimpi · March 23, 2016

I think getting a destination parameter added to the existing diagnostics script sounds like an excellent idea, and is a trivial change.

Another possibility is to log to an external drive that is managed by the Unassigned Devices plugin (this could be a USB stick). Someone mentioned the issue with drives counting against the license, but I do not think that is that critical an issue as the license check only applies when the array is started. There is nothing to stop you plugging in a USB device after that and using it for this purpose, and for this sort of investigation it does not seem that unreasonable to have some manual action required to set up the logging device.

Something else that occurs to me is for the plugin to set up a 'tail -f' on the syslog to a device. This is more likely to get the last critical event, but I guess there is a faint chance of the device being used ending up corrupted if the system crashes in the middle of a write. Still seems worth trying? I know we have in the past suggested a user sets up a 'tail' writing to the unRAID USB drive in really intractable situations. However I would be very reluctant to be doing this on a regular basis to the unRAID USB drive as that definitely would not be good for its wear characteristics - but an external one could be fine as that is easier to replace if necessary, and many people are likely to have old ones lying around that are now too small to be of much use on a daily basis for general storage.

bonienl · March 23, 2016

One thing in your story isn't clear to me. It was not possible to use console access when the lock-up happened? It is very rare that console freezes.

danioj · March 23, 2016

One thing in your story isn't clear to me. It was not possible to use console access when the lock-up happened? It is very rare that console freezes.

The main driver behind this initiative is the case where the server crashes / resets / becomes unavailable and requires a hard reset. It is not really for the case where the user has access to the CLI as they can grab the diagnostics file themselves.

I have noticed that instances of these issues have become frequent over the past year or so. This of course is a largely anecdotal statement and it is in the main my opinion BUT as means of offering some evidence a simple search of the forums can perhaps give some indication of the growing issues:

Search Term: "no telnet" 2015

https://www.google.com.au/#q=%22no+telnet%22+2015+site:lime-technology.com

Search Term: "crash" 2015

https://www.google.com.au/#q=%22crash%22+2015+site:lime-technology.com

Search Term: "hard reset" 2015

https://www.google.com.au/#q=%22hard+reset%22+2015+site:lime-technology.com

RobJ · March 24, 2016

jonathanm has already mentioned it, but I agree that it would be more useful if it was easily adjustable as to how often it runs. Cleaner to do in a plugin, but possible as a command line option too. And each option is associated with a preset cron statement. Options would be something like -

- Never (so the user can disable it when desired)

- Once a month

- Once a week

- Daily

- Twice a day

- 4 times a day

- Hourly

- Every 30 minutes

- Every 15 minutes

- Every 10 minutes

- Every 5 minutes

- Every 2 minutes

- Every minute (may be too fast for data collection if there are huge issues, very large logs)

- Continuous (just keeps looping through the saves)

Why so many options? (Partly because I believe in choice, never been a Steve Jobs fan.) With a long running always-on server, some may only want monthly. Then if the system crashes, you would change it to something like hourly, and if that still can't catch the issue, change it to every 5 minutes or faster, until you catch the pest. Then once problem is resolved, change it back to monthly, or daily if you aren't sure you trust it yet.

Would "auto-" be better than "ag-"?

bonienl · March 24, 2016

Perhaps it can be considered to let the powerdown plugin do a 'diagnostics' next to (or instead of) the syslog file ?

kizer · March 24, 2016

I totally think this should be part of the powerdown script. Hold in my power button as a last resort and have it pull the diagnostics for me.

Squid · March 24, 2016

I totally think this should be part of the powerdown script. Hold in my power button as a last resort and have it pull the diagnostics for me.

I think you mean push the power button. A hold of the power button will always force the machine off

kizer · March 24, 2016

Yeah what you said. lol

danioj · March 25, 2016

If everyone thinks this is best off in the powerdown plugin then I guess I can contact dlandon. It could be as simple as just executing the script when the button press is captured?

I am surprised that there seems to be consensus for this though as I would like to point out that this does NOT deal with the case of a lockups / resets etc.

Anyway - Ill keep it for myself as a script - I have it running every 2 hours now. I like the idea that no matter what is happening or going / WILL go on I always have recent diagnostics to hand.

EDIT: Cross Post over in the Powerdown Plugin Support Thread:

http://lime-technology.com/forum/index.php?topic=31735.msg458313#msg458313

Automated Diagnostics File Gathering Script

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation