Safety

July 17, 200817 yr

So as far as Safety Features, What is or should be implemented? How and by whom?

<snip>

Joe's script does well, but I think some of the mail part should be externalized.

</snip>

I agree 100%... at the time I wrote the original e-mail alert I did not consider that many other add-on packages could be routing mail externally.

The mail program should be separate, and usable by whatever needs it.

As far as a consistent "subject" I disagree a tiny bit. The e-mail "subject" for an ERROR condition should be very different than the "subject" for a daily "STATUS" message showing usage, etc. That way, I can set up in-box rules to file away the status messages, and flag the errors for more immediate attention. If the subjects look too similar, an error condition might not be noticed.

Joe L.

July 17, 200817 yr

Joe i think we all agree and are meaning the same thing.

How my clusters do it is pretty simple:

[Hostname or IP][Classification][Error Code] Some words

so an example would be

[TOWER][iNFO 107]Configuration backup

I think we should fork this thread as i can see lots of discussion on this one.

July 17, 200817 yr

As far as a consistent "subject" I disagree a tiny bit.

By consistent, I meant standardized. I agree messages have to "look" different.

But Identification fields of the messages should be consistent as far HOST Timing STATUS program and MESSAGE.

Example:

[hostname] DATETIME CONDITION PROGRAM "MESSAGE"

[Tower] 07/16/08 22:22:22 ALERT hddtemp /dev/sda is at 40c.

[Tower] 07/16/08 22:22:22 STATUS smartcheck S.M.A.R.T. Health is OK.

[Tower] 07/16/08 22:22:22 WARNING smartcheck S.M.A.R.T. Health /dev/sda needs review.

[Tower] 07/16/08 22:22:22 ERROR syslogcheck: ide error detected in /var/log/syslog.

[Tower] 07/16/08 22:22:22 WARNING drivespace drive /dev/md1 utilization > 90%.

[Tower] 07/16/08 22:22:22 CRITICAL drivespace drive /dev/md2 utilization > 97%.

[Tower] 07/16/08 22:22:22 ERROR drivemount: /dev/md1 is not mounted.

[Tower] 07/16/08 22:22:22 STATUS array parity sync active progress 60%.

[Tower] 07/16/08 22:22:22 STATUS array parity sync complete.

So we have the following conditions.

STATUS Notification condition

ERROR Notification condition which could cause problems.

ALERT (level 1 condition) requires action.

WARNING (level 2 condition) requires action.

CRITICAL (level 3 condition) Requires action.

July 17, 200817 yr

I think we should fork this thread as i can see lots of discussion on this one.

Where to? user customizations?

July 17, 200817 yr

As far as a consistent "subject" I disagree a tiny bit.

By consistent, I meant standardized. I agree messages have to "look" different.

But Identification fields of the messages should be consistent as far HOST Timing STATUS program and MESSAGE.

Example:

[hostname] DATETIME CONDITION PROGRAM "MESSAGE"

[Tower] 07/16/08 22:22:22 ALERT hddtemp /dev/sda is at 40c.

[Tower] 07/16/08 22:22:22 STATUS smartcheck S.M.A.R.T. Health is OK.

[Tower] 07/16/08 22:22:22 WARNING smartcheck S.M.A.R.T. Health /dev/sda needs review.

[Tower] 07/16/08 22:22:22 ERROR syslogcheck: ide error detected in /var/log/syslog.

[Tower] 07/16/08 22:22:22 WARNING drivespace drive /dev/md1 utilization > 90%.

[Tower] 07/16/08 22:22:22 CRITICAL drivespace drive /dev/md2 utilization > 97%.

[Tower] 07/16/08 22:22:22 ERROR drivemount: /dev/md1 is not mounted.

[Tower] 07/16/08 22:22:22 STATUS array parity sync active progress 60%.

[Tower] 07/16/08 22:22:22 STATUS array parity sync complete.

So we have the following conditions.

STATUS Notification condition

ERROR Notification condition which could cause problems.

ALERT (level 1 condition) requires action.

WARNING (level 2 condition) requires action.

CRITICAL (level 3 condition) Requires action.

I could live with something like that, but to make it easier to sort the mail messages, you might want to make the date format something that sorts lexically.

[Tower] 2008-02-17 22:22:22 CRITICAL: turbo encabulator prefabulated amulite spurving bearing alignment is required.

I know the date is not in an international format, but it will sort correctly by date/time when you sort by title.

Joe L.

July 17, 200817 yr

Darn bearings! Do you have to deal with that problem very often?

July 17, 200817 yr

How about you define the interface for a notifier black box, and separate out the notifier from the base applications. Notifier can be self contained and modularized so people that add pager, SMS messaging, e-mail, even SNMP traps as modules to the alerter.

Define the interface, and applications will come.

July 17, 200817 yr

NASuser, why do your posts inevitably end up as trolls? I'm really sick and tired of them, particularly the snide swipes at Tom and trying to "call him out" like a gunfighter. When you write your own software application, you can decide what features to add and when. Until then, no one hired you as a project manger for unRAID. ... We don't need the trolls. Bashing Tom and the fact unRAID development is not happening on any individual's preferred schedule or priorities is not helpful. It has gotten old and stale.

Come on bubbaQ, tell us what you really think!

I agree with most of what you say. This is Tom's tugboat, we're just passengers. Sometimes I wish he would chime in more often, but he has been very consistent in the kinds of threads he engages in (users having really difficult problems that even Joe L. and RobJ can't solve), and not in the religious arguments about features and functions.

NASuser, why'd you change your name to NAS?

July 17, 200817 yr

Not a troll. He stated a situation, observation and an opinion.

While not perfectly tactful, he did say "I would love to read Tom's opinion on this subject"

Can we drop it, move along and discuss the core reason for this thead.

Please

July 17, 200817 yr

Hey guys, sometimes things don't read the same on posting as was intended in original thought. I am extremely proud of the unRAID Community here & couldn't produce the product without it & I am pretty thick-skinned and always give every poster (and support emailer) the benefit of any doubt.

The idea of monitoring temperature is a very good one & we really want to do it as part of more comprehensive disk health package. Some things complicate this:

a) many older drives don't support SMART, or support it with various quirks, so there needs to be a way to configure this disk-by-disk;

b) the drive temperature reporting can be notoriously inaccurate. A lot depends on where the sensor is located & these sensors are not typically calibrated anyway. One way around this is to develop a temperature "profile" for a disk.

c) needs to be a way to handle "spikes", etc.

d) probably more issues I'm not thinking of.

So, while this is a great feature, it will take some work & we already are about to be skinned by people wanting other features... so, it has to go onto Ye Olde Laundry List.

Side note: an order of multiple systems always get's your feature request moved to the top

July 18, 200817 yr

Hey guys, sometimes things don't read the same on posting as was intended in original thought. I am extremely proud of the unRAID Community here & couldn't produce the product without it & I am pretty thick-skinned and always give every poster (and support emailer) the benefit of any doubt.

The idea of monitoring temperature is a very good one & we really want to do it as part of more comprehensive disk health package. Some things complicate this:

a) many older drives don't support SMART, or support it with various quirks, so there needs to be a way to configure this disk-by-disk;

b) the drive temperature reporting can be notoriously inaccurate. A lot depends on where the sensor is located & these sensors are not typically calibrated anyway. One way around this is to develop a temperature "profile" for a disk.

c) needs to be a way to handle "spikes", etc.

d) probably more issues I'm not thinking of.

So, while this is a great feature, it will take some work & we already are about to be skinned by people wanting other features... so, it has to go onto Ye Olde Laundry List.

Side note: an order of multiple systems always get's your feature request moved to the top

Tom,

We've been discussing a more modular approach to monitoring system health. Basically, there are multiple parameters that indicate "health" of a server. Individual drive temperature is only one of them. Other "SMART" related parameters include reallocated sectors, and those pending reallocation.

There are potentially many other parameters that a monitor might report equally important that have nothing to do with "SMART." Those might include array status (if a disk has failed) or low free space on one or more disks (time to watch for a good sale), or low free memory (something is filling the syslog and the server is running out of RAM for processes.)

We are proposing a modular "black box" approach. I can envision a specific directory on the flash drive with "plug in"scripts, each designed to monitor an aspect of health appropriate. A drivng "black box" process can run each "health plug in" in turn, consolidate their output, and then route it as appropriate via an "alert plug in." It might use whatever alert method is best for a given user. (e-mail, YAC client, etc) at minimum, if nothing else is configured, a message can be routed to the system console. It might also be displayed on the emhttp web page interface.

I know that so far, most of the functionality has been embedded in emhttp. I think that it cannot be everything to everybody. It can however provide the infrastructure needed. We've been looking for feedback on implimentation of a series of scripts, invoked if present by emhttp at certain events (before array start, after array start, before array stop, after array stop, etc). These would allow us with development skills the ability to hone through some of the "laundry list" An additional improvement might be a way to communicate with emhttp through an API other than the web. Since you already have struggled to learn how to read the disk temperature, it would be nice not to have to duplicate that effort. The "health plug in" I referred to could evolve as we better understand how to get a drive's temperature.

Personally, I think that array "Safety" should be high on the laundry list, equally high as NFS file system support. The "Safety" features include e-mail alert and ups support for a clean shutdown in the event of a prolonged failure. I know there are a million possible UPS devices. You cannot provide support for every one. One at a time, we can post solutions... and "plug in scripts"

Provide the infrastructure and the "Safety" scripts will be developed. Will we get it perfect the first time? no.. but with a bit of polish, the scripts will quickly be written and refined.

You can focus on the features involving the user interface, additional file-system support, the kernel, and the "md" driver. I fear that if emhttp does not provide the hooks needed, it will eventually be bypassed, and a substitute developed to provide the needed hooks. I even experimented with a crude proxy filter to capture the commands sent to emhttp. It might work, but having the hooks built in would be perfect.

The biggest issue will be the safe execution of these scripts without endangering emhttp. You do not want it to block waiting for a script to finish forever. Clearly, is does need to wait... otherwise, the shutdown tasks, or startup tasks might not be completed before the emhttp process continues with its task.

Time to revisit the laundry list with all this in mind. I'd like to see a "server health" screen on the web-interface. I would like it to invoke a specific script on /boot. Initially, the script can just cat syslog. You can just display its output on the "server health" link in a textarea with a scroll bar. We'll refine the script, and argue, and resolve our arguments with the best of everyone's ideas....

Joe L.

July 18, 200817 yr

Joe L... Eloquent and Well Said.

July 18, 200817 yr

Two beers and a movie later ...

I think that I am about the biggest safety advocate in these forums, and have been suggesting features to force the server to shut down for safety's sake under a variety of circumstances, including hot disks, failed disks, and power failures. As Tom mentions, there are a number of real world issies that have to be solved to make these types of features work in the real world.

I am going to stay out of the technical fray. A number of great brains are looking at how to add safety features to unRAID. I am less intested in the how and more interested in the what. My only suggestion is to make the functionality accessible to that the average unRAID user can make it work.

Looking at safety, we really need to closely consider how users have actually lost data, and address these high likelihood, high impact problems.

1. Pressing the restore button at the wrong times. This has bitten more users in the butt than any other issue.

2. Problem rebuilding a failed (up upgraded) disk. One bad sector on any other disk will lead to lost data. Doing a parity check routinely is an easy precaution that few users know to do. An ability to schedule these would be an important safety feature. A post just today indicated a user that had not done a parity check since April, and he had over 1000 parity errors. How many other users think they are protected but aren't. Big safety exposure here!

3. At least one user accidentally assigned a data disk to the parity slot by accident. Deadly. GUI changes to make this hard to do would be a good safety feature.

4. People don't know how hot is too hot and run their disks at or near the edge of the temperature envelop. Perhaps having unRAID color code the temps on the main tab to blue (cool), green (good), yellow (warm), or red (hot) would let users know they have a problem. Fan failures are very rare, but setting up hot-running unRAID servers is not.

I'm glad this is high on Tom's radar and look forward to these features being added to unRAID in the not too distant future.

Anyone about to order multiple systems, please tell Tom to move this feature to the top of the list!

July 18, 200817 yr

Two beers and a movie later ...

<snip>

Looking at safety, we really need to closely consider how users have actually lost data, and address these high likelihood, high impact problems.

An interesting and valid approach to this problem.

1. Pressing the restore button at the wrong times. This has bitten more users in the butt than any other issue.

No argument from me here. Currently, I like "Delete Disk Configuration" with instructions something like "Pressing this button will remove the current system.dat configuration file and create a new one based on the currently assigned and working disks. I like the idea of it first going to a new web-page with additional instructions, warnings, and another "I'm sure" checkbox. I like the idea of only enabling the button if a disk has been marked for removal from the array. Perhaps a new value is needed on the devices page "Delete Disk from Configuration" It would be equivalent to "unassigned" but unless chosen for a currently assigned disk, the "Delete Dsk configuration" button would not be enabled.

Just about any other need for this button can be handled by asking the user to delete or rename the system.dat file from the flash drive.

2. Problem rebuilding a failed (up upgraded) disk. One bad sector on any other disk will lead to lost data. Doing a parity check routinely is an easy precaution that few users know to do. An ability to schedule these would be an important safety feature. A post just today indicated a user that had not done a parity check since April, and he had over 1000 parity errors. How many other users think they are protected but aren't. Big safety exposure here!

An option to schedule this on a weekly/bi-weekly/monthly basis is a great idea. It takes very little time to impliment, and will help many discover issues before they are forced to recover their data.

3. At least one user accidentally assigned a data disk to the parity slot by accident. Deadly. GUI changes to make this hard to do would be a good safety feature.

Something like invoking vol_id on the potential "partition" would help here. If any file-system is detected ( reiserfs, ext3, ntfs, fat) then a warning and additional "I'm sure" checkbox is warranted.

Here is the output on my parity drive (sdb1), and one of my data drives (hda1):

root@Tower:/boot# vol_id /dev/sdb1

/dev/sdb1: unknown volume type

root@Tower:/boot# vol_id /dev/hda1

ID_FS_USAGE=filesystem

ID_FS_TYPE=reiserfs

ID_FS_VERSION=3.6

ID_FS_UUID=f0241f95-552b-4488-95b8-28488e09873e

ID_FS_UUID_SAFE=f0241f95-552b-4488-95b8-28488e09873e

ID_FS_LABEL=

ID_FS_LABEL_SAFE=

4. People don't know how hot is too hot and run their disks at or near the edge of the temperature envelop. Perhaps having unRAID color code the temps on the main tab to blue (cool), green (good), yellow (warm), or red (hot) would let users know they have a problem.

Great idea. Easy to impliment for Tom, and much easier on inexperienced users to determine if a problem exists in their array.

Fan failures are very rare, but setting up hot-running unRAID servers is not.

I respectfully disagree here. Fan failures are common. Especially after they are spinning for several years. Normally fans start making noise before they completely fail, but since we often have our servers in out-of-the-way places, the noise might be not be noticed.

I'm glad this is high on Tom's radar and look forward to these features being added to unRAID in the not too distant future.

Anyone about to order multiple systems, please tell Tom to move this feature to the top of the list!

I agree with your approach. Learn from where most of the support threads require assistance.

July 18, 200817 yr

Rather than the weekly parity check which requires spinning up all drives for an extended time, a read-verify pass, 1 or 2 drives at a time, would be less taxing, and much less impact on regular use of the server, and a good PM test to discover bad sectors.

Also, I don't recall whether there is an ability to do a read-only parity test, rather than the full parity rebuild.... the former would likely be faster, and sufficiently protective for periodic use between full parity rebuilds.

July 18, 200817 yr

Discussion of new safety feature suggestions continued here

July 18, 200817 yr

Rather than the weekly parity check which requires spinning up all drives for an extended time, a read-verify pass, 1 or 2 drives at a time, would be less taxing, and much less impact on regular use of the server, and a good PM test to discover bad sectors.

Also, I don't recall whether there is an ability to do a read-only parity test, rather than the full parity rebuild.... the former would likely be faster, and sufficiently protective for periodic use between full parity rebuilds.

#1. Best you could do is to use "dd" to read every block from each disk in turn... however, if it takes 4 hours to read an entire disk it will take 40 hours to read 10 disks. The parallel read of all disks during a parity check is much faster. Granted, you could schedule one disk to be read a day over 10 days, but it is as easy to schedule a parity check.

#2. There is no concept of a parity build vs. a parity check in the "md" driver. You issue the exact same command to do both. the one difference is that in a rebuild the parity disk is marked as invalid, so it is entirely written and not read at all. Therefore, once parity is valid, pressing the "check parity" button is a read-only operation unless it is determined the value stored on the parity drive is incorrect. If that is determined, it is overwritten with a new calculated parity value.

It has been requested that the parity "check" not perform corrective actions, but instead only report what is found. A second "Parity Calculate" feature wold then do the correction. This would potentially be useful in some situations.

Joe L.

July 18, 200817 yr

It has been requested that the parity "check" not perform corrective actions, but instead only report what is found. A second "Parity Calculate" feature wold then do the correction. This would potentially be useful in some situations.

OK you answered my questions.

We need

* Parity Validate

* Parity Calculate

* Speed Threshold Control

July 18, 200817 yr

Rather than the weekly parity check which requires spinning up all drives for an extended time, a read-verify pass, 1 or 2 drives at a time, would be less taxing, and much less impact on regular use of the server, and a good PM test to discover bad sectors.

Also, I don't recall whether there is an ability to do a read-only parity test, rather than the full parity rebuild.... the former would likely be faster, and sufficiently protective for periodic use between full parity rebuilds.

The "md" driver code does not currently have a read-only verification process.

Since we are mostly interested in reading every block from every disk so we can take advantage of the SMART system, something like this will do each disk in turn. Remember, if the "read" fails, the "md" driver should calculate the correct data and write it to the failed drive. that should re-allocate the bad sector.

for i in /dev/md*

do

echo "Full read of $i started `date`"

dd if=$i of=/dev/null bs=4k

done

echo "Full read of all disks completed `date`"

It will take a very long time to complete. Days probably if you have a large array.

Joe L

Safety

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)