Jump to content

SMART notifications


RobJ

Recommended Posts

The new SMART notifications are a great feature, but need some changes.  Here's what I suggest -

 

Attribute                      Warning level      Trigger                      Optional note
  5 Reallocated_Sector_Ct      Warning            change in RAW                "reallocated sector count has increased to %d"
187 Reported_Uncorrect         Warning            increase in RAW              "Reported Uncorrectable count is %d"
197 Current_Pending_Sector     Critical           RAW > 0                      "Current Pending sectors is %d, MUST be repaired"
197 Current_Pending_Sector     Informational      RAW changed to 0             "Current Pending sectors is zero again!"
198 Offline_Uncorrectable      Warning            increase in RAW              "Offline Uncorrectable is %d"
199 UDMA_CRC_Error_Count       Informational      increase in RAW by 1         "UDMA CRC error count is %d"
199 UDMA_CRC_Error_Count       Warning            increase in RAW more than 1  "Check for bad SATA cable"
Any attribute if (FLAG & 1)    Warning        WORST<100 and WORST<(THRESH * 1.2) "Critical SMART attribute approaching failure"
Any attribute if (FLAG & 1)    Critical           WORST <= THRESH              "SMART failure, backup immediately and replace"

 

As you can see, almost all are based on changes in values, not the current values.  So you would have to store the previous number for 5 attributes, although more may be added later, such as for SSDs.  For any given drive, one or more of them may not exist.

 

The "Any attribute if (FLAG & 1)" means only critical attributes, those attributes with the one bit on (or put another way, they are odd numbers).  These are the only attributes that cause a SMART test to FAIL.  All other attributes are informational, no matter how serious they may look.  The first critical test is whether it is approaching the failure point, the next is if it has already reached it.  The first test is whether it's within 20% of it, which could be changed to 10% (THRESH * 1.1).  The (WORST < 100) test is necessary, because some drives have a THRESH almost as high as the initial value, which means they are perfect until it drops almost any amount, then it's considered failed.  Previous values don't need to be stored for any of these critical attributes.  And the set of critical attributes is different for different drives.

 

A CRC count that is not changing or only occasionally increases by one is not a problem.

 

196 Reallocated_Event_Count isn't needed, because it would only change when 5 Reallocated_Sector_Ct changed.

Link to comment
  • 5 months later...

The new SMART notifications are a great feature, but need some changes.  Here's what I suggest -

 

As you can see, almost all are based on changes in values, not the current values.  So you would have to store the previous number for 5 attributes, although more may be added later, such as for SSDs.  For any given drive, one or more of them may not exist.

 

The "Any attribute if (FLAG & 1)" means only critical attributes, those attributes with the one bit on (or put another way, they are odd numbers).  These are the only attributes that cause a SMART test to FAIL.  All other attributes are informational, no matter how serious they may look.  The first critical test is whether it is approaching the failure point, the next is if it has already reached it.  The first test is whether it's within 20% of it, which could be changed to 10% (THRESH * 1.1).  The (WORST < 100) test is necessary, because some drives have a THRESH almost as high as the initial value, which means they are perfect until it drops almost any amount, then it's considered failed.  Previous values don't need to be stored for any of these critical attributes.  And the set of critical attributes is different for different drives.

 

A CRC count that is not changing or only occasionally increases by one is not a problem.

 

196 Reallocated_Event_Count isn't needed, because it would only change when 5 Reallocated_Sector_Ct changed.

 

It definitely wasn't forgotten. Past weeks I've spent some considerable time in researching SMART and how to improve the current implementation. The biggest struggle here is the inconsistency of SMART itself. No single vendor seems to use the SMART attributes in the same way, and I saw even differences between firmware versions of the same vendor.

 

It appears there are two camps: one proclaiming to look at the normalized values, since this is how the vendor has made his specific implementation. The caveat here is though that some vendors put the threshold values for some or all attributes to 00, basically meaning no threshold defined and hence it won't be possible to generate a notification on such an attribute. This could be solved by looking at different attributes and make the correlation between these values. The challenge: I haven't found any documentation which describe how attributes are "normalized", apparently vendors keep that information to themselves (e.g.Seagate is saying don't use other disk tools, only ours give correct results, without explaining the inner details).

 

The other camp is proclaiming to look at raw values only. Though it is true that many of these values are directly readable/interpretable, it is not always clear how these values are incremented or what is the exact format of the value (binary groups, absolute number or whatever). Again no vendor documentation available and the SMART specification itself leaves this open. The worst thing is that the specification doesn't enforce any particular attribute or set of attributes, in theory a vendor implementing just one single random attribute complies with the specification. BAD.

 

So where does that leave us?

 

The current implementation in unRAID is based on best practice from BackBlaze. They have examined their faulty drives and made visible which SMART attributes are mostly contributing, the outcome is a list of 5 attributes which are mostly common between vendors and "may" predict an upcoming disk failure when starting to increment as raw value. There is NO reliable way to predict an imminent disk failure, sorry but SMART can't do that.

 

The notifications system allows to enable/disable each individual attribute out of the list of five from BackBlaze, and even add a custom attribute can be added. IMHO for the moment this is the safest approach, until vendors adopt SMART better and give more openness in their implementation or a more strict specification is introduced which rules out arbitrary interpretations.

 

Link to comment

Past weeks I've spent some considerable time in researching SMART and how to improve the current implementation. The biggest struggle here is the inconsistency of SMART itself. No single vendor seems to use the SMART attributes in the same way, and I saw even differences between firmware versions of the same vendor.

 

It appears there are two camps: one proclaiming to look at the normalized values, since this is how the vendor has made his specific implementation. The caveat here is though that some vendors put the threshold values for some or all attributes to 00, basically meaning no threshold defined and hence it won't be possible to generate a notification on such an attribute. This could be solved by looking at different attributes and make the correlation between these values. The challenge: I haven't found any documentation which describe how attributes are "normalized", apparently vendors keep that information to themselves (e.g.Seagate is saying don't use other disk tools, only ours give correct results, without explaining the inner details).

 

The other camp is proclaiming to look at raw values only. Though it is true that many of these values are directly readable/interpretable, it is not always clear how these values are incremented or what is the exact format of the value (binary groups, absolute number or whatever). Again no vendor documentation available and the SMART specification itself leaves this open. The worst thing is that the specification doesn't enforce any particular attribute or set of attributes, in theory a vendor implementing just one single random attribute complies with the specification. BAD.

 

Very well said!  I think everyone that has spent much time researching SMART has come to very similar conclusions.  Some of my own thoughts are in the incomplete wiki page Understanding SMART Reports.

 

Unfortunately I have found that some highly respected people, such as our Tom and Steve Gibson and other greats, have looked at all the inconsistency and strongly devalued SMART, as essentially useless.  My experience has been that once you understand the limitations and have examined numerous SMART reports, there is still a lot of useful info.  I have found that a blunt instrument can provide more accurate results once you examine and compare numerous readings, always taking into account the limitations.  For example, I have instruments that provide only 2 digits.  One is a digital temp, but is clearly sampling at a high rate.  If I see mostly 34's, but a few 33's and twice that of 35's, I can call it about 34.3.  If I see just as many 34's as 35's, I can say it has risen to about 34.5.  I have gained a little precision by studying its behavior and using 10 times the readings.  I find that the same principles can apply to other blunt tools, such as SMART reports.  I can look for trends in the numbers.  I can compare with equivalent models, and by doing so, eliminate considerable inconsistency.  With care and experience, SMART *is* useful.

 

So where does that leave us?

 

The current implementation in unRAID is based on best practice from BackBlaze. They have examined their faulty drives and made visible which SMART attributes are mostly contributing, the outcome is a list of 5 attributes which are mostly common between vendors and "may" predict an upcoming disk failure when starting to increment as raw value. There is NO reliable way to predict an imminent disk failure, sorry but SMART can't do that.

 

The notifications system allows to enable/disable each individual attribute out of the list of five from BackBlaze, and even add a custom attribute can be added. IMHO for the moment this is the safest approach, until vendors adopt SMART better and give more openness in their implementation or a more strict specification is introduced which rules out arbitrary interpretations.

 

I too was very impressed with BackBlaze's work, but respectfully, I had several issues with their conclusions (of course I could be very wrong!), plus their application is a little different than ours.  I wanted to question them on a couple of things, but never followed through, and forgot about it.  One was their citing and handling of Command Timeout.  They never explain why they see indications of issues with Command Timeout, or how they interpret it, and nothing that I saw of their published material corroborated it.  I have never seen a related issue either, in my own experience.  That doesn't mean there isn't one, just that probably our experience is still too limited.  Certainly, I have less experience than they do!

 

Their application is far more controlled than ours, with only high quality parts (and only a very limited list of them), probably standardized procedures and techniques, and experienced workers.  Ours involves numerous and extremely disparate systems, considerable variance in part quality, and mostly homeowners of very mixed technical abilities and experience.  Cabling is a non-issue for them, but may be the largest single issue for our users, so monitoring the CRC count becomes very important for us, is not important for them.  Their set of SMART attributes comes from a limited set of modern drives, ours from almost every drive model on the planet, old and new.

 

We also want to detect 'true' SMART failures, so I added the flag testing, the easiest way to know which attributes are considered 'Critical' for that drive model.  If an attribute is NOT critical, then the threshold is completely immaterial, and can be ignored.  If the vendor sets the threshold to zero, then it can't be reached (well this is SMART, very inconsistent, but I have never seen it reached).  If the vendor sets a numeric threshold for a non-critical attribute, then it's only informational, although it may possibly be useful to the drive owner that the vendor thinks this attribute *might* be indicative of trouble.  But the vendor themselves was not sufficiently convinced of that to make it a critical attribute!

 

I do still stand by my recommendations, although I accept that there may be improvements or tweaks to the list or the triggers.  I welcome discussion.

Link to comment

 

It appears there are two camps: one proclaiming to look at the normalized values, since this is how the vendor has made his specific implementation. The caveat here is though that some vendors put the threshold values for some or all attributes to 00, basically meaning no threshold defined and hence it won't be possible to generate a notification on such an attribute.

 

 

What makes you say this? A threshold of 00 if means that the threshold is zero. Most attributes have a 0 threshold since the scale is arbitrary. As long as the normalized VALUE is above the THRESH the attribute PASSes.

 

Link to comment

Have a look at the attribute list below.

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0000   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0000   100   100   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       86
12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       3
160 Uncorrectable_Error_Cnt 0x0000   100   100   000    Old_age   Offline      -       0
161 Valid_Spare_Block_Cnt   0x0000   100   100   000    Old_age   Offline      -       70
163 Initial_Bad_Block_Count 0x0000   100   100   000    Old_age   Offline      -       21
164 Total_Erase_Count       0x0000   100   100   000    Old_age   Offline      -       6814
165 Max_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       15
166 Min_Erase_Count         0x0000   100   100   000    Old_age   Offline      -       0
167 Average_Erase_Count     0x0000   100   100   000    Old_age   Offline      -       4
168 Max_Erase_Count_of_Spec 0x0000   100   100   000    Old_age   Offline      -       2000
169 Remaining_Lifetime_Perc 0x0000   100   100   000    Old_age   Offline      -       100
175 Program_Fail_Count_Chip 0x0000   100   100   000    Old_age   Offline      -       0
176 Erase_Fail_Count_Chip   0x0000   100   100   000    Old_age   Offline      -       0
177 Wear_Leveling_Count     0x0000   100   100   000    Old_age   Offline      -       0
178 Runtime_Invalid_Blk_Cnt 0x0000   100   100   000    Old_age   Offline      -       0
181 Program_Fail_Cnt_Total  0x0000   100   100   000    Old_age   Offline      -       0
182 Erase_Fail_Count_Total  0x0000   100   100   000    Old_age   Offline      -       0
192 Power-Off_Retract_Count 0x0000   100   100   000    Old_age   Offline      -       1
194 Temperature_Celsius     0x0000   100   100   000    Old_age   Offline      -       22
195 Hardware_ECC_Recovered  0x0000   100   100   000    Old_age   Offline      -       37
196 Reallocated_Event_Count 0x0000   100   100   000    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
198 Offline_Uncorrectable   0x0000   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0000   100   100   000    Old_age   Offline      -       0
232 Available_Reservd_Space 0x0000   100   100   000    Old_age   Offline      -       100
241 Host_Writes_32MiB       0x0000   100   100   000    Old_age   Offline      -       43581
242 Host_Reads_32MiB        0x0000   100   100   000    Old_age   Offline      -       12799
245 Flash_Writes_32MiB      0x0000   100   100   000    Old_age   Offline      -       54512

Link to comment

Hm, no... according to SMART specifications a vendor needs to convert a raw value into a normalized value ranging from 1 to 253. A threshold value of zero is used to indicate that no threshold is present and the associated attribute is not used for failure prediction.

 

Furthermore each attribute in the earlier list has a flag value of zero, which says that none of the attributes is considered critical.

 

It eludes me how to predict a disk failure when there are no critical attributes to observe and none of them will ever reach a failure threshold...

 

Link to comment

bonienl is right.  That attribute list is incredible!  I'm sure that that SSD will fail some day, but it will NEVER fail SMART!  Probably a new vendor?  Either the developer had little experience with programming for SMART, or it was a conscious decision to never allow SMART failures.

 

That attribute list does show immaturity.  They are too uniform, so probably fresh programming.  Most SMART reports show very old code mixed with new code for new attributes, which is why many have mixed starting values (some 100 and some 200) .  The flag bits are wrong, should not be all zero.  For example, error rates have a flag bit set for that, which should make Raw_Read_Error_Rate a non-zero flag value.

Link to comment

Hm, no... according to SMART specifications a vendor needs to convert a raw value into a normalized value ranging from 1 to 253. A threshold value of zero is used to indicate that no threshold is present and the associated attribute is not used for failure prediction.

 

Furthermore each attribute in the earlier list has a flag value of zero, which says that none of the attributes is considered critical.

 

It eludes me how to predict a disk failure when there are no critical attributes to observe and none of them will ever reach a failure threshold...

 

 

Do you have a reference for this?

 

 

How does this prevent notification on the change of a value? I'd expect those 100 VALUES to decrease as the drive ages.

 

 

The vendor is free to implement SMART as they see fit.

Link to comment

Do you have a reference for this?

 

There are several documents you can find on the internet. For example this one

 

How does this prevent notification on the change of a value? I'd expect those 100 VALUES to decrease as the drive ages.

 

The whole point with value and threshold is that a notification is only needed when the value is equal or below the threshold. A decreasing value as such doesn't mean there is an issue.

 

Link to comment

The whole point with value and threshold is that a notification is only needed when the value is equal or below the threshold. A decreasing value as such doesn't mean there is an issue.

 

I don't agree with this.

 

Although the drive makers might want you to believe this, and maybe the normalized values are a good guide for some attributes, there are several attributes that indicate a serious drive problem well before the value gets anywhere near the threshold. For spinning drive, these are most notably, Reallocated_Sector_Ct and Current_Pending_Sector, but also others like Offline_Uncorrectable and Multi_Zone_Error_Rate that I've often seen associated with failing drives.

 

Then there is UDMA_CRC_Error_Count, which normally rises when there is a cabling problem to a drive. One needs to be notified and take action once the raw value starts to increase, or face drives dropping from the array. And even if the raw attribute went sky high, and the normalized value dropped to or below the threshold, once corrected the drive would likely be fine.

Link to comment

Do you have a reference for this?

 

There are several documents you can find on the internet. For example this one

 

Does the manufacturer of the example drive provide this type of document?

 

How does this prevent notification on the change of a value? I'd expect those 100 VALUES to decrease as the drive ages.

 

The whole point with value and threshold is that a notification is only needed when the value is equal or below the threshold. A decreasing value as such doesn't mean there is an issue.

 

 

There are several things that can be watched. The FAILED column and the overall health.

SMART overall-health:   Passed

 

 

Watches on the RAW values that RobJ mentioned would be nice. Allowing the User to select attributes for RAW value watch could be helpful.

 

 

We could notify when a normalized VALUE has decreased by more than a configurable percentage over a configurable amount of time or present the amount of the time over which the drop occurred. This config should be available each attribute.

Link to comment

TESTERS REQUIRED

 

I have made some changes and additions to the way SMART attributes are interpreted and the handling of the corresponding notifications.

 

  • RAW or NORMALIZED values
    RAW = the existing implementation, an increase in the RAW value of an attribute results in a notification. It is possible to set a tolerance level, which acts as a margin to the next 'level' before sending a new notification.
    NORMALIZED = read the normalized value of the attribute and generate a notification when it is equal or below the threshold value (if threshold > 0). Here too a tolerance level can be set to allow notifications to be sent before the actual threshold is reached.
     
     
  • Tolerance level
    As explained above this sets a margin expressed as a percentage to allow filtering on notifications. It is calculated as threshold = value + percentage (by default percentage is zero).
     
     
  • Custom attributes
    A list of comma separated attributes can be added next to the pre-selected attributes. It is even possible to disable all pre-selected attributes and work only with custom attributes.
     

 

Notifications are generated for any of the enabled pre-selected attributes and the specified custom attributes.

 

I like to get some feedback on this approach and any remarks are welcome. If needed the pre-selected attributes list can be adjusted, but I need some consensus.

 

Those who like to test can download the package here. Copy the txz file to your flash drive and install manually:

 

installpkg dynamix.smart.test-2015.11.21.txz

 

After installation go to Settings -> Notification Settings to set the new parameters as desired. You need to have Advanced View enabled.

 

Link to comment

TESTERS REQUIRED

 

I have made some changes and additions to the way SMART attributes are interpreted and the handling of the corresponding notifications.

 

  • RAW or NORMALIZED values
    RAW = the existing implementation, an increase in the RAW value of an attribute results in a notification. It is possible to set a tolerance level, which acts as a margin to the next 'level' before sending a new notification.
    NORMALIZED = read the normalized value of the attribute and generate a notification when it is equal or below the threshold value (if threshold > 0). Here too a tolerance level can be set to allow notifications to be sent before the actual threshold is reached.
     
     
  • Tolerance level
    As explained above this sets a margin expressed as a percentage to allow filtering on notifications. It is calculated as threshold = value + percentage (by default percentage is zero).
     
     
  • Custom attributes
    A list of comma separated attributes can be added next to the pre-selected attributes. It is even possible to disable all pre-selected attributes and work only with custom attributes.
     

 

Notifications are generated for any of the enabled pre-selected attributes and the specified custom attributes.

 

I like to get some feedback on this approach and any remarks are welcome. If needed the pre-selected attributes list can be adjusted, but I need some consensus.

 

Those who like to test can download the package here. Copy the txz file to your flash drive and install manually:

 

installpkg dynamix.smart.test-2015.11.21.txz

 

After installation go to Settings -> Notification Settings to set the new parameters as desired. You need to have Advanced View enabled.

I like it with one exception (one that I've been stating since the beginning of dynamix integration).

 

Any attribute which is FAILING NOW should automatically trigger a failure status for the drive regardless of whether or not that attribute is monitored or not (irregardless of what the SMART overall-health states)

 

Additionally, when the value is less than the threshold (ie: the attribute is FAILING NOW), the dashboard warning level should be a thumbs down, rather than a warning.

 

In conjunction, ideally I would like to see the monitoring settings have the capability to be set individually for each drive.  With your tolerance level, there are going to be times when you're going to want to know when a certain drive has ANY change in its attributes, but on another drive a 10% change will be sufficient, etc.

 

 

Link to comment

like it with one exception (one that I've been stating since the beginning of dynamix integration).

 

Any attribute which is FAILING NOW should automatically trigger a failure status for the drive regardless of whether or not that attribute is monitored or not (irregardless of what the SMART overall-health states)

 

Now when any attribute has the state "FAILING NOW" it will result in a notification and the corresponding attribute is highlighted on the disk attributes page.

 

Additionally, when the value is less than the threshold (ie: the attribute is FAILING NOW), the dashboard warning level should be a thumbs down, rather than a warning.

 

Any failing attribute is listed on the dashboard page when hovering over the warning sign. I keep the thumbs down sign exclusively for a failed SMART health status.

 

A updated package is available. You need to re-install if already installed the earlier package.

Link to comment

like it with one exception (one that I've been stating since the beginning of dynamix integration).

 

Any attribute which is FAILING NOW should automatically trigger a failure status for the drive regardless of whether or not that attribute is monitored or not (irregardless of what the SMART overall-health states)

 

Now when any attribute has the state "FAILING NOW" it will result in a notification and the corresponding attribute is highlighted on the disk attributes page.

 

Additionally, when the value is less than the threshold (ie: the attribute is FAILING NOW), the dashboard warning level should be a thumbs down, rather than a warning.

 

Any failing attribute is listed on the dashboard page when hovering over the warning sign. I keep the thumbs down sign exclusively for a failed SMART health status.

 

A updated package is available. You need to re-install if already installed the earlier package.

Good stuff.  I would still like to see individual settings for each drive for alert levels, monitored attributes, etc, but I'm probably just being greedy
Link to comment

Good stuff.  I would still like to see individual settings for each drive for alert levels, monitored attributes, etc, but I'm probably just being greedy

 

Understand your request, but have to put this on the "todo" list, since it requires additional disk parameters and need to think of some way to do that (might need help of LT).

 

Ps. More tweaking done. If already installed, please download and install the txz file again.

 

UPDATE

The latest version allows global SMART settings, which can be found under Settings -> Disk Settings

And... individual SMART settings per disk, which can be found under Main -> Device name -> Device Settings

 

This approach allows to quickly set values for all disks and only adjust those disks which are different.

 

WAIT...

 

There is more! It is also possible to set the controller type. People using a RAID controller can use this on global and individual level to get the proper SMART command for their hardware.

 

I am looking for beta-testers to try out these new features, specifically the SMART controller selection.

 

As before it is required to do a manual installation. Download the latest package here

 

installpkg dynamix.smart.test-2015.11.22.txz

 

Good luck :)

global-smart-settings.png.5236acc5cf49b28ac08db3dd02132741.png

disk_smart_settings.png.e25c3e17d23faa6e0ddbfde163da8541.png

Link to comment

Hey, thanks for the update.

 

I have set the "Default SMART notification value:" to "Normalized"

No more warning signs (as with unRAID 6.1.3) do show up, although my SMART values (naturally) have stayed the same. Maybe a preference for Normalized would help.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...