SMART not so smart sometimes

January 20, 200917 yr

The way that the SMART system works, is that each attribute is "normallized" to a number between 255 and 0 (255 is always good, 0 is always bad). If the normalized value falls below a predefined "threshold", then SMART fails the drive and says it will fail within 24 hours. Based on the raw value of the attribute, a "value" is computed for that attribute. This is compared to the threshold to determine failure.

Check out the values below. Most of us here would say that this drive is in awful shape. But even with 201 reallocated sectors, the drive is sill showing a normalized value of 174. It would have to drop to the threshold (140) for it to fail based on that attribute.

And look at the current pending sector line, the normalized value is 193 (with 563 pending sectors), and it would have to go all the way down to 0 to fail the drive! At that rate, there would have to be about 2300 pending sectors to fail the drive!

This drive is in terrible shape and should definitely fail the test.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   199   199   051    Pre-fail  Always       -       12493
  3 Spin_Up_Time            0x0003   227   218   021    Pre-fail  Always       -       5641
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       355
  5 Reallocated_Sector_Ct   0x0033   174   174   140    Pre-fail  Always       -       201
  7 Seek_Error_Rate         0x000f   200   200   051    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   080   080   000    Old_age   Always       -       15159
10 Spin_Retry_Count        0x0013   100   100   051    Pre-fail  Always       -       0
11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       89
194 Temperature_Celsius     0x0022   253   253   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   156   156   000    Old_age   Always       -       44
197 Current_Pending_Sector  0x0012   194   193   000    Old_age   Always       -       563
198 Offline_Uncorrectable   0x0010   193   193   000    Old_age   Offline      -       584
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0009   148   148   051    Pre-fail  Offline      -       2639

Quote

January 20, 200917 yr

Well it depends on some other things... like drive size, and number of spare sectors.

A 1TB drive will have a LOT mor sectors, and a LOT more spare sectors than on a 10GB drive.

Number of sector errors that you can live with is going to be a lot higher on a 1TB drive than on a 10GB drive.... and the number of spare sectors is going to be higher too.

Quote

January 21, 200917 yr

He's right. The numbers do have to be interpreted correctly, and from what I have seen, many of the numbers are essentially meaningless, possibly useful only in an advisory manner. This especially applies to any attributes that are not critical, marked with the PRE_FAIL flag. The only numbers generally of value in those attributes without the PRE_FAIL flag, are either the VALUE column or the RAW_VALUE column. A threshold does not mean much, if it is not used. The Current_Pending_Sector count is one of these, primarily a count of suspect sectors that, from my reading, have one last chance for testing and recovery before being written off and remapped. And with the right surface test (I forget which), this value will probably return to zero (but with a large increase in the remapped sector count). The Reallocated_Sector_Ct *is* a critical number, that can cause a failing grade, but as BubbaQ said, there a large number of spare sectors kept in reserve just for this purpose. Although we certainly worry about a drive that begins to use up these reserve sectors, that is what they are there for, and that is how the SMART system is supposed to work. A drive is not really used up, until its supply of spare sectors is used up.

Quote

January 21, 200917 yr

Author

Didn't quite understand that. Sorry ... By RE_FAIL do you mean pre-flag type?

I think I was pretty accurate in my email about how smart works (we can debate whether this drive should have failed the overall health assessment).

The RAW_VALUEs may or may not be humanly intelligible. But sometimes they are (i.e., usually counts and temperature values are meaningful).

The manufacturer comes up with some mathematical formula (or algorithm) that takes the RAW_VALUE and converts it into a number between 0 (bad) and 255 (good). (So a reallocated_sector_ct of 0 should lead to a score at or close to 255, and a reallocated_sector_ct of 1,000,000 should lead to a score at or close to 0.) that "normalized" number becomes the VALUE. The manufacturer also defines, for each attribute, what a failing value is and call that the (THRESH) (threshold value). Since we don't know the formula, we can't back into the RAW_VALUE that would cause that THRESH to be met, so we don't know the RAW_VALUE_THRESHOLD, which is what we'd really like to know.

Now this drive has 201 reallocated sectors, and 563 are waiting to be reallocated on the next write (why unRAID didn't force that to happen I can't explain. One thought is that 201 is the most available spare sectors, but that may not be true).

The tricky part of grading these numbers involve the fact that a disk may have surface defects, that once reallocated, result in a perfectly usable majority of the surface. If that is the case, everything is good. 100 bad sectors is a tiny fraction of the surface. Even 1000 sectors (assuming it can reallocate that many) would not be bad. But after that initial check, the values should stabalize. Over time you might have a few marginal sectors go bad, but not dozens and certainly not hundreds. One of the limitations of SMART is that it looks at the values "statically" and not in comparison with prior values.

If I had a drive that had run for 2 years wtih 1-2 reallocated sectors, and then one day saw this, I'd be taking it out of service.

I guess I'm surprised that you two are in disagreement with me on this. Next time I have a drive get to this point and is out of warranty, I'll see if one of you is interested in making a purchase of a slightly used-up drive at a substantial discount.

Otherwise I might just enjoy cutting it open and showing my son how a disk drive works. {Insert sound of chainsaw here}

Quote

January 21, 200917 yr

One of the limitations of SMART is that it looks at the values "statically" and not in comparison with prior values.

If I had a drive that had run for 2 years wtih 1-2 reallocated sectors, and then one day saw this, I'd be taking it out of service.

Which is exactly why I wrote smarthistory. It may not show you much now, but wait 'till you have a year's worth of history data, and then see a graph of SMART parameters.

Quote

January 21, 200917 yr

Author

One of the limitations of SMART is that it looks at the values "statically" and not in comparison with prior values.

If I had a drive that had run for 2 years wtih 1-2 reallocated sectors, and then one day saw this, I'd be taking it out of service.

Which is exactly why I wrote smarthistory. It may not show you much now, but wait 'till you have a year's worth of history data, and then see a graph of SMART parameters.

It will be interesting to see.

The problem is that a smart report is only telling you what the drive has seen, not what it would see if you did a complete surface scan at that moment. While sleeping or spinning unaccessed, the smart parameters we care about are not going to be updated. We have 10T arrays, how much do you think is being accessed a day - a few gig maybe? Many disks don't get accessed for weeks at a time, sometimes months. Realistically, how likely is it that we're going to start picking up errors except right after a parity check?

Please don't take this negatively. I think this is a good concept. I'm just trying to get my head around how this will work in practice.

One idea I had was to run some type of abbreviated read test on the drive just before running you data-collecting smart run. Perhaps read a random 1% of the sectors on the drive. Would that give a better chance of finding a problem? Is it worth the wear and tear on the drive? Is it actually good for the drive to get "some exercise" rather than sit spun down for so long" ...

So many questions, so few answers ...

Quote

January 22, 200917 yr

If the drive is not spinning, there is little chance for errors.

Quote

SMART not so smart sometimes

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)