January 20, 200917 yr The way that the SMART system works, is that each attribute is "normallized" to a number between 255 and 0 (255 is always good, 0 is always bad). If the normalized value falls below a predefined "threshold", then SMART fails the drive and says it will fail within 24 hours. Based on the raw value of the attribute, a "value" is computed for that attribute. This is compared to the threshold to determine failure. Check out the values below. Most of us here would say that this drive is in awful shape. But even with 201 reallocated sectors, the drive is sill showing a normalized value of 174. It would have to drop to the threshold (140) for it to fail based on that attribute. And look at the current pending sector line, the normalized value is 193 (with 563 pending sectors), and it would have to go all the way down to 0 to fail the drive! At that rate, there would have to be about 2300 pending sectors to fail the drive! This drive is in terrible shape and should definitely fail the test. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 199 199 051 Pre-fail Always - 12493 3 Spin_Up_Time 0x0003 227 218 021 Pre-fail Always - 5641 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 355 5 Reallocated_Sector_Ct 0x0033 174 174 140 Pre-fail Always - 201 7 Seek_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 9 Power_On_Hours 0x0032 080 080 000 Old_age Always - 15159 10 Spin_Retry_Count 0x0013 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 253 051 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 89 194 Temperature_Celsius 0x0022 253 253 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 156 156 000 Old_age Always - 44 197 Current_Pending_Sector 0x0012 194 193 000 Old_age Always - 563 198 Offline_Uncorrectable 0x0010 193 193 000 Old_age Offline - 584 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0009 148 148 051 Pre-fail Offline - 2639
January 20, 200917 yr Well it depends on some other things... like drive size, and number of spare sectors. A 1TB drive will have a LOT mor sectors, and a LOT more spare sectors than on a 10GB drive. Number of sector errors that you can live with is going to be a lot higher on a 1TB drive than on a 10GB drive.... and the number of spare sectors is going to be higher too.
January 21, 200917 yr He's right. The numbers do have to be interpreted correctly, and from what I have seen, many of the numbers are essentially meaningless, possibly useful only in an advisory manner. This especially applies to any attributes that are not critical, marked with the PRE_FAIL flag. The only numbers generally of value in those attributes without the PRE_FAIL flag, are either the VALUE column or the RAW_VALUE column. A threshold does not mean much, if it is not used. The Current_Pending_Sector count is one of these, primarily a count of suspect sectors that, from my reading, have one last chance for testing and recovery before being written off and remapped. And with the right surface test (I forget which), this value will probably return to zero (but with a large increase in the remapped sector count). The Reallocated_Sector_Ct *is* a critical number, that can cause a failing grade, but as BubbaQ said, there a large number of spare sectors kept in reserve just for this purpose. Although we certainly worry about a drive that begins to use up these reserve sectors, that is what they are there for, and that is how the SMART system is supposed to work. A drive is not really used up, until its supply of spare sectors is used up.
January 21, 200917 yr Author Didn't quite understand that. Sorry ... By RE_FAIL do you mean pre-flag type? I think I was pretty accurate in my email about how smart works (we can debate whether this drive should have failed the overall health assessment). The RAW_VALUEs may or may not be humanly intelligible. But sometimes they are (i.e., usually counts and temperature values are meaningful). The manufacturer comes up with some mathematical formula (or algorithm) that takes the RAW_VALUE and converts it into a number between 0 (bad) and 255 (good). (So a reallocated_sector_ct of 0 should lead to a score at or close to 255, and a reallocated_sector_ct of 1,000,000 should lead to a score at or close to 0.) that "normalized" number becomes the VALUE. The manufacturer also defines, for each attribute, what a failing value is and call that the (THRESH) (threshold value). Since we don't know the formula, we can't back into the RAW_VALUE that would cause that THRESH to be met, so we don't know the RAW_VALUE_THRESHOLD, which is what we'd really like to know. Now this drive has 201 reallocated sectors, and 563 are waiting to be reallocated on the next write (why unRAID didn't force that to happen I can't explain. One thought is that 201 is the most available spare sectors, but that may not be true). The tricky part of grading these numbers involve the fact that a disk may have surface defects, that once reallocated, result in a perfectly usable majority of the surface. If that is the case, everything is good. 100 bad sectors is a tiny fraction of the surface. Even 1000 sectors (assuming it can reallocate that many) would not be bad. But after that initial check, the values should stabalize. Over time you might have a few marginal sectors go bad, but not dozens and certainly not hundreds. One of the limitations of SMART is that it looks at the values "statically" and not in comparison with prior values. If I had a drive that had run for 2 years wtih 1-2 reallocated sectors, and then one day saw this, I'd be taking it out of service. I guess I'm surprised that you two are in disagreement with me on this. Next time I have a drive get to this point and is out of warranty, I'll see if one of you is interested in making a purchase of a slightly used-up drive at a substantial discount. Otherwise I might just enjoy cutting it open and showing my son how a disk drive works. {Insert sound of chainsaw here}
January 21, 200917 yr One of the limitations of SMART is that it looks at the values "statically" and not in comparison with prior values. If I had a drive that had run for 2 years wtih 1-2 reallocated sectors, and then one day saw this, I'd be taking it out of service. Which is exactly why I wrote smarthistory. It may not show you much now, but wait 'till you have a year's worth of history data, and then see a graph of SMART parameters.
January 21, 200917 yr Author One of the limitations of SMART is that it looks at the values "statically" and not in comparison with prior values. If I had a drive that had run for 2 years wtih 1-2 reallocated sectors, and then one day saw this, I'd be taking it out of service. Which is exactly why I wrote smarthistory. It may not show you much now, but wait 'till you have a year's worth of history data, and then see a graph of SMART parameters. It will be interesting to see. The problem is that a smart report is only telling you what the drive has seen, not what it would see if you did a complete surface scan at that moment. While sleeping or spinning unaccessed, the smart parameters we care about are not going to be updated. We have 10T arrays, how much do you think is being accessed a day - a few gig maybe? Many disks don't get accessed for weeks at a time, sometimes months. Realistically, how likely is it that we're going to start picking up errors except right after a parity check? Please don't take this negatively. I think this is a good concept. I'm just trying to get my head around how this will work in practice. One idea I had was to run some type of abbreviated read test on the drive just before running you data-collecting smart run. Perhaps read a random 1% of the sectors on the drive. Would that give a better chance of finding a problem? Is it worth the wear and tear on the drive? Is it actually good for the drive to get "some exercise" rather than sit spun down for so long" ... So many questions, so few answers ...
Archived
This topic is now archived and is closed to further replies.