Is my disk 'prefailing'?

December 15, 20178 yr

Hi Guys, I'm trying to sort thru a few pending issues with my unRAID. Johnnie.black pointed out in another thread that one of my disks had timeouts but it recovered. Upon reading up on SMART and examining the smart diags I'm concerned the disk might be 'prefailing' because the WORST value is close to THRESH value (see attached SMART report):

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   063   044    Pre-fail  Always       -       180107091

According to the below link, "Critical attribute - if its WORST falls below its THRESH, then the drive will be considered FAILED". The cables have been swapped out and the drive is a warranty replacement drive from an earlier failure with a constellation ES3.

https://wiki.lime-technology.com/Understanding_SMART_Reports#1_Raw_Read_Error_Rate

If it is indeed failing, I'd like to get it swapped out to avoid another incident.

ST4000NM0033-9ZM170.txt

Quote

December 15, 20178 yr

Community Expert

For Seagates the raw value of that attribute is mostly meaningless, only if the normalized value goes below the threshold do you need to worry.

Quote

December 15, 20178 yr

52 minutes ago, johnnie.black said:

For Seagates the raw value of that attribute is mostly meaningless, only if the normalized value goes below the threshold do you need to worry.

Exactly - some drives are not storing counter values as-is in the raw value. They may encode number of errors in sequence and other data within the bit pattern, giving truly huge raw values.

Your WORST value can come from a situation where the drive has been running very, very warm. The current value has bounced back quite a lot.

Quote

December 15, 20178 yr

Author

3 hours ago, johnnie.black said:

For Seagates the raw value of that attribute is mostly meaningless, only if the normalized value goes below the threshold do you need to worry.

Thanks....wish there was a universal standard for SMART values.

Quote

December 15, 20178 yr

Author

2 hours ago, pwm said:

Your WORST value can come from a situation where the drive has been running very, very warm. The current value has bounced back quite a lot.

There are times some of my drives will run hotter than usual and cross the 'hot' threshold of 45c, but I've never had any cross the critical threshold of 55c. As I understand, most HDDs today can handle temperatures up to 60c. So if my drives have been running between 45c and 50c would that be considered warm enough to affect the WORST values?

Quote

December 15, 20178 yr

Community Expert

1 minute ago, Joseph said:

Thanks....wish there was a universal standard for SMART values.

Me too, that attribute is relevant for WD, HGST and Toshiba drives where a non zero raw value is never a good sign, except maybe for some WD drives where it stays fixed at 1 without increasing, that seems OK.

Quote

December 16, 20178 yr

18 hours ago, Joseph said:

There are times some of my drives will run hotter than usual and cross the 'hot' threshold of 45c, but I've never had any cross the critical threshold of 55c. As I understand, most HDDs today can handle temperatures up to 60c. So if my drives have been running between 45c and 50c would that be considered warm enough to affect the WORST values?

Some measurements in a SMART drive are hard counters where there should be basically zero events counted - or where any non-zero means "beware".

But some measurements aren't hard counters but are better compared to analog measurements containing noise and natural variance. If you take a multimeter and measures the mains power, your multimeter will show fluctuating values short-term. And it will show larger fluctuation if you compare at dinner time or the middle of the night.

So some of the performance values you see in the SMART data will be affected by the temperature of the drive, fluctuations in supply voltages, mechanical precision when manufactured, progressing wear etc. So specific values aren't that very meaningful. In normal use, you get a smaller/larger variation that you can ignore. But the drive manufacturer have decided on a limit where they don't see the variation as within specification anymore. That's why there is a threshold level. So normally, you needing put too much thought into the current value or the worst value unless one or both of these values starts to get real close to the threshold. Maybe you can get that value to match your "worst" by running the drive very hot. Or having the power supply being extra hot and so giving a slightly higher or lower supply voltage. But it's not really too important.

So in the end - it's possible the worst samples got taken when the drive was between 45°C and 50°C. But that's still temperatures supported by the drive manufacturer, and your worst value is still within what the manufacturer considers "good enough". So the only reason to care about temperature is that electronics on average shows a failure rate that doubles for every 10°C higher temperature, i.e. it's generally better to keep down the temperatures. But in many situations, the increased failures because of higher temperatures can be totally ignored because the MTBF of the device is so much higher than the expected economical life of the product that it becomes irrelevant to care. If the device is expected to be used for 3 years and the expected MTBF of the electronics gets dropped from 200 years to 100 years, then lower noise levels or lower power consumption from less powerful fans may be more important factors.

In the end, the drive manufacturers themselves aren't all-powerful magicians. They have done a bit of guessing and also evaluated a bit of statistics from test units and from customer return units. But they do not know any "exact" hard limits when different SMART metrics values becomes dangerous. The values are just indicators, where history shows that some are more important than others.

Quote

December 16, 20178 yr

Author

4 hours ago, pwm said:

Some measurements in a SMART drive are hard counters where there should be basically zero events counted - or where any non-zero means "beware".

But some measurements aren't hard counters but are better compared to analog measurements containing noise and natural variance. If you take a multimeter and measures the mains power, your multimeter will show fluctuating values short-term. And it will show larger fluctuation if you compare at dinner time or the middle of the night.

So some of the performance values you see in the SMART data will be affected by the temperature of the drive, fluctuations in supply voltages, mechanical precision when manufactured, progressing wear etc. So specific values aren't that very meaningful. In normal use, you get a smaller/larger variation that you can ignore. But the drive manufacturer have decided on a limit where they don't see the variation as within specification anymore. That's why there is a threshold level. So normally, you needing put too much thought into the current value or the worst value unless one or both of these values starts to get real close to the threshold. Maybe you can get that value to match your "worst" by running the drive very hot. Or having the power supply being extra hot and so giving a slightly higher or lower supply voltage. But it's not really too important.

So in the end - it's possible the worst samples got taken when the drive was between 45°C and 50°C. But that's still temperatures supported by the drive manufacturer, and your worst value is still within what the manufacturer considers "good enough". So the only reason to care about temperature is that electronics on average shows a failure rate that doubles for every 10°C higher temperature, i.e. it's generally better to keep down the temperatures. But in many situations, the increased failures because of higher temperatures can be totally ignored because the MTBF of the device is so much higher than the expected economical life of the product that it becomes irrelevant to care. If the device is expected to be used for 3 years and the expected MTBF of the electronics gets dropped from 200 years to 100 years, then lower noise levels or lower power consumption from less powerful fans may be more important factors.

In the end, the drive manufacturers themselves aren't all-powerful magicians. They have done a bit of guessing and also evaluated a bit of statistics from test units and from customer return units. But they do not know any "exact" hard limits when different SMART metrics values becomes dangerous. The values are just indicators, where history shows that some are more important than others.

Good info! Something is knocking drives offline during monthly parity checks. I'm wondering if my PS is the issue... but its been replaced once already and data cables twice.

Edited December 16, 20178 yr by Joseph

Quote

Is my disk 'prefailing'?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)