Is the HDD faulty


Recommended Posts

On hdtune my Samsung HD204ui F4 is showing that it has a SMART issue where there is a problem with calibration retry count 2.

But Samsung's own diagnostic tool says its fine.

 

Which one should I believe?

Get a SMART report on that drive.

 

smartctl -d ata -a /dev/sdX

where sdX = the device for your disk.

 

Then, look at the normalized value for the calibration retry parameter.  If it is nearing the failure threshold for that parameter, RMA the drive.  If not, stop worrying, use the drive.

 

Joe L.

Link to comment

I am not sure what you mean by normailsed value but it reads:

                                          Current Worst threshold data status

calibration retry count                252      252      0      2      warning

 

Joe can you check my preclear logs? The first hdd in the log passed without any issues.

 

Which hdd would you RMA?

 

Thanks.

preclear_info.txt

Link to comment

The problem hdd has this error but I am not sure what you mean about normailised value:

 

                                           Current Worst threshold data status

calibration retry count                252      252      0       2       warning

The normalized values are the first two in the list you gave.

 

The "current" normalized value is 252.

The worst ever normalized value for that parameter is 252.

 

The failure threshold for that parameter is 0.  If the current value goes BELOW the failure threshold the disk fails that SMART test and is considered as FAILING_NOW.

 

The "data" column in your list is a "raw" value that has meaning only to the manufacturer in most cases.

 

Here is a sample of a smartctl output (as I suggested you get) so you can see what a failing attribute looks like:

Also note, there is no standard among disks.  This disk has 100 as its starting normalized value for calibration attempts once the drive gets a few hours use, and a setting of 253 as it leaves the factory.    The only standard is if the normalized current value is above the failure threshold, the drive is consider good by the SMART report.  The drive shown below has a normalized value of 84 for re-allocated sectors and a failure threshold of 140.  It is FAILING_NOW.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       38319
  3 Spin_Up_Time            0x0027   040   040   021    Pre-fail  Always       -       15000
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       257
  5 Reallocated_Sector_Ct   0x0033   [b][color=red]084[/color][/b]   084   [b][color=red]140[/color][/b]    Pre-fail  Always   [b][color=red]FAILING_NOW[/color][/b] 927
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4019
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       6210
194 Temperature_Celsius     0x0022   122   102   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       550
197 Current_Pending_Sector  0x0032   199   196   000    Old_age   Always       -       338
198 Offline_Uncorrectable   0x0030   200   198   000    Old_age   Offline      -       103
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   179   151   000    Old_age   Offline      -       4355

Link to comment

I am preclearing the "bad" hdd again.

 

Would you RMA it?

I somehow don't think you understand SMART yet.

 

I did not see any reason to RMA any drive.   

 

Did I miss something that you are seeing?  If so, post the line you are concerned about so I don't have to read your mind.

 

Joe L.

Link to comment

I need to clarify. I do understand SMART. Lets go beyond that now.

 

Once preclear had finshed on all 5 hdds (using alt f1, f2 , f3 etc), preclear reported errors and information.

 

I precleared again two of the hdds with the most errors and information one at a time and this time there was no errors and information. Nothing. Just a clear screen.

 

This leads me to believe that there could either be something wrong with preclear and multiple disks or something wrong with my motherboard.

Link to comment

The pre-clear process shows you the DIFFERENCES between a SMART report taken at the beginning of the clearing process and one taken at the end.

 

If no differences, no output will result.  If Any differences, you'll get output.  The output does not indicate an error, just a difference.

 

There are a handful of lines that will always be different, those are filtered out from the output of the "diff" command.  For example, I expect the power-on-hours to be different, so you;ll never see it in the "diff" output.

 

Furthermore a drive could be FAILING_NOW in the beginning SMART report and also in the end SMART report and because it did not change it would not show in the "diff" output.

 

Basically, do not use just the "diff" output to determine if a disk is failing.  Use it in combination with the full smart report.

For each disk there are two separate SMART reports in your /tmp directory.  In the same way, they are also logged in your syslog.

You can decide if your disks are incrementing the smart parameters... it has absolutely nothing to do with the pre-clear processing.

 

All that said... if there is poor quality cabling and you get cross-talk and induced noise because you tightly cabled them together, or a noisy power supply, or disks that vibrate and cause adjacent disks to have a more difficult time in reading their disks because of the transmitted vibration, then yes, pre-clearing multiple disks at the same time may uncover a hardware issue with your server.  It may not be a single disk... It may only surface when the are all active together.  It is your hardware.  You get to "defend" it.  Just don't go returning a disk because of a single "read" error.  All disks have read errors... some report them some do not... Thy just re-try and re-read the sector.

 

You'll have to experiment and learn how your server performs.  You'll just need to be aware that all your disks will be active when performing an initial parity calc, or when performing a parity check.

 

Pre-clear can handle multiple disks being cleared at the same time... but can your hardware?  It is a reporting tool.  You can analyze the output and decide on your own. (now that you know how to interpret the results)

 

A "disk calibration" error might be cause if the disk temperature changed so drastically during the pre-clear process that the disk platters changed physical size enough the heads had to re-calibrate.    You have to analyze your own situation.  If a disk is failing and you suspect it only acts up when multiple disks are spinning, re-test it.

 

Joe L.

Link to comment

No we can not move beyond that. You do not understand preclear and SMART.

 

What "errors" did preclear report? It typically reports differences between the SMART report before and after. It is perfectly natural for there to be differences that are NOT errors. A blank report means there are no differences. A full report only means there were differences. It does NOT mean there are errors.

 

For example:

The raw values can change but they do not indicate an error.

The nominal values can change but they do not indicate an error.

The threshold values can change but they do not indicate an error.

The maximum values can change but they do not indicate an error.

The minimum values can change but they do not indicate an error.

 

 

Link to comment
  • 2 weeks later...

As I don't know much about SMART can you check these pre preclear results?

The file you attached is not the pre-clear results.  Those are the initial SMART reports taken at the start of the pre-clearing process.  The pre-clear process takes another smart report at its end and then shows you the differences between the initial report and the post clear report.

 

You would need to post the final results for anyone to be able to know how it did.

 

Joe L.

Link to comment

Does anyone know what this means?

 

Offline data collection status:  (0x80)^IOffline data collection activity

was never started.

 

Offline data collection status:  (0x84)^IOffline data collection activity was suspended by an interrupting command from host.

offline data collection is typically a requested "short" or "long" smart test, although I've seen disks perform tests on their own when they are otherwise idle.  From what your output is saying, you've never requested either a long or short test of the drive.

 

The "offline" collection is aborted when the disk is spun down  (The interrupting command is the spin-down command).

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.