Is the HDD faulty

December 15, 201015 yr

On hdtune my Samsung HD204ui F4 is showing that it has a SMART issue where there is a problem with calibration retry count 2.

But Samsung's own diagnostic tool says its fine.

Which one should I believe?

Quote

December 15, 201015 yr

On hdtune my Samsung HD204ui F4 is showing that it has a SMART issue where there is a problem with calibration retry count 2.

But Samsung's own diagnostic tool says its fine.

Which one should I believe?

Get a SMART report on that drive.

smartctl -d ata -a /dev/sdX

where sdX = the device for your disk.

Then, look at the normalized value for the calibration retry parameter. If it is nearing the failure threshold for that parameter, RMA the drive. If not, stop worrying, use the drive.

Joe L.

Quote

December 15, 201015 yr

Author

HDTune gives you all of that info anyway. 100 is the threshold and the error amount is 2.

Quote

December 15, 201015 yr

HDTune gives you all of that info anyway. 100 is the threshold and the error amount is 2.

What is the NORMALIZED value for that parameter? It sounds as if you are telling me the RAW value=2.

Joe L.

Quote

December 15, 201015 yr

Author

I am not sure what you mean by normailsed value but it reads:

Current Worst threshold data status

calibration retry count 252 252 0 2 warning

Joe can you check my preclear logs? The first hdd in the log passed without any issues.

Which hdd would you RMA?

Thanks.

preclear_info.txt

Quote

December 15, 201015 yr

I will find out shortly.

Joe can you check my preclear logs? The first hdd in the log passed without any issues.

Which hdd would you RMA?

Thanks.

I see absolutely nothing wrong with any of those disks.

No RMA would be in order on any of them.

Quote

December 15, 201015 yr

Author

The problem hdd has this error but I am not sure what you mean about normailised value:

Current Worst threshold data status

calibration retry count 252 252 0 2 warning

Quote

December 15, 201015 yr

The problem hdd has this error but I am not sure what you mean about normailised value:

Current Worst threshold data status

calibration retry count 252 252 0 2 warning

The normalized values are the first two in the list you gave.

The "current" normalized value is 252.

The worst ever normalized value for that parameter is 252.

The failure threshold for that parameter is 0. If the current value goes BELOW the failure threshold the disk fails that SMART test and is considered as FAILING_NOW.

The "data" column in your list is a "raw" value that has meaning only to the manufacturer in most cases.

Here is a sample of a smartctl output (as I suggested you get) so you can see what a failing attribute looks like:

Also note, there is no standard among disks. This disk has 100 as its starting normalized value for calibration attempts once the drive gets a few hours use, and a setting of 253 as it leaves the factory. The only standard is if the normalized current value is above the failure threshold, the drive is consider good by the SMART report. The drive shown below has a normalized value of 84 for re-allocated sectors and a failure threshold of 140. It is FAILING_NOW.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       38319
  3 Spin_Up_Time            0x0027   040   040   021    Pre-fail  Always       -       15000
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       257
  5 Reallocated_Sector_Ct   0x0033   [b][color=red]084[/color][/b]   084   [b][color=red]140[/color][/b]    Pre-fail  Always   [b][color=red]FAILING_NOW[/color][/b] 927
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4019
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       6210
194 Temperature_Celsius     0x0022   122   102   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   001   001   000    Old_age   Always       -       550
197 Current_Pending_Sector  0x0032   199   196   000    Old_age   Always       -       338
198 Offline_Uncorrectable   0x0030   200   198   000    Old_age   Offline      -       103
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   179   151   000    Old_age   Offline      -       4355

Quote

December 15, 201015 yr

Author

Ok. Thanks. I understand SMART now!

So if the 253 current value starts dropping towards 0 threshold value then there is a problem. I get it.

Thanks Joe!

Quote

December 15, 201015 yr

Author

I am preclearing the "bad" hdd again.

Would you RMA it?

Quote

December 15, 201015 yr

I am preclearing the "bad" hdd again.

Would you RMA it?

I somehow don't think you understand SMART yet.

I did not see any reason to RMA any drive.

Did I miss something that you are seeing? If so, post the line you are concerned about so I don't have to read your mind.

Joe L.

Quote

December 18, 201015 yr

Author

Nope I do understand. However I have noticed something odd. I cleared 5 hdds at the same time and preclear reported errors on all hdds.

I then precleared 2 of the worst hdds from the 5 one at a time and no errors at all!!

Maybe preclear can't handle more than 1 hdd at a time?

Quote

December 18, 201015 yr

Preclear can handle multiple simultaneous clears.

I too think you do not understand SMART. What specifically in those SMART reports make you think the drives are faulty?

Quote

December 18, 201015 yr

Author

I need to clarify. I do understand SMART. Lets go beyond that now.

Once preclear had finshed on all 5 hdds (using alt f1, f2 , f3 etc), preclear reported errors and information.

I precleared again two of the hdds with the most errors and information one at a time and this time there was no errors and information. Nothing. Just a clear screen.

This leads me to believe that there could either be something wrong with preclear and multiple disks or something wrong with my motherboard.

Quote

December 18, 201015 yr

The pre-clear process shows you the DIFFERENCES between a SMART report taken at the beginning of the clearing process and one taken at the end.

If no differences, no output will result. If Any differences, you'll get output. The output does not indicate an error, just a difference.

There are a handful of lines that will always be different, those are filtered out from the output of the "diff" command. For example, I expect the power-on-hours to be different, so you;ll never see it in the "diff" output.

Furthermore a drive could be FAILING_NOW in the beginning SMART report and also in the end SMART report and because it did not change it would not show in the "diff" output.

Basically, do not use just the "diff" output to determine if a disk is failing. Use it in combination with the full smart report.

For each disk there are two separate SMART reports in your /tmp directory. In the same way, they are also logged in your syslog.

You can decide if your disks are incrementing the smart parameters... it has absolutely nothing to do with the pre-clear processing.

All that said... if there is poor quality cabling and you get cross-talk and induced noise because you tightly cabled them together, or a noisy power supply, or disks that vibrate and cause adjacent disks to have a more difficult time in reading their disks because of the transmitted vibration, then yes, pre-clearing multiple disks at the same time may uncover a hardware issue with your server. It may not be a single disk... It may only surface when the are all active together. It is your hardware. You get to "defend" it. Just don't go returning a disk because of a single "read" error. All disks have read errors... some report them some do not... Thy just re-try and re-read the sector.

You'll have to experiment and learn how your server performs. You'll just need to be aware that all your disks will be active when performing an initial parity calc, or when performing a parity check.

Pre-clear can handle multiple disks being cleared at the same time... but can your hardware? It is a reporting tool. You can analyze the output and decide on your own. (now that you know how to interpret the results)

A "disk calibration" error might be cause if the disk temperature changed so drastically during the pre-clear process that the disk platters changed physical size enough the heads had to re-calibrate. You have to analyze your own situation. If a disk is failing and you suspect it only acts up when multiple disks are spinning, re-test it.

Joe L.

Quote

December 18, 201015 yr

No we can not move beyond that. You do not understand preclear and SMART.

What "errors" did preclear report? It typically reports differences between the SMART report before and after. It is perfectly natural for there to be differences that are NOT errors. A blank report means there are no differences. A full report only means there were differences. It does NOT mean there are errors.

For example:

The raw values can change but they do not indicate an error.

The nominal values can change but they do not indicate an error.

The threshold values can change but they do not indicate an error.

The maximum values can change but they do not indicate an error.

The minimum values can change but they do not indicate an error.

Quote

December 18, 201015 yr

Author

Thanks Joe. I have checked and re-checked the hdd with the calibration retry count issue and it seems fine. Only time will tell!

Quote

January 1, 201115 yr

Author

As I don't know much about SMART can you check these pre preclear results?

before_preclear_2.txt

Quote

January 1, 201115 yr

As I don't know much about SMART can you check these pre preclear results?

The file you attached is not the pre-clear results. Those are the initial SMART reports taken at the start of the pre-clearing process. The pre-clear process takes another smart report at its end and then shows you the differences between the initial report and the post clear report.

You would need to post the final results for anyone to be able to know how it did.

Joe L.

Quote

January 1, 201115 yr

Author

Hi Joe,

I was aware of that but one of HDDs in the initial SMART check before preclear was unusual.

Quote

January 1, 201115 yr

Hi Joe,

I was aware of that but one of HDDs in the initial SMART check before preclear was unusual.

Which drive was "unusual"? I saw one re-allocated sector on one disk... I guess I did not look close enough.

What did you see?

Joe L.

Quote

January 1, 201115 yr

Author

The one at the end. JK1130YAH8NG5T. It has extra errors on it. I was just wondering what these extra bits meant.

Quote

January 2, 201115 yr

Author

I have finished the preclear. They all have the same preclear result. I think this is because I did Ctrl C to cancel a previous preclear.

preclear_BKKT_W31T_NETT_NG5T.txt

Quote

January 4, 201115 yr

Author

Does anyone know what this means?

Offline data collection status: (0x80)^IOffline data collection activity

was never started.

Offline data collection status: (0x84)^IOffline data collection activity was suspended by an interrupting command from host.

Quote

January 4, 201115 yr

Does anyone know what this means?

Offline data collection status: (0x80)^IOffline data collection activity

was never started.

Offline data collection status: (0x84)^IOffline data collection activity was suspended by an interrupting command from host.

offline data collection is typically a requested "short" or "long" smart test, although I've seen disks perform tests on their own when they are otherwise idle. From what your output is saying, you've never requested either a long or short test of the drive.

The "offline" collection is aborted when the disk is spun down (The interrupting command is the spin-down command).

Quote

Is the HDD faulty

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)