Drive failing???

February 3, 200917 yr

I received three ST31500341AS last week from Amazon.com. It looks like one might be failing. Comments?

» reallocated_sector_ct=175

» current_pending_sector=55

» offline_uncorrectable=55

» head_flying_hours=2.58944e+13

» attribute_241=1680463781

» attribute_242=1080455958

For some reason the linked image isnt showing up.

February 3, 200917 yr

I received three ST31500341AS last week from Amazon.com. It looks like one might be failing. Comments?

» reallocated_sector_ct=175

» current_pending_sector=55

» offline_uncorrectable=55

» head_flying_hours=2.58944e+13

» attribute_241=1680463781

» attribute_242=1080455958

http://www.aandsspecialty.com/smart.jpg

For some reason the linked image isnt showing up.

I wouldn't worry about "head_flying_hours", attribute_241 or attribute_242. Not sure what these are but normally gargantuan values are not very meaningful.

The high_fly_writes are not uncommon, and I wouldn't worry about them unless they start to get high quicikly. I've seen logs with values in the low 100s. If these go into the thousands or ten thousands I'd be worried, but so far evidence is that these are not a huge problem in reasonably small numbers.

I don't have much experience with "offline_uncorrectable", but likely related to the current_pending_sectors. These are sectors that have been marked bad but have not been remapped. (unRAID normally forces the issue and causes remaps to occur - a good thing - but obviously there is a problem here. Maybe offline_uncorrectable means they can't be rempped?).

The reallocated_sectors and current pending_sectors ARE something to be worried about. If this is a new drive and these are increasing seeminly every time you copy data to the drive, I think I would pursue getting a replacement. Conventional wisdom on these is that if the number stays constant, even if it is relatively high (into the hundreds), that the drive is okay. But my experience is anything over 2 is unusual, and anything over 10 is trouble. Joe L. has a drive with 100 reallocated sectors that is old and the nuimber has not gone up. This is very uncommon in my experience.

To give you a comparison, I have 16 drives in my system. 14 have 0 reallocated sectors. 2 have 1 reallocated sectors. That's it. None have current_pending_sectors. I do have high_fly_writes (1 on 1 drive) as well as some spin_retry_counts (11/15 on 2 drives). ALL of these attributes are on my only 2 1T Seagates! Although once a Seagate fanboy in the 7200.10 days, I am now singing the praises of WD GP drives. These have ZERO bad sectors and no other curious attribute values either.

Bottom line, replace disk2.

February 4, 200917 yr

My Seagate 1.5TB is showing the following:

s spin_retry_count=6

s high_fly_writes=357

s udma_crc_error_count=3

Three of my WD 'Green' 1TB drives are all showing udma_crc_error_count values of 5 or under.

The one I'm curious about is one of my WD 1TB drives is showing ata_error_count=24 (in RED. Red worries me!) I can't figure out what this parameter is or if a value of 24 indicates a problem.

February 4, 200917 yr

That ata_error_count is the number of errors logged against the drive. The list of errors is available in the smartctl report itself.

Frequently errors occur due to cabling problems or other install-time issues. There is no way to reset the error count, so once they show up, they are there to stay.

Feel free to post your smartctl report and I'll have a look.

My understanding of high fly writes is that when they are detected the drive takes corrective action to keep the heads from getting "too high" and causing a problem. I'd not be overly concerned with them, but to be honest 357 is the largest number I've ever seen. (I have seen over 100 though).

The udma_crc_error_count at 3 doesn't sound too bad. I'd keep an eye on it.

I get this spin_retry_counts incrementing a few a month. Eventually the number will likely get large enough to fail the drive and then I can get a new one . It is dangerous when an older drive starts getting spin_retry_counts - kindof like a car having trouble starting. But if a new car is always a little slow to start, it doesn't mean its a problem.

February 4, 200917 yr

Is this what you mean by smartctl report:

Statistics for /dev/sde WDC_WD10EACS-00ZJB0_WD-WCASJ1234212

smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10EACS-00ZJB0
Serial Number:    WD-WCASJ1234212
Firmware Version: 01.01B01
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Feb  3 17:25:28 2009 GMT+8
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (27960) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       18
  3 Spin_Up_Time            0x0003   186   178   021    Pre-fail  Always       -       7675
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       830
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   100   253   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6332
10 Spin_Retry_Count        0x0012   100   100   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0012   100   100   051    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       218
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       285
193 Load_Cycle_Count        0x0032   150   150   000    Old_age   Always       -       151647
194 Temperature_Celsius     0x0022   129   108   000    Old_age   Always       -       23
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   193   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   193   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       4
200 Multi_Zone_Error_Rate   0x0008   200   189   051    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 24 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 24 occurred at disk power-on lifetime: 3101 hours (129 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 c0 b7 64 7b e0  Error: UNC 192 sectors at LBA = 0x007b64b7 = 8086711

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 c0 b7 64 7b 3f 08      05:55:20.113  READ DMA EXT
  27 00 00 00 00 00 00 08      05:55:20.093  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      05:55:20.073  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      05:55:20.053  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 00 08      05:55:20.033  READ NATIVE MAX ADDRESS EXT

Error 23 occurred at disk power-on lifetime: 3101 hours (129 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 c0 b7 64 7b e0  Error: UNC 192 sectors at LBA = 0x007b64b7 = 8086711

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 c0 b7 64 7b 3f 08      05:55:15.948  READ DMA EXT
  27 00 00 00 00 00 00 08      05:55:15.928  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      05:55:15.908  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      05:55:15.888  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 00 08      05:55:15.868  READ NATIVE MAX ADDRESS EXT

Error 22 occurred at disk power-on lifetime: 3101 hours (129 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 c0 b7 64 7b e0  Error: UNC 192 sectors at LBA = 0x007b64b7 = 8086711

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 c0 b7 64 7b 3f 08      05:55:11.782  READ DMA EXT
  27 00 00 00 00 00 00 08      05:55:11.762  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      05:55:11.742  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      05:55:11.722  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 00 08      05:55:11.702  READ NATIVE MAX ADDRESS EXT

Error 21 occurred at disk power-on lifetime: 3101 hours (129 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 c0 b7 64 7b e0  Error: UNC 192 sectors at LBA = 0x007b64b7 = 8086711

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 c0 b7 64 7b 3f 08      05:55:07.617  READ DMA EXT
  27 00 00 00 00 00 00 08      05:55:07.597  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      05:55:07.577  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      05:55:07.557  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 00 08      05:55:07.537  READ NATIVE MAX ADDRESS EXT

Error 20 occurred at disk power-on lifetime: 3101 hours (129 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 c0 b7 64 7b e0  Error: UNC 192 sectors at LBA = 0x007b64b7 = 8086711

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  25 00 c0 b7 64 7b 3f 08      05:55:03.451  READ DMA EXT
  27 00 00 00 00 00 00 08      05:55:03.432  READ NATIVE MAX ADDRESS EXT
  ec 00 00 00 00 00 00 08      05:55:03.411  IDENTIFY DEVICE
  ef 03 46 00 00 00 00 08      05:55:03.411  SET FEATURES [set transfer mode]
  27 00 00 00 00 00 00 08      05:55:03.392  READ NATIVE MAX ADDRESS EXT

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1389         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

February 4, 200917 yr

Is this what you mean by smartctl report:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 18

3 Spin_Up_Time 0x0003 186 178 021 Pre-fail Always - 7675

4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 830

5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0

7 Seek_Error_Rate 0x000e 100 253 051 Old_age Always - 0

9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6332

10 Spin_Retry_Count 0x0012 100 100 051 Old_age Always - 0

11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 218

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 285

193 Load_Cycle_Count 0x0032 150 150 000 Old_age Always - 151647

194 Temperature_Celsius 0x0022 129 108 000 Old_age Always - 23

196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0

197 Current_Pending_Sector 0x0012 200 193 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0010 200 193 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 4

200 Multi_Zone_Error_Rate 0x0008 200 189 051 Old_age Offline - 0

SMART Error Log Version: 1

ATA Error Count: 24 (device log contains only the most recent five errors)

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 24 occurred at disk power-on lifetime: 3101 hours (129 days + 5 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

40 51 c0 b7 64 7b e0 Error: UNC 192 sectors at LBA = 0x007b64b7 = 8086711

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

25 00 c0 b7 64 7b 3f 08 05:55:20.113 READ DMA EXT

27 00 00 00 00 00 00 08 05:55:20.093 READ NATIVE MAX ADDRESS EXT

ec 00 00 00 00 00 00 08 05:55:20.073 IDENTIFY DEVICE

ef 03 46 00 00 00 00 08 05:55:20.053 SET FEATURES [set transfer mode]

27 00 00 00 00 00 00 08 05:55:20.033 READ NATIVE MAX ADDRESS EXT

Yes, that is a smart report.

I clipped away some of the report to help describe how to look at the ata errors intelligently.

I highlighted several numbers. The top highlight tells you how many hours the disk has been powered on as of the time the smart report was taken. This drive has been powered on for 6332 hours (~264 24-hour days).

The second shows you where the ata_error_count comes from (it is 24 for this drive)

The third tells you how many power on hours the disk was when the most recent error occurred. 3101 = 129 days. So about 134 (264 - 129) power on days ago, your drive got an error.

The fourth section tells you what the error was. This was an UNC error (UNC=Uncorrectable Data: An ECC error in the data field could not be corrected (a media error or read instability)).

If you look at the other 4 errors (it only reports the last 5), they all look like they happened at basically the same time and got basically the same error at the same spot on the disk. You have 4 UDMA_CRC_Error_Count - not sure if that is related to this error or not. Have you run a full parity check in the last 134 power on days? If so, this error did not recur and you are probably fine. If not, I'd run one and see if the error continues. As some point unRAID should mark the sector bad and remap a fresh one. In my experience, ata errors are seldom serious unless they occurred very recently and align with a particular observed event (like a server crash), but it is smart to try and understand them. This one could be indicative or a bad spot on your disk that needs to be remapped.

Many times there are signs of bad data (your error says it is looking at LBA 0x007b64b7, which "feels" valid (although I'm not sure), but I saw a log today where the LBA was 0x0FFFFFFF. This is clearly not a normal value. I believe it was caused by a bad or loose cable. The drive is going to issue an error since this LBA does not exist on the drive.

Hope this helps!

Cheers!

February 4, 200917 yr

Very interesting and educational analysis. Thank you. I never thought to compare when an error occured to how long the drive has been in service. Well, I didn't realize there was a timestamp on when errors occured. These reports make a little more sense to me now.

February 4, 200917 yr

Author

Thanks bjp999 ! Amazon is sending a new drive. It should arrive tomorrow.

Drive failing???

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)