February 3, 200917 yr I received three ST31500341AS last week from Amazon.com. It looks like one might be failing. Comments? » reallocated_sector_ct=175 » current_pending_sector=55 » offline_uncorrectable=55 » head_flying_hours=2.58944e+13 » attribute_241=1680463781 » attribute_242=1080455958 For some reason the linked image isnt showing up.
February 3, 200917 yr I received three ST31500341AS last week from Amazon.com. It looks like one might be failing. Comments? » reallocated_sector_ct=175 » current_pending_sector=55 » offline_uncorrectable=55 » head_flying_hours=2.58944e+13 » attribute_241=1680463781 » attribute_242=1080455958 http://www.aandsspecialty.com/smart.jpg For some reason the linked image isnt showing up. I wouldn't worry about "head_flying_hours", attribute_241 or attribute_242. Not sure what these are but normally gargantuan values are not very meaningful. The high_fly_writes are not uncommon, and I wouldn't worry about them unless they start to get high quicikly. I've seen logs with values in the low 100s. If these go into the thousands or ten thousands I'd be worried, but so far evidence is that these are not a huge problem in reasonably small numbers. I don't have much experience with "offline_uncorrectable", but likely related to the current_pending_sectors. These are sectors that have been marked bad but have not been remapped. (unRAID normally forces the issue and causes remaps to occur - a good thing - but obviously there is a problem here. Maybe offline_uncorrectable means they can't be rempped?). The reallocated_sectors and current pending_sectors ARE something to be worried about. If this is a new drive and these are increasing seeminly every time you copy data to the drive, I think I would pursue getting a replacement. Conventional wisdom on these is that if the number stays constant, even if it is relatively high (into the hundreds), that the drive is okay. But my experience is anything over 2 is unusual, and anything over 10 is trouble. Joe L. has a drive with 100 reallocated sectors that is old and the nuimber has not gone up. This is very uncommon in my experience. To give you a comparison, I have 16 drives in my system. 14 have 0 reallocated sectors. 2 have 1 reallocated sectors. That's it. None have current_pending_sectors. I do have high_fly_writes (1 on 1 drive) as well as some spin_retry_counts (11/15 on 2 drives). ALL of these attributes are on my only 2 1T Seagates! Although once a Seagate fanboy in the 7200.10 days, I am now singing the praises of WD GP drives. These have ZERO bad sectors and no other curious attribute values either. Bottom line, replace disk2.
February 4, 200917 yr My Seagate 1.5TB is showing the following: s spin_retry_count=6 s high_fly_writes=357 s udma_crc_error_count=3 Three of my WD 'Green' 1TB drives are all showing udma_crc_error_count values of 5 or under. The one I'm curious about is one of my WD 1TB drives is showing ata_error_count=24 (in RED. Red worries me!) I can't figure out what this parameter is or if a value of 24 indicates a problem.
February 4, 200917 yr That ata_error_count is the number of errors logged against the drive. The list of errors is available in the smartctl report itself. Frequently errors occur due to cabling problems or other install-time issues. There is no way to reset the error count, so once they show up, they are there to stay. Feel free to post your smartctl report and I'll have a look. My understanding of high fly writes is that when they are detected the drive takes corrective action to keep the heads from getting "too high" and causing a problem. I'd not be overly concerned with them, but to be honest 357 is the largest number I've ever seen. (I have seen over 100 though). The udma_crc_error_count at 3 doesn't sound too bad. I'd keep an eye on it. I get this spin_retry_counts incrementing a few a month. Eventually the number will likely get large enough to fail the drive and then I can get a new one . It is dangerous when an older drive starts getting spin_retry_counts - kindof like a car having trouble starting. But if a new car is always a little slow to start, it doesn't mean its a problem.
February 4, 200917 yr Is this what you mean by smartctl report: Statistics for /dev/sde WDC_WD10EACS-00ZJB0_WD-WCASJ1234212 smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: WDC WD10EACS-00ZJB0 Serial Number: WD-WCASJ1234212 Firmware Version: 01.01B01 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue Feb 3 17:25:28 2009 GMT+8 SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (27960) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 18 3 Spin_Up_Time 0x0003 186 178 021 Pre-fail Always - 7675 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 830 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000e 100 253 051 Old_age Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6332 10 Spin_Retry_Count 0x0012 100 100 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 218 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 285 193 Load_Cycle_Count 0x0032 150 150 000 Old_age Always - 151647 194 Temperature_Celsius 0x0022 129 108 000 Old_age Always - 23 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 193 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 193 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 4 200 Multi_Zone_Error_Rate 0x0008 200 189 051 Old_age Offline - 0 SMART Error Log Version: 1 ATA Error Count: 24 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 24 occurred at disk power-on lifetime: 3101 hours (129 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 c0 b7 64 7b e0 Error: UNC 192 sectors at LBA = 0x007b64b7 = 8086711 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 c0 b7 64 7b 3f 08 05:55:20.113 READ DMA EXT 27 00 00 00 00 00 00 08 05:55:20.093 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 05:55:20.073 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 05:55:20.053 SET FEATURES [set transfer mode] 27 00 00 00 00 00 00 08 05:55:20.033 READ NATIVE MAX ADDRESS EXT Error 23 occurred at disk power-on lifetime: 3101 hours (129 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 c0 b7 64 7b e0 Error: UNC 192 sectors at LBA = 0x007b64b7 = 8086711 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 c0 b7 64 7b 3f 08 05:55:15.948 READ DMA EXT 27 00 00 00 00 00 00 08 05:55:15.928 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 05:55:15.908 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 05:55:15.888 SET FEATURES [set transfer mode] 27 00 00 00 00 00 00 08 05:55:15.868 READ NATIVE MAX ADDRESS EXT Error 22 occurred at disk power-on lifetime: 3101 hours (129 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 c0 b7 64 7b e0 Error: UNC 192 sectors at LBA = 0x007b64b7 = 8086711 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 c0 b7 64 7b 3f 08 05:55:11.782 READ DMA EXT 27 00 00 00 00 00 00 08 05:55:11.762 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 05:55:11.742 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 05:55:11.722 SET FEATURES [set transfer mode] 27 00 00 00 00 00 00 08 05:55:11.702 READ NATIVE MAX ADDRESS EXT Error 21 occurred at disk power-on lifetime: 3101 hours (129 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 c0 b7 64 7b e0 Error: UNC 192 sectors at LBA = 0x007b64b7 = 8086711 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 c0 b7 64 7b 3f 08 05:55:07.617 READ DMA EXT 27 00 00 00 00 00 00 08 05:55:07.597 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 05:55:07.577 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 05:55:07.557 SET FEATURES [set transfer mode] 27 00 00 00 00 00 00 08 05:55:07.537 READ NATIVE MAX ADDRESS EXT Error 20 occurred at disk power-on lifetime: 3101 hours (129 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 c0 b7 64 7b e0 Error: UNC 192 sectors at LBA = 0x007b64b7 = 8086711 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 c0 b7 64 7b 3f 08 05:55:03.451 READ DMA EXT 27 00 00 00 00 00 00 08 05:55:03.432 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 05:55:03.411 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 05:55:03.411 SET FEATURES [set transfer mode] 27 00 00 00 00 00 00 08 05:55:03.392 READ NATIVE MAX ADDRESS EXT SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 1389 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
February 4, 200917 yr Is this what you mean by smartctl report: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 18 3 Spin_Up_Time 0x0003 186 178 021 Pre-fail Always - 7675 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 830 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000e 100 253 051 Old_age Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6332 10 Spin_Retry_Count 0x0012 100 100 051 Old_age Always - 0 11 Calibration_Retry_Count 0x0012 100 100 051 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 218 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 285 193 Load_Cycle_Count 0x0032 150 150 000 Old_age Always - 151647 194 Temperature_Celsius 0x0022 129 108 000 Old_age Always - 23 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 193 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 193 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 4 200 Multi_Zone_Error_Rate 0x0008 200 189 051 Old_age Offline - 0 SMART Error Log Version: 1 ATA Error Count: 24 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 24 occurred at disk power-on lifetime: 3101 hours (129 days + 5 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 c0 b7 64 7b e0 Error: UNC 192 sectors at LBA = 0x007b64b7 = 8086711 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 25 00 c0 b7 64 7b 3f 08 05:55:20.113 READ DMA EXT 27 00 00 00 00 00 00 08 05:55:20.093 READ NATIVE MAX ADDRESS EXT ec 00 00 00 00 00 00 08 05:55:20.073 IDENTIFY DEVICE ef 03 46 00 00 00 00 08 05:55:20.053 SET FEATURES [set transfer mode] 27 00 00 00 00 00 00 08 05:55:20.033 READ NATIVE MAX ADDRESS EXT Yes, that is a smart report. I clipped away some of the report to help describe how to look at the ata errors intelligently. I highlighted several numbers. The top highlight tells you how many hours the disk has been powered on as of the time the smart report was taken. This drive has been powered on for 6332 hours (~264 24-hour days). The second shows you where the ata_error_count comes from (it is 24 for this drive) The third tells you how many power on hours the disk was when the most recent error occurred. 3101 = 129 days. So about 134 (264 - 129) power on days ago, your drive got an error. The fourth section tells you what the error was. This was an UNC error (UNC=Uncorrectable Data: An ECC error in the data field could not be corrected (a media error or read instability)). If you look at the other 4 errors (it only reports the last 5), they all look like they happened at basically the same time and got basically the same error at the same spot on the disk. You have 4 UDMA_CRC_Error_Count - not sure if that is related to this error or not. Have you run a full parity check in the last 134 power on days? If so, this error did not recur and you are probably fine. If not, I'd run one and see if the error continues. As some point unRAID should mark the sector bad and remap a fresh one. In my experience, ata errors are seldom serious unless they occurred very recently and align with a particular observed event (like a server crash), but it is smart to try and understand them. This one could be indicative or a bad spot on your disk that needs to be remapped. Many times there are signs of bad data (your error says it is looking at LBA 0x007b64b7, which "feels" valid (although I'm not sure), but I saw a log today where the LBA was 0x0FFFFFFF. This is clearly not a normal value. I believe it was caused by a bad or loose cable. The drive is going to issue an error since this LBA does not exist on the drive. Hope this helps! Cheers!
February 4, 200917 yr Very interesting and educational analysis. Thank you. I never thought to compare when an error occured to how long the drive has been in service. Well, I didn't realize there was a timestamp on when errors occured. These reports make a little more sense to me now.
February 4, 200917 yr Author Thanks bjp999 ! Amazon is sending a new drive. It should arrive tomorrow.
Archived
This topic is now archived and is closed to further replies.