November 22, 201312 yr Hello, I was doing a random check on my unRAID server and noticed 4306 errors for the Parity drive (but ball is still green). I then ran a smartctl on all the drive and also found a high number of error on one of the data disk. I'm going to buy two new drive (and switch parity to 3TB). Which drive do you recommend me to swap first (parity or disk11) ? Should I run a parity check before? Thanks PARITY DRIVE root@babylon:~# smartctl -a -A /dev/sdi smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (Adv. Format) family Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WCAZA4474532 Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Fri Nov 22 22:25:13 2013 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (36360) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 198 198 051 Pre-fail Always - 9667 3 Spin_Up_Time 0x0027 167 164 021 Pre-fail Always - 6650 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1131 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 072 072 000 Old_age Always - 20899 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 140 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 84 193 Load_Cycle_Count 0x0032 152 152 000 Old_age Always - 146232 194 Temperature_Celsius 0x0022 129 110 000 Old_age Always - 21 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 196 196 000 Old_age Always - 1419 198 Offline_Uncorrectable 0x0030 200 197 000 Old_age Offline - 30 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 001 001 000 Old_age Offline - 148883 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 20899 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. DISK11 Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x30b7) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 102 099 006 Pre-fail Always - 291896 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 760 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 21171801 9 Power_On_Hours 0x0032 028 028 000 Old_age Always - 63694 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 30 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 079 058 045 Old_age Always - 21 (Min/Max 18/31) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 6 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 760 194 Temperature_Celsius 0x0022 021 042 000 Old_age Always - 21 (0 16 0 0) 195 Hardware_ECC_Recovered 0x001a 037 011 000 Old_age Always - 291896 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 26164940768955 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 678912054 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1438427421 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
November 22, 201312 yr With two "iffy" disks it's a risk no matter which way you do it. I'd do the following ... => Run a parity check to confirm you have good parity (you don't want to rebuild a data drive without that). Be sure it's a correcting check, so any errors are fixed ... and if there ARE errors fixed, then run another one after that to confirm everything's now good. => Don't do ANYTHING on the array after that. Save the complete contents of the flash drive. Then shut down; replace the parity drive with your new 3TB drive (saving the old parity drive); and then start the system and let it rebuild parity. => If all went well, you can now replace the data drive. If there were problems encountered (i.e. the data drive failed before you got all of that done); then you can replace the old parity drive; copy the contents of the flash drive you saved back; and boot the system exactly as it was. Then you can replace the failed data drive and let it rebuild. Doing it as I just outlined lets you do it BOTH ways ... hopefully the parity-first works; but if not, you still have the ability to rebuild your data drive first instead.
November 22, 201312 yr Sorry to butt in, may I ask what's up with disk 11 as I cant figure out what's failing about it, always have trouble understanding these results. 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 I'm concerned that I'm missing something on my own disks.
November 22, 201312 yr Author So don't the following errors indicate a drive going bad? 1 Raw_Read_Error_Rate 0x000f 102 099 006 Pre-fail Always - 291896 7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 21171801
November 22, 201312 yr http://en.wikipedia.org/wiki/S.M.A.R.T. Always thought those values were vendor specific, don't really mean anything to us, i may be wrong though, the 2 i listed are the important ones. Wait for someone with more experience reading the reports to chime in. though.
November 22, 201312 yr Different vendors do indeed list a different set of parameters -- and indeed list them under different conditions. Seagate, for example, shows all of the raw read and seek errors; while WD only lists those after certain thresholds are exceeded. The more important number to look at is the "Value" => this starts at either 200 or 100 (depending on both the parameter and the manufacturer) and then is reduced as the data exceeds the optimal values. Seagate tends to show more raw data ... and can consequently cause more unfounded worrying than WD As for the reallocated sectors and pending reallocations being the "important ones" => that's a matter of opinion. Modern drives are DESIGNED to automatically remap defective sectors to spare areas, so the fact you have a few reallocated sectors is NOT, by itself, a bad sign. What's more important is if the number of reallocated sectors is changing .. indicating a drive that's not only got some bad sectors; but has likely got a bit of dust or other foreign material in the sealed platters that's causing further degradation. Your parity drive has a lot of pending reallocations -- meaning the next time those sectors are written to they'll be reallocated. The number is high enough that I would indeed replace that drive. Drive 11 doesn't have any particularly worrisome values. It's doing a lot of re-seeks and error correction, but they've always been successful, so you're not getting read or write errors that the OS sees. In fact, for a drive with over 7 years of use, it's in fairly good shape. It's true, however, that a drive that old is probably ready to be replaced and relegated to storing backups.
November 22, 201312 yr So don't the following errors indicate a drive going bad? 1 Raw_Read_Error_Rate 0x000f 102 099 006 Pre-fail Always - 291896 7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 21171801 These values appear to be improving over time. As long as they don't cross the threshold and are marked "failing now" you can ignore them.
November 22, 201312 yr So don't the following errors indicate a drive going bad? 1 Raw_Read_Error_Rate 0x000f 102 099 006 Pre-fail Always - 291896 7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 21171801 No, the current normalized value is well above the affiliated pre-failure threshold. There is nothing wrong at all. All drives have read errors, some report them in the smart report, most do not.
November 23, 201312 yr Author Big thanks to everybody for providing such valuable information. A pair of 4TB Red drives is on the way. I will only replace the parity drive. Funny to see that failure on parity drive is evolving (in the wrong direction) but ball is still solid green. I'm marking the thread as Solved for now. root@babylon:~# smartctl -a -A /dev/sdi smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (Adv. Format) family Device Model: WDC WD20EARS-00MVWB0 Serial Number: WD-WCAZA4474532 Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sat Nov 23 09:45:40 2013 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (36360) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 171 171 051 Pre-fail Always - 105420 3 Spin_Up_Time 0x0027 167 164 021 Pre-fail Always - 6650 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1131 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 072 072 000 Old_age Always - 20911 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 140 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 84 193 Load_Cycle_Count 0x0032 152 152 000 Old_age Always - 146244 194 Temperature_Celsius 0x0022 127 110 000 Old_age Always - 23 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 196 196 000 Old_age Always - 1419 198 Offline_Uncorrectable 0x0030 200 197 000 Old_age Offline - 30 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 001 001 000 Old_age Offline - 148883 SMART Error Log Version: 1 ATA Error Count: 2 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 2 occurred at disk power-on lifetime: 20903 hours (870 days + 23 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 18 d6 02 ef Error: UNC at LBA = 0x0f02d618 = 251844120 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 70 d5 02 ef 08 40d+16:29:59.194 READ DMA c8 00 00 70 cc 02 ef 08 40d+16:29:58.251 READ DMA Error 1 occurred at disk power-on lifetime: 20901 hours (870 days + 21 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 88 4b b5 e7 Error: UNC at LBA = 0x07b54b88 = 129321864 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 20 4b b5 e7 08 40d+14:40:35.048 READ DMA c8 00 00 20 42 b5 e7 08 40d+14:40:32.954 READ DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 20899 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
November 23, 201312 yr Funny to see that failure on parity drive is evolving (in the wrong direction) but ball is still solid green.That behaviour is exactly what the drive manufacturer intended, only it's happening much too quickly. As long as there are still spare sectors available, the drive will continue to test "good". Problem is, with the rate of increase you are seeing, you may be out of spares in a matter of hours. Until a write to it fails, unraid will keep it online and green.
Archived
This topic is now archived and is closed to further replies.