EMKO Posted May 13, 2011 Share Posted May 13, 2011 for a week now i have been trying to figure out if my parity drive is still ok to be used. a week ago i did a parity check and it gave me some errors and around 40 ATA errros. i did another parity check this time i got less ata errors and after few more times i still got more errors and my ata count is at 75. any ideas what i should do? whats this Warning: ATA error count 75 inconsistent with error log pointer 1 ? unraid 4.6 and i have been using it for about a year now first time im getting these errors current_pending_sector=6 offline_uncorrectable=3 ata_error_count=75 syslog http://pastebin.com/HzMfbw7e SMART smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD20EADS-00R6B0 Serial Number: WD-WCAVY2269670 Firmware Version: 01.00A01 User Capacity: 2,000,398,934,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Fri May 13 09:08:05 2011 MDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (43200) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 179 154 021 Pre-fail Always - 8025 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1701 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 9943 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 40 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 29 193 Load_Cycle_Count 0x0032 193 193 000 Old_age Always - 21686 194 Temperature_Celsius 0x0022 127 117 000 Old_age Always - 25 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 6 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 3 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 Warning: ATA error count 75 inconsistent with error log pointer 1 ATA Error Count: 75 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 9790 30300665 # 2 Short offline Completed without error 00% 9789 - # 3 Short offline Completed without error 00% 8827 - # 4 Extended offline Completed: read failure 90% 8816 35253715 # 5 Short offline Aborted by host 80% 8816 - # 6 Short offline Aborted by host 80% 8816 - # 7 Short offline Completed without error 00% 8816 - # 8 Short offline Aborted by host 90% 8816 - # 9 Short offline Aborted by host 90% 8816 - #10 Short offline Completed without error 00% 8816 - #11 Extended offline Completed: read failure 90% 8459 27516853 #12 Extended offline Completed: read failure 90% 7671 27516853 #13 Extended offline Completed: read failure 90% 7671 27516853 #14 Extended offline Completed: read failure 90% 7671 27516853 #15 Short offline Completed: read failure 90% 7661 27516853 #16 Short offline Completed: read failure 30% 2454 3328229 #17 Short offline Aborted by host 10% 2446 - #18 Short offline Aborted by host 90% 2446 - #19 Short offline Aborted by host 70% 2446 - #20 Short offline Aborted by host 60% 2446 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Quote Link to comment
EMKO Posted May 13, 2011 Author Share Posted May 13, 2011 ok im confused now i just started a parity check and i seen these errors i checked smart and it says SMART Error Log Version: 1 No Errors Logged i dont see any ata errors anymore what i figured out is those errors from ata3.00 are actually from a DISK 1 drive not parity. so now i have no idea whats going on and if on of these hard drives is still ok. :( May 13 11:23:57 Tower kernel: md: recovery thread woken up ... (unRAID engine) May 13 11:23:57 Tower kernel: md: recovery thread checking parity... (unRAID engine) May 13 11:23:57 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine) May 13 11:28:33 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors) May 13 11:28:33 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related) May 13 11:28:33 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues) May 13 11:28:33 Tower kernel: ata3.00: cmd 25/00:40:e7:53:b3/00:03:02:00:00/e0 tag 0 dma 425984 in (Drive related) May 13 11:28:33 Tower kernel: res 51/40:3f:db:56:b3/00:00:02:00:00/e0 Emask 0x9 (media error) (Errors) May 13 11:28:33 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related) May 13 11:28:33 Tower kernel: ata3.00: error: { UNC } (Errors) May 13 11:28:33 Tower kernel: ata3.00: configured for UDMA/133 (Drive related) May 13 11:28:33 Tower kernel: ata3: EH complete (Drive related) May 13 11:28:36 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors) May 13 11:28:36 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related) May 13 11:28:36 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues) May 13 11:28:36 Tower kernel: ata3.00: cmd 25/00:40:e7:53:b3/00:03:02:00:00/e0 tag 0 dma 425984 in (Drive related) May 13 11:28:36 Tower kernel: res 51/40:3f:db:56:b3/00:00:02:00:00/e0 Emask 0x9 (media error) (Errors) May 13 11:28:36 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related) May 13 11:28:36 Tower kernel: ata3.00: error: { UNC } (Errors) May 13 11:28:36 Tower kernel: ata3.00: configured for UDMA/133 (Drive related) May 13 11:28:36 Tower kernel: ata3: EH complete (Drive related) May 13 11:28:38 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors) May 13 11:28:38 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related) May 13 11:28:38 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues) May 13 11:28:38 Tower kernel: ata3.00: cmd 25/00:40:e7:53:b3/00:03:02:00:00/e0 tag 0 dma 425984 in (Drive related) May 13 11:28:38 Tower kernel: res 51/40:3f:db:56:b3/00:00:02:00:00/e0 Emask 0x9 (media error) (Errors) May 13 11:28:38 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related) May 13 11:28:38 Tower kernel: ata3.00: error: { UNC } (Errors) May 13 11:28:38 Tower kernel: ata3.00: configured for UDMA/133 (Drive related) May 13 11:28:38 Tower kernel: ata3: EH complete (Drive related) Quote Link to comment
prostuff1 Posted May 13, 2011 Share Posted May 13, 2011 There were some pending sectors in the previous smart report. It looks like a cable/power issue. Make sure all connections are secure (unplug and plug back in) and start again. Quote Link to comment
EMKO Posted May 13, 2011 Author Share Posted May 13, 2011 ok i will try that, the pending sectors is always like this it goes up and down all the time same with offline uncorrectable which never goes below 1. here is how it looks like now not much difference except that the ata errors are gone. will have to check cables later. smartctl -a -d ata /dev/sdc (parity) smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD20EADS-00R6B0 Serial Number: WD-WCAVY2269670 Firmware Version: 01.00A01 User Capacity: 2,000,398,934,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Fri May 13 11:46:25 2011 MDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (43200) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 179 154 021 Pre-fail Always - 8025 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1701 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 9945 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 40 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 29 193 Load_Cycle_Count 0x0032 193 193 000 Old_age Always - 21690 194 Temperature_Celsius 0x0022 124 117 000 Old_age Always - 28 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 6 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 3 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 9790 30300665 # 2 Short offline Completed without error 00% 9789 - # 3 Short offline Completed without error 00% 8827 - # 4 Extended offline Completed: read failure 90% 8816 35253715 # 5 Short offline Aborted by host 80% 8816 - # 6 Short offline Aborted by host 80% 8816 - # 7 Short offline Completed without error 00% 8816 - # 8 Short offline Aborted by host 90% 8816 - # 9 Short offline Aborted by host 90% 8816 - #10 Short offline Completed without error 00% 8816 - #11 Extended offline Completed: read failure 90% 8459 27516853 #12 Extended offline Completed: read failure 90% 7671 27516853 #13 Extended offline Completed: read failure 90% 7671 27516853 #14 Extended offline Completed: read failure 90% 7671 27516853 #15 Short offline Completed: read failure 90% 7661 27516853 #16 Short offline Completed: read failure 30% 2454 3328229 #17 Short offline Aborted by host 10% 2446 - #18 Short offline Aborted by host 90% 2446 - #19 Short offline Aborted by host 70% 2446 - #20 Short offline Aborted by host 60% 2446 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Quote Link to comment
EMKO Posted May 13, 2011 Author Share Posted May 13, 2011 tested parity check 3 times and every time at exactly 0.8% these errors come up can this still be cable/power issue? i cant shut down the server right now its being used. May 13 12:16:51 Tower kernel: md: recovery thread woken up ... (unRAID engine) May 13 12:16:51 Tower kernel: md: recovery thread checking parity... (unRAID engine) May 13 12:16:51 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine) May 13 12:20:02 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors) May 13 12:20:02 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related) May 13 12:20:02 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues) May 13 12:20:02 Tower kernel: ata3.00: cmd 25/00:90:77:57:ce/00:03:01:00:00/e0 tag 0 dma 466944 in (Drive related) May 13 12:20:02 Tower kernel: res 51/40:ff:f9:59:ce/00:00:01:00:00/e0 Emask 0x9 (media error) (Errors) May 13 12:20:02 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related) May 13 12:20:02 Tower kernel: ata3.00: error: { UNC } (Errors) May 13 12:20:02 Tower kernel: ata3.00: configured for UDMA/133 (Drive related) May 13 12:20:02 Tower kernel: ata3: EH complete (Drive related) May 13 12:20:05 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors) May 13 12:20:05 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related) May 13 12:20:05 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues) May 13 12:20:05 Tower kernel: ata3.00: cmd 25/00:90:77:57:ce/00:03:01:00:00/e0 tag 0 dma 466944 in (Drive related) May 13 12:20:05 Tower kernel: res 51/40:ff:f9:59:ce/00:00:01:00:00/e0 Emask 0x9 (media error) (Errors) May 13 12:20:05 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related) May 13 12:20:05 Tower kernel: ata3.00: error: { UNC } (Errors) May 13 12:20:05 Tower kernel: ata3.00: configured for UDMA/133 (Drive related) May 13 12:20:05 Tower kernel: ata3: EH complete (Drive related) May 13 12:20:08 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors) May 13 12:20:08 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related) May 13 12:20:08 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues) May 13 12:20:08 Tower kernel: ata3.00: cmd 25/00:90:77:57:ce/00:03:01:00:00/e0 tag 0 dma 466944 in (Drive related) May 13 12:20:08 Tower kernel: res 51/40:ff:f9:59:ce/00:00:01:00:00/e0 Emask 0x9 (media error) (Errors) May 13 12:20:08 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related) May 13 12:20:08 Tower kernel: ata3.00: error: { UNC } (Errors) May 13 12:20:08 Tower kernel: ata3.00: configured for UDMA/133 (Drive related) May 13 12:20:08 Tower kernel: ata3: EH complete (Drive related) Quote Link to comment
Joe L. Posted May 14, 2011 Share Posted May 14, 2011 tested parity check 3 times and every time at exactly 0.8% these errors come up can this still be cable/power issue? i cant shut down the server right now its being used. May 13 12:16:51 Tower kernel: md: recovery thread woken up ... (unRAID engine) May 13 12:16:51 Tower kernel: md: recovery thread checking parity... (unRAID engine) May 13 12:16:51 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine) May 13 12:20:02 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors) May 13 12:20:02 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related) May 13 12:20:02 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues) May 13 12:20:02 Tower kernel: ata3.00: cmd 25/00:90:77:57:ce/00:03:01:00:00/e0 tag 0 dma 466944 in (Drive related) May 13 12:20:02 Tower kernel: res 51/40:ff:f9:59:ce/00:00:01:00:00/e0 Emask 0x9 (media error) (Errors) May 13 12:20:02 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related) May 13 12:20:02 Tower kernel: ata3.00: error: { UNC } (Errors) May 13 12:20:02 Tower kernel: ata3.00: configured for UDMA/133 (Drive related) May 13 12:20:02 Tower kernel: ata3: EH complete (Drive related) May 13 12:20:05 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors) May 13 12:20:05 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related) May 13 12:20:05 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues) May 13 12:20:05 Tower kernel: ata3.00: cmd 25/00:90:77:57:ce/00:03:01:00:00/e0 tag 0 dma 466944 in (Drive related) May 13 12:20:05 Tower kernel: res 51/40:ff:f9:59:ce/00:00:01:00:00/e0 Emask 0x9 (media error) (Errors) May 13 12:20:05 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related) May 13 12:20:05 Tower kernel: ata3.00: error: { UNC } (Errors) May 13 12:20:05 Tower kernel: ata3.00: configured for UDMA/133 (Drive related) May 13 12:20:05 Tower kernel: ata3: EH complete (Drive related) May 13 12:20:08 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors) May 13 12:20:08 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related) May 13 12:20:08 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues) May 13 12:20:08 Tower kernel: ata3.00: cmd 25/00:90:77:57:ce/00:03:01:00:00/e0 tag 0 dma 466944 in (Drive related) May 13 12:20:08 Tower kernel: res 51/40:ff:f9:59:ce/00:00:01:00:00/e0 Emask 0x9 (media error) (Errors) May 13 12:20:08 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related) May 13 12:20:08 Tower kernel: ata3.00: error: { UNC } (Errors) May 13 12:20:08 Tower kernel: ata3.00: configured for UDMA/133 (Drive related) May 13 12:20:08 Tower kernel: ata3: EH complete (Drive related) They could be power related, but those are MEDIA errors. (translation, un-readable sectors on the disk) UNC errors are almost always related to bad sectors on the disk that are not readable. Get a "smart" report on the disk. smartctl -d ata -a /dev/sdX where sdX = the three letter designation for your disk. Look for re-allocated sectors and sectors pending re-allocation. (The counts are in the RAW column on the far right) Joe L. Quote Link to comment
EMKO Posted May 14, 2011 Author Share Posted May 14, 2011 sorry for stupid question but how do i identify which hard drive those errors are for? is it by ata3.00 ? i was using mymain syslog entries for disc and i didn't see these UNC errors under parity but only under disc 1 which i think is wrong as the smart report for disc 1 has no errors at all. did so more parity check tests waited till 1% before canceling 1st time it gave those UNC errors at 0.4% and current_pending_sector went from 6 to 7 2nd time no errors 3rd UNC error at 0.8% 4th UNC errors at 0.9% right now my smart looks like this smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD20EADS-00R6B0 Serial Number: WD-WCAVY2269670 Firmware Version: 01.00A01 User Capacity: 2,000,398,934,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sat May 14 08:02:12 2011 MDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 121) The previous self-test completed having the read element of the test failed. Total time to complete Offline data collection: (43200) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 179 154 021 Pre-fail Always - 8025 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 1705 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 087 087 000 Old_age Always - 9965 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 40 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 29 193 Load_Cycle_Count 0x0032 193 193 000 Old_age Always - 21768 194 Temperature_Celsius 0x0022 126 117 000 Old_age Always - 26 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 7 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 1 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 9790 30300665 # 2 Short offline Completed without error 00% 9789 - # 3 Short offline Completed without error 00% 8827 - # 4 Extended offline Completed: read failure 90% 8816 35253715 # 5 Short offline Aborted by host 80% 8816 - # 6 Short offline Aborted by host 80% 8816 - # 7 Short offline Completed without error 00% 8816 - # 8 Short offline Aborted by host 90% 8816 - # 9 Short offline Aborted by host 90% 8816 - #10 Short offline Completed without error 00% 8816 - #11 Extended offline Completed: read failure 90% 8459 27516853 #12 Extended offline Completed: read failure 90% 7671 27516853 #13 Extended offline Completed: read failure 90% 7671 27516853 #14 Extended offline Completed: read failure 90% 7671 27516853 #15 Short offline Completed: read failure 90% 7661 27516853 #16 Short offline Completed: read failure 30% 2454 3328229 #17 Short offline Aborted by host 10% 2446 - #18 Short offline Aborted by host 90% 2446 - #19 Short offline Aborted by host 70% 2446 - #20 Short offline Aborted by host 60% 2446 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. anyways when i wake up i will have a chance to power down and replug the cables. Quote Link to comment
SSD Posted May 14, 2011 Share Posted May 14, 2011 There are basically two things that can cause drive errors: 1 - The drive itself is failing. If the drive is failing, you will see attribute errors in the smart report - notably reallocated sectors and pending sectors. (Although other attributes have their individual failure thresholds, and if those thresholds are approached they can also be indicators of impending drive failure). The computer / OS does not cause the drive to have these types of errors. 2 - The connection to the drive is bad (e.g., bad cable, bad drive cage, bad port, etc.). If the connection is bad, you will tend to see errors in the unRAID syslog AND see the ata error count on the drive increase. These types of errors indicate that the data is being garbled in transmission. So if, for example, you have a 1T drive and the computer is requesting a read of a sector at offset 750G, and the instruction is garbled and the drive sees it as an instruction to read a sector at offset 1750G (bigger than the drive), the drive will return some error to the OS, likely some type of read error. This type of error is logged in the syslog, and also remembered by the drive as an "ata error". In this type of scenario, the drive is doing exactly what it should be doing, and the problem is frequently the cable. It looks like you have run some extensive self-tests on this drive. Note that the spin down feature of unRAID can cause drive self-tests to fail. So make sure to disable spin down on a drive before attempting to run a self test. I don't have much experience with the "offline uncorrectable" attribute. A value of 1 is not affecting the normalized attribute values, so I am assuming it is not a problem worth worrying about. But I am not sure - I personally would not be happy to see offline uncorrectable errors. Current pending sectors indicate that there was difficulty reading a sector and that it needs to be monitored for possible relocation at a later time. Frequently current pending sectors become reallocated after a parity check or preclear cycle. But I've also seen pending sectors, even a hundred of them, clear themselves and go back to 0, with no reallocated sectors. It is hard to interpret why this happens, but there is no evidence that these drives have given problems in future use. I would recommend running parity checks and watching the attributes, paying particular attention to the ata error count and the reallocated sector count. If ata errors increase, check / replace your cables to the drive. If reallocated sectors increase (and don't hold steady for three consecutive parity checks), it is time to RMA the drive. Quote Link to comment
EMKO Posted May 15, 2011 Author Share Posted May 15, 2011 those 75 ata errors i had disappeared during a parity check and ever since they don't show up anymore. i replugged the cables but i still am getting those UNC errors. Offline_Uncorrectable since jan 24 goes from 0 - 4 Current_Pending_Sector since jan 24 it has gone from 2 - 9 right now its at current_pending_sector=7 offline_uncorrectable=1 reallocated sector count has never changed from 0, shouldn't it have gone up since jan 24 when i started to get current pending sectors? and ata errors are gone Quote Link to comment
SSD Posted May 15, 2011 Share Posted May 15, 2011 those 75 ata errors i had disappeared during a parity check and ever since they don't show up anymore. i replugged the cables but i still am getting those UNC errors. Offline_Uncorrectable since jan 24 goes from 0 - 4 Current_Pending_Sector since jan 24 it has gone from 2 - 9 right now its at current_pending_sector=7 offline_uncorrectable=1 reallocated sector count has never changed from 0, shouldn't it have gone up since jan 24 when i started to get current pending sectors? and ata errors are gone The ata errors should never reset to zero. This is a cumulative count of invalid ATA instructions received. Firmware, like all software, has bugs. If you are seeing the ata errors disappear, I suspect you are experiencing an unforseen firmware bug due to the variety and number of errors the drive is experiencing. Double check and make sure it is the same and not a different drive you were seeing the ata error count. If it is the same drive, and the ata error count got reset to 0, I would not trust the smart monitoring on the drive and would RMA it. Why take chances? Quote Link to comment
EMKO Posted May 15, 2011 Author Share Posted May 15, 2011 yes it the same drive smart said this when it had ata errors SMART Error Log Version: 1 Warning: ATA error count 75 inconsistent with error log pointer 1 now its like this SMART Error Log Version: 1 No Errors Logged did few more parity checks still get UNC error but current pending sector is still at 7 and uncorrectable is at 1 still none for relocated count. Thanks for the help hopefully i can just RMA it and not have to worry about my parity Quote Link to comment
EMKO Posted June 1, 2011 Author Share Posted June 1, 2011 just installed 5.0b6a did a parity check and noticed in mymain that the ata error went back to 100 i was going to record the errors but after i refreshed the ata errors are gone again. Is this something that should be happening? now i was looking at the nice custom gui and i clicked smart log on it and now i see the errors. is there a reason they go away in the my main smart report but stay on this report? my cache drive has had 22 ata errors and they never go away. is this something i can rma the drive for? i don't want them to try the drive and tell me its fine http://pastebin.com/e6ipKtrK Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.