February 18, 201016 yr Hi Guys, From the Wiki (somewhere, I can't find it now), I think I have a faulty power cable or SATA cable but would like some confirmation before wriggling / replacing things. I am currently running unraid 4.4.2 on a full slackware distribution. Up until recently, I have had no real issues until my newest drive started showing errors. Unraid has marked the drive with a red circle. Now, I've run short and long S.M.A.R.T. tests several times, and there are 0 issues. So, I pressed the restore button, did a parity sync and all was fine for a few days. It was after this I noticed that some of my files may have disappeared and I didn't think it was PEBKAC. A few days later, the same issue again. So, I went in the same circle again - and lost some data - again. To make it easier, I'll post some stats: Drive: 1TB - ata-ST31000528AS_6VP1PBAY (Disk 3) Smart Report: smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: ST31000528AS Serial Number: 6VP1PBAY Firmware Version: CC37 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Fri Feb 19 04:17:18 2010 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 245) Self-test routine in progress... 50% of test remaining. Total time to complete Offline data collection: ( 600) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 180) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 226697282 3 Spin_Up_Time 0x0003 097 095 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 361 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 066 060 030 Pre-fail Always - 4873256 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 3028 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 164 183 Unknown_Attribute 0x0032 099 099 000 Old_age Always - 1 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 100 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 071 059 045 Old_age Always - 29 (Lifetime Min/Max 27/29) 194 Temperature_Celsius 0x0022 029 041 000 Old_age Always - 29 (0 19 0 0) 195 Hardware_ECC_Recovered 0x001a 037 023 000 Old_age Always - 226697282 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 144976621079867 241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 3589836420 242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 630453533 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Self-test routine in progress 50% 3028 - # 2 Short offline Completed without error 00% 2801 - # 3 Extended offline Completed without error 00% 2711 - # 4 Short offline Completed without error 00% 2699 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. (This seems to be a perfect drive?) /var/log/messages: Feb 14 07:49:06 TANK kernel: ata9: hard resetting link Feb 14 07:49:08 TANK kernel: ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Feb 15 04:47:41 TANK kernel: sdk:md: disk3 read error Feb 15 04:47:42 TANK kernel: pe read error: 1205131344/3, count: 1 Feb 15 04:47:43 TANK kernel: pe read error: 1205139208/3, count: 1 Feb 15 04:47:43 TANK kernel: <4pe read error: 1205139216/3, count: 1 Feb 15 04:47:43 TANK kernel: <4pe read error: 1205139248/3, count: 1 Feb 15 04:47:43 TANK kernel: <pe read error: 1205139256/3, count: 1 Feb 15 04:47:43 TANK kernel: pe read error: 1205139264/3, count: 1 Feb 17 22:27:02 TANK kernel: scsi 9:0:0:0: Direct-Access ATA ST31000528AS CC37 PQ: 0 ANSI: 5 Feb 17 22:27:02 TANK kernel: sd 9:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB) Feb 17 22:27:02 TANK kernel: sd 9:0:0:0: [sdi] Write Protect is off Feb 17 22:27:02 TANK kernel: sd 9:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
February 18, 201016 yr Hi Guys, From the Wiki (somewhere, I can't find it now), I think I have a faulty power cable or SATA cable but would like some confirmation before wriggling / replacing things. I am currently running unraid 4.4.2 on a full slackware distribution. Up until recently, I have had no real issues until my newest drive started showing errors. Unraid has marked the drive with a red circle. Now, I've run short and long S.M.A.R.T. tests several times, and there are 0 issues. So, I pressed the restore button, did a parity sync and all was fine for a few days. It was after this I noticed that some of my files may have disappeared and I didn't think it was PEBKAC. A few days later, the same issue again. So, I went in the same circle again - and lost some data - again. To make it easier, I'll post some stats: Drive: 1TB - ata-ST31000528AS_6VP1PBAY (Disk 3) Smart Report: smartctl version 5.38 [i486-slackware-linux-gnu] Copyright © 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: ST31000528AS Serial Number: 6VP1PBAY Firmware Version: CC37 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Fri Feb 19 04:17:18 2010 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 245) Self-test routine in progress... 50% of test remaining. Total time to complete Offline data collection: ( 600) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 180) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 119 099 006 Pre-fail Always - 226697282 3 Spin_Up_Time 0x0003 097 095 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 361 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 066 060 030 Pre-fail Always - 4873256 9 Power_On_Hours 0x0032 097 097 000 Old_age Always - 3028 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 164 183 Unknown_Attribute 0x0032 099 099 000 Old_age Always - 1 184 Unknown_Attribute 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Unknown_Attribute 0x0032 100 099 000 Old_age Always - 100 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 071 059 045 Old_age Always - 29 (Lifetime Min/Max 27/29) 194 Temperature_Celsius 0x0022 029 041 000 Old_age Always - 29 (0 19 0 0) 195 Hardware_ECC_Recovered 0x001a 037 023 000 Old_age Always - 226697282 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 144976621079867 241 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 3589836420 242 Unknown_Attribute 0x0000 100 253 000 Old_age Offline - 630453533 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Self-test routine in progress 50% 3028 - # 2 Short offline Completed without error 00% 2801 - # 3 Extended offline Completed without error 00% 2711 - # 4 Short offline Completed without error 00% 2699 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. (This seems to be a perfect drive?) /var/log/messages: Feb 14 07:49:06 TANK kernel: ata9: hard resetting link Feb 14 07:49:08 TANK kernel: ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Feb 15 04:47:41 TANK kernel: sdk:md: disk3 read error Feb 15 04:47:42 TANK kernel: pe read error: 1205131344/3, count: 1 Feb 15 04:47:43 TANK kernel: pe read error: 1205139208/3, count: 1 Feb 15 04:47:43 TANK kernel: <4pe read error: 1205139216/3, count: 1 Feb 15 04:47:43 TANK kernel: <4pe read error: 1205139248/3, count: 1 Feb 15 04:47:43 TANK kernel: <pe read error: 1205139256/3, count: 1 Feb 15 04:47:43 TANK kernel: pe read error: 1205139264/3, count: 1 Feb 17 22:27:02 TANK kernel: scsi 9:0:0:0: Direct-Access ATA ST31000528AS CC37 PQ: 0 ANSI: 5 Feb 17 22:27:02 TANK kernel: sd 9:0:0:0: [sdi] 1953525168 512-byte hardware sectors (1000205 MB) Feb 17 22:27:02 TANK kernel: sd 9:0:0:0: [sdi] Write Protect is off Feb 17 22:27:02 TANK kernel: sd 9:0:0:0: [sdi] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA I had to look up "PEBKAC" There are two issues here... (three actually, but we'll get to the third) 1. For a disk to be taken off-line, a "write" to it failed. Typically, this is a hardware issue... It could be the disk, or a loose connector (SATA or Power) or a intermittent backplane, or drive tray, or even a flaky drive controller. 2. For files to disappear, they were either removed by somebody ... or more likely ... you have a corrupt file-system which needs repair. To determine the reason the drive was taken off-line we would need to see a copy of the syslog from after the failure occurs but BEFORE you next rebooted. (It might not be too late, depends on if you've rebooted since the initial failure or not) So... post a copy of your syslog ... attach it to your next post. If the physical on-disk file-system suffered some corruption when the "write" to the drive failed it would have still written the correct parity information. You might have been able to un-do the corruption by rebuilding the data on the failed drive after possibly re-seating the connectors, etc. Instead, you elected to throw away the existing parity, set a new drive configuration, and rebuild parity from the data drives (including the possibly corrupt file-system on the disk where the "write" error occurred.) So... from now on, unless explicitly advised by an experienced member of this forum to press the button labeled "restore" don't. Pressing it is PEBKAC in most cases. Do not press it unless it is part of the "trust-my-parity" procedure as described in the wiki or you are removing a disk from the array and will not replace it... If your disk that had the "red" icon had actually failed, by pressing the button labeled "restore" you would have erased its prior contents from parity and there is no way to get it back. It does not restore data, but sets a initial disk configuration. Now, hopefully is it just an intermittent connection... and you have some file-system corruption. As you use the corrupted file-system, it is possible for it to lose track of files. Once you stop the array, power down, and verify the cables for tightness, power back up and use the procedure described here in the wiki to check the file system on the disk. Odds are good it will need repair. http://lime-technology.com/wiki/index.php?title=Check_Disk_Filesystems Oh yes, read here about the evils of the button labeled "Restore" http://lime-technology.com/forum/index.php?topic=1833.msg12918#msg12918 In the future, always use the button labeled as "Start" to start the array. If the drive goes off-line again, use the procedure described here: http://lime-technology.com/wiki/index.php?title=FAQ#How_do_I_recover_from_a_hard_disk_failure.3F
February 19, 201016 yr Author Yep, PEBKAC it is then :-) I think your right on the data rebuild process. Basically, I had moved files (which would have gone to disk3 as it had the most free-space), found an error, pressed restore (even though it said disk contents are not affected), and did a parity sync. Lesson learned. I'll check the cables out on the weekend. As for the error message(s), dmesg said it was unable to identify the interface. This is the same message in the Wiki: "ata7: hard resetting link ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310) ata7.00: qc timeout (cmd 0xec) ata7.00: failed to IDENTIFY (I/O error, err_mask=0x4) ata7.00: revalidation failed (errno=-5) ata7: failed to recover some devices, retrying in 5 secs"
Archived
This topic is now archived and is closed to further replies.