November 29, 201015 yr During the exercise of installing my first 2TB drive (clearly, this has to replace an existing 1TB parity drive) I have discovered a single pending sector on one of my data drives. The process I followed was: 1) A parity check - clean. 2) Pre-clear the new drive 3) Unassign existing parity drive 4) Power down and install new drive in place of previous parity drive. 5) Power up, assign parity drive 6) Build parity - it was at this point that I looked at SMART reports and found the pending sector on disk2. Now, I can't be entirely sure when the pending sector occurred. Would there be a log entry recording the occurrence? Is there anyway of discovering which file is affected by this pending sector? What would be the best course of action from here?
November 29, 201015 yr This is how I understand it. If unRAID is reading from a disk and gets a read error then it recreates the data it could not read and tries to write it back to the hard drive. In this manner, if there is a bad sector then it will get re-allocated and fully re-written. Now, I have no clue if this actually happens or not. However, on a parity build unRAID would not be able to do this since the parity is not valid. There really is no way to know what file it would effect. Did you get a read error on that drive? I would think unRAID would show a read error when it hits a bad sector. If you didn't yet get a read error then another parity check might be the best course of action and that should force a re-allocation and repair of the sector. Check if the pending goes to 0 and the the re-allocated goes to 1 after it completes. Also, a few repeated checks might be a very good idea to see if the disk shows an increasing number of sector failures. If the pending and reallocated counters keeps increasing then the drive must be replaced to avoid data loss. Peter
November 29, 201015 yr Author This is how I understand it. If unRAID is reading from a disk and gets a read error then it recreates the data it could not read and tries to write it back to the hard drive. In this manner, if there is a bad sector then it will get re-allocated and fully re-written. Now, I have no clue if this actually happens or not. However, on a parity build unRAID would not be able to do this since the parity is not valid. Which is why I suspect/believe that the bad sector was discovered during the parity build. There really is no way to know what file it would effect. That's a shame. A lot of my data could be recovered from elsewhere ... if only I knew which file is involved. I know that the SMART report won't tell me which sector, but I was hoping that an error logged by unRAID might be more informative. Did you get a read error on that drive? I would think unRAID would show a read error when it hits a bad sector. I'm guessing that the parity build started sometime around this entry in the log: Nov 28 14:06:08 Tower kernel: md: recovery thread woken up ... Nov 28 14:06:08 Tower kernel: md: recovery thread syncing parity disk ... and would have finished at the point that the data drives were spun down. I am somewhat disappointed not to find any signs of an error report in the system log during this time. If you didn't yet get a read error then another parity check might be the best course of action and that should force a re-allocation and repair of the sector. Check if the pending goes to 0 and the the re-allocated goes to 1 after it completes. Indeed. However, if the read error occurred during the parity build, my concern is that the parity may not be correct and, therefore, the data cannot be reconstructed! As I understand it, if the sector is successfully read on a subsequent attempt, then the pending count would go back to zero without the re-allocated count going up. Also, a few repeated checks might be a very good idea to see if the disk shows an increasing number of sector failures. If the pending and reallocated counters keeps increasing then the drive must be replaced to avoid data loss. Indeed! Having just had to rebuild another system after the 3 month old system disk suddenly started to accumulate pending sectors (1000+ when I pulled the plug) and three key system directories 'disappeared', I am fairly sensitive about this possibility.
November 29, 201015 yr Did you ever see an error reported on the unRaid GUI? If not, and based on your post, I think the chances are extremely low that you have lost data. Proceed with a parity check (a read only check would be best) and report results. Be especially sensitive to errors on the Web GUI, sync errors, disk related errors in the syslog. Take a fresh smart report after the parity check and post the results.
November 29, 201015 yr I replaced a bunch of my 7200 RPM 1.5TB Seagate drives with 2TB Green WD drives, one of the many had 3 pending sectors and still does. As I understand it, my pre-clear found those errors (SMART reported 0 to start with). Since the drive is empty and pre-clear did a single cycle which wrote 0's to the drive they are still listed as pending. I believe that those pending sectors will change into bad sectors or return to good sectors once data is written unsuccessfully or unsuccessfully to those sectors. I guess I should have run a second pre-clear cycle to clear those pending sectors one way or another. As long as you don't have parity errors you should be ok.
November 29, 201015 yr So would I be correct in saying the web interface did not show any read errors on that drive? If there is no error shown then do another parity check, or as already suggested a no-correct parity check. I'd be curious if a no-correct parity check will still try to fix the sector. Still, it will show if the drive is stable with one bad sector or if it's failing. Peter
November 30, 201015 yr So would I be correct in saying the web interface did not show any read errors on that drive? If there is no error shown then do another parity check, or as already suggested a no-correct parity check. I'd be curious if a no-correct parity check will still try to fix the sector. Still, it will show if the drive is stable with one bad sector or if it's failing. Peter If the error count did not increment, then the drive did not return a read error. If the drive did not return a read error, we have no frickin idea what the drive was doing when it marked the sector as pending relocation. One of life's little mysteries. If you were doing a read-only parity check and encountered a true read error, I expect that unRAID WOULD do its normal thing (read values from other disks and rewrite the sector). Now if it was able to read all the sectors, but the parity did not match the data, the read-only parity check would not attempt to adjust parity - a normal parity check would. This issue highlights the biggest issue with unRAID - ensuring data integrity. unRAID's parity protection is somewhat tenuous. So long as all is working correctly, parity is well maintained. But if a malfunctioning disk were to do something unexected, like spew some junk in its dying breath, parity can be thrown off the tiniest bit leaving a corruption you could never find. This is why, instead of arguing for a RAID6 type configuration to protect users from 2 simultanious disk failures (something none of us would likely face in a lifetime), I'd like to see some ability to maintain PAR2-like sets to allow the system to detect and correct minor corruption. I personally create PAR2 sets on my full disks to protect me from such an occurance.
November 30, 201015 yr Author Well, the parity check (no correct) completed without any sync errors. Nov 30 07:03:19 Tower kernel: mdcmd (49): check NOCORRECT Nov 30 07:03:19 Tower kernel: Nov 30 07:03:19 Tower kernel: md: recovery thread woken up ... Nov 30 07:03:19 Tower kernel: md: recovery thread checking parity... Nov 30 07:03:19 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. Nov 30 08:00:01 Tower logger: mover started Nov 30 08:00:01 Tower logger: ./.new/readme.txt Nov 30 08:00:01 Tower logger: ./.new Nov 30 08:00:01 Tower logger: ./.hide Nov 30 08:00:01 Tower logger: ./.readme.txt Nov 30 08:00:01 Tower logger: . Nov 30 08:00:01 Tower logger: nothing to move Nov 30 08:00:01 Tower logger: mover finished Nov 30 09:03:34 Tower emhttp: shcmd (86): /usr/sbin/hdparm -y /dev/sde >/dev/null Nov 30 10:00:01 Tower logger: mover started Nov 30 10:00:01 Tower logger: ./.new/readme.txt Nov 30 10:00:01 Tower logger: ./.new Nov 30 10:00:01 Tower logger: ./.hide Nov 30 10:00:01 Tower logger: ./.readme.txt Nov 30 10:00:01 Tower logger: . Nov 30 10:00:01 Tower logger: nothing to move Nov 30 10:00:01 Tower logger: mover finished Nov 30 10:49:44 Tower kernel: mdcmd (50): spindown 1 Nov 30 10:49:45 Tower kernel: mdcmd (51): spindown 2 Nov 30 11:05:04 Tower kernel: perl[24944]: segfault at 0 ip 0810d070 sp bf9d62a0 error 4 in perl5.10.0[8048000+123000] Nov 30 11:20:18 Tower kernel: mdcmd (52): spindown 1 Nov 30 12:00:01 Tower logger: mover started Nov 30 12:00:01 Tower logger: ./.new/readme.txt Nov 30 12:00:01 Tower logger: ./.new Nov 30 12:00:01 Tower logger: ./.hide Nov 30 12:00:01 Tower logger: ./.readme.txt Nov 30 12:00:01 Tower logger: . Nov 30 12:00:01 Tower logger: nothing to move Nov 30 12:00:01 Tower logger: mover finished Nov 30 14:00:01 Tower logger: mover started Nov 30 14:00:01 Tower logger: ./.new/readme.txt Nov 30 14:00:01 Tower logger: ./.new Nov 30 14:00:01 Tower logger: ./.hide Nov 30 14:00:01 Tower logger: ./.readme.txt Nov 30 14:00:01 Tower logger: . Nov 30 14:00:01 Tower logger: nothing to move Nov 30 14:00:01 Tower logger: mover finished Nov 30 14:13:51 Tower kernel: md: sync done. time=25831sec rate=75626K/sec Nov 30 14:13:51 Tower kernel: md: recovery thread sync completion status: 0 Nov 30 14:29:01 Tower kernel: mdcmd (53): spindown 0 However, the SMART report for disk2 is still showing one pending sector. What is more, it is now also showing a Multi_Zone_Error_Rate raw value of 171. Statistics for /dev/sdc 00P_WD-WMAVU0236768 smartctl -a -d ata /dev/sdc smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD10EADS-00P8B0 Serial Number: WD-WMAVU0236768 Firmware Version: 01.00A01 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue Nov 30 18:20:15 2010 SGT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (23100) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 182 177 021 Pre-fail Always - 5891 4 Start_Stop_Count 0x0032 097 097 000 Old_age Always - 3976 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6131 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 324 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 133 193 Load_Cycle_Count 0x0032 189 189 000 Old_age Always - 35000 194 Temperature_Celsius 0x0022 120 091 000 Old_age Always - 30 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 199 197 000 Old_age Offline - 171 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. I'm beginning to think that I should use my old parity drive to replace disk2 and then run more tests (like a preclear) on the old disk2.
November 30, 201015 yr A smart precaution. I personally would not be terribly concerned with this one pending sector that is not causing any external symptoms - but also would not be surprised to see it start to display worse symptoms as time goes on.
December 1, 201015 yr Author Well, the pre-clear has reached the post-read phase, and the SMART report is now clean, not even a reallocated event: smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD10EADS-00P8B0 Serial Number: WD-WMAVU0236768 Firmware Version: 01.00A01 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Wed Dec 1 14:23:05 2010 SGT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 241) Self-test routine in progress... 10% of test remaining. Total time to complete Offline data collection: (23100) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 181 177 021 Pre-fail Always - 5908 4 Start_Stop_Count 0x0032 097 097 000 Old_age Always - 3977 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 6150 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 325 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 133 193 Load_Cycle_Count 0x0032 189 189 000 Old_age Always - 35317 194 Temperature_Celsius 0x0022 111 091 000 Old_age Always - 39 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 197 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 6140 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. I think that I now feel confident to assign this drive as disk3! Thank you all for all your advices.
Archived
This topic is now archived and is closed to further replies.