A Pending Sector

November 29, 201015 yr

During the exercise of installing my first 2TB drive (clearly, this has to replace an existing 1TB parity drive) I have discovered a single pending sector on one of my data drives.

The process I followed was:

1) A parity check - clean.

2) Pre-clear the new drive

3) Unassign existing parity drive

4) Power down and install new drive in place of previous parity drive.

5) Power up, assign parity drive

6) Build parity

- it was at this point that I looked at SMART reports and found the pending sector on disk2.

Now, I can't be entirely sure when the pending sector occurred.

Would there be a log entry recording the occurrence?

Is there anyway of discovering which file is affected by this pending sector?

What would be the best course of action from here?

November 29, 201015 yr

This is how I understand it. If unRAID is reading from a disk and gets a read error then it recreates the data it could not read and tries to write it back to the hard drive. In this manner, if there is a bad sector then it will get re-allocated and fully re-written. Now, I have no clue if this actually happens or not. However, on a parity build unRAID would not be able to do this since the parity is not valid.

There really is no way to know what file it would effect.

Did you get a read error on that drive? I would think unRAID would show a read error when it hits a bad sector. If you didn't yet get a read error then another parity check might be the best course of action and that should force a re-allocation and repair of the sector. Check if the pending goes to 0 and the the re-allocated goes to 1 after it completes.

Also, a few repeated checks might be a very good idea to see if the disk shows an increasing number of sector failures. If the pending and reallocated counters keeps increasing then the drive must be replaced to avoid data loss.

Peter

November 29, 201015 yr

Author

This is how I understand it. If unRAID is reading from a disk and gets a read error then it recreates the data it could not read and tries to write it back to the hard drive. In this manner, if there is a bad sector then it will get re-allocated and fully re-written. Now, I have no clue if this actually happens or not. However, on a parity build unRAID would not be able to do this since the parity is not valid.

Which is why I suspect/believe that the bad sector was discovered during the parity build.

There really is no way to know what file it would effect.

That's a shame. A lot of my data could be recovered from elsewhere ... if only I knew which file is involved. I know that the SMART report won't tell me which sector, but I was hoping that an error logged by unRAID might be more informative.

Did you get a read error on that drive? I would think unRAID would show a read error when it hits a bad sector.

I'm guessing that the parity build started sometime around this entry in the log:

Nov 28 14:06:08 Tower kernel: md: recovery thread woken up ...
Nov 28 14:06:08 Tower kernel: md: recovery thread syncing parity disk ...

and would have finished at the point that the data drives were spun down.

I am somewhat disappointed not to find any signs of an error report in the system log during this time.

If you didn't yet get a read error then another parity check might be the best course of action and that should force a re-allocation and repair of the sector. Check if the pending goes to 0 and the the re-allocated goes to 1 after it completes.

Indeed. However, if the read error occurred during the parity build, my concern is that the parity may not be correct and, therefore, the data cannot be reconstructed!

As I understand it, if the sector is successfully read on a subsequent attempt, then the pending count would go back to zero without the re-allocated count going up.

Also, a few repeated checks might be a very good idea to see if the disk shows an increasing number of sector failures. If the pending and reallocated counters keeps increasing then the drive must be replaced to avoid data loss.

Indeed! Having just had to rebuild another system after the 3 month old system disk suddenly started to accumulate pending sectors (1000+ when I pulled the plug) and three key system directories 'disappeared', I am fairly sensitive about this possibility.

November 29, 201015 yr

Did you ever see an error reported on the unRaid GUI? If not, and based on your post, I think the chances are extremely low that you have lost data. Proceed with a parity check (a read only check would be best) and report results. Be especially sensitive to errors on the Web GUI, sync errors, disk related errors in the syslog. Take a fresh smart report after the parity check and post the results.

November 29, 201015 yr

I replaced a bunch of my 7200 RPM 1.5TB Seagate drives with 2TB Green WD drives, one of the many had 3 pending sectors and still does. As I understand it, my pre-clear found those errors (SMART reported 0 to start with). Since the drive is empty and pre-clear did a single cycle which wrote 0's to the drive they are still listed as pending. I believe that those pending sectors will change into bad sectors or return to good sectors once data is written unsuccessfully or unsuccessfully to those sectors. I guess I should have run a second pre-clear cycle to clear those pending sectors one way or another.

As long as you don't have parity errors you should be ok.

November 29, 201015 yr

So would I be correct in saying the web interface did not show any read errors on that drive?

If there is no error shown then do another parity check, or as already suggested a no-correct parity check. I'd be curious if a no-correct parity check will still try to fix the sector. Still, it will show if the drive is stable with one bad sector or if it's failing.

Peter

November 30, 201015 yr

So would I be correct in saying the web interface did not show any read errors on that drive?

If there is no error shown then do another parity check, or as already suggested a no-correct parity check. I'd be curious if a no-correct parity check will still try to fix the sector. Still, it will show if the drive is stable with one bad sector or if it's failing.

Peter

If the error count did not increment, then the drive did not return a read error. If the drive did not return a read error, we have no frickin idea what the drive was doing when it marked the sector as pending relocation. One of life's little mysteries.

If you were doing a read-only parity check and encountered a true read error, I expect that unRAID WOULD do its normal thing (read values from other disks and rewrite the sector). Now if it was able to read all the sectors, but the parity did not match the data, the read-only parity check would not attempt to adjust parity - a normal parity check would.

This issue highlights the biggest issue with unRAID - ensuring data integrity. unRAID's parity protection is somewhat tenuous. So long as all is working correctly, parity is well maintained. But if a malfunctioning disk were to do something unexected, like spew some junk in its dying breath, parity can be thrown off the tiniest bit leaving a corruption you could never find.

This is why, instead of arguing for a RAID6 type configuration to protect users from 2 simultanious disk failures (something none of us would likely face in a lifetime), I'd like to see some ability to maintain PAR2-like sets to allow the system to detect and correct minor corruption. I personally create PAR2 sets on my full disks to protect me from such an occurance.

November 30, 201015 yr

Author

Well, the parity check (no correct) completed without any sync errors.

Nov 30 07:03:19 Tower kernel: mdcmd (49): check NOCORRECT
Nov 30 07:03:19 Tower kernel:
Nov 30 07:03:19 Tower kernel: md: recovery thread woken up ...
Nov 30 07:03:19 Tower kernel: md: recovery thread checking parity...
Nov 30 07:03:19 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks.
Nov 30 08:00:01 Tower logger: mover started
Nov 30 08:00:01 Tower logger: ./.new/readme.txt
Nov 30 08:00:01 Tower logger: ./.new
Nov 30 08:00:01 Tower logger: ./.hide
Nov 30 08:00:01 Tower logger: ./.readme.txt
Nov 30 08:00:01 Tower logger: .
Nov 30 08:00:01 Tower logger: nothing to move
Nov 30 08:00:01 Tower logger: mover finished
Nov 30 09:03:34 Tower emhttp: shcmd (86): /usr/sbin/hdparm -y /dev/sde >/dev/null
Nov 30 10:00:01 Tower logger: mover started
Nov 30 10:00:01 Tower logger: ./.new/readme.txt
Nov 30 10:00:01 Tower logger: ./.new
Nov 30 10:00:01 Tower logger: ./.hide
Nov 30 10:00:01 Tower logger: ./.readme.txt
Nov 30 10:00:01 Tower logger: .
Nov 30 10:00:01 Tower logger: nothing to move
Nov 30 10:00:01 Tower logger: mover finished
Nov 30 10:49:44 Tower kernel: mdcmd (50): spindown 1
Nov 30 10:49:45 Tower kernel: mdcmd (51): spindown 2
Nov 30 11:05:04 Tower kernel: perl[24944]: segfault at 0 ip 0810d070 sp bf9d62a0 error 4 in perl5.10.0[8048000+123000]
Nov 30 11:20:18 Tower kernel: mdcmd (52): spindown 1
Nov 30 12:00:01 Tower logger: mover started
Nov 30 12:00:01 Tower logger: ./.new/readme.txt
Nov 30 12:00:01 Tower logger: ./.new
Nov 30 12:00:01 Tower logger: ./.hide
Nov 30 12:00:01 Tower logger: ./.readme.txt
Nov 30 12:00:01 Tower logger: .
Nov 30 12:00:01 Tower logger: nothing to move
Nov 30 12:00:01 Tower logger: mover finished
Nov 30 14:00:01 Tower logger: mover started
Nov 30 14:00:01 Tower logger: ./.new/readme.txt
Nov 30 14:00:01 Tower logger: ./.new
Nov 30 14:00:01 Tower logger: ./.hide
Nov 30 14:00:01 Tower logger: ./.readme.txt
Nov 30 14:00:01 Tower logger: .
Nov 30 14:00:01 Tower logger: nothing to move
Nov 30 14:00:01 Tower logger: mover finished
Nov 30 14:13:51 Tower kernel: md: sync done. time=25831sec rate=75626K/sec
Nov 30 14:13:51 Tower kernel: md: recovery thread sync completion status: 0
Nov 30 14:29:01 Tower kernel: mdcmd (53): spindown 0

However, the SMART report for disk2 is still showing one pending sector. What is more, it is now also showing a Multi_Zone_Error_Rate raw value of 171.

Statistics for /dev/sdc 00P_WD-WMAVU0236768

smartctl -a -d ata /dev/sdc
smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green family
Device Model:     WDC WD10EADS-00P8B0
Serial Number:    WD-WMAVU0236768
Firmware Version: 01.00A01
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Nov 30 18:20:15 2010 SGT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (23100) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0027   182   177   021    Pre-fail  Always       -       5891
 4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3976
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6131
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       324
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       133
193 Load_Cycle_Count        0x0032   189   189   000    Old_age   Always       -       35000
194 Temperature_Celsius     0x0022   120   091   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   199   197   000    Old_age   Offline      -       171

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I'm beginning to think that I should use my old parity drive to replace disk2 and then run more tests (like a preclear) on the old disk2.

November 30, 201015 yr

A smart precaution. I personally would not be terribly concerned with this one pending sector that is not causing any external symptoms - but also would not be surprised to see it start to display worse symptoms as time goes on.

December 1, 201015 yr

Author

Well, the pre-clear has reached the post-read phase, and the SMART report is now clean, not even a reallocated event:

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green family
Device Model:     WDC WD10EADS-00P8B0
Serial Number:    WD-WMAVU0236768
Firmware Version: 01.00A01
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Wed Dec  1 14:23:05 2010 SGT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85)	Offline data collection activity
				was aborted by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 241)	Self-test routine in progress...
				10% of test remaining.
Total time to complete Offline 
data collection: 		 (23100) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   181   177   021    Pre-fail  Always       -       5908
  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3977
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6150
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       325
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       133
193 Load_Cycle_Count        0x0032   189   189   000    Old_age   Always       -       35317
194 Temperature_Celsius     0x0022   111   091   000    Old_age   Always       -       39
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   197   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      6140         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I think that I now feel confident to assign this drive as disk3!

Thank you all for all your advices.

A Pending Sector

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)