[SOLVED]Parity disk and one data disk starting to fail,which one to change first

November 22, 201312 yr

Hello,

I was doing a random check on my unRAID server and noticed 4306 errors for the Parity drive (but ball is still green).

I then ran a smartctl on all the drive and also found a high number of error on one of the data disk.

I'm going to buy two new drive (and switch parity to 3TB). Which drive do you recommend me to swap first (parity or disk11) ? Should I run a parity check before?

Thanks

PARITY DRIVE

root@babylon:~# smartctl -a -A /dev/sdi
smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format) family
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA4474532
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri Nov 22 22:25:13 2013 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (36360) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   198   198   051    Pre-fail  Always       -       9667
  3 Spin_Up_Time            0x0027   167   164   021    Pre-fail  Always       -       6650
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1131
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   072   072   000    Old_age   Always       -       20899
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       140
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       84
193 Load_Cycle_Count        0x0032   152   152   000    Old_age   Always       -       146232
194 Temperature_Celsius     0x0022   129   110   000    Old_age   Always       -       21
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   196   196   000    Old_age   Always       -       1419
198 Offline_Uncorrectable   0x0030   200   197   000    Old_age   Offline      -       30
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   001   001   000    Old_age   Offline      -       148883

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     20899         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

DISK11

Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x30b7)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   102   099   006    Pre-fail  Always       -       291896
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       760
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always       -       21171801
  9 Power_On_Hours          0x0032   028   028   000    Old_age   Always       -       63694
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       30
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   079   058   045    Old_age   Always       -       21 (Min/Max 18/31)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       760
194 Temperature_Celsius     0x0022   021   042   000    Old_age   Always       -       21 (0 16 0 0)
195 Hardware_ECC_Recovered  0x001a   037   011   000    Old_age   Always       -       291896
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       26164940768955
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       678912054
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1438427421

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Quote

November 22, 201312 yr

With two "iffy" disks it's a risk no matter which way you do it.

I'd do the following ...

=> Run a parity check to confirm you have good parity (you don't want to rebuild a data drive without that). Be sure it's a correcting check, so any errors are fixed ... and if there ARE errors fixed, then run another one after that to confirm everything's now good.

=> Don't do ANYTHING on the array after that. Save the complete contents of the flash drive. Then shut down; replace the parity drive with your new 3TB drive (saving the old parity drive); and then start the system and let it rebuild parity.

=> If all went well, you can now replace the data drive. If there were problems encountered (i.e. the data drive failed before you got all of that done); then you can replace the old parity drive; copy the contents of the flash drive you saved back; and boot the system exactly as it was. Then you can replace the failed data drive and let it rebuild.

Doing it as I just outlined lets you do it BOTH ways ... hopefully the parity-first works; but if not, you still have the ability to rebuild your data drive first instead.

Quote

November 22, 201312 yr

Sorry to butt in, may I ask what's up with disk 11 as I cant figure out what's failing about it, always have trouble understanding these results.

5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0

I'm concerned that I'm missing something on my own disks.

Quote

November 22, 201312 yr

The disk 11 report looks fine. The parity drive should be replaced.

Quote

November 22, 201312 yr

Author

So don't the following errors indicate a drive going bad?

  1 Raw_Read_Error_Rate     0x000f   102   099   006    Pre-fail  Always       -       291896
  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always            -       21171801

Quote

November 22, 201312 yr

http://en.wikipedia.org/wiki/S.M.A.R.T.

Always thought those values were vendor specific, don't really mean anything to us, i may be wrong though, the 2 i listed are the important ones.

Wait for someone with more experience reading the reports to chime in. though.

Quote

November 22, 201312 yr

Different vendors do indeed list a different set of parameters -- and indeed list them under different conditions. Seagate, for example, shows all of the raw read and seek errors; while WD only lists those after certain thresholds are exceeded.

The more important number to look at is the "Value" => this starts at either 200 or 100 (depending on both the parameter and the manufacturer) and then is reduced as the data exceeds the optimal values. Seagate tends to show more raw data ... and can consequently cause more unfounded worrying than WD

As for the reallocated sectors and pending reallocations being the "important ones" => that's a matter of opinion. Modern drives are DESIGNED to automatically remap defective sectors to spare areas, so the fact you have a few reallocated sectors is NOT, by itself, a bad sign. What's more important is if the number of reallocated sectors is changing .. indicating a drive that's not only got some bad sectors; but has likely got a bit of dust or other foreign material in the sealed platters that's causing further degradation. Your parity drive has a lot of pending reallocations -- meaning the next time those sectors are written to they'll be reallocated. The number is high enough that I would indeed replace that drive. Drive 11 doesn't have any particularly worrisome values. It's doing a lot of re-seeks and error correction, but they've always been successful, so you're not getting read or write errors that the OS sees. In fact, for a drive with over 7 years of use, it's in fairly good shape. It's true, however, that a drive that old is probably ready to be replaced and relegated to storing backups.

Quote

November 22, 201312 yr

So don't the following errors indicate a drive going bad?

  1 Raw_Read_Error_Rate     0x000f   102   099   006    Pre-fail  Always       -       291896
  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always            -       21171801

These values appear to be improving over time. As long as they don't cross the threshold and are marked "failing now" you can ignore them.

Quote

November 22, 201312 yr

So don't the following errors indicate a drive going bad?

  1 Raw_Read_Error_Rate     0x000f   102   099   006    Pre-fail  Always       -       291896
  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always            -       21171801

No, the current normalized value is well above the affiliated pre-failure threshold. There is nothing wrong at all.

All drives have read errors, some report them in the smart report, most do not.

Quote

November 23, 201312 yr

Author

Big thanks to everybody for providing such valuable information. A pair of 4TB Red drives is on the way. I will only replace the parity drive. Funny to see that failure on parity drive is evolving (in the wrong direction) but ball is still solid green. I'm marking the thread as Solved for now.

root@babylon:~# smartctl -a -A /dev/sdi
smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format) family
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA4474532
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat Nov 23 09:45:40 2013 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (36360) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   171   171   051    Pre-fail  Always       -       105420
  3 Spin_Up_Time            0x0027   167   164   021    Pre-fail  Always       -       6650
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1131
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   072   072   000    Old_age   Always       -       20911
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       140
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       84
193 Load_Cycle_Count        0x0032   152   152   000    Old_age   Always       -       146244
194 Temperature_Celsius     0x0022   127   110   000    Old_age   Always       -       23
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   196   196   000    Old_age   Always       -       1419
198 Offline_Uncorrectable   0x0030   200   197   000    Old_age   Offline      -       30
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   001   001   000    Old_age   Offline      -       148883

SMART Error Log Version: 1
ATA Error Count: 2
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 20903 hours (870 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 18 d6 02 ef  Error: UNC at LBA = 0x0f02d618 = 251844120

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 70 d5 02 ef 08  40d+16:29:59.194  READ DMA
  c8 00 00 70 cc 02 ef 08  40d+16:29:58.251  READ DMA

Error 1 occurred at disk power-on lifetime: 20901 hours (870 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 88 4b b5 e7  Error: UNC at LBA = 0x07b54b88 = 129321864

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 20 4b b5 e7 08  40d+14:40:35.048  READ DMA
  c8 00 00 20 42 b5 e7 08  40d+14:40:32.954  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     20899         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Quote

November 23, 201312 yr

Funny to see that failure on parity drive is evolving (in the wrong direction) but ball is still solid green.

That behaviour is exactly what the drive manufacturer intended, only it's happening much too quickly. As long as there are still spare sectors available, the drive will continue to test "good". Problem is, with the rate of increase you are seeing, you may be out of spares in a matter of hours. Until a write to it fails, unraid will keep it online and green.

Quote

[SOLVED]Parity disk and one data disk starting to fail,which one to change first

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)