parity drive ata error count going up


Recommended Posts

for a week now i have been trying to figure out if my parity drive is still ok to be used. a week ago i did a parity check and it gave me some errors and around 40 ATA errros. i did another parity check this time i got less ata errors and after few more times i still got more errors and my ata count is at 75. any ideas what i should do?

 

whats this

Warning: ATA error count 75 inconsistent with error log pointer 1 ?

 

unraid 4.6 and i have been using it for about a year now first time im getting these errors

 

current_pending_sector=6

offline_uncorrectable=3

ata_error_count=75

 

syslog

http://pastebin.com/HzMfbw7e

 

SMART

 

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green family
Device Model:     WDC WD20EADS-00R6B0
Serial Number:    WD-WCAVY2269670
Firmware Version: 01.00A01
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri May 13 09:08:05 2011 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121)	The previous self-test completed having
				the read element of the test failed.
Total time to complete Offline 
data collection: 		 (43200) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   179   154   021    Pre-fail  Always       -       8025
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1701
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       9943
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       29
193 Load_Cycle_Count        0x0032   193   193   000    Old_age   Always       -       21686
194 Temperature_Celsius     0x0022   127   117   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       6
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       3
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
Warning: ATA error count 75 inconsistent with error log pointer 1

ATA Error Count: 75 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      9790         30300665
# 2  Short offline       Completed without error       00%      9789         -
# 3  Short offline       Completed without error       00%      8827         -
# 4  Extended offline    Completed: read failure       90%      8816         35253715
# 5  Short offline       Aborted by host               80%      8816         -
# 6  Short offline       Aborted by host               80%      8816         -
# 7  Short offline       Completed without error       00%      8816         -
# 8  Short offline       Aborted by host               90%      8816         -
# 9  Short offline       Aborted by host               90%      8816         -
#10  Short offline       Completed without error       00%      8816         -
#11  Extended offline    Completed: read failure       90%      8459         27516853
#12  Extended offline    Completed: read failure       90%      7671         27516853
#13  Extended offline    Completed: read failure       90%      7671         27516853
#14  Extended offline    Completed: read failure       90%      7671         27516853
#15  Short offline       Completed: read failure       90%      7661         27516853
#16  Short offline       Completed: read failure       30%      2454         3328229
#17  Short offline       Aborted by host               10%      2446         -
#18  Short offline       Aborted by host               90%      2446         -
#19  Short offline       Aborted by host               70%      2446         -
#20  Short offline       Aborted by host               60%      2446         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

ok im confused now i just started a parity check and i seen these errors i checked smart and it says

 

SMART Error Log Version: 1

No Errors Logged

 

i dont see any ata errors anymore what i figured out is those errors from ata3.00 are actually from a DISK 1 drive not parity. so now i have no idea whats going on and if on of these hard drives is still ok.  :( :(

 

May 13 11:23:57 Tower kernel: md: recovery thread woken up ... (unRAID engine)
May 13 11:23:57 Tower kernel: md: recovery thread checking parity... (unRAID engine)
May 13 11:23:57 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine)
May 13 11:28:33 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
May 13 11:28:33 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related)
May 13 11:28:33 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues)
May 13 11:28:33 Tower kernel: ata3.00: cmd 25/00:40:e7:53:b3/00:03:02:00:00/e0 tag 0 dma 425984 in (Drive related)
May 13 11:28:33 Tower kernel:          res 51/40:3f:db:56:b3/00:00:02:00:00/e0 Emask 0x9 (media error) (Errors)
May 13 11:28:33 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related)
May 13 11:28:33 Tower kernel: ata3.00: error: { UNC } (Errors)
May 13 11:28:33 Tower kernel: ata3.00: configured for UDMA/133 (Drive related)
May 13 11:28:33 Tower kernel: ata3: EH complete (Drive related)
May 13 11:28:36 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
May 13 11:28:36 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related)
May 13 11:28:36 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues)
May 13 11:28:36 Tower kernel: ata3.00: cmd 25/00:40:e7:53:b3/00:03:02:00:00/e0 tag 0 dma 425984 in (Drive related)
May 13 11:28:36 Tower kernel:          res 51/40:3f:db:56:b3/00:00:02:00:00/e0 Emask 0x9 (media error) (Errors)
May 13 11:28:36 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related)
May 13 11:28:36 Tower kernel: ata3.00: error: { UNC } (Errors)
May 13 11:28:36 Tower kernel: ata3.00: configured for UDMA/133 (Drive related)
May 13 11:28:36 Tower kernel: ata3: EH complete (Drive related)
May 13 11:28:38 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
May 13 11:28:38 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related)
May 13 11:28:38 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues)
May 13 11:28:38 Tower kernel: ata3.00: cmd 25/00:40:e7:53:b3/00:03:02:00:00/e0 tag 0 dma 425984 in (Drive related)
May 13 11:28:38 Tower kernel:          res 51/40:3f:db:56:b3/00:00:02:00:00/e0 Emask 0x9 (media error) (Errors)
May 13 11:28:38 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related)
May 13 11:28:38 Tower kernel: ata3.00: error: { UNC } (Errors)
May 13 11:28:38 Tower kernel: ata3.00: configured for UDMA/133 (Drive related)
May 13 11:28:38 Tower kernel: ata3: EH complete (Drive related)

Link to comment

ok i will try that, the pending sectors is always like this it goes up and down all the time same with offline uncorrectable which never goes below 1. here is how it looks like now not much difference except that the ata errors are gone. will have to check cables later.

 

smartctl -a -d ata /dev/sdc (parity)

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green family
Device Model:     WDC WD20EADS-00R6B0
Serial Number:    WD-WCAVY2269670
Firmware Version: 01.00A01
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Fri May 13 11:46:25 2011 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121)	The previous self-test completed having
				the read element of the test failed.
Total time to complete Offline 
data collection: 		 (43200) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   179   154   021    Pre-fail  Always       -       8025
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1701
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       9945
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       29
193 Load_Cycle_Count        0x0032   193   193   000    Old_age   Always       -       21690
194 Temperature_Celsius     0x0022   124   117   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       6
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       3
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      9790         30300665
# 2  Short offline       Completed without error       00%      9789         -
# 3  Short offline       Completed without error       00%      8827         -
# 4  Extended offline    Completed: read failure       90%      8816         35253715
# 5  Short offline       Aborted by host               80%      8816         -
# 6  Short offline       Aborted by host               80%      8816         -
# 7  Short offline       Completed without error       00%      8816         -
# 8  Short offline       Aborted by host               90%      8816         -
# 9  Short offline       Aborted by host               90%      8816         -
#10  Short offline       Completed without error       00%      8816         -
#11  Extended offline    Completed: read failure       90%      8459         27516853
#12  Extended offline    Completed: read failure       90%      7671         27516853
#13  Extended offline    Completed: read failure       90%      7671         27516853
#14  Extended offline    Completed: read failure       90%      7671         27516853
#15  Short offline       Completed: read failure       90%      7661         27516853
#16  Short offline       Completed: read failure       30%      2454         3328229
#17  Short offline       Aborted by host               10%      2446         -
#18  Short offline       Aborted by host               90%      2446         -
#19  Short offline       Aborted by host               70%      2446         -
#20  Short offline       Aborted by host               60%      2446         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

tested parity check 3 times and every time at exactly 0.8% these errors come up can this still be cable/power issue? i cant shut down the server right now its being used.

 

May 13 12:16:51 Tower kernel: md: recovery thread woken up ... (unRAID engine)

May 13 12:16:51 Tower kernel: md: recovery thread checking parity... (unRAID engine)

May 13 12:16:51 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine)

May 13 12:20:02 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)

May 13 12:20:02 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related)

May 13 12:20:02 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues)

May 13 12:20:02 Tower kernel: ata3.00: cmd 25/00:90:77:57:ce/00:03:01:00:00/e0 tag 0 dma 466944 in (Drive related)

May 13 12:20:02 Tower kernel:          res 51/40:ff:f9:59:ce/00:00:01:00:00/e0 Emask 0x9 (media error) (Errors)

May 13 12:20:02 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related)

May 13 12:20:02 Tower kernel: ata3.00: error: { UNC } (Errors)

May 13 12:20:02 Tower kernel: ata3.00: configured for UDMA/133 (Drive related)

May 13 12:20:02 Tower kernel: ata3: EH complete (Drive related)

May 13 12:20:05 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)

May 13 12:20:05 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related)

May 13 12:20:05 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues)

May 13 12:20:05 Tower kernel: ata3.00: cmd 25/00:90:77:57:ce/00:03:01:00:00/e0 tag 0 dma 466944 in (Drive related)

May 13 12:20:05 Tower kernel:          res 51/40:ff:f9:59:ce/00:00:01:00:00/e0 Emask 0x9 (media error) (Errors)

May 13 12:20:05 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related)

May 13 12:20:05 Tower kernel: ata3.00: error: { UNC } (Errors)

May 13 12:20:05 Tower kernel: ata3.00: configured for UDMA/133 (Drive related)

May 13 12:20:05 Tower kernel: ata3: EH complete (Drive related)

May 13 12:20:08 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)

May 13 12:20:08 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related)

May 13 12:20:08 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues)

May 13 12:20:08 Tower kernel: ata3.00: cmd 25/00:90:77:57:ce/00:03:01:00:00/e0 tag 0 dma 466944 in (Drive related)

May 13 12:20:08 Tower kernel:          res 51/40:ff:f9:59:ce/00:00:01:00:00/e0 Emask 0x9 (media error) (Errors)

May 13 12:20:08 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related)

May 13 12:20:08 Tower kernel: ata3.00: error: { UNC } (Errors)

May 13 12:20:08 Tower kernel: ata3.00: configured for UDMA/133 (Drive related)

May 13 12:20:08 Tower kernel: ata3: EH complete (Drive related)

Link to comment

tested parity check 3 times and every time at exactly 0.8% these errors come up can this still be cable/power issue? i cant shut down the server right now its being used.

 

May 13 12:16:51 Tower kernel: md: recovery thread woken up ... (unRAID engine)

May 13 12:16:51 Tower kernel: md: recovery thread checking parity... (unRAID engine)

May 13 12:16:51 Tower kernel: md: using 1152k window, over a total of 1953514552 blocks. (unRAID engine)

May 13 12:20:02 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)

May 13 12:20:02 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related)

May 13 12:20:02 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues)

May 13 12:20:02 Tower kernel: ata3.00: cmd 25/00:90:77:57:ce/00:03:01:00:00/e0 tag 0 dma 466944 in (Drive related)

May 13 12:20:02 Tower kernel:          res 51/40:ff:f9:59:ce/00:00:01:00:00/e0 Emask 0x9 (media error) (Errors)

May 13 12:20:02 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related)

May 13 12:20:02 Tower kernel: ata3.00: error: { UNC } (Errors)

May 13 12:20:02 Tower kernel: ata3.00: configured for UDMA/133 (Drive related)

May 13 12:20:02 Tower kernel: ata3: EH complete (Drive related)

May 13 12:20:05 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)

May 13 12:20:05 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related)

May 13 12:20:05 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues)

May 13 12:20:05 Tower kernel: ata3.00: cmd 25/00:90:77:57:ce/00:03:01:00:00/e0 tag 0 dma 466944 in (Drive related)

May 13 12:20:05 Tower kernel:          res 51/40:ff:f9:59:ce/00:00:01:00:00/e0 Emask 0x9 (media error) (Errors)

May 13 12:20:05 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related)

May 13 12:20:05 Tower kernel: ata3.00: error: { UNC } (Errors)

May 13 12:20:05 Tower kernel: ata3.00: configured for UDMA/133 (Drive related)

May 13 12:20:05 Tower kernel: ata3: EH complete (Drive related)

May 13 12:20:08 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)

May 13 12:20:08 Tower kernel: ata3.00: irq_stat 0x40000001 (Drive related)

May 13 12:20:08 Tower kernel: ata3.00: failed command: READ DMA EXT (Minor Issues)

May 13 12:20:08 Tower kernel: ata3.00: cmd 25/00:90:77:57:ce/00:03:01:00:00/e0 tag 0 dma 466944 in (Drive related)

May 13 12:20:08 Tower kernel:          res 51/40:ff:f9:59:ce/00:00:01:00:00/e0 Emask 0x9 (media error) (Errors)

May 13 12:20:08 Tower kernel: ata3.00: status: { DRDY ERR } (Drive related)

May 13 12:20:08 Tower kernel: ata3.00: error: { UNC } (Errors)

May 13 12:20:08 Tower kernel: ata3.00: configured for UDMA/133 (Drive related)

May 13 12:20:08 Tower kernel: ata3: EH complete (Drive related)

They could be power related, but those are MEDIA errors.  (translation, un-readable sectors on the disk)

 

UNC errors are almost always related to bad sectors on the disk that are not readable.

 

Get a "smart" report on the disk.

smartctl -d ata -a /dev/sdX

where sdX = the three letter designation for your disk.

 

Look for re-allocated sectors and sectors pending re-allocation.  (The counts are in the RAW column on the far right)

Joe L.

Link to comment

sorry for stupid question but how do i identify which hard drive those errors are for? is it by ata3.00 ?

i was using mymain syslog entries for disc and i didn't see these UNC errors under parity but only under disc 1 which i think is wrong as the smart report for disc 1 has no errors at all.

 

 

did so more parity check tests waited till 1% before canceling

1st time it gave those UNC errors at 0.4% and current_pending_sector went from 6 to 7

2nd time no errors

3rd UNC error at 0.8%

4th UNC errors at 0.9%

 

right now my smart looks like this

smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green family
Device Model:     WDC WD20EADS-00R6B0
Serial Number:    WD-WCAVY2269670
Firmware Version: 01.00A01
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sat May 14 08:02:12 2011 MDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 121)	The previous self-test completed having
				the read element of the test failed.
Total time to complete Offline 
data collection: 		 (43200) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   179   154   021    Pre-fail  Always       -       8025
  4 Start_Stop_Count        0x0032   099   099   000    Old_age   Always       -       1705
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   087   087   000    Old_age   Always       -       9965
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       29
193 Load_Cycle_Count        0x0032   193   193   000    Old_age   Always       -       21768
194 Temperature_Celsius     0x0022   126   117   000    Old_age   Always       -       26
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       7
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%      9790         30300665
# 2  Short offline       Completed without error       00%      9789         -
# 3  Short offline       Completed without error       00%      8827         -
# 4  Extended offline    Completed: read failure       90%      8816         35253715
# 5  Short offline       Aborted by host               80%      8816         -
# 6  Short offline       Aborted by host               80%      8816         -
# 7  Short offline       Completed without error       00%      8816         -
# 8  Short offline       Aborted by host               90%      8816         -
# 9  Short offline       Aborted by host               90%      8816         -
#10  Short offline       Completed without error       00%      8816         -
#11  Extended offline    Completed: read failure       90%      8459         27516853
#12  Extended offline    Completed: read failure       90%      7671         27516853
#13  Extended offline    Completed: read failure       90%      7671         27516853
#14  Extended offline    Completed: read failure       90%      7671         27516853
#15  Short offline       Completed: read failure       90%      7661         27516853
#16  Short offline       Completed: read failure       30%      2454         3328229
#17  Short offline       Aborted by host               10%      2446         -
#18  Short offline       Aborted by host               90%      2446         -
#19  Short offline       Aborted by host               70%      2446         -
#20  Short offline       Aborted by host               60%      2446         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

anyways when i wake up i will have a chance to power down and replug the cables.

Link to comment

There are basically two things that can cause drive errors:

 

1 - The drive itself is failing.  If the drive is failing, you will see attribute errors in the smart report - notably reallocated sectors and pending sectors.  (Although other attributes have their individual failure thresholds, and if those thresholds are approached they can also be indicators of impending drive failure).  The computer / OS does not cause the drive to have these types of errors.

 

2 - The connection to the drive is bad (e.g., bad cable, bad drive cage, bad port, etc.).  If the connection is bad, you will tend to see errors in the unRAID syslog AND see the ata error count on the drive increase.  These types of errors indicate that the data is being garbled in transmission.  So if, for example, you have a 1T drive and the computer is requesting a read of a sector at offset 750G, and the instruction is garbled and the drive sees it as an instruction to read a sector at offset 1750G (bigger than the drive), the drive will return some error to the OS, likely some type of read error.  This type of error is logged in the syslog, and also remembered by the drive as an "ata error".  In this type of scenario, the drive is doing exactly what it should be doing, and the problem is frequently the cable.

 

It looks like you have run some extensive self-tests on this drive.  Note that the spin down feature of unRAID can cause drive self-tests to fail.  So make sure to disable spin down on a drive before attempting to run a self test.

 

I don't have much experience with the "offline uncorrectable" attribute.  A value of 1 is not affecting the normalized attribute values, so I am assuming it is not a problem worth worrying about.  But I am not sure - I personally would not be happy to see offline uncorrectable errors.

 

Current pending sectors indicate that there was difficulty reading a sector and that it needs to be monitored for possible relocation at a later time.  Frequently current pending sectors become reallocated after a parity check or preclear cycle.  But I've also seen pending sectors, even a hundred of them, clear themselves and go back to 0, with no reallocated sectors.  It is hard to interpret why this happens, but there is no evidence that these drives have given problems in future use.

 

I would recommend running parity checks and watching the attributes, paying particular attention to the ata error count and the reallocated sector count.  If ata errors increase, check / replace your cables to the drive.  If reallocated sectors increase (and don't hold steady for three consecutive parity checks), it is time to RMA the drive.

Link to comment

those 75 ata errors i had disappeared  during a parity check and ever since they don't show up anymore. i replugged the cables but i still am getting those UNC errors.

 

Offline_Uncorrectable since jan 24 goes from 0 - 4

Current_Pending_Sector since jan 24 it has gone from 2 - 9

 

right now its at

 

current_pending_sector=7

offline_uncorrectable=1

 

reallocated sector count has never changed from 0, shouldn't it have gone up since jan 24 when i started to get current pending sectors?

and ata errors are gone

Link to comment

those 75 ata errors i had disappeared  during a parity check and ever since they don't show up anymore. i replugged the cables but i still am getting those UNC errors.

 

Offline_Uncorrectable since jan 24 goes from 0 - 4

Current_Pending_Sector since jan 24 it has gone from 2 - 9

 

right now its at

 

current_pending_sector=7

offline_uncorrectable=1

 

reallocated sector count has never changed from 0, shouldn't it have gone up since jan 24 when i started to get current pending sectors?

and ata errors are gone

 

The ata errors should never reset to zero.  This is a cumulative count of invalid ATA instructions received.  Firmware, like all software, has bugs.  If you are seeing the ata errors disappear, I suspect you are experiencing an unforseen firmware bug due to the variety and number of errors the drive is experiencing.  Double check and make sure it is the same and not a different drive you were seeing the ata error count.  If it is the same drive, and the ata error count got reset to 0, I would not trust the smart monitoring on the drive and would RMA it.  Why take chances?

Link to comment

yes it the same drive

 

smart said this when it had ata errors

 

SMART Error Log Version: 1

Warning: ATA error count 75 inconsistent with error log pointer 1

 

now its like this

 

SMART Error Log Version: 1

No Errors Logged

 

did few more parity checks still get UNC error but current pending sector is still at 7 and uncorrectable is at 1 still none for relocated count.

 

Thanks for the help hopefully i can just RMA it and not have to worry about my parity

 

 

Link to comment
  • 3 weeks later...

just installed 5.0b6a did a parity check and noticed in mymain that the ata error went back to 100 i was going to record the errors but after i refreshed the ata errors are gone again. Is this something that should be happening? now i was looking at the nice custom gui and i clicked smart log on it and now i see the errors. is there a reason they go away in the my main smart report but stay on this report? my cache drive has had 22 ata errors and they never go away. is this something i can rma the drive for? i don't want them to try the drive and tell me its fine

 

http://pastebin.com/e6ipKtrK

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.