[SOLVED] Holy Hell, Errors across 3 drives after parity check


Recommended Posts

My server (4.7 plus) does a monthly parity scan. After I ran it, I saw there were errors on disk 0, 2, and my cache drive.

Here's a snapshot of mymain, syslog and smart reports for all three.

 

I'm not familiar with sector reallocation, usually my drives just straight up fail.  :P

 

SMART says they've all passed, should I be worried or carry on? I'd love if someone with a better mind on these kinds of things took a peek. I havent touched the system since I have noticed these problems. Should I shut down and reboot so the sectors are allocated or...? RMA? Yikes!

 

Thank you!!

 

 

MyMain:

Mymain.png

 

syslog: http://dl.dropbox.com/u/519591/unraid/syslog-2012-02-01.txt

 

Disk0

smartctl -a -d ata /dev/sdd (parity)
smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA1020507
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Feb  2 00:09:05 2012 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (40500) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   166   162   021    Pre-fail  Always       -       6691
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       601
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       8928
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       68
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       17
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2371
194 Temperature_Celsius     0x0022   122   115   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       13
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       6
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       22

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


 

Disk2:

smartctl -a -d ata /dev/sde (disk2)
smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA1050439
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Feb  2 00:09:40 2012 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (38100) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   198   198   051    Pre-fail  Always       -       2543
  3 Spin_Up_Time            0x0027   165   162   021    Pre-fail  Always       -       6741
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       472
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       8908
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       61
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       13
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2198
194 Temperature_Celsius     0x0022   120   113   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       123
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       33
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   199   000    Old_age   Offline      -       248

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Cache:

smartctl -a -d ata /dev/sdc (cache)
smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint T166 series
Device Model:     SAMSUNG HD501LJ
Serial Number:    S0ZFJ1KQ302194
Firmware Version: CR100-12
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 3b
Local Time is:    Thu Feb  2 00:09:42 2012 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
				was completed without error.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (8779) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 150) minutes.
SCT capabilities: 	       (0x003f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail  Always       -       107
  3 Spin_Up_Time            0x0007   100   100   015    Pre-fail  Always       -       7488
  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3478
  5 Reallocated_Sector_Ct   0x0033   253   253   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   253   253   000    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0024   253   253   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       8508
10 Spin_Retry_Count        0x0032   253   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0012   253   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       1087
13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age   Always       -       712510462
187 Reported_Uncorrect      0x0032   253   253   000    Old_age   Always       -       8912896
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       27
190 Airflow_Temperature_Cel 0x0022   074   055   000    Old_age   Always       -       26
194 Temperature_Celsius     0x0022   157   100   000    Old_age   Always       -       27
195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age   Always       -       712510462
196 Reallocated_Event_Count 0x0032   253   253   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   253   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age   Always       -       0
201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age   Always       -       0
202 Data_Address_Mark_Errs  0x0032   253   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1115         -
# 2  Short captive       Completed without error       00%       379         -
# 3  Short captive       Completed without error       00%       379         -
# 4  Short captive       Completed without error       00%       379         -
# 5  Short offline       Completed without error       00%       379         -
# 6  Short offline       Completed without error       00%       379         -
# 7  Short offline       Completed without error       00%         0         -

Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Link to comment

A reboot will do nothing and that will be tough to fully recover without some data loss. Every other disk must be healthy to properly recover a failed disk and you have 2 array disks which have issues. I'd first copy the data off disk2 if possible. You may want to power-down and re-seat all the HDD connections first and then do another parity check.

 

Antec power supply or older/cheaper model by any chance?

 

Peter

Link to comment

Server is actually an HP Proliant ML110 I got for free. Unsure on PSU make, though I wanted to upgrade to a custom setup so I could add more disks (Server is full)

 

c00706191.jpg

 

I could copy the data off disk2 to a disk off the array, thats not a big deal. I'll do that tonight. What would you recommend I do with the drives? I guess sector reallocation is inevitable sometimes but am I seeing enough that the drives are in danger?

 

Man, what a pain! hah.

Link to comment

I also have 7 disks which vary from about 10 months to 4 years old and have not yet had 1 pending or reallocated sector, so it's not normal to suddenly see that on 3 drives at one time.

 

I would suspect it's a hardware issue. Possibly poor cooling during the parity check or a poor power which over stressed during the parity check. See if you can get the specs off the power supply or swap it for a better quality supply even if only for a test.

 

You can either run another parity check and see if the numbers change or do a preclear on the suspect drives and see if the sectors clear-up. Of course, the preclear will break the array and wipe the data off the drives.

 

Peter

 

Link to comment

I also have 7 disks which vary from about 10 months to 4 years old and have not yet had 1 pending or reallocated sector, so it's not normal to suddenly see that on 3 drives at one time.

 

I would suspect it's a hardware issue. Possibly poor cooling during the parity check or a poor power which over stressed during the parity check. See if you can get the specs off the power supply or swap it for a better quality supply even if only for a test.

 

You can either run another parity check and see if the numbers change or do a preclear on the suspect drives and see if the sectors clear-up. Of course, the preclear will break the array and wipe the data off the drives.

 

Peter

I suspect it is more likely he never looked at the smart reports before, never precleared the disks, and it is only after loading unMENU and seeing the errors highlighted is he panicking.

 

Re-seating connectors will not likely to help.  In fact, if accidentally dislodged could make a manageable case go bad very quickly.

 

Basically, you need to work from your worst disk to the least bad.

The un-readable sectors could be unused space, or in critical files... no way to know.

 

I would copy whatever is critical off of disk2 onto other disks in the array that currently have NO errors. 

Once copied off and safely on other disks, delete the files from disk2.  If the un-readable sectors are in files, unRAID should re-construct the contents from parity and the other disks. 

 

You just need to play the odds that the un-readable sector on one of the other disks is not in the file with the bad sector on disk2.  (Odds are in your favor)

 

Then, copy them back.  That should cause disk2 to re-allocate the bad sectors (if they were in files)

 

Then, get a new set of smart reports, rinse, lather, and repeat with the files on the cache drive.

 

Lastly, you'll need to re-check parity.  That should fix the errors there by writing the un-readable sectors.

Don't do that until you fix disk2 though, or it will not be able to reconstruct what is bad there.

 

Lastly, think about RMAing disk2.  And preclear all your drives before putting them in use.

 

Joe L.

Link to comment

Nah I'm not that bad haha, these drives were precleared 3 times in a row before being put into use along with the other drives in the array. SMART reports were golden pre&post preclear so all these pending sectors are new ones. They were going strong all the way up until the latest parity check, the last being 1/1/12. :(

 

OK so next steps.

[*]copy everything off disk2 to other disks in array (I assume its not a problem to copy if OFF the array too, just puts whatever disk I copy it too at risk if its not protected right?)

[*]Copy data back on to disk 2 after sacrificing goat

[*]Check to ensure pending sectors/unrecoverable sectors etc has not risen since previous report, else RMA

[*]Do the same with above with cache drive

[*]Once both cache and disk2 seem ok, recheck parity which should sort parity drive. If Allocated sectors jumps dramatically consider RMA

 

Thanks for the help guys! I'll follow up after disk2 is moved, deleted and added back.

Link to comment

Just writing to the disks may not clear the bad sectors. You have to write to the bad sector.

 

In theory, unRAID will reconstruct the sector and write it back to the drive when there is a read error due to a bad sector. So, in theory, just running parity checks would clear the bad pending sectors. I believe you need to do a correcting parity check for this to happen though and even then, I've read that it does this in theory but don't recall ever seeing proof that it actually happens.

 

I'd still be very suspect of your power supply. I've read a few cases here where pending sectors were occurring on multiple drives and a new power supply cleared the problem. They were not re-allocated either, they were cleared from the SMART data indicating they began to work corrrectly and could be read again. It might no help, but you've got 3 drives out of 7 acting bad at the same time which just isn't expected.

 

Peter

 

Link to comment

Just writing to the disks may not clear the bad sectors. You have to write to the bad sector.

 

In theory, unRAID will reconstruct the sector and write it back to the drive when there is a read error due to a bad sector. So, in theory, just running parity checks would clear the bad pending sectors. I believe you need to do a correcting parity check for this to happen though and even then, I've read that it does this in theory but don't recall ever seeing proof that it actually happens.

 

I'd still be very suspect of your power supply. I've read a few cases here where pending sectors were occurring on multiple drives and a new power supply cleared the problem. They were not re-allocated either, they were cleared from the SMART data indicating they began to work corrrectly and could be read again. It might no help, but you've got 3 drives out of 7 acting bad at the same time which just isn't expected.

 

Peter

I can envision a situation where marginal power could cause writes to sectors to be marginal to where they sometimes cannot be read back.  A proper "write" with good power would result in the original sector being used, and not a re-allocation. 
Link to comment

Sounds like I need to fasttrack my new build then.  ;D

 

I'm currently still in the process of moving everything off disk 2. if I do a preclear its going to toast the parity validity(I think) - will unraid just ask if I want to rebuild from parity and should I do that, or should I just copy everything back over normally and then build a NEW parity?

Link to comment

Sorry, poor choice of words. If I do a preclear on the disk, its going involve me taking it out of thr array and deleting everything on it, meaning I'd have to recalulate parity. :)

 

 

So I copied everything off, pending sectors went UP by about 5 sectors on disk2. :(

I'm going to shut down the server, see if perhaps the PSU is a normal deal or if its some proprietary HP thing and try a swap. From there I'll do a preclear on disk2 and see how it fares.  Anything bad about my nest steps?

 

thanks all!

 

Link to comment

turns out the PSU isnt proprietary but its designed in a way where it's upside down compared to standard PSUs, so any PSU worth its salt with a fan blows air against a piece of aluminum. I'm going to order my parts and do an upgrade before I finish this up and do preclears in the meantime, this thing is a P4 that really is on it's last legs.

 

And the PSU is a noname clunker at 350Watts, on paper it should be fine but guessing it just cant cut it anymore due to time. :(

Link to comment

Preclear finished on disk2, funny because after the zero-write all the sectors were rewritten, but by the end of the test it jumped to 373.

 

Sounds like it's time to RMA? If I still have a SMART= PASS they'll take the disk regardless?

 

I'm getting a new PSU today and plan to migrate everything into a new machine and do testing from a stronger platform. as long as I reassign drives in the correct order I shouldnt have any migration issues right?

 

results below. Thanks!

 

Disk Temperature: 30C, Elapsed Time:  38:22:31
========================================================================1.13
==  WDC WD20EARS-00MVWB0    WD-WCAZA1050439
== Disk /dev/sde has been successfully precleared
== with a starting sector of 63
============================================================================
** Changed attributes in files: /tmp/smart_start_sde  /tmp/smart_finish_sde
                ATTRIBUTE   NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS      RAW_VA                                                                  

                                                   LUE
      Raw_Read_Error_Rate =   188     198           51        ok          15935
    Reallocated_Sector_Ct =   199     200          140        ok          30
      Temperature_Celsius =   120     122            0        ok          30
  Reallocated_Event_Count =   171     200            0        ok          29
   Current_Pending_Sector =   199     200            0        ok          373
No SMART attributes are FAILING_NOW

132 sectors were pending re-allocation before the start of the preclear.
211 sectors were pending re-allocation after pre-read in cycle 1 of 1.
0 sectors were pending re-allocation after zero of disk in cycle 1 of 1.
373 sectors are pending re-allocation at the end of the preclear,
    a change of 241 in the number of sectors pending re-allocation.
0 sectors had been re-allocated before the start of the preclear.
30 sectors are re-allocated at the end of the preclear,
    a change of 30 in the number of sectors re-allocated.

Link to comment

New server is assembled, upgraded to a light duty AMD Athlon 4850e processor and mobo combo in an old CMStacker, but more importantly Corsair HX650 PSU.  8) I had a Corsair 620 watt PSU in my desktop I wanted to test with but it has 3 (!) rails so while its good for my 3 disk 1 cdrom desktop, not so much for a big server.

 

I have submitted an RMA request for Disk2 since its a big offender on the reallocated sectors. Once I get the new disk, I'll preclear, load it into the array and copy over data into it. Then I'll tackle the parity drive and not too important cache drive.

 

Lionelhutz, the issue is the fitment is upside down in the ML110, so the psu was unable to intake since the inlet was flush against the top of the case. Doesn't matter now though, anyone want a used HP ML110 that can handle loads of up to 6 disks but not 7?  ;D

Link to comment

Update: Disk 2 was RMAed, new disk was precleared and passed the test so I added it to the array. There was a data rebuild to it, but since the previous disk was empty my files arent on there and potentially corupted. Right now I'm copying all the files back onto disk 2. I still have the issue of the parity drive throwing errors, my pending sector count actually went up by 1 when I did the rebuild.  :-\

 

How do I go about forcing it to refresh? Will a simple parity check do this or would I need to pull it out of the array, preclear it and then introduce it back if it fixes itself?

Link to comment

So after copying about 1.5TB back onto the drive, the parity drive as used heavily and the pending sectors were handled. Here is a revised SMART. 87 errors are reported on the unraid page. Should I still RMA this drive? If not, should I just keep doing parity checks until errors are zero?

 

thanks!

 

smartctl -a -d ata /dev/sdd (parity)
smartctl 5.39.1 2010-01-28 r3054 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WCAZA1020507
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Wed Feb 15 01:23:53 2012 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (40500) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   165   162   021    Pre-fail  Always       -       6733
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       632
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       9110
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       90
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       28
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2430
194 Temperature_Celsius     0x0022   122   115   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       6
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       23

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      9082         3511387976

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

 

Antec power supply or older/cheaper model by any chance?

 

Peter

 

Hi Peter,

May I ask why you specifically asked about the Antec PSU. My current server PSU is an Antec 650W, and I've had a few problems with multiple simultaneous failed disks and other errors. I'm switching the whole server to new hardware shortly, including a better PSU, but I just wondered whether you'd encountered inherent problems with the Antecs.

 

Sorry for threadjack, OP.

Link to comment

 

Antec power supply or older/cheaper model by any chance?

 

Peter

 

Hi Peter,

May I ask why you specifically asked about the Antec PSU. My current server PSU is an Antec 650W, and I've had a few problems with multiple simultaneous failed disks and other errors. I'm switching the whole server to new hardware shortly, including a better PSU, but I just wondered whether you'd encountered inherent problems with the Antecs.

 

Sorry for threadjack, OP.

 

There have been 2 or 3 others who have reported odd disk issues here that went away after replacing their Antec supply.

 

Peter

 

Link to comment

 

Antec power supply or older/cheaper model by any chance?

 

Peter

 

Hi Peter,

May I ask why you specifically asked about the Antec PSU. My current server PSU is an Antec 650W, and I've had a few problems with multiple simultaneous failed disks and other errors. I'm switching the whole server to new hardware shortly, including a better PSU, but I just wondered whether you'd encountered inherent problems with the Antecs.

 

Sorry for threadjack, OP.

 

There have been 2 or 3 others who have reported odd disk issues here that went away after replacing their Antec supply.

 

Peter

 

Thanks for the information. I'll relegate my Antec to HTPC use eventually, where no real data is actually stored.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.