Jump to content

1 Disk going bad ?


peter_sm

Recommended Posts

Hi,

 

I'm concern about one disk, it says that I have error on one disk on the main page :-(

 

It's looks I got these errors yesterday when I did a parity check, I'm doing a new parity check right now and see if those error could be correcter.

That I can see at the report is this.

196 Reallocated_Event_Count 0x0032  199  199  000    Old_age  Always      -      1

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      4

 

 

Below you see the smart result, could someone tell me if this disk is bad? or is it possible to fix those errors?

 

 

 

// Peter

smartctl -a -d ata /dev/sde
smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10EAVS-00D7B0
Serial Number:    WD-WCAU40258662
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Jul 25 11:42:26 2010 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85)	Offline data collection activity
				was aborted by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 114)	The previous self-test completed having
				the read element of the test failed.
Total time to complete Offline 
data collection: 		 (23400) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   148   144   021    Pre-fail  Always       -       7558
  4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3922
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       12329
10 Spin_Retry_Count        0x0032   100   100   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   051    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       820
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       20
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3921
194 Temperature_Celsius     0x0022   126   102   000    Old_age   Always       -       24
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       4
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%     12329         4921505

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.





Link to comment

A disk error reported on the unRAID main page is a "read" error.  Basically the disk reported to the OS it could not read a sector.

 

When unRAID's driver gets this it re-constructs what could not be read by reading all the other disks in the array and presents to you the re-constructed sector. (the one that could not be read from the physical disk.)

 

unRAID also writes the sector back to the disk it could not read.  This gives the firmware on the disk a chance to re-allocate the contents of the un-readable sector to one of the spare sectors.  On a typical disk these days there are several thousand spare sectors.  it can re-allocate the sector because the "write" to it gave it the contents.

 

You apparently have several other sectors that have been detected as un-readable at some point.  When they are next read (or written) they too will be re-allocated (or not, since the firmware first tries to write to the original location, just in case it was not written to originally in a way that could be read back.)

 

The "normalized" value for re-allocated sectors probably started at 200. It will be considered as "failed" when it gets to zero (the value of the "threshold" column).

 

You cannot do anything to change the errors.  When you next "write" to those sectors they will be re-allocated.  The only way to "force" a write of the entire disk would be to:

 

Run a full parity check (which you've just done) to make sure no other problems exist.

Stop the array

Un-assign the disk

Start the disk with it un-assigned.  This will simulate its failure and cause the array to forget its model/serial number

Stop the array a second time

Re-assign the disk

"Start" the array once more by pressing "Start"

 

The array will then re-construct the disk by reading the others in the array, and when it gets to those needing re-allocation they will be re-allocated if the original sector is un-writable.

 

You will be without parity protection while the disk is being re-constructed.  The re-construction of a 1TB drive will take at least 8 hours.  If your array is not stable enough to be without parity protection for that long, you should make copies of any critical files elsewhere or on multiple disks.

 

Joe L.

Link to comment

Thanks Joe!

 

Great answer, So I should not be worried :-) What I do is only write to the disk from the disk shares from my windows PC ,never to the user shares.

 

So perhaps it's OK to do the procedure that you suggest.

 

EDIT

 

 

This is the new smart report, there is a change in the Current_Pending_Sector it's now 3, before it's was 4.

196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       3

 

smartctl -a -d ata /dev/sde
smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10EAVS-00D7B0
Serial Number:    WD-WCAU40258662
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Jul 25 18:12:49 2010 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 114)	The previous self-test completed having
				the read element of the test failed.
Total time to complete Offline 
data collection: 		 (23400) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0027   148   144   021    Pre-fail  Always       -       7558
 4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3923
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   100   253   051    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       12335
10 Spin_Retry_Count        0x0032   100   100   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   051    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       820
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       20
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3922
194 Temperature_Celsius     0x0022   129   102   000    Old_age   Always       -       21
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       3
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%     12329         4921505

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Info from the syslog

Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923248/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923256/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923264/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923272/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923280/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923288/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923296/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923304/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923312/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923320/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923328/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923336/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923344/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923352/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923360/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923368/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923376/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923384/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923392/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923400/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923408/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923416/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923424/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923432/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923440/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923448/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923456/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923464/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923472/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923480/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923488/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923496/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923504/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923512/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923520/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923528/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923536/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923544/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923552/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923560/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923568/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923576/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923584/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923592/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923600/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923608/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923616/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923624/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923632/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923640/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923648/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923656/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923664/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923672/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923680/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923688/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923696/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923704/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923712/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923720/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923728/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923736/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923744/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923752/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923760/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923768/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923776/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923784/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923792/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923800/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923808/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923816/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923824/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923832/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923840/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923848/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923856/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923864/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923872/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923880/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923888/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923896/4, count: 1

Link to comment

Thanks Joe!

 

Great answer, So I should not be worried :-) What I do is only write to the disk from the disk shares from my windows PC ,never to the user shares.

users-shares or disk-shares... Makes absolutely no difference how you are writing to the disks.  You could be doing a linux "cp" (copy) command and you would be writing to the disks, again no difference.

So perhaps it's OK to do the procedure that you suggest.

 

EDIT

 

 

This is the new smart report, there is a change in the Current_Pending_Sector it's now 3, before it's was 4.

196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       3

 

smartctl -a -d ata /dev/sde
smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10EAVS-00D7B0
Serial Number:    WD-WCAU40258662
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Sun Jul 25 18:12:49 2010 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 114)	The previous self-test completed having
				the read element of the test failed.
Total time to complete Offline 
data collection: 		 (23400) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0027   148   144   021    Pre-fail  Always       -       7558
 4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3923
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   100   253   051    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       12335
10 Spin_Retry_Count        0x0032   100   100   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   051    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       820
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       20
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3922
194 Temperature_Celsius     0x0022   129   102   000    Old_age   Always       -       21
196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       3
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%     12329         4921505

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Info from the syslog

Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923248/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923256/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923264/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923272/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923280/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923288/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923296/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923304/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923312/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923320/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923328/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923336/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923344/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923352/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923360/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923368/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923376/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923384/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923392/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923400/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923408/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923416/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923424/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923432/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923440/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923448/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923456/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923464/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923472/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923480/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923488/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923496/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923504/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923512/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923520/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923528/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923536/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923544/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923552/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923560/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923568/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923576/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923584/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923592/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923600/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923608/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923616/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923624/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923632/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923640/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923648/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923656/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923664/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923672/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923680/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923688/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923696/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923704/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923712/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923720/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923728/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923736/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923744/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923752/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923760/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923768/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923776/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923784/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923792/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923800/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923808/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923816/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923824/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923832/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923840/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923848/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923856/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923864/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923872/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923880/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923888/4, count: 1
Jul 25 11:48:19 Tower kernel: md: disk4 read error
Jul 25 11:48:19 Tower kernel: handle_stripe read error: 4923896/4, count: 1

I see you have no re-allocated sectors... That indicates that so far each time a read error occurred, that unRAID re-wrote the sector based on parity and the other data disks and that the actual sector was not re-allocated since the "write" was successful

 

I'm guessing you did not pre-clear this disk with the preclear_disk.sh script??  Or did you?

 

Joe L.

Link to comment

Hi Joe,

 

I did not cleared the disc, should I move/copy all data to a new disk, and then clear the disc?

 

If that is the case I move the data in the console.

 

//Peter

It is not enough to move the data... you must also remove the disk from the array (un-assign it)

 

I would not worry about it.. Just run a smart report every now and again.

 

The pre-clear script would have identified the un-readable sectors in the pre-read phase then written zeros to them (re-allocating them if appropriate) and then re-reading them in the post phase to ensure what was written as zeros is read back as zeros and to ensure no un-readable sectors remain.

 

Since you did not do that as a pre-clearing process,  you'll find your un-readable sectors only when attempting to read your files. (or during a parity check, which reads all the sectors)  It is why we stress doing a parity check immediately after doing the initial parity calc.

 

Joe L.

Link to comment

Hi Joe,

 

I have some ???

 

What would the stepp be if I want to clear the disc, but I need to be sure I back up all data to a new disc,  these disc must be outside the array?

 

You say.....

 

Stop the array

Un-assign the disk

Start the disk with it un-assigned.  This will simulate its failure and cause the array to forget its model/serial number

Stop the array a second time

Re-assign the disk

"Start" the array once more by pressing "Start"

 

.....but when shall I do pre clear ? and should above procedure rebuild my data?

Link to comment

You would perform the pre-clear after you un-assign the disk and re-start the array, but before you re-assign the disk to the array.

 

Yes, it will rebuild the data back onto the existing (but pre-cleared) disk.

 

Before you do anything you'll want to copy any critical files off your array or at least copy them to multiple disks, since you'll be without parity protection for more than a day.  Figure 20 hours or so to run the preclear script on the drive, and another 8 or so to rebuild it.

 

Joe L.

 

 

Link to comment

Joe!

 

I did a pre clear of above disk, and I'm going to rebuild the disk, but I am very concern of all the messages a got, should I trust the disk ?

 

//Peter

 

=                unRAID server Pre-Clear disk /dev/sdf
=                       cycle 1 of 1
= Disk Pre-Clear-Read completed                                 DONE
= Step 1 of 10 - Copying zeros to first 2048k bytes             DONE
= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE
= Step 3 of 10 - Disk is now cleared from MBR onward.           DONE
= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4       DONE
= Step 5 of 10 - Clearing MBR code area                         DONE
= Step 6 of 10 - Setting MBR signature bytes                    DONE
= Step 7 of 10 - Setting partition 1 to precleared state        DONE
= Step 8 of 10 - Notifying kernel we changed the partitioning   DONE
= Step 9 of 10 - Creating the /dev/disk/by* entries             DONE
= Step 10 of 10 - Testing if the clear has been successful.     DONE
= Disk Post-Clear-Read completed                                DONE
Disk Temperature: 26C, Elapsed Time:  18:07:09
============================================================================
==
== Disk /dev/sdf has been successfully precleared
==
============================================================================
S.M.A.R.T. error count differences detected after pre-clear
note, some 'raw' values may change, but not be an indication of a problem
64,65c64,65
< 196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
< 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       3
---
> 196 Reallocated_Event_Count 0x0032   198   198   000    Old_age   Always       -       2
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       73
71c71,166
< No Errors Logged
---
> Warning: ATA error count 713 inconsistent with error log pointer 1
>
> ATA Error Count: 713 (device log contains only the most recent five errors)
>       CR = Command Register [HEX]
>       FR = Features Register [HEX]
>       SC = Sector Count Register [HEX]
>       SN = Sector Number Register [HEX]
>       CL = Cylinder Low Register [HEX]
>       CH = Cylinder High Register [HEX]
>       DH = Device/Head Register [HEX]
>       DC = Device Command Register [HEX]
>       ER = Error register [HEX]
>       ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
>
> Error 713 occurred at disk power-on lifetime: 12384 hours (516 days + 0 hours)
>   When the command that caused the error occurred, the device was active or idle.
>
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 69 9e 06 e0  Error: UNC at LBA = 0x00069e69 = 433769
>
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   c8 00 08 68 9e 06 00 08   1d+12:42:59.687  READ DMA
>   ef 10 02 00 00 00 00 08   1d+12:42:59.686  SET FEATURES [Reserved for Serial ATA]
>   ec 00 00 00 00 00 00 08   1d+12:42:59.686  IDENTIFY DEVICE
>   ef 03 46 00 00 00 00 08   1d+12:42:59.686  SET FEATURES [set transfer mode]
>
> Error 712 occurred at disk power-on lifetime: 12384 hours (516 days + 0 hours)
>   When the command that caused the error occurred, the device was active or idle.
>
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 69 9e 06 e0  Error: UNC at LBA = 0x00069e69 = 433769
>
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   c8 00 08 68 9e 06 00 08   1d+12:42:55.713  READ DMA
>   ef 10 02 00 00 00 00 08   1d+12:42:55.712  SET FEATURES [Reserved for Serial ATA]
>   ec 00 00 00 00 00 00 08   1d+12:42:55.712  IDENTIFY DEVICE
>   ef 03 46 00 00 00 00 08   1d+12:42:55.712  SET FEATURES [set transfer mode]
>
> Error 711 occurred at disk power-on lifetime: 12384 hours (516 days + 0 hours)
>   When the command that caused the error occurred, the device was active or idle.
>
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 69 9e 06 e0  Error: UNC at LBA = 0x00069e69 = 433769
>
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   c8 00 08 68 9e 06 00 08   1d+12:42:51.606  READ DMA
>   ef 10 02 00 00 00 00 08   1d+12:42:51.605  SET FEATURES [Reserved for Serial ATA]
>   ec 00 00 00 00 00 00 08   1d+12:42:51.605  IDENTIFY DEVICE
>   ef 03 46 00 00 00 00 08   1d+12:42:51.605  SET FEATURES [set transfer mode]
>
> Error 710 occurred at disk power-on lifetime: 12384 hours (516 days + 0 hours)
>   When the command that caused the error occurred, the device was active or idle.
>
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 69 9e 06 e0  Error: UNC at LBA = 0x00069e69 = 433769
>
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   c8 00 08 68 9e 06 00 08   1d+12:42:47.333  READ DMA
>   ef 10 02 00 00 00 00 08   1d+12:42:47.332  SET FEATURES [Reserved for Serial ATA]
>   ec 00 00 00 00 00 00 08   1d+12:42:47.332  IDENTIFY DEVICE
>   ef 03 46 00 00 00 00 08   1d+12:42:47.332  SET FEATURES [set transfer mode]
>
> Error 709 occurred at disk power-on lifetime: 12384 hours (516 days + 0 hours)
>   When the command that caused the error occurred, the device was active or idle.
>
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 69 9e 06 e0  Error: UNC at LBA = 0x00069e69 = 433769
>
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   c8 00 08 68 9e 06 00 08   1d+12:42:43.359  READ DMA
>   ef 10 02 00 00 00 00 08   1d+12:42:43.358  SET FEATURES [Reserved for Serial ATA]
>   ec 00 00 00 00 00 00 08   1d+12:42:43.358  IDENTIFY DEVICE
>   ef 03 46 00 00 00 00 08   1d+12:42:43.358  SET FEATURES [set transfer mode]
============================================================================

Link to comment

Stuff here...

 

You need to run another preclear on this drive before you put data back on it and trust it.  The sector count should not be going up and should stabilize.

 

For new disks I always run 2 passes at least, with a third to be determined by the outcome of the first 2.

Link to comment

Joe!

 

I did a pre clear of above disk, and I'm going to rebuild the disk, but I am very concern of all the messages a got, should I trust the disk ?

 

//Peter

 

=                unRAID server Pre-Clear disk /dev/sdf
=                       cycle 1 of 1
= Disk Pre-Clear-Read completed                                 DONE
= Step 1 of 10 - Copying zeros to first 2048k bytes             DONE
= Step 2 of 10 - Copying zeros to remainder of disk to clear it DONE
= Step 3 of 10 - Disk is now cleared from MBR onward.           DONE
= Step 4 of 10 - Clearing MBR bytes for partition 2,3 & 4       DONE
= Step 5 of 10 - Clearing MBR code area                         DONE
= Step 6 of 10 - Setting MBR signature bytes                    DONE
= Step 7 of 10 - Setting partition 1 to precleared state        DONE
= Step 8 of 10 - Notifying kernel we changed the partitioning   DONE
= Step 9 of 10 - Creating the /dev/disk/by* entries             DONE
= Step 10 of 10 - Testing if the clear has been successful.     DONE
= Disk Post-Clear-Read completed                                DONE
Disk Temperature: 26C, Elapsed Time:  18:07:09
============================================================================
==
== Disk /dev/sdf has been successfully precleared
==
============================================================================
S.M.A.R.T. error count differences detected after pre-clear
note, some 'raw' values may change, but not be an indication of a problem
64,65c64,65
< 196 Reallocated_Event_Count 0x0032   199   199   000    Old_age   Always       -       1
< 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       3
---
> 196 Reallocated_Event_Count 0x0032   198   198   000    Old_age   Always       -       2
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       73
71c71,166
< No Errors Logged
---
> Warning: ATA error count 713 inconsistent with error log pointer 1
>
> ATA Error Count: 713 (device log contains only the most recent five errors)
>       CR = Command Register [HEX]
>       FR = Features Register [HEX]
>       SC = Sector Count Register [HEX]
>       SN = Sector Number Register [HEX]
>       CL = Cylinder Low Register [HEX]
>       CH = Cylinder High Register [HEX]
>       DH = Device/Head Register [HEX]
>       DC = Device Command Register [HEX]
>       ER = Error register [HEX]
>       ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
>
> Error 713 occurred at disk power-on lifetime: 12384 hours (516 days + 0 hours)
>   When the command that caused the error occurred, the device was active or idle.
>
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 69 9e 06 e0  Error: UNC at LBA = 0x00069e69 = 433769
>
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   c8 00 08 68 9e 06 00 08   1d+12:42:59.687  READ DMA
>   ef 10 02 00 00 00 00 08   1d+12:42:59.686  SET FEATURES [Reserved for Serial ATA]
>   ec 00 00 00 00 00 00 08   1d+12:42:59.686  IDENTIFY DEVICE
>   ef 03 46 00 00 00 00 08   1d+12:42:59.686  SET FEATURES [set transfer mode]
>
> Error 712 occurred at disk power-on lifetime: 12384 hours (516 days + 0 hours)
>   When the command that caused the error occurred, the device was active or idle.
>
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 69 9e 06 e0  Error: UNC at LBA = 0x00069e69 = 433769
>
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   c8 00 08 68 9e 06 00 08   1d+12:42:55.713  READ DMA
>   ef 10 02 00 00 00 00 08   1d+12:42:55.712  SET FEATURES [Reserved for Serial ATA]
>   ec 00 00 00 00 00 00 08   1d+12:42:55.712  IDENTIFY DEVICE
>   ef 03 46 00 00 00 00 08   1d+12:42:55.712  SET FEATURES [set transfer mode]
>
> Error 711 occurred at disk power-on lifetime: 12384 hours (516 days + 0 hours)
>   When the command that caused the error occurred, the device was active or idle.
>
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 69 9e 06 e0  Error: UNC at LBA = 0x00069e69 = 433769
>
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   c8 00 08 68 9e 06 00 08   1d+12:42:51.606  READ DMA
>   ef 10 02 00 00 00 00 08   1d+12:42:51.605  SET FEATURES [Reserved for Serial ATA]
>   ec 00 00 00 00 00 00 08   1d+12:42:51.605  IDENTIFY DEVICE
>   ef 03 46 00 00 00 00 08   1d+12:42:51.605  SET FEATURES [set transfer mode]
>
> Error 710 occurred at disk power-on lifetime: 12384 hours (516 days + 0 hours)
>   When the command that caused the error occurred, the device was active or idle.
>
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 69 9e 06 e0  Error: UNC at LBA = 0x00069e69 = 433769
>
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   c8 00 08 68 9e 06 00 08   1d+12:42:47.333  READ DMA
>   ef 10 02 00 00 00 00 08   1d+12:42:47.332  SET FEATURES [Reserved for Serial ATA]
>   ec 00 00 00 00 00 00 08   1d+12:42:47.332  IDENTIFY DEVICE
>   ef 03 46 00 00 00 00 08   1d+12:42:47.332  SET FEATURES [set transfer mode]
>
> Error 709 occurred at disk power-on lifetime: 12384 hours (516 days + 0 hours)
>   When the command that caused the error occurred, the device was active or idle.
>
>   After command completion occurred, registers were:
>   ER ST SC SN CL CH DH
>   -- -- -- -- -- -- --
>   40 51 00 69 9e 06 e0  Error: UNC at LBA = 0x00069e69 = 433769
>
>   Commands leading to the command that caused the error were:
>   CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
>   -- -- -- -- -- -- -- --  ----------------  --------------------
>   c8 00 08 68 9e 06 00 08   1d+12:42:43.359  READ DMA
>   ef 10 02 00 00 00 00 08   1d+12:42:43.358  SET FEATURES [Reserved for Serial ATA]
>   ec 00 00 00 00 00 00 08   1d+12:42:43.358  IDENTIFY DEVICE
>   ef 03 46 00 00 00 00 08   1d+12:42:43.358  SET FEATURES [set transfer mode]
============================================================================

I agree with the previous post.

 

You have 73 sectors pending re-allocation.  That is not good.  The disk might be usable if a subsequent preclear_disk.sh pass results in no additional un-readable sectors.

 

To have the errors as currently reported, the sectors had to be un-readable in the post-read phase, since the "writing" of zeros to the entire drive should have performed any possible re-allocations of sectors un-readable in the pre-read phase.

 

I would perform another preclear_disk.sh on the disk.  If the sectors pending re-allocation do not go away (get re-allocated, or get re-written in place) and the number of Sectors pending re-allocation drop to 0, I would not trust this drive.

 

Joe L.

Link to comment

HI,

 

DATA rebuild was OK, but when run a parity check I got new errors on the main page. right now I'm copying all the data to my new disk, and then take this disk off the array,and test the disk more.

 

EDIT: I going to run some preclear_disk.sh on this disk and see what happens.

 

My new smart results looks like this

 

martctl -a -d ata /dev/sdf
smartctl version 5.38 [i486-slackware-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD10EAVS-00D7B0
Serial Number:    WD-WCAU40258662
Firmware Version: 01.01A01
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Wed Jul 28 15:32:11 2010 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (23400) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0027   152   144   021    Pre-fail  Always       -       7375
 4 Start_Stop_Count        0x0032   097   097   000    Old_age   Always       -       3941
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   200   200   051    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       12399
10 Spin_Retry_Count        0x0032   100   100   051    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   051    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       826
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       20
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3940
194 Temperature_Celsius     0x0022   124   102   000    Old_age   Always       -       26
196 Reallocated_Event_Count 0x0032   198   198   000    Old_age   Always       -       2
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       10
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       20%     12329         4921505

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Link to comment

What is the best way to remove a disk from the array ?

Please clarify your question.

 

Do you wish to remove a drive and not install a replacement?

 

Do you wish to remove a drive temporally, so you can run a pre_clear on it, and then re-install it?

 

Do you wish to remove it and replace the drive with a different drive?

Link to comment

I was down in the basement, and I heard some "clicking" so its looks like the disk is not OK!

 

Joe: I want to have the disk off the array, and not replace it, if it good after some test i need to figured out if I want the disk in the array or not, right now( then I add it as a "new one") I'm copy the data to my new disk right now

 

EDIT

 

Joe ask :Do you wish to remove a drive and not install a replacement? YES

Link to comment

I was down in the basement, and I heard some "clicking" so its looks like the disk is not OK!

 

Joe: I want to have the disk off the array, and not replace it, if it good after some test i need to figured out if I want the disk in the array or not, right now( then I add it as a "new one") I'm copy the data to my new disk right now

 

EDIT

 

Joe ask :Do you wish to remove a drive and not install a replacement? YES

Ok,

 

That is easy.

 

After you finish copying the data to a new disk.

 

1. stop the array

2. un-assign the drive you wish to remove

3. Then, do one of the two following, depending on your version of unRAID.

    3a. if on a version  of unRAID with a "restore" button on the main page, press it after checking the checkbox under it

    3b. if on one of the recent versions of unRAID where the "restore" button has been replaced by a command line equivalent, 

    log in via telnet or on the system console and type:

    initconfig

4. Press "refresh" on your web-browser, all disks should show as "blue"

5. Press "Start"  (A new initial parity calculation will begin.  You'll be without parity protection until it is complete)

 

Joe L.

 

Link to comment

Thanks Joe,

 

I have the 4.5.6 version, and I didn't  know about the new command initconfig  ;)

 

I hope I can copy all my data (so far so good), there is some "funny" noise from the disk  :-\ and I don't know if I can trust my parity, if so then I should  remove my new disk and replace that with my bad disk

Link to comment

Thanks Joe,

 

I have the 4.5.6 version, and I didn't  know about the new command initconfig  ;)

Too many users of unRAID mistakenly pressed the button labeled as "restore" thinking it would restore their data. 

It was actually an "Initialize Disk Configuration and Immediately Invalidate Parity" button.  Parity is invalidated since it was based on the prior disk configuration.  It does not affect data on existing data disks, but pressing the "Restore button" at the wrong time (when a disk has failed) would eliminate any ability to re-construct a failed drive.

 

For that reason the button was removed and re-named as "initconfig" on the command line at this time.  It might get put back on the web-interface at some time in the future, but if it does, I hope it is re-named as "Initialize Disk Configuration" to confuse less users.

 

Joe L.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...