Jump to content

2 drives failed - second one during rebuild of first replacement....


davekeel

Recommended Posts

unRaid Server Pro 5.0.5

Intel Atom CPU 1.66GHz

9 disks in array (various 3TB and 2TB, plus cache and 3TB Parity)

 

Really could use some tech help/advice… I’ve had an unRaid problem that has totally confused me and need some help on my next steps!

 

Last Monday I awoke to a message to let know that Disk8 in my array had failed. I removed it and since it was a 2TB I took the opportunity to replace it with a 3TB.

 

Started  up my unRaid and commenced a rebuild - as per Wiki instruction

Stop the array.

Power down the unit.

Replace smaller disk with new bigger disk.

Power up the unit.

Start the array.

 

- this seemed to progress OK at first, but then disk 4 started showing errors during the process. It completed the rebuild of the array parity OK but then Disk 4 was marked as faulty showing millions of read errors. So - since this was a 2TB I ordered another 3TB to replace it.

 

I replaced the second faulty drive and carried out the same instructions as above. I got a message that the drive was formatting and then several hours later the rebuild was complete BUT this new replacement 3TB drive was flagged as unformatted???!!! Without thinking it through sufficiently I clicked on the format option :-( and now I have a full extra 3TB spare BUT have obviously (?) lost all of the data that was on the second faulty almost full 2TB Disk 4.

 

So - questions are this ….

1) Have I lost all of the data that was on the second faulty replaced drive

2) If so is there anything I can do to replace it?

3) If there is nothing that I can do to replace it, is there anything that I can do to work out what files are lost/incomplete?

 

Thanks in advance - really hope that somebody can help/advise…. stop me worrying/loosing sleep!

Thanks

 

Link to comment

It is very likely that the 'faulty' disk 4 is actually OK, and that the error was a transient one (such as a loose cable).  Whatever you do keep that disk intact for the time being as it is likely that the data on it can be recovered.

 

I think that since you were getting read errors on disk4 there is a good chance that the rebuild of disk8 was not actually good as a rebuild requires all other data disks plus parity to be error free.  If disk8 was not rebuilt correctly, then the subsequent attempt to rebuild disk4 would not have restored the correct contents - and thus the 'unformatted disk' being reported.    Do you still have the 'faulty' disk8 intact so that you can check if it is really faulty in case the rebuilt disk8 has invalid data.

 

Do you have SMART reports for the 'faulty' disks?  They would help with determining if the disks really are faulty or the failures were caused by external factors.

Link to comment

Hi - Wow - that was quick ... and super helpful too! I did have to do a bit of rough handling to get the replacement disk8 physically installed. Thank You. Yes - I have the "faulty" Disk 4 intact .. have not done anything with it! Can plug into my mac (unless you advise otherwise of course)... or even plug into a spare caddy on the unRaid tower if you think that this would be adviseable?! (then - I guess - if it is OK - attempt a further rebuild of disk8 - might need advise on how to do that.

Sorry don't have smart reports - but if I plug into spare caddy then perhaps I could generate one?!

Thanks again - any advice on next steps really appreciated!

 

Link to comment

I did have to do a bit of rough handling to get the replacement disk8 physically installed.

 

This is an all to common scenario. You are replacing a failed or failing disk with a new one, and in the process knock something loose, and then another drive fails during the rebuild, and suddenly you are on the verge of data loss.

 

I concur with itimpi that likely you have problems with the rebuilt disk8 as well as disk4, and that your best chance of full recovery of both is with the original 2T drives.

 

Installing as non-array devices is your unRaid server is the best path forward. You need to get the smart reports.

 

BTW, to avoid this in the future, I strongly recommend using drive cages (4in3s or 5in3s) for your drives. With these devices it is very easy to swap disks out, with near zero chance of knocking a cable loose as you extract the old one and install the new one. They are mandatory equipment as far as I am concerned.

Link to comment

Hi - and thanks!

OK - just so I'm clear (before I get myself into even deeper trouble).. can you confirm that I:

a)spin down/shut down

b)replace 2 new 3TB drives with the 2 old 2TB "faulty" drives (one of which may not be faulty)

c)start up

d)first faulty drive (Disk 4) should show as faulty/second "faulty" drive should (fingers crossed) show as OK.

e)do some Smart checking - can you advise re doing this please?!

e)spin down/shut down

f)replace first faulty drive (Disk 4) with one of the new 3TB ones

g)start up and rebuild replacement for faulty drive (Disk 4)

It kind of seems to make sense to me - except that parity data is probably screwed by now due to the drives that I have removed/replaced?!

Thanks in anticipation of response!

 

Link to comment

Hi - and thanks!

OK - just so I'm clear (before I get myself into even deeper trouble).. can you confirm that I:

a)spin down/shut down

b)replace 2 new 3TB drives with the 2 old 2TB "faulty" drives (one of which may not be faulty)

c)start up

d)first faulty drive (Disk 4) should show as faulty/second "faulty" drive should (fingers crossed) show as OK.

e)do some Smart checking - can you advise re doing this please?!

e)spin down/shut down

f)replace first faulty drive (Disk 4) with one of the new 3TB ones

g)start up and rebuild replacement for faulty drive (Disk 4)

It kind of seems to make sense to me - except that parity data is probably screwed by now due to the drives that I have removed/replaced?!

Thanks in anticipation of response!

No, that won't work. If you replace the new drives with the old ones it is just going to see the old drives as "wrong disk" and it won't let you start. You'll have to use them non-array as bjp999 said. Do you have unMenu?
Link to comment

OK - anyone got any more advice? I have both the "faulty" drives and if I temp remove the cache drive then I could create two spare bays to install them into. What now? Unfortunately I have no idea where to go from here re getting me back to where I was/need to be. Any advice/instructions really appreciated.

Thanks

Link to comment

OK - bjp999 or trurl (or anyone else who can help - please!!!) - I tried plugging each of the suspected faulty drives into the spare caddy slot in my server and rebooting … each time I could not get unRaid to boot - no idea why but the PC board would start up but unRaid did not start… just a blinking cursor. Bios probably trying to boot from it?!?!

 

So = I shut down, removed the cache drive and replaced it with the faulty drives one at a time and produced the Smart reports below (first using smartctl -d ata -tshort /dev/sd* and then using smartctl -a -A /dev/sd*). Does this tell you what you need to know in order to help me get back to where I was and recover any lost data?! The first drive that failed is showing a lot of extra messages. The second suspected faulty drive appears, as you suggested, perhaps not to be faulty according to the report/lack of messages as shown below?! So - if that is still intact and unaltered does it help you to advise re next steps?

 

Smart report for the first “faulty” drive = disk4 (now replaced)

 

root@Tower:~# smartctl -a -A /dev/sdd

smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)

Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

 

=== START OF INFORMATION SECTION ===

Model Family:    Hitachi Deskstar 7K3000

Device Model:    Hitachi HDS723020BLA642

Serial Number:    MN1220F306HKVD

LU WWN Device Id: 5 000cca 369c2f4b2

Firmware Version: MN6OA180

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Size:      512 bytes logical/physical

Rotation Rate:    7200 rpm

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  ATA8-ACS T13/1699-D revision 4

SATA Version is:  SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:    Fri Sep 11 17:02:56 2015 BST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x80) Offline data collection activity

was never started.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (19665) seconds.

Offline data collection

capabilities: (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  1) minutes.

Extended self-test routine

recommended polling time: ( 328) minutes.

SCT capabilities:       (0x003d) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000b  070  070  016    Pre-fail  Always      -      4854434

  2 Throughput_Performance  0x0005  136  136  054    Pre-fail  Offline      -      83

  3 Spin_Up_Time            0x0007  135  135  024    Pre-fail  Always      -      420 (Average 430)

  4 Start_Stop_Count        0x0012  099  099  000    Old_age  Always      -      5911

  5 Reallocated_Sector_Ct  0x0033  100  100  005    Pre-fail  Always      -      83

  7 Seek_Error_Rate        0x000b  100  100  067    Pre-fail  Always      -      0

  8 Seek_Time_Performance  0x0005  130  130  020    Pre-fail  Offline      -      28

  9 Power_On_Hours          0x0012  095  095  000    Old_age  Always      -      38083

10 Spin_Retry_Count        0x0013  100  100  060    Pre-fail  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      150

192 Power-Off_Retract_Count 0x0032  095  095  000    Old_age  Always      -      6109

193 Load_Cycle_Count        0x0012  095  095  000    Old_age  Always      -      6109

194 Temperature_Celsius    0x0002  200  200  000    Old_age  Always      -      30 (Min/Max 10/54)

196 Reallocated_Event_Count 0x0032  100  100  000    Old_age  Always      -      100

197 Current_Pending_Sector  0x0022  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0008  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x000a  200  200  000    Old_age  Always      -      242

 

SMART Error Log Version: 1

ATA Error Count: 242 (device log contains only the most recent five errors)

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

 

Error 242 occurred at disk power-on lifetime: 15673 hours (653 days + 1 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 01 d7 00 01 00  Error: ICRC, ABRT at LBA = 0x000100d7 = 65751

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  ca 00 08 d0 00 01 e0 08      21:22:16.323  WRITE DMA

  27 00 00 00 00 00 e0 08      21:22:16.306  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

  ec 00 00 00 00 00 a0 08      21:22:16.284  IDENTIFY DEVICE

  ef 03 42 00 00 00 a0 08      21:22:16.266  SET FEATURES [set transfer mode]

  27 00 00 00 00 00 e0 08      21:22:16.244  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

 

Error 241 occurred at disk power-on lifetime: 15673 hours (653 days + 1 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 01 d7 00 01 00  Error: ICRC, ABRT at LBA = 0x000100d7 = 65751

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  ca 00 08 d0 00 01 e0 08      21:22:15.707  WRITE DMA

  27 00 00 00 00 00 e0 08      21:22:15.668  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

  ec 00 00 00 00 00 a0 08      21:22:15.650  IDENTIFY DEVICE

  ef 03 42 00 00 00 a0 08      21:22:15.628  SET FEATURES [set transfer mode]

  27 00 00 00 00 00 e0 08      21:22:15.610  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

 

Error 240 occurred at disk power-on lifetime: 15673 hours (653 days + 1 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 01 c7 00 00 00  Error: ICRC, ABRT at LBA = 0x000000c7 = 199

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  ca 00 08 c0 00 00 e0 08      21:22:15.071  WRITE DMA

  27 00 00 00 00 00 e0 08      21:22:15.054  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

  ec 00 00 00 00 00 a0 08      21:22:15.031  IDENTIFY DEVICE

  ef 03 42 00 00 00 a0 08      21:22:15.014  SET FEATURES [set transfer mode]

  27 00 00 00 00 00 e0 08      21:22:14.992  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

 

Error 239 occurred at disk power-on lifetime: 15673 hours (653 days + 1 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 01 c7 00 00 00  Error: ICRC, ABRT at LBA = 0x000000c7 = 199

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  ca 00 08 c0 00 00 e0 08      21:22:14.455  WRITE DMA

  27 00 00 00 00 00 e0 08      21:22:14.437  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

  ec 00 00 00 00 00 a0 08      21:22:14.415  IDENTIFY DEVICE

  ef 03 42 00 00 00 a0 08      21:22:14.398  SET FEATURES [set transfer mode]

  27 00 00 00 00 00 e0 08      21:22:14.376  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

 

Error 238 occurred at disk power-on lifetime: 15673 hours (653 days + 1 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  84 51 01 c7 00 00 00  Error: ICRC, ABRT at LBA = 0x000000c7 = 199

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  ca 00 08 c0 00 00 e0 08      21:22:13.839  WRITE DMA

  27 00 00 00 00 00 e0 08      21:22:13.821  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

  ec 00 00 00 00 00 a0 08      21:22:13.799  IDENTIFY DEVICE

  ef 03 42 00 00 00 a0 08      21:22:13.781  SET FEATURES [set transfer mode]

  27 00 00 00 00 00 e0 08      21:22:13.760  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%    38083        -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Smart report for second “Faulty” drive = disk8 (now replaced)

 

root@Tower:~# smartctl -a -A /dev/sdd

smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)

Copyright © 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

 

=== START OF INFORMATION SECTION ===

Model Family:    Western Digital Caviar Green (AF, SATA 6Gb/s)

Device Model:    WDC WD20EARX-008FB0

Serial Number:    WD-WMAZA8037952

LU WWN Device Id: 5 0014ee 2b18edeb3

Firmware Version: 51.0AB51

User Capacity:    2,000,398,934,016 bytes [2.00 TB]

Sector Sizes:    512 bytes logical, 4096 bytes physical

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  ATA8-ACS (minor revision not indicated)

SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is:    Fri Sep 11 17:26:28 2015 BST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status:      (  0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (36180) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: (  2) minutes.

Extended self-test routine

recommended polling time: ( 389) minutes.

Conveyance self-test routine

recommended polling time: (  5) minutes.

SCT capabilities:       (0x30b5) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x002f  200  199  051    Pre-fail  Always      -      868

  3 Spin_Up_Time            0x0027  187  185  021    Pre-fail  Always      -      5633

  4 Start_Stop_Count        0x0032  095  095  000    Old_age  Always      -      5033

  5 Reallocated_Sector_Ct  0x0033  195  195  140    Pre-fail  Always      -      236

  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0

  9 Power_On_Hours          0x0032  063  063  000    Old_age  Always      -      27655

10 Spin_Retry_Count        0x0032  100  100  000    Old_age  Always      -      0

11 Calibration_Retry_Count 0x0032  100  100  000    Old_age  Always      -      0

12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      106

192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      32

193 Load_Cycle_Count        0x0032  165  165  000    Old_age  Always      -      105228

194 Temperature_Celsius    0x0022  124  110  000    Old_age  Always      -      26

196 Reallocated_Event_Count 0x0032  001  001  000    Old_age  Always      -      219

197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0

200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      55

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Short offline      Completed without error      00%    27654        -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Finally if you are unable to help or point me at a help page that is relevant to getting myself back to where I was then is anyone able to give me a terminal command to, say, search through all my mounted drives for .pdf or .mkv etc files - this way I can find out what files appear to be complete and consequently then attempt to work out what is missing/lost?!

 

As ever - thanks in anticipation of absolutely any forthcoming help or advice. Getting desperate! :'( :'( :'( :'(

 

 

 

 

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...