2 drives failed - second one during rebuild of first replacement.... - General Support (V5 and Older)

September 6, 201510 yr

unRaid Server Pro 5.0.5

Intel Atom CPU 1.66GHz

9 disks in array (various 3TB and 2TB, plus cache and 3TB Parity)

Really could use some tech help/advice… I’ve had an unRaid problem that has totally confused me and need some help on my next steps!

Last Monday I awoke to a message to let know that Disk8 in my array had failed. I removed it and since it was a 2TB I took the opportunity to replace it with a 3TB.

Started up my unRaid and commenced a rebuild - as per Wiki instruction

Stop the array.

Power down the unit.

Replace smaller disk with new bigger disk.

Power up the unit.

Start the array.

- this seemed to progress OK at first, but then disk 4 started showing errors during the process. It completed the rebuild of the array parity OK but then Disk 4 was marked as faulty showing millions of read errors. So - since this was a 2TB I ordered another 3TB to replace it.

I replaced the second faulty drive and carried out the same instructions as above. I got a message that the drive was formatting and then several hours later the rebuild was complete BUT this new replacement 3TB drive was flagged as unformatted???!!! Without thinking it through sufficiently I clicked on the format option :-( and now I have a full extra 3TB spare BUT have obviously (?) lost all of the data that was on the second faulty almost full 2TB Disk 4.

So - questions are this ….

1) Have I lost all of the data that was on the second faulty replaced drive

2) If so is there anything I can do to replace it?

3) If there is nothing that I can do to replace it, is there anything that I can do to work out what files are lost/incomplete?

Thanks in advance - really hope that somebody can help/advise…. stop me worrying/loosing sleep!

Thanks

Quote

September 6, 201510 yr

It is very likely that the 'faulty' disk 4 is actually OK, and that the error was a transient one (such as a loose cable). Whatever you do keep that disk intact for the time being as it is likely that the data on it can be recovered.

I think that since you were getting read errors on disk4 there is a good chance that the rebuild of disk8 was not actually good as a rebuild requires all other data disks plus parity to be error free. If disk8 was not rebuilt correctly, then the subsequent attempt to rebuild disk4 would not have restored the correct contents - and thus the 'unformatted disk' being reported. Do you still have the 'faulty' disk8 intact so that you can check if it is really faulty in case the rebuilt disk8 has invalid data.

Do you have SMART reports for the 'faulty' disks? They would help with determining if the disks really are faulty or the failures were caused by external factors.

Quote

September 6, 201510 yr

Author

Hi - Wow - that was quick ... and super helpful too! I did have to do a bit of rough handling to get the replacement disk8 physically installed. Thank You. Yes - I have the "faulty" Disk 4 intact .. have not done anything with it! Can plug into my mac (unless you advise otherwise of course)... or even plug into a spare caddy on the unRaid tower if you think that this would be adviseable?! (then - I guess - if it is OK - attempt a further rebuild of disk8 - might need advise on how to do that.

Sorry don't have smart reports - but if I plug into spare caddy then perhaps I could generate one?!

Thanks again - any advice on next steps really appreciated!

Quote

September 6, 201510 yr

Author

Just re-read your response and realised that I never confirmed that I also have the faulty disk 8 intact too - although I am pretty confident that it is faulty/dead

Quote

September 6, 201510 yr

I did have to do a bit of rough handling to get the replacement disk8 physically installed.

This is an all to common scenario. You are replacing a failed or failing disk with a new one, and in the process knock something loose, and then another drive fails during the rebuild, and suddenly you are on the verge of data loss.

I concur with itimpi that likely you have problems with the rebuilt disk8 as well as disk4, and that your best chance of full recovery of both is with the original 2T drives.

Installing as non-array devices is your unRaid server is the best path forward. You need to get the smart reports.

BTW, to avoid this in the future, I strongly recommend using drive cages (4in3s or 5in3s) for your drives. With these devices it is very easy to swap disks out, with near zero chance of knocking a cable loose as you extract the old one and install the new one. They are mandatory equipment as far as I am concerned.

Quote

September 6, 201510 yr

Author

Hi - and thanks!

OK - just so I'm clear (before I get myself into even deeper trouble).. can you confirm that I:

a)spin down/shut down

b)replace 2 new 3TB drives with the 2 old 2TB "faulty" drives (one of which may not be faulty)

c)start up

d)first faulty drive (Disk 4) should show as faulty/second "faulty" drive should (fingers crossed) show as OK.

e)do some Smart checking - can you advise re doing this please?!

e)spin down/shut down

f)replace first faulty drive (Disk 4) with one of the new 3TB ones

g)start up and rebuild replacement for faulty drive (Disk 4)

It kind of seems to make sense to me - except that parity data is probably screwed by now due to the drives that I have removed/replaced?!

Thanks in anticipation of response!

Quote

September 6, 201510 yr

Hi - and thanks!

OK - just so I'm clear (before I get myself into even deeper trouble).. can you confirm that I:

a)spin down/shut down

b)replace 2 new 3TB drives with the 2 old 2TB "faulty" drives (one of which may not be faulty)

c)start up

d)first faulty drive (Disk 4) should show as faulty/second "faulty" drive should (fingers crossed) show as OK.

e)do some Smart checking - can you advise re doing this please?!

e)spin down/shut down

f)replace first faulty drive (Disk 4) with one of the new 3TB ones

g)start up and rebuild replacement for faulty drive (Disk 4)

It kind of seems to make sense to me - except that parity data is probably screwed by now due to the drives that I have removed/replaced?!

Thanks in anticipation of response!

No, that won't work. If you replace the new drives with the old ones it is just going to see the old drives as "wrong disk" and it won't let you start. You'll have to use them non-array as bjp999 said. Do you have unMenu?

Quote

September 6, 201510 yr

Author

Yes - unmenu installed and working!

Quote

September 6, 201510 yr

Do you have extra slots/ports you can put the old drives in?

Quote

September 6, 201510 yr

Author

I have one spare slot that is empty and I could create another by temp. removing the cache drive I guess?!

Quote

September 8, 201510 yr

Author

OK - anyone got any more advice? I have both the "faulty" drives and if I temp remove the cache drive then I could create two spare bays to install them into. What now? Unfortunately I have no idea where to go from here re getting me back to where I was/need to be. Any advice/instructions really appreciated.

Thanks

Quote

September 11, 201510 yr

Author

OK - bjp999 or trurl (or anyone else who can help - please!!!) - I tried plugging each of the suspected faulty drives into the spare caddy slot in my server and rebooting … each time I could not get unRaid to boot - no idea why but the PC board would start up but unRaid did not start… just a blinking cursor. Bios probably trying to boot from it?!?!

So = I shut down, removed the cache drive and replaced it with the faulty drives one at a time and produced the Smart reports below (first using smartctl -d ata -tshort /dev/sd* and then using smartctl -a -A /dev/sd*). Does this tell you what you need to know in order to help me get back to where I was and recover any lost data?! The first drive that failed is showing a lot of extra messages. The second suspected faulty drive appears, as you suggested, perhaps not to be faulty according to the report/lack of messages as shown below?! So - if that is still intact and unaltered does it help you to advise re next steps?

Smart report for the first “faulty” drive = disk4 (now replaced)

root@Tower:~# smartctl -a -A /dev/sdd

smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)

=== START OF INFORMATION SECTION ===

Model Family: Hitachi Deskstar 7K3000

Device Model: Hitachi HDS723020BLA642

Serial Number: MN1220F306HKVD

LU WWN Device Id: 5 000cca 369c2f4b2

Firmware Version: MN6OA180

User Capacity: 2,000,398,934,016 bytes [2.00 TB]

Sector Size: 512 bytes logical/physical

Rotation Rate: 7200 rpm

Device is: In smartctl database [for details use: -P show]

ATA Version is: ATA8-ACS T13/1699-D revision 4

SATA Version is: SATA 2.6, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is: Fri Sep 11 17:02:56 2015 BST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x80) Offline data collection activity

was never started.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (19665) seconds.

Offline data collection

capabilities: (0x5b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

No Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 328) minutes.

SCT capabilities: (0x003d) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x000b 070 070 016 Pre-fail Always - 4854434

2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 83

3 Spin_Up_Time 0x0007 135 135 024 Pre-fail Always - 420 (Average 430)

4 Start_Stop_Count 0x0012 099 099 000 Old_age Always - 5911

5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 83

7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0

8 Seek_Time_Performance 0x0005 130 130 020 Pre-fail Offline - 28

9 Power_On_Hours 0x0012 095 095 000 Old_age Always - 38083

10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 150

192 Power-Off_Retract_Count 0x0032 095 095 000 Old_age Always - 6109

193 Load_Cycle_Count 0x0012 095 095 000 Old_age Always - 6109

194 Temperature_Celsius 0x0002 200 200 000 Old_age Always - 30 (Min/Max 10/54)

196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 100

197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 242

SMART Error Log Version: 1

ATA Error Count: 242 (device log contains only the most recent five errors)

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 242 occurred at disk power-on lifetime: 15673 hours (653 days + 1 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

84 51 01 d7 00 01 00 Error: ICRC, ABRT at LBA = 0x000100d7 = 65751

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

ca 00 08 d0 00 01 e0 08 21:22:16.323 WRITE DMA

27 00 00 00 00 00 e0 08 21:22:16.306 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

ec 00 00 00 00 00 a0 08 21:22:16.284 IDENTIFY DEVICE

ef 03 42 00 00 00 a0 08 21:22:16.266 SET FEATURES [set transfer mode]

27 00 00 00 00 00 e0 08 21:22:16.244 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

Error 241 occurred at disk power-on lifetime: 15673 hours (653 days + 1 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

84 51 01 d7 00 01 00 Error: ICRC, ABRT at LBA = 0x000100d7 = 65751

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

ca 00 08 d0 00 01 e0 08 21:22:15.707 WRITE DMA

27 00 00 00 00 00 e0 08 21:22:15.668 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

ec 00 00 00 00 00 a0 08 21:22:15.650 IDENTIFY DEVICE

ef 03 42 00 00 00 a0 08 21:22:15.628 SET FEATURES [set transfer mode]

27 00 00 00 00 00 e0 08 21:22:15.610 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

Error 240 occurred at disk power-on lifetime: 15673 hours (653 days + 1 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

84 51 01 c7 00 00 00 Error: ICRC, ABRT at LBA = 0x000000c7 = 199

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

ca 00 08 c0 00 00 e0 08 21:22:15.071 WRITE DMA

27 00 00 00 00 00 e0 08 21:22:15.054 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

ec 00 00 00 00 00 a0 08 21:22:15.031 IDENTIFY DEVICE

ef 03 42 00 00 00 a0 08 21:22:15.014 SET FEATURES [set transfer mode]

27 00 00 00 00 00 e0 08 21:22:14.992 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

Error 239 occurred at disk power-on lifetime: 15673 hours (653 days + 1 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

84 51 01 c7 00 00 00 Error: ICRC, ABRT at LBA = 0x000000c7 = 199

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

ca 00 08 c0 00 00 e0 08 21:22:14.455 WRITE DMA

27 00 00 00 00 00 e0 08 21:22:14.437 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

ec 00 00 00 00 00 a0 08 21:22:14.415 IDENTIFY DEVICE

ef 03 42 00 00 00 a0 08 21:22:14.398 SET FEATURES [set transfer mode]

27 00 00 00 00 00 e0 08 21:22:14.376 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

Error 238 occurred at disk power-on lifetime: 15673 hours (653 days + 1 hours)

When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:

ER ST SC SN CL CH DH

-- -- -- -- -- -- --

84 51 01 c7 00 00 00 Error: ICRC, ABRT at LBA = 0x000000c7 = 199

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

-- -- -- -- -- -- -- -- ---------------- --------------------

ca 00 08 c0 00 00 e0 08 21:22:13.839 WRITE DMA

27 00 00 00 00 00 e0 08 21:22:13.821 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

ec 00 00 00 00 00 a0 08 21:22:13.799 IDENTIFY DEVICE

ef 03 42 00 00 00 a0 08 21:22:13.781 SET FEATURES [set transfer mode]

27 00 00 00 00 00 e0 08 21:22:13.760 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed without error 00% 38083 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Smart report for second “Faulty” drive = disk8 (now replaced)

root@Tower:~# smartctl -a -A /dev/sdd

smartctl 6.2 2013-07-26 r3841 [i686-linux-3.9.11p-unRAID] (local build)

=== START OF INFORMATION SECTION ===

Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s)

Device Model: WDC WD20EARX-008FB0

Serial Number: WD-WMAZA8037952

LU WWN Device Id: 5 0014ee 2b18edeb3

Firmware Version: 51.0AB51

User Capacity: 2,000,398,934,016 bytes [2.00 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Device is: In smartctl database [for details use: -P show]

ATA Version is: ATA8-ACS (minor revision not indicated)

SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)

Local Time is: Fri Sep 11 17:26:28 2015 BST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x82) Offline data collection activity

was completed without error.

Auto Offline Data Collection: Enabled.

Self-test execution status: ( 0) The previous self-test routine completed

without error or no self-test has ever

been run.

Total time to complete Offline

data collection: (36180) seconds.

Offline data collection

capabilities: (0x7b) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities: (0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability: (0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time: ( 2) minutes.

Extended self-test routine

recommended polling time: ( 389) minutes.

Conveyance self-test routine

recommended polling time: ( 5) minutes.

SCT capabilities: (0x30b5) SCT Status supported.

SCT Feature Control supported.

SCT Data Table supported.

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

1 Raw_Read_Error_Rate 0x002f 200 199 051 Pre-fail Always - 868

3 Spin_Up_Time 0x0027 187 185 021 Pre-fail Always - 5633

4 Start_Stop_Count 0x0032 095 095 000 Old_age Always - 5033

5 Reallocated_Sector_Ct 0x0033 195 195 140 Pre-fail Always - 236

7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0

9 Power_On_Hours 0x0032 063 063 000 Old_age Always - 27655

10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0

11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0

12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 106

192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 32

193 Load_Cycle_Count 0x0032 165 165 000 Old_age Always - 105228

194 Temperature_Celsius 0x0022 124 110 000 Old_age Always - 26

196 Reallocated_Event_Count 0x0032 001 001 000 Old_age Always - 219

197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0

198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0

200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 55

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed without error 00% 27654 -

SMART Selective self-test log data structure revision number 1

SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS

1 0 0 Not_testing

2 0 0 Not_testing

3 0 0 Not_testing

4 0 0 Not_testing

5 0 0 Not_testing

Selective self-test flags (0x0):

After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

Finally if you are unable to help or point me at a help page that is relevant to getting myself back to where I was then is anyone able to give me a terminal command to, say, search through all my mounted drives for .pdf or .mkv etc files - this way I can find out what files appear to be complete and consequently then attempt to work out what is missing/lost?!

As ever - thanks in anticipation of absolutely any forthcoming help or advice. Getting desperate! :'( :'( :'( :'(

Quote

2 drives failed - second one during rebuild of first replacement....

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)