Upgraded parity to larger drive, another drive died during parity data build


Recommended Posts

Hi guys,

 

I've just upgraded my parity drive from 3TB to 4TB, and around 1/3 of the way through the generation of the parity, one of the data drives has died. 

It still shows in Unraid, I can see all the files, but it gets a ton of read errors and I can't copy anything off it.

I then tried to pop my original 3TB parity drive in, but it says it says the replacement drive needs to be the same size, so I can't use it.  I haven't written much to the array since swapping out parity drives.

Is there a way to do the following:

 

Install old 3TB parity drive (all other disks are 3TB or less)

Remove dying data drive

Power up the Unraid box

Get it to "trust" the old parity drive and let me start the array in a "missing disk" scenario, so I can recover the 1.5TB of data using parity information?

After I have recovered the data, I plan to pop the 4TB parity drive back in, and use the old 3TB parity drive to replace the dying data disk.

 

I had a look here:

http://lime-technology.com/wiki/index.php/Make_unRAID_Trust_the_Parity_Drive,_Avoid_Rebuilding_Parity_Unnecessarily

But it doesn't really state that it should be used in this case.

 

Running one of the very last Unraid 5 versions.

Link to comment

...  I haven't written much to the array since swapping out parity drives...

If you have written anything at all to the array since swapping parity drives then the old parity is invalid and cannot be used to rebuild a data disk.

 

It is possible that the drive hasn't really died, but instead you have a connection problem caused when you swapped the parity.

 

Does the drive actually have a redball?

 

Post SMART for the drive and a syslog.

Link to comment

The drive hasn't redballed, but it gets a ton of read errors during parity rebuild.  It's taken all week to get to 50%, and it was running at 200kBytes/sec when I powered down the NAS.  I currently have the dying drive in an external USB enclosure, attempting to get data off using some Linux reader and some other tools.  It's also erroring out when attached to the PC.  I'd acknowledge some of the parity info would not be correct, but the majority should be.  I'd assume the last dozen files might be damaged due to this but the rest of the parity info would still be OK?

I don't have the drive in the NAS right now so I can't post logs.  Just wondering if there is a way to fire up the array with the drive missing and the old parity drive installed. 

Link to comment

OK I have popped in the old parity drive, temporarily installed the dying drive, followed the instructions on the link above, ie:

 

Take screenshot of current drive arrangement

Telnet in and type:  initconfig

Go to website and reassign all the drives

Tick "trust array"

In the telnet prompt: mdcmd set invalidslot 99

Start array

Close webpage

Yank dying drive out

Copy data off the "virtual" disk14 (the dying drive)

Recovered data appears to be intact so far :)

 

Link to comment

This is a really scary scenario to me, that I hope I never have to deal with, but "just in case" I'd love to know whether in the end, your process worked for you, and  if you'd do something different given the same situation..  Please post back once you're done!!

 

Unraid is so awesome that I hate to criticize in any way, but it would be really cool to be able to either have a "raid 6" type double parity, or to be able to configure a new parity disk while the original parity disk is still online and usable... 

Link to comment

 

 

Unraid is so awesome that I hate to criticize in any way, but it would be really cool to be able to either have a "raid 6" type double parity, or to be able to configure a new parity disk while the original parity disk is still online and usable...

 

V6.2 supports dual parity.

 

Link to comment

It copied successfully for a few hours and I managed to rescue a stack of files I was unable to before.  I walked back in and now the share for disk 14 is showing as blank.  Maybe it hit some out-of-sync data on the parity drive and crapped itself.

Rebooting the box now to see if I can start the array with the missing disk and recover the remaining data.

 

I've stuck with this version of Unraid and Plex forever, but the double-parity option really appeals to me now!

Link to comment

Rebooting with disk14 unplugged results on disk14 showing as "not installed" rather than missing.  Can't browse to \\nas\disk14 either, and can't browse disk14 via the web interface either. Might reboot again and yank the drive after the array starts again.  Not sure how to get the array to present the "virtual" disk14 :(

Link to comment

Rebooting just gives the choice of wiping disk14 or rebuilding it.  Might pop to the computer shop tomorrow and grab a 3TB drive and take the rebuild option.  The array just doesn't seem to want to present a "virtual" disk14 unless anyone knows a trick?

Link to comment

Rebooting just gives the choice of wiping disk14 or rebuilding it.  Might pop to the computer shop tomorrow and grab a 3TB drive and take the rebuild option.  The array just doesn't seem to want to present a "virtual" disk14 unless anyone knows a trick?

you might want to post a screenshot to make it clearer what the current state is.  Sounds like the 'virtual' disk may have some file system level corruption that will need fixing.  Rebuilding a disk with file system level corruption at the virtual level just gives a physical disk with the same corruption.
Link to comment

Not really relevant now, but just thought I would comment on this bit:

...  I'd assume the last dozen files might be damaged due to this but the rest of the parity info would still be OK?

If you knew for certain that the files written since the old parity was in sync were only on the damaged disk, then what you say would be true. Any files written to other disks though would have an unknown affect on the rebuilt disk, since the bits on those disks would also be involved in reconstructing the data on the damaged disk, and there is no telling where on the damaged disk those out-of-sync bits would be.
Link to comment

Hi guys,

 

I understand a few files will be unrecoverable, probably 50GB or so that were written before the parity drive was upgraded.  But I'd assume the parity for the remaining 1.4TB of sectors should still be accurate.  I just can't seem to see a way to get the "virtual" parity drive to show.  Maybe I'll need to do the "trust my array" procedure a second time? 

 

Here's a few screenshots:

 

Array with dying disk not installed:

array1.png

Array Start dialog box with disk not installed - doing so doesn't give me a virtual disk 14:

array2.png

Array with dying disk (14) present:

array3.png

Array Start dialog box with options presented.  Notice I can't just start it as is. 

array4.png

SMART test results:

array5.png

 

A quick snippet of the disk test log retrieved when I click "collect":

ATA Error Count: 42265 (device log contains only the most recent five errors)

CR = Command Register [HEX]

FR = Features Register [HEX]

SC = Sector Count Register [HEX]

SN = Sector Number Register [HEX]

CL = Cylinder Low Register [HEX]

CH = Cylinder High Register [HEX]

DH = Device/Head Register [HEX]

DC = Device Command Register [HEX]

ER = Error register [HEX]

ST = Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It "wraps" after 49.710 days.

 

Error 42265 occurred at disk power-on lifetime: 22116 hours (921 days + 12 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 ff ff ff ef 00      00:13:14.151  READ DMA EXT

  25 00 08 ff ff ff ef 00      00:13:14.144  READ DMA EXT

  25 00 08 ff ff ff ef 00      00:13:14.127  READ DMA EXT

  ef 10 02 00 00 00 a0 00      00:13:14.126  SET FEATURES [Enable SATA feature]

  27 00 00 00 00 00 e0 00      00:13:14.126  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

 

Error 42264 occurred at disk power-on lifetime: 22116 hours (921 days + 12 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 ff ff ff ef 00      00:13:11.508  READ DMA EXT

  ef 10 02 00 00 00 a0 00      00:13:11.507  SET FEATURES [Enable SATA feature]

  27 00 00 00 00 00 e0 00      00:13:11.507  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]

  ec 00 00 00 00 00 a0 00      00:13:11.506  IDENTIFY DEVICE

  ef 03 46 00 00 00 a0 00      00:13:11.506  SET FEATURES [set transfer mode]

 

Error 42263 occurred at disk power-on lifetime: 22116 hours (921 days + 12 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 00 08 ff ff ff ef 00      00:13:08.764  READ DMA EXT

  ca 00 08 c0 00 00 e0 00      00:13:08.764  WRITE DMA

  c8 00 38 a8 0c 00 e0 00      00:13:08.726  READ DMA

  c8 00 08 a0 0c 00 e0 00      00:13:08.726  READ DMA

  c8 00 08 98 0c 00 e0 00      00:13:08.726  READ DMA

 

Error 42262 occurred at disk power-on lifetime: 22115 hours (921 days + 11 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 03 08 ff ff ff ef 00      02:31:59.657  READ DMA EXT

  25 03 08 ff ff ff ef 00      02:31:59.060  READ DMA EXT

  25 03 08 ff ff ff ef 00      02:31:56.426  READ DMA EXT

  25 03 08 ff ff ff ef 00      02:31:53.798  READ DMA EXT

  25 03 08 ff ff ff ef 00      02:31:53.206  READ DMA EXT

 

Error 42261 occurred at disk power-on lifetime: 22115 hours (921 days + 11 hours)

  When the command that caused the error occurred, the device was active or idle.

 

  After command completion occurred, registers were:

  ER ST SC SN CL CH DH

  -- -- -- -- -- -- --

  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

 

  Commands leading to the command that caused the error were:

  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name

  -- -- -- -- -- -- -- --  ----------------  --------------------

  25 03 08 ff ff ff ef 00      02:31:56.426  READ DMA EXT

  25 03 08 ff ff ff ef 00      02:31:53.798  READ DMA EXT

  25 03 08 ff ff ff ef 00      02:31:53.206  READ DMA EXT

  25 03 08 ff ff ff ef 00      02:31:52.609  READ DMA EXT

  25 03 08 ff ff ff ef 00      02:31:52.017  READ DMA EXT

Link to comment

OK I have just run the trust my array procedure, but without "mdcmd set invalidslot 99" oops, what does this do?

Either way I am recovering more data, going to run down the road now and try and source another drive if I can find a PC shop open on a Sunday.

Link to comment

Lots of these in the log as I do the recovery, I'm guessing due to the incomplete parity.  Looks like it has frozen so I'll have to reboot and continue the process.

 

Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1

Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck?

Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1

Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck?

Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1

Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck?

Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1

Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck?

Link to comment

Looks like my motherboard or CPU has crapped itself just to add to the party, it boots, but at glacial pace (I'm talking 1 hr just to get to the unraid boot screen), almost like the CPU is running at 1MHz or something. 

Even with all SATA, PCI and excess RAM sticks removed it does the same thing :(

Tried a spare motherboard out of an old Acer i3 PC, but it didn't seem to support USB boot.  Any recommendations for a motherboard with lots (6+) SATA ports?

<insert photo of me tearing hair out here>

Link to comment

I've pulled the CPU off and replaced the i3 540 with an i3 550 off the spare motherboard.  Noticed a bent CPU socket pin from when I originally built the box years ago and a missing CPU socket pin next to it on the motherboard, straightened out the bent pin and it seems to be running at normal speed again, even with the other pin missing.  Might still investigate a new motherboard though.

Link to comment

OK I managed to get my hands on an old Intel SB3240GP with Xeon X3440 CPU on it.

http://www.intel.com/content/www/us/en/support/server-products/server-boards/single-socket-server-boards/intel-server-board-s3420gp-family.html

 

All up and running, have installed new disk to replace the faulty one. 

Clicked rebuild data.  It is rebuilding the data but I also have a choice to format the disk?  I also can't see disk14 in the list of shares I can connect to.

array6.png

Should I just wait it out?  What happens if it does a data rebuild when the disk isn't formatted first?

Also looks like all the newer data was copied to disk15 and disk16, so I'm hoping the parity disk swap doesn't affect what is on those.

I managed to get 1.4TB out of the 1.5TB of data back. 

Link to comment

Lots of these in the log as I do the recovery, I'm guessing due to the incomplete parity.  Looks like it has frozen so I'll have to reboot and continue the process.

 

Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1

Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck?

Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1

Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck?

Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1

Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck?

Jul 31 10:54:06 NAS kernel: REISERFS warning: reiserfs-5090 is_tree_node: node level 0 does not match to the expected one 1

Jul 31 10:54:06 NAS kernel: REISERFS error (device md14): vs-5150 search_by_key: invalid format found in block 346236761. Fsck?

 

These are file system errors, not disk errors or parity errors, nothing to do with parity.  Once you have access to Disk 14 again, see Check Disk File systems.

 

All up and running, have installed new disk to replace the faulty one. 

Clicked rebuild data.  It is rebuilding the data but I also have a choice to format the disk?  I also can't see disk14 in the list of shares I can connect to.

 

Should I just wait it out?  What happens if it does a data rebuild when the disk isn't formatted first?

 

You don't want to format Disk 14, that would empty the disk, and update the parity accordingly.  Rebuilding a drive rewrites all of the bits on the drive, overwriting whatever was on the drive before.

Link to comment

Even though Disk14 has been replaced?  Doesn't it need to format the disk before it can restore the data to it via parity?

 

I let it run for a few hours and it got to around 700GB in, but there was no disk14 share still.  And it still wanted me to format it.

I hadn't checked your reply, I cancelled the rebuild, hit Format, and then clicked Rebuild again.  It is rebuilding again now, but the disk14 share is showing as blank.  Disk also shows as blank in the list of disks, see below. 

 

array7.png

 

Will the data reappear on that disk when the Rebuild completes?  Or have I just nuked all the data out of parity?  It seems to be writing something to the disk, but who knows what?

 

I have a copy of most of the data just in case, but I was hoping this would restore the data more nicely :)

Link to comment

As RobJ said

You don't want to format Disk 14, that would empty the disk, and update the parity accordingly.  Rebuilding a drive rewrites all of the bits on the drive, overwriting whatever was on the drive before.

The word format means "Write an empty filesystem to this disk". That is what it has always meant on every operating system you have ever used.  unRAID treats this write operation just like it does every other write operation, by updating parity. So now parity agrees that you have an empty filesystem, and now you are rebuilding an empty filesystem. Format is NEVER part of the rebuild process.

 

Link to comment

Damn!  How do I reverse/stop this?  Or have I already lost that parity information?  It's not the end of the world as I recovered most of the data via parity before I popped this new replacement disk in. 

 

Why does it say "Data-Rebuild" when it isn't rebuilding anything?

Link to comment

Damn!  How do I reverse/stop this?  Or have I already lost that parity information?  It's not the end of the world as I recovered most of the data via parity before I popped this new replacement disk in. 

 

Why does it say "Data-Rebuild" when it isn't rebuilding anything?

It is rebuilding the disk. The disk is just a bunch of bits. It will rebuild the entire disk. Unfortunately, the bits on the disk now represent an empty filesystem. The parity information now agrees that the disk has an empty filesystem on it, and that is what it is rebuilding.

 

An empty filesystem is not an empty disk. The only thing that sort of corresponds to the idea of an empty disk is a cleared disk. A cleared disk, such as can be created with preclear, is a disk full of zeros.

 

A filesystem, even an empty filesystem, has the filesystem metadata that represents folders, where files are stored, etc. In the case of a newly formatted disk, which is the empty filesystem you wrote when you told it to format, you basically have the metadata that represents the top level folder, with no files or folders in it.

 

However, the empty filesystem metadata is a pretty small amount of the bits on the disk. All of the other bits that used to be there are still there (and also in the parity array), so if you let the rebuild complete all of those other bits will still be there.

 

It might be possible to recover some of the files from those other bits using filesystem repair tools. Since you say you have backups, it might be easier and more reliable to just restore from the backups. If that is what you want to do I would say go ahead and let the rebuild complete, so the filesystem will be empty and it will also be consistent with parity. Then you can just copy everything from your backups.

Link to comment

Sounds like a plan, thanks heaps for your help :)

Worst case I'll plug the old dying drive into a caddy, attach to PC and use some tools to recover the remaining 100GB from it, assuming that data is not on bad sectors.

 

Eventually I need to replace all the remaining Seagate ST3000DM001 drives as these have a nightmarish failure rate!

https://www.backblaze.com/blog/hard-drive-reliability-q4-2015/

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.