Jump to content

Tricky problem with replacing a dead drive & parity disk at the same time


isochronous

Recommended Posts

Recently I came home after a sever thunderstorm to find my unraid server making a horrible clicking noise from one of the hard drives.  Unfortunately, at the same time, my router had somehow bricked itself during the storm, and I was unable to access the unraid web GUI.  I tried to soft-power-off the server by hitting its power button, but it just kept spinning down then spinning up the dying drive, so I was eventually forced to hold the power button in and shut it off the brute force way.

 

So now I know that one of my drives is dead, I don't know which one, and I don't really want to power back up the array until I have a replacement drive ready.  I decided that now was as good a time as any to add some capacity, so I decided to buy two 2TB drives, one as a new parity disk, and the other as the replacement for the dead drive.

 

The problem (well, the main problem, now) is that currently my parity disk is 640GB.  Which means that if I want to replace the dead drive (assuming it's not the parity disk) then I have to install the replacement drive, reconstruct it from the parity info, then swap out the parity drive.  I'm sure at least some of you already see the problem: the parity disk has to be at least as big as any drive in the array, which means I'm going to have problems using the 640GB parity disk to reconstruct data on the 2TB drive.  So then I should swap out the parity drive first, but then obviously the problem is that I won't be able to reconstruct the data on the dead drive. 

 

My question is: Is there any way to tell unraid to just use the first 640GB of the new data drive, and then later use something like GParted to size up the partition after I've swapped out the parity drive?  Or, failing that, is there a way to copy the parity data from the existing parity drive to the new 2TB drive, then swap out the data disk and rebuild the array?  Is there some other answer to this problem that I haven't thought about?

 

Your help will be very much appreciated.

Link to comment

unRAID has a feature called "swap-disable".  It allows you to replace parity with a larger drive, and then use the parity disk to replace a failed disk.

 

Look at the official unRAID docs (see here) and search for "swap-disable".

 

If you search the forum for swap-disable you'll find lots of people that have done it successfully.

Link to comment
  • 3 weeks later...

I'm just about to do something very similar. Difference being that I will be removing a functional 400GB disk, demoting the 1TB parity disk to data disk, and then upgrading a second 400GB with 2TB.

 

My setup is as follows:

OK	parity		/dev/sde	SAMSUNG_HD103UJ_S13PJDWQA14001	*	11487363	2253733					
OK	/dev/md1	/mnt/disk1	/dev/sdc	00P8B0_WD-WCAVU0248604	34°C	11531262	118555		1.00T	999.78G	100%	394.43M
OK	/dev/md2	/mnt/disk2	/dev/sdd	00M2B0_WD-WMAV50602294	*	11968618	375906		1.00T	999.26G	100%	915.39M
OK	/dev/md3	/mnt/disk3	/dev/sda	00M2B0_WD-WMAV50287415	*	11690864	183695		1.00T	999.82G	100%	355.15M
OK	/dev/md5	/mnt/disk5	/dev/sdf	63U_WD-WCAV58157919	*	9853000	   775957		1.00T	999.33G	100%	847.10M
OK	/dev/md6	/mnt/disk6	/dev/sdg	SAMSUNG_HD401LJ_S0HVJ1CLB04953	*	3616025	68784		400.08G	399.10G	100%	976.47M
OK	/dev/md7	/mnt/disk7	/dev/sdi	SAMSUNG_HD401LJ_S0HVJ13P110541	*	3464275	106243		400.08G	400.05G	100%	24.88M
OK	/dev/md8	/mnt/disk8	/dev/sdb	SAMSUNG_HD403LJ_S0NFJ1KLC05846	*	4869871	153833		400.08G	399.11G	100%	966.93M
OK	/dev/md9	/mnt/disk9	/dev/hda	HDS724040KLAT80_KRFA06RAGV3A2C	*	3401023	366583		400.08G	399.37G	100%	705.01M

 

So if someone could confirm the plan:

 

  • Run parity check
  • Add jumper to the two WD20EARS drives
     
  • Shut down server and replace the 400GB with one of the 2TB ones
  • Start server and select swap-disable option
  • System will transfer parity and then rebuild 400GB data on the 1TB disk previously used for parity.
  • At some point this is done
  • Check different data elements with focus on the data that was originally on the 400GB disk
  • Shut down
  • Replace another 400BGB disk with the other 2TB disk
  • Restart and rebuild the second 400GB's data on the other 2TB
  • Enjoy newly added space
     

 

Are there any reason or purpose to do a preclear (other then identifying if one of the 2TBs is dead)?

 

Any other suggestions to the plan?

 

TIA

 

/Niels

Link to comment

I'm just about to do something very similar. Difference being that I will be removing a functional 400GB disk, demoting the 1TB parity disk to data

It does not work the way you described.  There is no "swap-disable" option to choose.  It is implied by your actions.

 

To perform the task you are attempting you must

1.  Stop the array

2.  Power down.  Remove the existing 400Gig drive.  Install the new 2TB drive. Power up.

3.  Assign the new 2TB drive as parity.

4.  Assign the previous parity drive in place of the 400Gig drive you removed.

When you then power up and "Start" the array the contents of the old parity drive (currently in the data slot that once held the 400Gig drive) will be copied to the new parity drive.  (This will take somewhere between 3 and 4 hours, during which your array will be off-line as 1TB must be copied)  You'll probably need to check the box under "Start" to enable it.

Then, after the old parity is copied to the new parity drive, the array will come online and the parity disk in combination with the other data disks will be used to re-construct the contents of the old 400Gig drive onto the 1TB drive.

When this is completed you'll have a 2TB parity drive, and a 1TB data drive (in place of the 400Gig data drive)

I'd then do a parity check to make sure the new  disks can all be read correctly.

 

I do not recommend moving the drives physically until AFTER you have re-established parity and subsequently checked parity.  Then you can

stop the array, physically move them, and then power back up.  As long as you use the same ports the server should figure it out.  If you use different slots you might need to use the "Devices" page to assign the disks to their correct slots in the array first.

 

Joe L.

Link to comment

Are there any reason or purpose to do a preclear (other then identifying if one of the 2TBs is dead)?

Absolutely, in fact, it is highly desirable in this situation of you risk data loss.

 

The preclear will identify any un-readable sectors before you trust your data to those new disks.

 

Or you can copy your existing "parity" data to the new disk and learn about the un-readable sector when it tries to use them to re-construct your failed drive.  By then it will be too late.  Your reconstructed data will be corrupted.

 

Once you are back with good parity, and have checked it by doing a full parity check, you can stop the array once more, install the other 2T drive in place of the other 400Gig drive and again press "Start" to begin the re-construction.

 

Joe L.

Link to comment

Thank you very much for the clarification Joe  :) . That was very helpful.

 

I will use your posts as my agenda for the weekend.

 

I will make another machine do the preclearing and then carefully follow the steps as described.

 

Maybe this elaboration could go into the Wiki at some point to avoid future questions and doubts down these lines? The current description is maybe a little too high level.

Link to comment

Almost 24 hours ago I started the Preclear, and things are looking good (I think) so far. It's around 10% into step 10 reading the zeros. I have attached the logs anyway.

 

When they are done, I will have to decide which of the 400GB drives I should retire. If no other indicator, I will simply take the ones with the most hours on the clock. Does the SMART-reports suggest anything different?

 

Preclear-mid-step10.zip

SMART-400G-drives.txt

Link to comment

Finally - after 58 hours  :o - the preclear of the two 2TB drives including postread has finished.

 

Could someone please reconfirm that these drives are ok? I believe it looks right, as everything with "Reallocated" or "Pending" is 0. But I'm a bit unshure about the UDMA-count, that may point towards issues?

 

It should be noted, that the preclear was done on a Jetway board that usually resides in my car,  but was the only available hardware with SATA ports for the purpose. It has been doing the preclear partly separated in almost a birds-nest of cables ::)

NiA-2TB-Preclear.zip

Link to comment

Finally - after 58 hours  :o - the preclear of the two 2TB drives including postread has finished.

 

Could someone please reconfirm that these drives are ok? I believe it looks right, as everything with "Reallocated" or "Pending" is 0. But I'm a bit unshure about the UDMA-count, that may point towards issues?

 

It should be noted, that the preclear was done on a Jetway board that usually resides in my car,  but was the only available hardware with SATA ports for the purpose. It has been doing the preclear partly separated in almost a birds-nest of cables ::)

The drives look fine.

 

As far as the UDMA errors, the current normalized value is 200, the worst is 200, both are the initial factory initialized values, the failure threshold is 000. 

The "raw" value has meaning only to the manufacturer.  There is no indication of any problem at all.

[b]ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1916[/b]

 

Link to comment

I did the procedure to the letter, but now I get this message

 

Too many wrong and/or missing disks

 

I shut down the server. Took the SATA cable from the disk to be replaced (disk 6), plugged it into the new disk.

 

Power on, and the system complains as expected that the parity disk is not the largest one. The 2TB disk is located in the array as disk 6, where the other one was sitting.

 

Unassigned disk 6. Assigned the 2TB to the parity slot. Assigned the old 1TB parity disk as disk 6.

 

And then to unRAID main web page where the message Too many wrong and/or missing disks  is presented next to the start button  ???  :'(

 

What should I do now?

 

Below it can be seen, that the parity disk is now located in slot 6.

 

Status Disk Mounted Device Model/Serial Temp Reads Writes Errors Size Used %Used Free

DISK_WRONG parity /dev/sdh SAMSUNG_HD103UJ_S13PJDWQA14001 <-- was old disk in this slot

  WDC WD20EARS-00M_WD-WCAZA1026118 <-- current disk in this slot 38°C 38 0

OK /dev/md1 /dev/sdc 00P8B0_WD-WCAVU0248604 33°C 70 0

OK /dev/md2 /dev/sdd 00M2B0_WD-WMAV50602294 32°C 70 0

OK /dev/md3 /dev/sda 00M2B0_WD-WMAV50287415 34°C 55 0

OK /dev/md5 /dev/sdg 63U_WD-WCAV58157919 35°C 63 0

DISK_WRONG /dev/md6 /dev/sdf SAMSUNG_HD401LJ_S0HVJ1CLB04953 <-- was old disk in this slot

  SAMSUNG HD103UJ _S13PJDWQA14001 <-- current disk in this slot 29°C 44 0

OK /dev/md7 /dev/sdi SAMSUNG_HD401LJ_S0HVJ13P110541 30°C 67 0

OK /dev/md8 /dev/sdb SAMSUNG_HD403LJ_S0NFJ1KLC05846 31°C 70 0

OK /dev/md9 /dev/hda HDS724040KLAT80_KRFA06RAGV3A2C 41°C 54 0

 

(had to change another disk, as the original candidate [/dev/hda HDS724040KLAT] actually was a PATA-drive)

Link to comment

And the expected swap is now building. <phew>.

 

So, the steps are as follows (taken from previous link slightly modfied):

 

1. Shutdown

2. Remove drive to be changed.

3. Boot, observe webGui says disk is missing, but Start array anyway.

3. Array should start, will use parity reconstruct to satisfy I/O requests to now disabled disk.

4. Shutdown

5. Move old parity to disk1 slot, install new (bigger) disk in parity slot.

6. Boot, observe webGui shows 'parity-swap' situation present.

7. Start array - wait for copy to finish (probably several hours) - when done, parity rebuild should be taking place

 

I actually managed without booting. Simply starting the array, stopping it again (took actually approx. 15 minutes) and then reassigning parity to drive6 (in my case) and the 2TB to parity.

 

Now it is copying to the tune of 1% per 7 minutes... Waiting patiently...  :)

Link to comment

I will propose something that you probably may not like.

You are using hardware (motherboards - this and the one used for preclearing) that stray of the norm.

 

I personally have had so many problems with nVidia boards in the past that I wont use any motherboard based on nVidia chipset (or for that matter any Sis, Via or ALi based one) for Unraid. Granted that particular board may have been reported as working  but did that person had the exact configuration as you...?

 

And if I see a person with the "N" in the motherboard model number posting about problems I usually will just skip the reply - as in most cases these are tech enthusiasts that are reading about, using these boards and being passionate regarding their choice.

 

But for critical application (and my data is critical) I wont touch anything other than relatively recent Intel or AMD.

 

And I believe another knowledgeable user here shares the same opinion regarding the nVidia.

 

So if you have another computer with Intel or AMD based chipset I will suggest to at least temporarily use that one for the data recovery. And you will need another machine with internet access to be able to read here for solutions that will come your way.

 

Good luck

 

Link to comment

Thanks for the feedback.

 

However, when I originally got the MoBo, it was already Level 2 tested. Somewhere even Tom suggested it to someone, and it was used by several others with success.

 

I too have used it with success for a year.

 

I believe the problem with loosing the drive1 after parityswap may come from power supply issues (most likely) or maybe the sata cable being loosely attached after working inside the cabinet. At least, it's an issue that occurred during the upgrade process, pointing to a problem triggered by that. You could of course argue that it is MoBo related.

 

The reason I suspect the power supply is, that I has problems turning the server on again after doing the disk swap.

 

I will however take your recommendations into consideration for what will probably be a Christmas rebuild of the unRAID server. I will start the hunt for a low-power consuming board, and then probably redelegating the Asus board for secondary services (Weather server, SubSonic, ...)

 

Still I'm looking for immediate advice on how to proceed with the existing disks and setup in order to have the highest likelyhood of preserving as much data as possible.

 

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...