Help removing some failed/failing disks

ratmice · December 16, 2017

So I have a situation where there are 2 disks I want to remove from my array. It started the other day when I noticed 2 disks were showing lots of reallocated sectors. At the same time I had trouble shutting down the server, and when I restarted a forced parity check was initiated. During this parity check one of the failing drives dropped, and the parity check aborted.

So the current situation is one failed drive, and one about to fail. I have backed up all information from both drives. I also have a pre-cleared, larger drive ready to deploy into the array. My question is, can the 2 drives be removed while keeping parity? I know I can get the failing drive, that is currently mountable, out of the array by emptying, or formatting, it first. But, is there a way to get rid of the faulty drive, as well, while keeping parity intact? I'm extra paranoid right now about another drive failing during the normal parity build, if I were to just go with a new config. I do not have dual parity yet, but that is the next thing on the list, as soon as I get the array back in shape.

Some help, from someone who understands the finer nuances of these kind of procedures would be greatly appreciated

SSD · December 16, 2017

(Having backups definitely reduces your risk!)

You should post a diagnostics file. This will allow us to see the smart reports from the two failing disks and judge just how bad they are, and see if anything else is concerning in the syslog.

Generally to remove two drives and transfer their contents to a third, I'd add the new disk to the server - partition and format the disks for unRAID (I usually try to get unRAID to do this, but do believe that UD will now partition and format disks to unRAID standard). One this disk is formatted and mounted in UD, you can copy the data from each of the two failing disks to the new drive, and once you are satisfied that data transfer is complete and accurate, you can then unmount the UD, stop the array, do a new config, omit the two failed disks from your drive config, include the new disk in the drive config, and start the "new" array configuration which will rebuild parity.

But let's start with the diagnostic (Tools -> Diagnostics)

ratmice · December 16, 2017

So , just to reiterate, all the info on these 2 drives (#1 and #16) has been copied elsewhere. So, I can reformat the failing one, and thus remove it from the array , I think by utilizing the trust parity procedure. The drive should be empty after the format and won't have any effect on parity. For the red-balled drive, can you follow a similar procedure by removing all files from the (emulated) disk and than removing it from the array, and trust parity again? I'm just not sure if that works with emulated drives that are offline.

I understand (mostly) your procedure using UD. However, the data is already copied elsewhere, so I don't need to retain any data currently on them.

I have attached the diagnostics to this post. redballed drive: (sdv) ST3000DM001-9YN166_S1F0KX93

Failing drive: (sdl) ST2000DM001-9YN164_S1E06XWM

tower-diagnostics-20171216-1711.zip

Edited December 16, 2017 by ratmice

JorgeB · December 16, 2017

5 minutes ago, ratmice said:

So, I can reformat the failing one, and thus remove it from the array , I think by utilizing the trust parity procedure.

No, formatting a drive is not the same as clearing, you'd need to start the array in maintenance mode and write zeros to both the remaining disk you want to remove and the emulated disk, one at a time:

https://wiki.lime-technology.com/Shrink_array#The_.22Clear_Drive_Then_Remove_Drive.22_Method

SSD · December 16, 2017

4 minutes ago, ratmice said:

So , just to reiterate, all the info on these 2 drives (#1 and #16) has been copied elsewhere. So, I can reformat the failing one, and thus remove it from the array , I think by utilizing the trust parity procedure. The drive should be empty after the format and won't have any effect on parity. For the red-balled drive, can you follow a similar procedure by removing all files from the (emulated) disk and than removing it from the array, and trust parity again? I'm just not sure if that works with emulated drives that are offline.

I understand (mostly) your procedure using UD. However, the data is already copied elsewhere, so I don't need to retain any data currently on them.

I have attached the diagnostics to this post.

tower-diagnostics-20171216-1711.zip

You should read up more on how parity works. Formatting a drive does not allow you to remove a disk as you describe, but ZEROing the disk would allow you to do what you are suggesting.

I'd prefer to the get the data copied across from the array and save the backup to be a backup. Otherwise, you'll be unprotected as you recover from the backup.

The UD approach is what I recommend.

Looking at the syslog, it appears you have a Marvell controller that is causing mischief. @johnnie.black can better assist with this, but believe you need a new controller - like the LSI SAS9201-9i.

ratmice · December 16, 2017

Johnnie, in the write up you link to, above, the following is in the procedure:

One quick way to clean a drive is reformat it! To format an array drive, you stop the array
and then on the Main page click on the link for the drive and change the file system type 
to something different than it currently is, then restart the array. You will then be 
presented with an option to format it. Formatting a drive removes all of its data, and 
the parity drive is updated accordingly, so the data cannot be easily recovered.

is this not way quicker than writing all zeros?

edit: OK I see the issue, still needs to be cleared by writing all zeros, but could save a bunch of time by not having to erase all files.

Edited December 16, 2017 by ratmice

JorgeB · December 16, 2017

5 minutes ago, ratmice said:

Johnnie, in the write up you link to, above, the following is in the procedure:


One quick way to clean a drive is reformat it! To format an array drive, you stop the array
and then on the Main page click on the link for the drive and change the file system type 
to something different than it currently is, then restart the array. You will then be 
presented with an option to format it. Formatting a drive removes all of its data, and 
the parity drive is updated accordingly, so the data cannot be easily recovered.

is this not way quicker than writing all zeros?

That means clear as in empty the drive, though a different term should be used, you can see that is just part of the procedure.

ratmice · December 16, 2017

I was editing above as you were replying. Thanks, I see that now.

JorgeB · December 16, 2017

8 minutes ago, SSD said:

Looking at the syslog, it appears you have a Marvell controller that is causing mischief.

It may be, but it affected a single drive, I'd like to see the SMART report for that disk and it dropped offline, a reboot/power cycle and new diags are needed.

ratmice · December 16, 2017

here is a new set of diagnostics after reboot.

tower-diagnostics-20171216-1752.zip

JorgeB · December 16, 2017

It's definitely the disk, another ST3000DM001 goes boom...

ratmice · December 16, 2017

Thank you for taking the time to help me out. Happy Holidays!

Just confirming that it is OK to zero the emulated drive (in maintenance mode) and then use a trust parity procedure to remove the disk without borking parity. I think that's what you said above.

SSD · December 16, 2017

A little confused what you are trying to do.

You can zero a disk (any number of disks actually) and remove them all from the array (by doing a new config, omitting them from the config, and then trusting parity). Always a good idea to do a parity check to confirm soon after. I actually do a short parity check - 5-10 mins - just to confirm.

But you had said you wanted to add a different larger disk. Would you just want to replace the disk with the larger disk and have it rebuild?

JorgeB · December 16, 2017

It is OK, it will take a while, I believe the script isn't working with latest unRAID, but you can do it manually, also like SSD mentioned the SASLPs have known issues, it wasn't one of them this time but it likely be in the future, LSI are the recommended controllers for v6, would also recommend dual parity for an array of that size.

To zero a disk (actual or emulated) you can follow this:

1. If disable enable reconstruct write (aka turbo write): Settings -> Disk Settings -> Tunable (md_write_method)

2. Start array in Maintenance mode. (array will not be accessible during the clearing)

3. Identify which disk you're removing

4. From the command line type:

dd bs=1M if=/dev/zero of=/dev/mdX status=progress

replace X with the correct disk number, disk1=md1, disk2=md2 and so on

5. Wait, this will take a long time, about 2 to 3 hours per TB.

6. When the command completes, Stop array, go to Tools -> New Config -> Retain current configuration: All -> Apply

7. Go back to Main page, unassign the cleared device. * with dual parity disk order has to be maintained, including empty slot(s) *

8. Click checkbox "Parity is already valid.", and start the array

ratmice · December 16, 2017

5 minutes ago, SSD said:

A little confused what you are trying to do.

You can zero a disk (any number of disks actually) and remove them all from the array (by doing a new config, omitting them from the config, and then trusting parity). Always a good idea to do a parity check to confirm soon after. I actually do a short parity check - 5-10 mins - just to confirm.

But you had said you wanted to add a different larger disk. Would you just want to replace the disk with the larger disk and have it rebuild?

I'm not surprised, as I'm a bit confused, as well. What I really want to do is first get rid of the 2 flaky disks, hopefully while still having parity protection. I think part of my (infectious) confusion is unfamiliarity with the trusting parity. I think Johnnie has helped in that regard. Adding the new larger disk will happen soon, but doesn't have to be at the same time.

I will start with getting rid of the emulated disk, and thus regain parity protection for the other disks in the array once it's gone. That way if anything gets FUBAR'd, I at least will have parity protection for the other disks. Then I can get to removing the failing disk and adding the new disk.

13 minutes ago, johnnie.black said:

It is OK, it will take a while, I believe the script isn't working with latest unRAID, but you can do it manually, also like @SSDmentioned the SASLPs have known issues, it wasn't one of them this time but it likely be in the future, LSI are the recommended controllers for v6, would also recommend dual parity for an array of that size, to zero a disk (actual or emulated) you can follow this:

Thank you for this, I thought I had read the script was having trouble. Just to be extra sure do you have to remove all files from the disk before manual zeroing, or was that just to get the script to run? (not that I'm using the script)

Good info about the controllers, I will definitely look into it. I have a drive ready to become my second parity, I was waiting for the array to get into some semblance of stability before I deploy it.

JorgeB · December 16, 2017

Just now, ratmice said:

Just to be extra sure do you have to remove all files from the disk before manual zeroing, or was that just to get the script to run? (not that I'm using the script)

Just for the script.

P.S.: forgot to mention it will take longer to clear the emulated disk compared to an actual disk, since all disks need to be read.

JonathanM · December 16, 2017

1 minute ago, johnnie.black said:

P.S.: forgot to mention it will take longer to clear the emulated disk compared to an actual disk, since all disks need to be read.

Assuming the actual disk is relatively healthy.

JorgeB · December 16, 2017

3 minutes ago, jonathanm said:

Assuming the actual disk is relatively healthy.

Yep, and I wouldn't recommend trying to clear a disk with issues.

P.S.: good to see you back from your posting sabbatical

JorgeB · December 17, 2017

4 minutes ago, johnnie.black said:

Yep, and I wouldn't recommend trying to clear a disk with issues.

Just to be clear, when I say issues I mean a disk with pending sectors or similar, your other disk has a considerable number of reallocated sectors but the rest seems mostly OK, so it should clear fine, especially since disks are much more likely to error on reads than on writes.

JorgeB · December 17, 2017

One more thought, since the failing drive is more likely to fail on read, it would be safer (though possibly a little slower) to first zero and remove that one and only after that remove the emulated disk, because writing to the emulated drive will require that all others disks are read.

ratmice · December 17, 2017

So, this brings up an interesting dilemma. In terms of probability of having another issue while extricating myself from the current situation, what would be the best way to proceed?

1. zero and remove the emulated drive first - worried about the stress on the pre-failing while zeroing emulated drive making it actually fail

2. zero and remove the pre-failing drive first - worried about the stress of zeroing on the drive making it actually fail

3. just bite the bullet and pull both, add the new one, new config and live without parity protection for a day.

4. I suppose if 1 or 2 fails I can always fall back to 3.

am I waaaaaaay overthinking this?

pwm · December 17, 2017

I would zero the prefailing drive.

Then take care of the emulated drive.

SSD · December 17, 2017

With backups in place your risk is minimal regardless of what you do.

In general, though, I like to access "fragile" disks as little as possible. Doing full disk operations is not fragile. And the problem with full disk operations is that unless they complete successfully, you are not confident what was recovered and what was not. Finishing in the middle is worse than nothing, as you could spend the rest of your life trying to figure out what is good and what is not.

Copying data file by file is different. Each file that copies is a success. If the drive dies in the middle, you have what you have and you don't what you don't. Puts you in the drivers seat to copy the most valuable data first.

I would pursue adding a new disk as a UD, copying to data from the problematic "real" drive to it, and then coping the data from the emulated disk to it. Then rebuild parity without the two bad disks but including the new good disk.

Again, your backups save the day, so whatever you do it low risk. But for people without the backups, that's what I would do.

ratmice · December 17, 2017

Overnight the pre-failing disk was zeroed, I removed it from the array this morning . All went well. Interesting side note is that now the red-balled disk is back online like nothing is wrong? Not sure what to make of that.

JorgeB · December 17, 2017

2 hours ago, ratmice said:

Interesting side note is that now the red-balled disk is back online like nothing is wrong? Not sure what to make of that.

Disk had dropped offline, it's normal to come back online after a reboot or power cycle, disk is still bad though.

Help removing some failed/failing disks

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation