replacing disks without any risk


Recommended Posts

When using single parity, wiki says the following about replacing a single disk:

 

https://lime-technology.com/wiki/UnRAID_6/Storage_Management#Replacing_disks

 

> Array cannot tolerate a disk failure without potential data loss to

> both the disk being replaced and the additional disk that has failed.

 

I believe this is a serious and unnecessary limitation of the current unRAID disk replacement implementation. As long as both the new disk and the to-be-replaced disk are available and healthy, unRAID should do the following:

 

1) Unmount the array.

2) Copy all sectors from the old disk to the new one.

3) After step 2) has completed, simply swap the old and new disks in the array.

4) Done.

 

This approach has many advantages over the current approach:

 

- much faster, because all other disks (including Parity) stay idle!

- no problem if any 1 disk fails during the replacement procedure!

- you could even replace multiple disks at once, without losing Parity protection

 

The one and only disadvantage of this solution (that I can see) is that the array should not be mounted (it could be mounted in read-only mode, though), because otherwise sectors of the old disk might be changed by file system accesses while the sector copy operation is in process. Of course it would be possible to solve this problem by mirroring any sector write accesses to the old drive to the new one at driver level.

Edited by madshi
Link to comment
1 hour ago, madshi said:

The one and only disadvantage of this solution (that I can see)

 

Not quite. Here is the way it currently works.

 

unRAID only disables a disk when a write to it fails (but see below*). That failed write, and any subsequent writes or reads of the disk, are emulated by the rest of the parity array. It is important that the failed write be emulated or filesystem corruption could be the result, since the write might have been updating filesystem information.

 

---added--- Once the disk is disabled unRAID will not use it again until rebuilt, but instead it will emulate the disk for all subsequent operations.

 

 

So, as you can see, the actual physical disk is now out-of-sync with parity, its contents are not up-to-date, and are possibly even invalid from a file integrity perspective.

 

The emulated disk, however, is in-sync with parity, its contents are up-to-date, and most likely it is completely valid from a file integrity perspective. The array can still be used and the data from the emulated disk can be accessed for whatever purpose.

 

 

* It is also possible that unRAID will try to get the data from a failed read by calculating if from the rest of the parity array, and then attempt to write it back to the disk. If that write fails the disk is disabled. So a failed read could result in a failed write that disables the disk.

Link to comment

Thanks for your feedback. But to be honest, I'm not sure how your comment relates to my suggestion.

 

You're talking about failed writes and disk emulation. My suggestion is about replacing a perfectly healthy drive in a perfectly healthy array. These two things don't have anything in common, as far as I can see. Or am I missing something?

Link to comment

Your suggestion would not work if there is not a spare slot.  There are already relatively simple ways to do what you suggest without the need for a spare slot.   All that cannot be handled is more disks failing than you have parity drives.   It is just that there are some manual steps where you want an automated solution?

Link to comment
Just now, itimpi said:

Your suggestion would not work if there is not a spare slot.  There are already relatively simple ways to do what you suggest without the need for a spare slot.   All that cannot be handled is more disks failing than you have parity drives.   It is just that there are some manual steps where you want an automated solution?

 

You're right that my suggestion requires a spare slot.

 

Which "relatively simple ways" to do what I suggest are available? The key thing I'm looking for is the ability to replace a perfectly healthy disk without losing Parity protection during the replacement process. As far as I understand, the standard unRAID method to replace a smaller healthy disk with a larger new disk is the following:

 

1) Stop the array. (And power down if there's no spare slot.)

2) Replace the old disk in the array with the new one.

3) Start + rebuild the array.

 

Correct? Let's say disk 5 (a completely unrelated disk) dies during step 3). If I only have 1 Parity disk, all data of disk 5 will be completely lost! Correct? Don't you consider this a serious unRAID limitation? It's currently not possible (at least not using the recommended approach) to replace a perfectly healthy disk with another one, without risking to lose data if only 1 disk fails during step 3).

 

My suggestion allows replacing a (healthy) disk with a new one, in a perfectly safe way - unlike unRAID's current recommended approach, which is *not* safe.

Link to comment
4 minutes ago, remotevisitor said:

Your current suggestion also requires the replacement disk to be the same size as the disk it is replacing.

 

if the replacement disk is larger you would also have to arrange for the additional space to be zeroed otherwise parity would be invalidated.

 

True, but zeroing the additional sectors should be a piece of cake.

 

Just now, bonienl said:

 

That's why dual parity was introduced.

 

 

Dual parity is nice and all. But it also costs more (hardware).

 

Nobody likes my suggestion?  ?  Which would make it possible to safely replace healthy harddisks even when using only 1 Parity disk. Doesn't it bother you guys at all that (when using only 1 Parity disk) if another disk fails while trying to replace a *healty* disk, data loss occurs?

Link to comment

I reacted because you talk about a serious limitation in unRAID, which isn't true in my view. Fact: a single parity disk can replace a single failed disk, there is a certain risk when rebuilding the newly installed disk, but this can be mitigated by using dual parity.

 

I have done many disk upgrades over the years, and it never happened to me that a second disk failed during a rebuild. Perhaps I am lucky? But I believe it is the general experience with most users.

 

Link to comment
21 minutes ago, madshi said:

 

You're right that my suggestion requires a spare slot.

 

Which "relatively simple ways" to do what I suggest are available? The key thing I'm looking for is the ability to replace a perfectly healthy disk without losing Parity protection during the replacement process. As far as I understand, the standard unRAID method to replace a smaller healthy disk with a larger new disk is the following:

 

1) Stop the array. (And power down if there's no spare slot.)

2) Replace the old disk in the array with the new one.

3) Start + rebuild the array.

 

Correct? Let's say disk 5 (a completely unrelated disk) dies during step 3). If I only have 1 Parity disk, all data of disk 5 will be completely lost! Correct? Don't you consider this a serious unRAID limitation? It's currently not possible (at least not using the recommended approach) to replace a perfectly healthy disk with another one, without risking to lose data if only 1 disk fails during step 3).

 

My suggestion allows replacing a (healthy) disk with a new one, in a perfectly safe way - unlike unRAID's current recommended approach, which is *not* safe.

Not necessarily.  As long as you have the previous disk intact then you can use the New Config option to out it back to the original configuration and then use the “Parity is valid” option.  You then tell unRAID the newly failed disk is bad and you can now rebuild it onto a good disk.   Only if you get yet another failure will you end up losing data.

Link to comment
31 minutes ago, madshi said:

Nobody likes my suggestion?  ?  Which would make it possible to safely replace healthy harddisks even when using only 1 Parity disk. Doesn't it bother you guys at all that (when using only 1 Parity disk) if another disk fails while trying to replace a *healty* disk, data loss occurs?

 

The method was take the array umount. I think most user won't accept the array offline longtime and need another port/slot for add on harddisk (may break the licence)

In fact, advance user can do same thing by run in maintenace mode and sync the new disk, it still allow any 1 or 2 fault disk all time (1 or 2 parity).

 

So I don't think much different and all method have risk, when you plug / unplug Power, SATA, insert/reinsert disk .....

Edited by Benson
Link to comment
50 minutes ago, madshi said:

If I only have 1 Parity disk, all data of disk 5 will be completely lost!

 

No - your disk 5 will not be possible to emulate with the current setup (a half-mirrored new drive). But switch back configuration and restore the original disk that you wanted to replace and unRAID is back to a state where it can emulate the broken disk 5.

 

All this debate about doing it one way or another depends on if it's important that unRAID can be online and able to perform actual tasks with the array while replacing a disk.

Edited by pwm
Link to comment
41 minutes ago, madshi said:

Thanks for your feedback. But to be honest, I'm not sure how your comment relates to my suggestion.

 

You're talking about failed writes and disk emulation. My suggestion is about replacing a perfectly healthy drive in a perfectly healthy array. These two things don't have anything in common, as far as I can see. Or am I missing something?

 

Since this request was linked from another thread in which you did have a missing disk I didn't notice the distinction. There is also this in the OP

 

2 hours ago, madshi said:

> Array cannot tolerate a disk failure without potential data loss to

> both the disk being replaced and the additional disk that has failed.

 

In the "normal" situation, I'm not sure how the disk being replaced is at risk since it wouldn't be part of the array and possibly not even installed. As itimpi noted, the original disk can be used to recover, and in fact, we often recommend to replace a disabled disk even if it looks good, just so it will still be available if something goes wrong with the rebuild.

 

The one thing I would be concerned with your idea and possibly needs further elaboration. What happens if the process is interrupted for some reason, such as a power failure? unRAID is to some extent stateless, in that it doesn't remember much from boot to boot except for what is on the flash. And if power failed it couldn't save any state to flash.

 

 

Link to comment
9 minutes ago, bonienl said:

I reacted because you talk about a serious limitation in unRAID, which isn't true in my view. Fact: a single parity disk can replace a single failed disk, there is a certain risk when rebuilding the newly installed disk, but this can be mitigated by using dual parity.

 

I have done many disk upgrades over the years, and it never happened to me that a second disk failed during a rebuild. Perhaps I am lucky? But I believe it is the general experience with most users.

 

Of course it's somewhat unlikely that a disk failure happens exacty during the short time period when you do the array rebuild. But it's still possible, and somewhat scary (to me, at least). The ugly truth is that you *can* get data loss even if only one disk fails. Which is exactly what unRAID is supposed to protect us from!

 

I thought the main purpose of Dual Parity was to protect against having 2 drives fail at once? If actually 2 drives fail at once while replacing a healthy disk with a new disk, once again there will be data loss even when using Dual Parity, while my suggestion would have no data loss (when using Dual Parity).

 

6 minutes ago, itimpi said:

Not necessarily.  As long as you have the previous disk intact then you can use the New Config option to out it back to the original configuration and then use the “Parity is valid” option.  You then tell unRAID the newly failed disk is bad and you can now rebuild it onto a good disk.   Only if you get yet another failure will you end up losing data.

 

1 minute ago, Benson said:

The method was take the array umount. I think most user won't accept the array offline longtime and need another port/slot for add on harddisk (may brake the licence)

In fact, advance user can do same thing by run in maintenace mode and sync the new disk, it still allow any 1 or 2 fault disk all time (1 or 2 parity).

 

Yeah, doing the rebuild in maintenance mode sounds like a decent workaround. But if a disk dies in this situation, you'd still have to do some manual tweaking to get the array back without data loss.

 

Does nobody worry about how the average user can handle any of this? We're probably all power users here in this thread. But isn't unRAID supposed to be a viable option for the average user, too? The average user won't replace healthy disks in maintenance mode, and he won't be careful not to modify the array while the array rebuilds. And he won't know how to use "New Config" and "Parity is valid" to manually tweak the array back into a usable state if a disk fails during rebuild.

 

If replacing a disk offline is too much of a problem for usability, at least unRAID could offer my suggestion as a "safe" option and if the user insists on having the array up and running immediately, unRAID should warn that it's not safe.

 

And as already mentioned in my starting post: It should be possible to implement my suggestion in such a way that it even works with the array mounted and written to, by mirroring sector writes from the old disk to the new one. But this would be more difficult to implement, of course.

Link to comment
2 minutes ago, trurl said:

In the "normal" situation, I'm not sure how the disk being replaced is at risk since it wouldn't be part of the array and possibly not even installed. As itimpi noted, the original disk can be used to recover

 

Yes, with a bit of luck it could be used to recover. However, if the user has already written to the new disk in the meanwhile, the Parity disk has changed while the "disk being replaced" has not changed. Which means the Parity now doesn't match the "disk being replaced", anymore.

 

5 minutes ago, trurl said:

The one thing I would be concerned with your idea and possibly needs further elaboration. What happens if the process is interrupted for some reason, such as a power failure? unRAID is to some extent stateless, in that it doesn't remember much from boot to boot except for what is on the flash. And if power failed it couldn't save any state to flash.

 

Good point. I suppose the manual tweaking options suggested earlier in this thread might still be able to put the array back into a working state. But as I mentioned above, using those manual tweaking stuff is not for the faint-hearted, and certainly not for the average user.

 

My suggestion has the big benefit that it doesn't do anything to the array. Only once the old disk was completely copied to the new disk, the array config is modified to replace the disks. So if anything goes wrong (disk failure, power outage, whatever) while copying the old disk to the new disk, no harm was done at all, and the array will still automatically be in a working state. So my suggestion is safer, faster and even if something goes wrong, much more user friendly for the average user.

 

But yes, it requires an additional SATA port, and if we want to support mounting the array during the disk copy process, unRAID would have to mirror sectors at driver level, which might cost quite a bit of development time. Alternatively, my suggestion should be doable with much less development time, if we could accept as a limitation that the array can't be started while the disk is being copied.

Link to comment

For average user, I think current LT implement was direct and stright forward. I don't found other method or idea better then this.

 

For building speed, it may true if the replaced disk was the fastest one, otherwise if it is the slowest one, then recontrust write method may benifit the speed.

Link to comment
5 minutes ago, bonienl said:

How would you handle when the new disk is bigger than the existing disk (most people doing a upgrade will replace for a bigger disk)?

 

 

unRAID has the same situation when using the officially recommended approach of replacing the disk physically and then doing an array rebuild. What does unRAID currently do in that situation? I think it probably copies all sectors of the smaller disk to the larger one, and then zeroes out the remaining sectors of the larger disk? In any case, whatever unRAID uses with the current official solution, the same approach should work with my suggestion.

 

2 minutes ago, Benson said:

For average user, I think current LT implement was direct and stright forward. I don't found other method or idea better then this.

 

The current LT implementation is not safe, as explained above. If there's a disk failure while the array rebuilds, and if the user has already written to the new disk, then the Parity drive is no longer valid (when used together with the old disk).

Link to comment
3 minutes ago, Benson said:

For average user, I think current LT implement was direct and stright forward. I don't found other method or idea better then this.

 

When nothing goes wrong (the usual case), it is a very straightforward process of shutdown the server, replace the disk, start the system and assign the new disk.

Link to comment
11 minutes ago, madshi said:

my suggestion should be doable with much less development time, if we could accept as a limitation that the array can't be started while the disk is being copied.

 

Not sure but I think the parity swap procedure takes the array offline while parity is being copied. And of course clearing a disk for a new slot used to take the array offline (but doesn't on recent versions), so there is some precedent for this.

 

As for needing an additional port, of course the current procedure would still work just fine if this other approach were an option. After all, that is how a missing disk is replaced.

 

I agree that the risk with the current method isn't that great if you do it correctly.

 

On the other hand, I have seen users do all sorts of things when they need to replace a disk. They come on the forum after they have done unnecessary or even wrong things, often not even remembering exactly what they did do, and then ask us what to do next.

Link to comment

FWIW, I just started a "parity swap". As part of this procedure, unRAID first copies the old parity disk to the new parity disk (sector by sector, I assume), and it does so while the array is *not* mounted. Nobody seems to complain that this procedure requires the array to be unmounted for several hours, so it doesn't seem to be a real problem for anyone?

 

This approach is very very similar to what I'm suggesting when replacing a healthy data disk with a bigger one: First copy the smaller disk to the bigger one, exactly as the parity swap procedure does, then update the array references. Done. Should be very fast and 100% fail safe.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.