Jump to content

Creation of 3rd Disk Type for UnRaid Array (Spare)


Recommended Posts

Posted

It's been discussed in a few threads but I don't see an actual FR thread for this.

 

I'd like to propose the creation of a 3rd disk type for an UnRaid array. Presently there are data and parity disks but I'd like to see spare get added as a type. This would be a disk which must be equal to or less than parity and it would sit idle/spun down until needed. My request is a single Spare but I suspect the logic to do 2 x Spare wouldn't be vastly different. This would also coincide with the array supporting 2 x parity.

 

Upon array drive failure within an UnRaid array:

  • The system would perform two checks
    • spare's capacity is equal to or greater than the failed disk 
    • spare's capacity is equal to or less than installed parity
  • The system then modifies the array config to remove the failed drive and replace with the spare. This concept sort of already happens when a disk fails while array is up with emulation taking over automatically.
  • The system begins a data disk rebuild and the user can now decide whether to replace the dead disk as new spare, reinstall the dead disk as spare, or do nothing.

 

image.png.1b4c43f06ff6ec5540706d9c2558efb5.png

Posted

Sounds more like a Hot Spare raid concept.
The main array is the products name unraid. I don't think a hot spare keeps the intentions of the product label. All though its not a bad Idea. I think if you want that extra data protection. Your better off just adding a second parity. That is if your going to use the main array.
If you were going to do a ZFS pool though the new 7.0 might have some feature to what your looking for to a degree. You can setup a hot spare from what I saw in a video.

But a ZFS pool is not the main array so very different.  But Someone else should comment I'm really that much of a expert. I've just been using and playing with Unraid for a few years now.

Posted

The issue with this is the vast majority of dropped drives aren't actually bad, the issue lies with the controller, power, or interface cables. Automatically doing anything without assessing why the failure happened risks making a recovery into a data loss situation. Better to alert the operator to allow intelligent assessment and proper mitigation before causing harm.

  • Upvote 1
Posted (edited)
36 minutes ago, JonathanM said:

The issue with this is the vast majority of dropped drives aren't actually bad, the issue lies with the controller, power, or interface cables. Automatically doing anything without assessing why the failure happened risks making a recovery into a data loss situation. Better to alert the operator to allow intelligent assessment and proper mitigation before causing harm.

 

That's understandable but I'd counter with the scenario that UnRaid auto emulates contents when a disk drops. The fact is that UnRaid already takes measures into it's own hands when a disk drops so providing an enhanced means of data continuity would be beneficial if access or means of repair involve extended lead times. Just like it's not mandatory to run parity disk(s) this wouldn't be a mandatory feature but rather an additional option for those who cannot access the system and want to maintain full operation. My scenario is this; I routinely find myself traveling for 16-32 hours without access due to being in an airplane. Within this 16-32 hrs if a drive fails I could be back at full operation by the time I land or reacquire internet access. For this reason I already keep a warm spare disassociated from the array.

 

45 minutes ago, Bizquick said:

Sounds more like a Hot Spare raid concept.
The main array is the products name unraid. I don't think a hot spare keeps the intentions of the product label. All though its not a bad Idea. I think if you want that extra data protection. Your better off just adding a second parity. That is if your going to use the main array.
If you were going to do a ZFS pool though the new 7.0 might have some feature to what your looking for to a degree. You can setup a hot spare from what I saw in a video.

But a ZFS pool is not the main array so very different.  But Someone else should comment I'm really that much of a expert. I've just been using and playing with Unraid for a few years now.

Dual parity simply allows two disks to fail simultaneously before data loss occurs. To remedy and bring the array back to fully operational the dead drives would need to be assessed and either re-built onto itself or replaced. I'm proposing eliminating the manual step so that the sense of urgency or emergency is mitigated.

Edited by DiscoverIt
Posted
4 minutes ago, DiscoverIt said:

if a drive fails

How will unraid know the difference between a drive failure and the much more common communication failure? Both result in a write failing, which drops the drive, but communication failures often effect more than one drive, and trying to rebuild without correcting the issue could cause the rebuild to be corrupt.

 

8 minutes ago, DiscoverIt said:

For this reason I already keep a warm space disassociated from the array.

Automatically starting a rebuild without investigating the cause is risky, at least with your current method you can look at the syslog and smart reports and see if it's a clean drive failure, as well as monitoring the rebuild process.

 

I get it, you want the option to add risk to gain convenience, but that's not normally the path taken with adding unraid features. You have to balance the time spent on developing and testing a feature against the amount of gain by adding it. Development time is a hot commodity, and since ZFS already supports hot spares it's going to be a hard sell to modify the traditional unraid array.

Posted

Yeah those are some good points.
strange that your having drives fall offline. I'm guessing you have the power management settings really setup to turn down spin drives and they must not all spin up and run until needed all the time.
if that's the case it could be a number of things. Hardware compatibly will be what most people jump to first.

My guess is something in power management is not sending a command to put the drive online in way unraid OS is not seeing it to work correctly. Maybe whenit spins back up it changes drive position if so that would make me lean towards sata chipset compatibility issue (so yes hardware) but someone might have a work around for a case like that.

But your Idea it sounds pretty good I just think JoanthanM is right I dont see the Dev's interested in adding more functions like this to the main array now with OpenZFS getting more support.

 

 

Posted (edited)
29 minutes ago, JonathanM said:

How will unraid know the difference between a drive failure and the much more common communication failure? Both result in a write failing, which drops the drive, but communication failures often effect more than one drive, and trying to rebuild without correcting the issue could cause the rebuild to be corrupt.

 

Automatically starting a rebuild without investigating the cause is risky, at least with your current method you can look at the syslog and smart reports and see if it's a clean drive failure, as well as monitoring the rebuild process.

 

I get it, you want the option to add risk to gain convenience, but that's not normally the path taken with adding unraid features. You have to balance the time spent on developing and testing a feature against the amount of gain by adding it. Development time is a hot commodity, and since ZFS already supports hot spares it's going to be a hard sell to modify the traditional unraid array.

 

I'm not certain how the drive drops is relevant as a dropped drive regardless of cause must be rebuilt and cannot simply be re-added to the array. Yes it would be up to the user to determine if the drive is reusable or if the drive should be replaced. This proposal is simply to automate that process of being operational by using a known good drive and pause the urgency clock. I'm admittedly a little confused how a hot-spare concept increases risk. Can you explain an example where rolling out a hot spare in this scenario will negatively impact of harm the data integrity of the array?

 

While ZFS does support spares it has two crucially detrimental annoyances; no single drive expansion and no mixed sized vdevs. I mean this in all positive respect but the core functionality of an UnRaid array has not been improved or enhanced in years. I think it's actually closer to a decade now since Dual Parity was introduced and since then I cannot recall a notable feature improvement? There's plenty of meat left (such as multiple arrays, larger device count arrays, multi-threaded parity calculation, etc.) in the UnRaid model that's being left on the bone imo.

Edited by DiscoverIt
Posted (edited)
25 minutes ago, Bizquick said:

Yeah those are some good points.
strange that your having drives fall offline. I'm guessing you have the power management settings really setup to turn down spin drives and they must not all spin up and run until needed all the time.
if that's the case it could be a number of things. Hardware compatibly will be what most people jump to first.

My guess is something in power management is not sending a command to put the drive online in way unraid OS is not seeing it to work correctly. Maybe whenit spins back up it changes drive position if so that would make me lean towards sata chipset compatibility issue (so yes hardware) but someone might have a work around for a case like that.

But your Idea it sounds pretty good I just think JoanthanM is right I dont see the Dev's interested in adding more functions like this to the main array now with OpenZFS getting more support.

 

 

:knock on wood: my configuration is rock stable but as they say; an ounce of prevention is worth a pound of cure. I have my own comments I'll keep to myself about ZFS but most converts from TN we see end up moving to the UnRaid model because it's far superior for their use-case ;) 

Edited by DiscoverIt
Posted
54 minutes ago, DiscoverIt said:

That's understandable but I'd counter with the scenario that UnRaid auto emulates contents when a disk drops.

Except that when a disk drops for non drive related reasons, which seems to be the most common problem, the likelyhood of another drive dropping during the rebuild would be high.  

Posted
27 minutes ago, DiscoverIt said:

 

I'm not certain how the drive drops is relevant as a dropped drive regardless of cause must be rebuilt and cannot simply be re-added to the array. Yes it would be up to the user to determine if the drive is reusable or if the drive should be replaced. This proposal is simply to automate that process of being operational by using a known good drive and pause the urgency clock. I'm admittedly a little confused how a hot-spare concept increases risk. Can you explain an example where rolling out a hot spare in this scenario will negatively impact of harm the data integrity of the array?

Not sure I can agree to that statement. the reason the drive dropped is very relevant. Because it goes back to what I was talking about.
you could be facing hardware compatibly and sata supported chipsets. And like if its dropping and coming back online say in a different interface. like its in sdg and comes back up say in sdi or something to that effect. Then you might be able to get someone to help write a script to force certain serial #'s to always come back in the same interface. I think they try to do this in default already. But for some reason maybe its not able to etc..  But if your seeing issues like this and your looking for a work around so you can do support for yourself/ other customer while your in the Air. To me it sounds like you have something else you might need to work on instead of giving you more time to address some drive issues.
I just haven't experienced the kind of case your presenting. The case just feels really odd and seams like you have something else going on. 

I've done a lot of self hardware testing and building with a lot of strange old parts etc. I feel like what you have described sounds like a Sata Multiplexer interface is used and unraid OS is not bringing the drive back after maybe it went into a sleep mode or something. To me if I saw what you had happen I world be looking at my hardware and looking there and the logs to find something. something I found recently for my self. I've learned that HBA cables and fan noise actually can cause some drive events and smart errors. I've actually had to like do really good cable management recently to fix some of it. As well and replace some HBA cables. ignore all that I'm just talking about me sorry about that.  
 

Posted
52 minutes ago, Bizquick said:

Not sure I can agree to that statement. the reason the drive dropped is very relevant. Because it goes back to what I was talking about.
you could be facing hardware compatibly and sata supported chipsets. And like if its dropping and coming back online say in a different interface. like its in sdg and comes back up say in sdi or something to that effect. Then you might be able to get someone to help write a script to force certain serial #'s to always come back in the same interface. I think they try to do this in default already. But for some reason maybe its not able to etc..  But if your seeing issues like this and your looking for a work around so you can do support for yourself/ other customer while your in the Air. To me it sounds like you have something else you might need to work on instead of giving you more time to address some drive issues.
I just haven't experienced the kind of case your presenting. The case just feels really odd and seams like you have something else going on. 

I've done a lot of self hardware testing and building with a lot of strange old parts etc. I feel like what you have described sounds like a Sata Multiplexer interface is used and unraid OS is not bringing the drive back after maybe it went into a sleep mode or something. To me if I saw what you had happen I world be looking at my hardware and looking there and the logs to find something. something I found recently for my self. I've learned that HBA cables and fan noise actually can cause some drive events and smart errors. I've actually had to like do really good cable management recently to fix some of it. As well and replace some HBA cables. ignore all that I'm just talking about me sorry about that.  
 

 

I'm not certain where you get there is an issue at hand. This is about preventing or addressing future potential problems and enhancing the core functionality of the unraid array.

 

1 hour ago, Kilrah said:

Except that when a disk drops for non drive related reasons, which seems to be the most common problem, the likelyhood of another drive dropping during the rebuild would be high.  

 

Possibly and we can probably agree that the most common being caused by a faulty cable from all communication problems. The dangers of losing another drive on neighboring communication channels are quite unsubstantiated in my experience here. If a backplane or breakout fails you will have an n+1 scenario where you will likely see 4+ drives fail of which this feature would not benefit from as you can only rebuild from at absolute most, a two drive failure. I would not be able to answer if all 4+ drop immediately or if a delay occurs. In either case the outcome would be the same and you would then have to take a different approach by forcing a new config, rebuilding parity, and lastly hoping no FS corruption occurred. However in a single or two drive failure the spare could be implemented and get the array operational while the user troubleshoots the communication problem.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...