Jump to content
Sign in to follow this  
NAS

Raid 5 Doomed Article

14 posts in this topic Last Reply

Recommended Posts

It makes you wonder.. According to these articles anything 12TB and over, i.e. a 16 Drive unRAID environment with 1TB drives, is doomed too.

 

Share this post


Link to post

They failed statistics 101.... like the other article.

 

The gist is that 12 TB = 1x10^12 bits.  If the unrecoverable read error rate for each drive is 1 in 1x10^12 bits, they think you will get one error during the rebuild of a 12TB array.

 

But having 12 drives that each has a 1x10^12 bit error rate is NOT the same as 1 drive with a 1x10^12 bit error rate.

Share this post


Link to post

Theres also the striping-all-or-nothing approach of RAID 5 to take into account. By definition unRAID is not vulnerable to the catastrophic failures that can happen with a striped RAID.

Share this post


Link to post

They failed statistics 101.... like the other article.

 

The gist is that 12 TB = 1x10^12 bits.  If the unrecoverable read error rate for each drive is 1 in 1x10^12 bits, they think you will get one error during the rebuild of a 12TB array.

 

But having 12 drives that each has a 1x10^12 bit error rate is NOT the same as 1 drive with a 1x10^12 bit error rate.

 

100% true, they also don't consider that the value of a read error rate of 1x10^12 is a manufacturers value, which means that in order to justify this value the 95% confidence interval (subject to further refinement through safety factors) must be less than 1x10^12 in the manufacturers statistical analysis.  Most of the drives that are on the market would have a higher read error rate(better) than the actual marketed value (unless they have some legal way around this, the engineers have to "prove" the values are legit).

 

Cheers,

Matt

Share this post


Link to post
Theres also the striping-all-or-nothing approach of RAID 5 to take into account.

 

Which brings me back to my question from several weeks ago, of what does unRAID do when it encounters an uncorrectable read error when operating in degraded mode (i.e. with a failed data drive)?

 

It needs to continue, and not mark the drive as bad.

 

 

Share this post


Link to post

Which brings me back to my question from several weeks ago, of what does unRAID do when it encounters an uncorrectable read error when operating in degraded mode (i.e. with a failed data drive)?

 

It needs to continue, and not mark the drive as bad.

 

I agree.

 

unRAID does not take a drive out of service for a read error - only for a write error.  In this scenario (rebuilding a drive), I don't think that unRAID would stop for a read error.  It would likely assume the sector were all zeros and continue on.  This is just a guess.

 

TOM IF YOU ARE READING COULD YOU CONFIRM OR DENY?  INQUIRING MINDS WANT TO KNOW!

Share this post


Link to post

If array parity is valid, then for an unrecoverable...

Write error: the drive is 'disabled' but parity is updated (so that drive contents can be reconstructed);

Read error: block is 'reconstructed' by reading all other drives plus parity.  Result is then re-written to bad block (if this subsequent write fails then see 'write error' case above).

 

If array parity is not valid, then all unrecoverable errors are 'passed up' to the caller - this will result in originating application getting an I/O error, or possible loss of data if we're talking about a cache flush write.

 

A future feature would be to disable all write-behind, from Samba all the way to the driver, if array parity is not valid.  But this would really slow down writes.

Share this post


Link to post

But the question is --- if a drive is being rebuilt, and you get a read error from one of the drives (parity or data) during the rebuild, would unRAID terminate the reconstruct of the drive?  Or would it just go on to the next sector and complete the reconstruction on a best effort basis?

Share this post


Link to post

But the question is --- if a drive is being rebuilt, and you get a read error from one of the drives (parity or data) during the rebuild, would unRAID terminate the reconstruct of the drive?  Or would it just go on to the next sector and complete the reconstruction on a best effort basis?

 

The reconstruct will continue.

Share this post


Link to post

Thanks!  That's what I thought but good to know for sure.

 

Added to the "Best of the Forums", "Hail to the Chief" section here.

Share this post


Link to post
The reconstruct will continue.

 

Thank you.

 

Will there be any indication the unrecoverable error was experienced?

Share this post


Link to post

The reconstruct will continue.

 

Thank you.

 

Will there be any indication the unrecoverable error was experienced?

 

There will be the original error posted in the system log, but otherwise no.  In looking at the code & thinking about this, we should increment the 'Sync Errors' counter when this happens (as we do for errors detected during Parity Check).  This will be in the next release.

Share this post


Link to post

Some recovery tools will remap bad sectors and fill the remapped sector with some searchable string like "UNRECOVERABLE DATA UNRECOVERABLE DATA ..." so that, after recovery, the user could search the files for that string and figure out what file(s) were impacted. 

 

Having unRAID do something similar during a drive rebuild would be a nice enhancement.  It would allow a user to be able to figure out what got corrupted, rather than just knowing something got trampled with no means to figure out what it was.  No real harm in it - if you get a bad read you know that sector is not going to rebuild correctly - might just as well put something identifyable in there. 

 

You'd likely want to do this on both the restored disk AND the disk that gave the read error (unless it was parity).  Corresponding info should be in the syslog to guide a person to the affected drives.

 

I think that this would be a great advertising point!  "Stripe kill" is such a hot topic of criticism of RAID-5.  A robust story to tell about how unRAID gracefully handles this deadly (and relatively common) occurrence, giving the user the ability to recover most all of their data and the tools to figure out what, if anything, got corrupted, would be a great selling point IMO.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this