P+Q or dual parity in detail


Recommended Posts

There were many discussions on the forum in pre-release state but now that it's released, how is it performing?

I have been browsing the wiki for some minutes now and I couldn't find any info on this new feature.

 

http://www.lime-technology.com/wiki/index.php/UnRAID_Manual_6#Parity-Protected_Array

http://lime-technology.com/wiki/index.php/FAQ#How_does_parity_work.3F

 

Actually I found this:

P (Parity1) is calculated like this: P = d1 + d2 + ... + dn

Q (Parity2) is calculated like this: Q = d12 + d22 + ... + dn2

 

But what about the facts?

 

Requirements:

File system?

CPU power?

Comparison to only P-parity?

 

New behavior:

Having drives spinning all the time?

During read?

During write?

 

Adding the Q-drive? Rebuild necessary?

Remove Q-drive? Rebuild of P-drive necessary?

Will a parity error be traced back/reported to the drive that caused it?

 

Anything else that is worth mentioning?

 

I consider this feature an USP or at least special considering the competition and

therefore it should be highlighted and explained more in detail.

Link to comment

But what about the facts?

 

Requirements:

File system?

CPU power?

Comparison to only P-parity?

File system independent

Most users have reported little change although there seem to be some exceptions.

Not quite sure what you mean by comparison to only P-parity other than you get protected against 2 disks failing.

 

New behavior:

Having drives spinning all the time?

During read?

During write?

No difference in drives spinning compared to single parity.

 

Adding the Q-drive? Rebuild necessary?

Remove Q-drive? Rebuild of P-drive necessary?

Will a parity error be traced back/reported to the drive that caused it?

Adding Q drive and then parity needs building on it (as it did for P parity)

Removing Q drive possible at any time - just reverts to single disk protection

No tracing of disk causing parity error.    Far to compute heavy to be practical.

 

Anything else that is worth mentioning?

Not really.

It does introduce one restriction in that position of data drives is now relevant to maintain parity, so you can no longer shuffle them after a New config and maintain parity as was possible with P parity.

 

For most users it is just an additional parity disk to provide protection against 2 disks failing simultaneously with nothing much else to think about.

Link to comment

It does introduce one restriction in that position of data drives is now relevant to maintain parity, so you can no longer shuffle them after a New config and maintain parity as was possible with P parity.

Correct, the "Parity 2", or "Q" parity disk content is dependent on which "disk" positions the devices are in when Parity 2 is calculated.  The "Parity", or "P" parity disk still operates the same (no positional dependency).

 

Also, in case this isn't obvious:

- An array with a single "Disk 1" data disk, along with either P or Q is actually a 2-way mirror (that is, contents of both devices is identical).

- An array with a single "Disk 1" data disk, along with both P and Q is actually a 3-way mirror.

But

- An array with, say a single "Disk 2" data disk, along with both P and Q, Disk 2 and P are identical, but Q is different.

Link to comment

Thank you so far.

As for the CPU requirements, it seems there can be significant differences in parity check times.

http://lime-technology.com/forum/index.php?topic=51036.msg491331#msg491331

 

jonnie.black and  Frank1940 have been carrying out a lot of tests.

Did you come to a conclusion respectively a rule of thumb when it is reasonable to use Q-parity

without sacrificing parity check times?

Especially file servers are designed to be energy efficient thus having low performing CPU's.

Most budget builds have the same characteristic.

Link to comment

Thank you so far.

As for the CPU requirements, it seems there can be significant differences in parity check times.

http://lime-technology.com/forum/index.php?topic=51036.msg491331#msg491331

 

jonnie.black and  Frank1940 have been carrying out a lot of tests.

Did you come to a conclusion respectively a rule of thumb when it is reasonable to use Q-parity

without sacrificing parity check times?

Especially file servers are designed to be energy efficient thus having low performing CPU's.

Most budget builds have the same characteristic.

 

Sure "budget build" NAS might take longer to check/generate P+Q, though these days a very powerful CPU can be had on a budget.

 

If you want to talk about energy efficiency, consider this: say it does take significantly longer to check/generate P+Q vs. just P.  Further let's say you can measure this difference and calculate yearly energy cost.  Now consider if you lose 2 devices in in P-only array.  How much "energy" (cost) will you expend trying to find all the files that are now unrecoverable?

 

Whether to use P only or P+Q, or even no P and no Q based on parity check times is irrelevant IMHO.  In other words, having P is nothing more than "insurance"; having P+Q is better, though more expensive insurance.  Everyone has to judge for themselves how much insurance they need and if they are willing to pay the cost.

 

The reason it took a "long time" for P+Q to find it's way as a feature in unRAID is that for home NAS storage you generally have two kinds of data: stuff you absolutely can't lose, and stuff that would be inconvenient to lose.  For the first set of stuff, P+Q is not good enough, and P+Q+R would not be good enough - instead you must replicate this vital data to multiple independent destinations (aka, backups).  For the second set of stuff, having only P is probably good enough 99% of the time since data that might get lost you might be able to restore from somewhere else.  Having said this, due to sheer size of storage devices, and the expanding functionality and market of unRAID, clearly P+Q is a necessary feature now.

Link to comment

I agree with all of your statements Tom!

I'm not speaking of new builds, although those guys may also need to know

the requirements to be able to run certain features so they can pick

hardware accordingly.

 

I just want to avoid grief on user side. In particular users of existing systems,

build with certain intentions and expectations of their equipment.

They should know what they can load on their hardware and what not!

 

For example, introducing v6 pushed me to replace my single core CPU's as I noticed

that they were under 100% load during parity check and I couldn't use the

server during that time for other tasks. This was a surprise to me but luckily,

swapping the CPU's was easy.

 

Looking at P+Q parity, one should know that under certain circumstances

it may require significantly more effort than just adding one more drive!

Many of us may be tempted, in fact a fried of mine already asked me to

get him a new disk, to jump into the +Q adventure without thinking of

more reasonable things like a backup server or the like.

Link to comment

...  asked me to get him a new disk ... without thinking of more reasonable things like a backup server or the like.

 

Absolutely an issue => I fear a lot of folks who don't bother with backups will now feel that with dual parity they don't need them.    But as Tom noted, for important data (i.e. "...  stuff you absolutely can't lose"), dual parity -- or even triple parity or more -- "... is not good enough."

 

... but a lot of folks simply don't think about this until they lose a disk with the last 20 years worth of family pictures on it (or some similarly important data).

 

Simple fact is a NAS is NOT a backup.    No matter how fault tolerant it is.

 

Link to comment

No tracing of disk causing parity error.    Far to compute heavy to be practical.

 

 

i was hoping that this will be possible (always a big issue to determine where a parity error is coming from (and to go through mutiple smart reports manually to determine/test possible cause is difficult.

 

what other alternative is there to avoid "bit-rot" (if i am using the correct term)  (on that same topic - will Reiser or XFS be recommended as a more stable file system?)

 

Link to comment

what other alternative is there to avoid "bit-rot" (if i am using the correct term)  (on that same topic - will Reiser or XFS be recommended as a more stable file system?)

Genuine bit-rot is actually very rare, so many people do not bother to protect against it..  If it does concern you then the normal recommendation would be to use checksums (see the File Integrity plugin) to detect files that might suffer from bit-rot, and then if any is detected to restore the affected files from your backups.

 

 

Link to comment

No tracing of disk causing parity error.    Far to compute heavy to be practical.

 

 

i was hoping that this will be possible (always a big issue to determine where a parity error is coming from (and to go through mutiple smart reports manually to determine/test possible cause is difficult.

 

what other alternative is there to avoid "bit-rot" (if i am using the correct term)  (on that same topic - will Reiser or XFS be recommended as a more stable file system?)

If you are concerned about bit-rot then use btrfs and ECC RAM.  If not so much, use xfs.  These days I think it's wise to avoid ReiserFS.  (Meaning: I don't think anyone needs to rush and convert all their existing reiserfs volumes - I still have a large number of them - but for all new storage devices, btrfs or xfs is the way to go.)

Link to comment

No tracing of disk causing parity error.    Far to compute heavy to be practical.

 

 

i was hoping that this will be possible (always a big issue to determine where a parity error is coming from (and to go through mutiple smart reports manually to determine/test possible cause is difficult.

 

what other alternative is there to avoid "bit-rot" (if i am using the correct term)  (on that same topic - will Reiser or XFS be recommended as a more stable file system?)

If you are concerned about bit-rot then use btrfs and ECC RAM.  If not so much, use xfs.  These days I think it's wise to avoid ReiserFS.  (Meaning: I don't think anyone needs to rush and convert all their existing reiserfs volumes - I still have a large number of them - but for all new storage devices, btrfs or xfs is the way to go.)

 

Thank you sir! yes makes sense (backups first, then file integrity , but also then improving the hardware and file system seems to be a way to battle it).

 

i recently encountered a few parity errors (after having 0 for the past year), but i think its probably due to a power failure while writing to array (mover was running).

will investigate with a few repeat checks to make sure hardware is still co-operating.

Link to comment

One way you can provide better data integrity is to maintain current checksums of all of your files.

 

Then, if you have a parity error and suspect that it was on a specific disk (perhaps because there were errors reported on that disk), you can run a checksum validation on the files on that disk to identify any files that have changed.

 

MOST detected parity errors are actually on the parity disk(s) ... which is why UnRAID's default behavior is to simply correct the parity disk --> this is almost always the right thing to do.    I can say from experience that the few times I've had parity errors over the past 6-8 years, I've always run a full checksum validation (or, before I had checksums, a complete comparison with my backups) and in EVERY case there were no errors in any of the data files ... i.e. it was ALWAYS the parity disk itself which had been in error.

 

It would, of course, be nice if the dual parity system could at least identify which disk the error(s) were likely on by cross-referencing the parity calculations; but this would indeed require a fairly intensive computational check and could still not be completely accurate => I suspect that's why Limetech didn't include that as part of their implementation of dual parity.

 

Link to comment

I wrote a fairly detailed outline of why it's not easy to identify the exact error here:

https://lime-technology.com/forum/index.php?topic=51962.msg500479#msg500479

 

As Tom has noted, you could easily identify the error position  *IF* you knew there was only one error in the parity stripe; but if there are multiple errors attempting to correct the specific disk could result in propogating even more errors.

 

It IS possible to isolate the exact error location and confirm it's the only bit error (I outlined it in the post I referenced above), but it's a fairly involved process and wouldn't help if it found multiple error locations (except you'd then KNOW there were multiple locations ... just not which ones they were).

 

I think simply maintaining checksums, and using verification of those to confirm whether or not there were any actual data errors after you've had an identified parity error, is the best approach.

 

Link to comment

Interesting.  If you're right, and it is possible to "isolate the exact error location and confirm it's the only bit error", then I'd say it would almost certainly be worth the computation time to fix the actual problem rather than assume the parity bit is wrong.  But Tom made it sound like this wasn't possible?

Link to comment

Note that you can only do that  *IF* there is only a single bit error in that parity stripe.    The computations aren't all that complex, but there's no way to be CERTAIN that there's only a single error in the stripe => and the consequences if there are multiple errors would be that you could corrupt good data by "correcting" the wrong bit.

 

Tom's logic is that correcting parity ensures that at worst you have one file corrupted by the error; but you've preserved the ability to rebuild any other failed disk (including the one that has the corrupted bit if the error wasn't really on the parity drive).    It IS true that the parity disk is the one most likely in error -- UNLESS the system shows that there was an error reported on one of the data disks ... and if that was true, it should have been corrected by UnRAID when the error was encountered.

 

Link to comment

...  asked me to get him a new disk ... without thinking of more reasonable things like a backup server or the like.

 

Absolutely an issue => I fear a lot of folks who don't bother with backups will now feel that with dual parity they don't need them.    But as Tom noted, for important data (i.e. "...  stuff you absolutely can't lose"), dual parity -- or even triple parity or more -- "... is not good enough."

 

... but a lot of folks simply don't think about this until they lose a disk with the last 20 years worth of family pictures on it (or some similarly important data).

 

Simple fact is a NAS is NOT a backup.    No matter how fault tolerant it is.

 

I looked at the use of dual parity some time and calculated that the likelihood of two disks failing with a one month period was about the same as your house burning down.  Point being:  If you are concerned enough to be worried about disk failures enough that you are willing to consider (or use) dual parity, you should also be concerned about other physical dangers to your server.  Your first concern should be to develop a plan for an offsite backup of all irreplaceable files! 

 

I have three portable hard disks and two of them are always in a safety deposit box.  I actually have have five copies of my most important files!  (And those three are never exposed to the Internet which is another potential source of data loss.)

 

Plus, if you are considering dual parity to protect you against your laziness, you are going down the wrong path.  The recent versions of unRAID provides for automated parity checks and e-mail notifications of issues.  Set them up and read the messages.  Log onto your server occasionally and look at any errors on the Main tab.  Look at the attribute tab for all of your disks to see if any errors have occurred. 

 

One more thing.  If the worst should happen, NEVER PANIC.  Many problems have been made much, much worse by doing the wrong thing.  IF you are not sure what to do in a situation, ask first on the forum.  At the very least, the time you spend waiting for a reply will allow you to calm down so that any actions you take will be well thought out...

Link to comment

... I looked at the use of dual parity some time and calculated that the likelihood of two disks failing with a one month period was about the same as your house burning down.

 

:)  I don't agree.  With large arrays, especially if the bulk of the disks are essentially the same age, the likelihood of a bit error rate occurring in one of the "other" disks during a rebuild is not that low.  It IS low, and in general it's not likely -- but it CAN happen, and is more likely than your "house burning down"  :)    That's exactly why major data centers use RAID-6 instead of RAID-5 ... i.e. not so they can wait for two disks to fail before replacing the failed disks; but so the replacement of a single failed disk will succeed even if another disk fails in the process.

 

I DO agree that it's important to have multiple copies of your data, and that at least one copy should be off site.  ALL of my data is stored in at least 3 places [e.g. for media the UnRAID server; backup UnRAID server; and a set of backup disks kept in a fireproof, waterproof, data-rated safe];  and all of my "important" data is stored in 7 places [main PC, backup drive in that PC, backup drive in wife's PC, UnRAID server, backup UnRAID server, backup disk stored in a fireproof, waterproof, data-rated safe, and in a cloud backup].

 

Fault-tolerance on my servers isn't a necessity, it's a convenience -- so if I have a drive failure I don't have to reload anything from my backups.  Dual parity makes that even less likely, so I've added it to my two larger servers (14 & 18 drives).

 

 

Link to comment

Attached is a zipped file of an Excel spreadsheet that will show you the probability of drive failures.  (It was zipped to allow me to attach it.)  You can enter any assumed annual drive failure rate that you want in  cell C5.  IF you use the % sign when you enter the failure rate (5%), you can use the actual percentage.  Otherwise you have to enter it as .05.

Parity_Failure_Probabilities.zip

Link to comment

Remember that (a) real life can often "bite" you r.e. probabilities.    We bought our current home in 1998.  Three weeks later there was a major "100 year" flood.  (Fortunately we had minimal damage -- all exterior)    Three years later, in 2001, we had another ever worse "100 year" flood.    So much for the probabilities !

 

Also, remember that GIVEN that you have had a failure, your array is no longer a dual-parity array.    The probability of another failure at that point is the probability of a failure in an N-1 disk array with single parity.

 

Further, the issue isn't another complete disk failure -- it's a simple uncorrectable bit error in another disk during the rebuild.  And the probability of an uncorrectable bit error is FAR higher than that of a complete failure of the disk.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.