Dual or Single Parity: It's your choice


Frank1940

Recommended Posts

Now that version 6.2 is offering the choice of dual parity, many of you will be facing a decision as to whether to implement dual parity or remain with single parity.  It might seem to be a 'no-brainer' to move to dual parity but there is considerable expense involved with that move as most of you will realize! At that point, you might well be wondering what the risk/benefit truly is. 

 

Finding data on the the failure rates of hard drives is rather difficult to come by.  Most manufacturers consider it to be highly confidential information as most of them have had one or more drive models which have had high failure rates as reported by users in their reviews.  Plus, true failure-rate data can only become available about the time that the drive is about to be discontinued.  About the only source of information comes from Backblaze which issues an annual report of the failures of hard drives in their arrays.  You can find the latest report here:

 

          https://www.backblaze.com/blog/hard-drive-reliability-stats-q1-2016/

 

They are using 'consumer' type hard drives in their arrays rather than 'server' class drives so that is a plus.  But, on the flip side, they really pack the drives in tight and they are spun up and in use 24/7/365 so they are operating in the worst possible environment.  So I would surmise that their failure rates are much higher than would be expected on most unRAID servers where the drives are allowed to be spun down and the temperatures are probably more moderate. 

 

I decided to build a spread sheet to calculate the probability of drive failure(s) in a one month period.  I picked one month because most diligent users are running monthly parity checks and will receive a report on the health of their arrays at least that often.  (Of course, with ver 6.2 and later, the concerned user should be setting up Notifications to alert them to problems on a more frequent basis!  But that is another story...) 

 

I have attached three tables of failure probabilities --- 11%, 5% and 2%.  You can pick the one that best suits your risk toleration factor.  You will also note that I constructed the table for the number of DATA drives NOT the total number of drives in the array.  Adding a second parity drive to an existing array actually increases the probability of given number of drives failing in the same time period!  The red columns are the probability of actual data loss while the black columns are the probability of having a failure where the data is recoverable. 

 

Now, as to what the numbers mean.  A probability of .000104 means that you have (approximately) 1 chance in 10,000 of having that number of drives failing in a one month period.  A probability of .012448 means you have a (approximately) 1 chance in 100 of having that number of drives fail in a one month period. 

 

Do be a bit careful in using statistics.  If you had a drive fail this month, the probability of a drive failing next month are exactly the same as they were for this month.  However, the probability of a drive failing next month and another one the month after that are much, much higher.  What seems to be common sense is often mathematically untrue!

 

Now, how to make a bit of  sense of your risk.  Let's look at another potential source of data loss-- a house fire.  One source that I found indicated that the probability (for the USA) of having a fire in your house in the next month was .000264 (Approximately 3 in 1,000) and having the house declared a total loss was .0000278 (Approximately 3 in 10,000). 

 

If you look at the probability tables for dual parity, you are more likely to lose your data in a house fire then having a three disc failure on your array!  If you consider that dual parity is a absolute necessity, do you have a plan in place for off-site backup of your data? 

 

 

 

Parity_Failure_Probabilities11%.pdf

Parity_Failure_Probabilities5%.pdf

Parity_Failure_Probabilities2%.pdf

 

Here is a link with Backblaze data for the 3rd Quarter of 2016:

 

    https://www.backblaze.com/blog/hard-drive-failure-rates-q3-2016/

Edited by Frank1940
Added link for BackBlaze 3rdQ Data
  • Like 1
Link to comment

Great work.

 

But I'm going to add my usual comment about Backblaze (since you mentioned it) and state that since they don't justify what a "failure" is their stats are meaningless, and by them using consumer class drives in a pure RAID environment (and the fact that consumer vs enterprise drives operate differently with regards to TLER which would cause a consumer drive to drop out of a RAID array (not that unRaid is NOT RAID) far more often (and possibily constitute a failure)

Link to comment

Great work.

 

But I'm going to add my usual comment about Backblaze (since you mentioned it) and state that since they don't justify what a "failure" is their stats are meaningless, and by them using consumer class drives in a pure RAID environment (and the fact that consumer vs enterprise drives operate differently with regards to TLER which would cause a consumer drive to drop out of a RAID array (not that unRaid is NOT RAID) far more often (and possibily constitute a failure)

 

You are correct about Backblaze but they are the only source that anyone (apparently) has found that gives any indication of failure rates  of hard drives over an extended period of time.  (Most of the Manufacturer's reliability data is a large number of drives over a fairly short period of time to count the number of failures.  1000 disks for a 1000 hours with one failure = 1 failure per million hours of operation!)

 

If anything (I feel), Backblaze 'failure' rates are on the high side.  If we could somehow get enough data from unRAID users, I would suspect that the annual failure rate of all hard drives is under 2%. (But even I would be hard pressed to figure out how many disks of mine have failed and when...)  That is not to say that someone could not have purchased six of their ten hard drives from a lot of disks with a exceptionally high failure rate.  (Which is why it is better to purchase and add disks over a period of time rather than making a single purchase to completely fill an array.  By the way, this is what Backblaze does not do because they want to get the advantage of a high bulk purchase discount.)

 

By the way, I seem to recall that Backblaze actually runs their servers with failed disks in them (Their parity/redundancy scheme must be able to handle multiple failed disks.) and change out several disks at the same time.  And apparently, it is like RAID 5 in that all of the disks have to be the same size, so they will evaluate the point at which and which server they will remove from service so they can add a replacement server with a larger storage capacity.

 

I really don't want to be one who says that dual parity should not be used or that it is an absolute necessity.  My personal feeling is that is you have eight of less data drives, it will provide very little risk reduction of data loss.  Between eight and say twelve, it may have some benefit.  With more than about twelve drives, the benefit is substantial.  BUT a careless, happy-go-lucky unRAID administrator will never be out of danger.  Without careful observation of the warning signs, data loss is a certainty regardless of the number of disk failures that the system will tolerate before data loss...

 

 

Link to comment

Sorry to post in this old thread, but it was referenced in a recent thread and thought I would chime in.

 

I agree that the value of dual parity to guard against a second failed disk is quite low computed as it has been here. But I put forth the following use case that is more likely and for which dual parity could help.

 

A user has a disk fail (or thinks it did anyway). He opens up the case, and replaces the drive with a new one. While in there he touches some of the cables to other drives and unknowingly knocks something slightly askew. When trying to rebuild the disk, the rebuild fails as the second disk drops offline. Although this is not a deadly situation, a users reaction can be deadly. With dual parity, the user would have had to have 2 disks with cables knocked askew. It it was only one, the rebuild would complete successfully, and then the user would be able to address the second disk dropped from the array as a second event.

 

This, IMO, is the biggest benefit of dual parity, and it is a pretty good reason, especially for a novice user. Without locking cables and drive cages, and a lot of hours of burn in, it is extremely common to have disks drop when rebuliding due not to a failed disk, but to a dislodged cable. I'd put the percentage north of 15%, maybe higher, for notice users with 5+ disks without cages (my guess only). (Of course better is using locking cables and disk cages which manage the same risk much better IMO).

  • Like 1
Link to comment
On 9/24/2016 at 0:45 PM, Squid said:

Great work.

 

But I'm going to add my usual comment about Backblaze (since you mentioned it) and state that since they don't justify what a "failure" is their stats are meaningless, and by them using consumer class drives in a pure RAID environment (and the fact that consumer vs enterprise drives operate differently with regards to TLER which would cause a consumer drive to drop out of a RAID array (not that unRaid is NOT RAID) far more often (and possibily constitute a failure)

 

BackBlaze has posted a small set of SMART values that they track. The exact criteria may not be clear - but like most of us - they pull the trigger as the values start to degrade. I think its the best data points we have, at least for drives where they have a significant number.

Link to comment
17 minutes ago, bjp999 said:

(Of course better is using locking cables and disk cages which manage the same risk much better IMO).

One problem is that WD redesigned some of their drives in such a fashion that the metal tab on the locking cable has nothing to 'lock' against. Plus, most of the locking cables are missing a little plastic 'nub' that forces the two conducting surfaces of the connectors together.  This means that the cable end of the connector is basically floating inside the HD connector.  (You can actually tell if this is the case by gently pulling back on the cable after inserting.  If there is no friction, there is no 'nub'.  I have long maintained that the both SATA connectors -- power and data --- are a poster child on how NOT to design a connector!)

 

Here is  a link to the WD site about this issue:

http://support.wdc.com/knowledgebase/answer.aspx?ID=10477

 

Link to comment
10 hours ago, bjp999 said:

A user has a disk fail (or thinks it did anyway). He opens up the case, and replaces the drive with a new one. While in there he touches some of the cables to other drives and unknowingly knocks something slightly askew. When trying to rebuild the disk, the rebuild fails as the second disk drops offline. Although this is not a deadly situation, a users reaction can be deadly. With dual parity, the user would have had to have 2 disks with cables knocked askew. It it was only one, the rebuild would complete successfully, and then the user would be able to address the second disk dropped from the array as a second event.

 

 

Thanks for your comment as you have a very valid point.  It is something that folks should be aware of and use as when they are making a decision as to whether to move to dual parity.  One thing to realize is that there is always some way that data can be lost and panicking is the one that is the hardest to prevent because of its very nature.  At the end of the day, about the only way to totally have an almost completely fool-proof plan against data loss is to have a minimum of three complete backups and one of them in a secure offsite (and offline) location.  (And I do have a minimum of three backups of my irreplaceable files and one them is locked in a safety deposit box because I do worry about possible physical destruction of my home through some sort of natural disaster.)

Edited by Frank1940
Added (and ofline)
Link to comment
  • 2 years later...
On 9/24/2016 at 6:35 PM, Frank1940 said:

But, on the flip side, they really pack the drives in tight and they are spun up and in use 24/7/365 so they are operating in the worst possible environment.  So I would surmise that their failure rates are much higher than would be expected on most unRAID servers where the drives are allowed to be spun down and the temperatures are probably more moderate. 

Okay.. I have always prevented my drives from spinning down, because I heard that the fewer start/stops you have to do, the better it is for longevity. Is this not accurate for UnRaid? Because then I should be changing some settings ASAP..

Cheers.

Link to comment
1 hour ago, Froberg said:

Okay.. I have always prevented my drives from spinning down, because I heard that the fewer start/stops you have to do, the better it is for longevity. Is this not accurate for UnRaid? Because then I should be changing some settings ASAP..

Cheers.

Depends on your use pattern. If you access all your drives many times a day, leave them spun up, or set a very long spin down time. I would not use a spin down delay that resulted in more than 2 spin down cycles a day. Keep in mind the mover schedule.

Link to comment
2 hours ago, Froberg said:

Okay.. I have always prevented my drives from spinning down, because I heard that the fewer start/stops you have to do, the better it is for longevity. Is this not accurate for UnRaid? Because then I should be changing some settings ASAP..

Cheers.

To be honest, the only real life data that we have is the Backblaze data.  The designed life of Hard Drives is very long and it is a complex system of different sub-systems.  There are so many different failures mechanisms--  Electrical, mechanical, shock and vibration, environmental.  Each one has its own failure curve and each one is fatal to the drive.  The only sure thing that we know is that all Hard Drives will fail.  

 

Backblaze's data shows their experience with their particular set of conditions.  To fully test some other set of conditions, you need this basic setup-- A Large number of drives from different manufacturers and various models from each and a long period of time (several years) to conduct the test.  And when you are done, you need to start again because all those drive models have all been discontinued by their manufacturers.   

 

(By the way, the manufacturer's reliability data is extrapolated by testing a large number of drives for a short period of time.  Like 5000 drives for 1000 hours.  If there are 5 failures in this testing period, then they tend to report that the MTTF is 1,000,000 hours!) 

 

Probably the best you can say is that at least 80% of all Hard Drives will last for five years regardless of their operating conditions as long as those conditions are within the manufacturer's recommended limits.  And this comes with that standard warning-- "Your experience may vary."   🙄

 

As your more direct question, @jonathanm's response is correct.  You have to look at your usage and your conditions.  If you can keep your drives cool and you access the array in a fashion that it would be accessed hundreds of times each day throughout the day, then leave them spun up.   If you only access the array once or twice a day, spin them down after a short period of inactivity.  Of course, most of us tend to fall between these two extremes.  One other consideration is that each drive will consume an extra 5-6W of power if is left spun-up all the time.  This can be a considerable amount of electricity on a long period of time.  It is probably better for the earth's environment and your pocketbook if the drives are spundown on servers with moderate usage patterns. 

 

PS--  I spin the drives down on my servers after 45 minutes of inactivity.  In case you did not know, this means if I am a watching a two hour movie on one disk, after 45 minutes all of the other hard drives will be spun down.

Link to comment

In my case I use my media server most of the day. It's running some dockers, but those are contained on the SSD's. As for the storage, well, it depends on the day.. but we usually listen to music via the Plex server on the Sonos system, or the kids will be watching cartoons on an ipad or something.

 

I guess they could technically be spun down.. hmm. Won't it increase storage access times when whatever system requires whichever file?

Link to comment
43 minutes ago, Froberg said:

Won't it increase storage access times when whatever system requires whichever file?

Yes, by a second or two. If that latency bothers you, by all means, keep them spun up. That may sound smart alec, but it's truly not. The possible benefit of spinning them down is miniscule based on a random all day usage pattern.

 

The true beauty of unraid is the ability to keep like files on individual disks, so some can stay inactive. I have a few disks which stay spun up almost all the time, and several which can stay spun down for days or weeks at a time. Some stuff you just don't need to access regularly, but you want available at a moments notice vs. in cold storage.

Link to comment
Just now, jonathanm said:

Yes, by a second or two. If that latency bothers you, by all means, keep them spun up. That may sound smart alec, but it's truly not. The possible benefit of spinning them down is miniscule based on a random all day usage pattern.

 

The true beauty of unraid is the ability to keep like files on individual disks, so some can stay inactive. I have a few disks which stay spun up almost all the time, and several which can stay spun down for days or weeks at a time. Some stuff you just don't need to access regularly, but you want available at a moments notice vs. in cold storage.

I can see your point. Unfortunately I took the "just throw stuff wherever" approach when setting up UnRaid, so the cold-storage stuff is scattered all over the place along with the more active stuff.

Not sure I can change that at this point in time.

Link to comment
16 minutes ago, jonathanm said:

You can, but it's a little tedious. Do you have a good handle on what exactly a user share is with respect to the individual disks?

Are you referring to split-levels and allocation settings?

Mine's set up to just throw data wherever with high-water and split any directory.

I kind of wish I hadn't gone that route initially, now, since in the event of data-loss due to a drive or something else, I'd rather have a good idea of what data was lost.. but I didn't really understand how UnRaid functioned back then. Not really.

Link to comment

Once you wrap your head around how all the array and cache disks interact to form the user shares, you can safely move files from disk to disk to get your least used files consolidated onto drives by themselves.

 

It's not super complicated, but until you understand it, don't start playing around with the individual disks, as you can cause anything from hiding files from yourself to outright irretrievable data loss.

 

This thread is one place to start learning if you wish.

https://forums.unraid.net/topic/42152-disk-vs-user-share/

 

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.