How Reliable is an unRAID array

SSD · April 4, 2008

I started thinking about the affect of the unRAID parity on overall reliability versus JBOD, so decided to do the math. The results are interesting. Please help me double check the math. If anyoine has any ideas how I can upload an .XLS file, I will. The forum will not accept it.

My assumption was that an individual drive is 99% reliable per year.

The following assumes 10 equal sized disks in the array, plus parity.

The chances that ALL the 10 disks would stay good (parity not needed) would equal 99% ^ 10 = 90.44% (^ = to the power of)

The chances of one particular disk failing and the rest staying good would be 1% * (99% ^ 9) = 0.914%

Chance of any one of the disks failing while all other say good would be 10 * 0.914% = 9.14%

(Although not needed for this analysis, if you take 100% - 90.44% (changes all would stay good) - 9.14% (chance of ONE drive failure) = 0.43% This is the chance of multiple drive failures in the array.)

If you had a single drive failure, your parity disk would save you only 99% of the time: 9.14% * 99% = 9.04%

So your overall relability is 90.44% + 9.04% = 99.48%.

Your chances of failure due to multiple drive failures are 0.52%.

Said another way, your chances of data loss are 18.5x (1846% higher) without unRAID (100%-90.44%) / (100% - 99.48%).

[pre]

RELIABILITY WITH UNRAID

99.99% 99.9% 99.5% 99% 98% 95%

2 100.0% 100.0% 100.0% 100.0% 99.9% 99.3%

3 100.0% 100.0% 100.0% 99.9% 99.8% 98.6%

4 100.0% 100.0% 100.0% 99.9% 99.6% 97.7%

5 100.0% 100.0% 100.0% 99.9% 99.4% 96.7%

6 100.0% 100.0% 99.9% 99.8% 99.2% 95.6%

7 100.0% 100.0% 99.9% 99.7% 99.0% 94.3%

8 100.0% 100.0% 99.9% 99.7% 98.7% 92.9%

9 100.0% 100.0% 99.9% 99.6% 98.4% 91.4%

10 100.0% 100.0% 99.9% 99.5% 98.0% 89.8%

11 100.0% 100.0% 99.8% 99.4% 97.7% 88.2%

12 100.0% 100.0% 99.8% 99.3% 97.3% 86.5%

13 100.0% 100.0% 99.8% 99.2% 96.9% 84.7%

14 100.0% 100.0% 99.7% 99.0% 96.5% 82.9%

15 100.0% 100.0% 99.7% 98.9% 96.0% 81.1%

RELIABILITY WITHOUT UNRAID

99.99% 99.9% 99.5% 99% 98% 95%

2 99.98% 99.80% 99.00% 98.01% 96.04% 90.25%

3 99.97% 99.70% 98.51% 97.03% 94.12% 85.74%

4 99.96% 99.60% 98.01% 96.06% 92.24% 81.45%

5 99.95% 99.50% 97.52% 95.10% 90.39% 77.38%

6 99.94% 99.40% 97.04% 94.15% 88.58% 73.51%

7 99.93% 99.30% 96.55% 93.21% 86.81% 69.83%

8 99.92% 99.20% 96.07% 92.27% 85.08% 66.34%

9 99.91% 99.10% 95.59% 91.35% 83.37% 63.02%

10 99.90% 99.00% 95.11% 90.44% 81.71% 59.87%

11 99.89% 98.91% 94.64% 89.53% 80.07% 56.88%

12 99.88% 98.81% 94.16% 88.64% 78.47% 54.04%

13 99.87% 98.71% 93.69% 87.75% 76.90% 51.33%

14 99.86% 98.61% 93.22% 86.87% 75.36% 48.77%

15 99.85% 98.51% 92.76% 86.01% 73.86% 46.33%

DELTA (ABSOLUTE % INCREASE)

99.99% 99.9% 99.5% 99% 98% 95%

2 0.02% 0.20% 0.99% 1.96% 3.84% 9.03%

3 0.03% 0.30% 1.48% 2.91% 5.65% 12.86%

4 0.04% 0.40% 1.96% 3.84% 7.38% 16.29%

5 0.05% 0.50% 2.44% 4.75% 9.04% 19.34%

6 0.06% 0.60% 2.91% 5.65% 10.63% 22.05%

7 0.07% 0.70% 3.38% 6.52% 12.15% 24.44%

8 0.08% 0.79% 3.84% 7.38% 13.61% 26.54%

9 0.09% 0.89% 4.30% 8.22% 15.01% 28.36%

10 0.10% 0.99% 4.76% 9.04% 16.34% 29.94%

11 0.11% 1.09% 5.20% 9.85% 17.62% 31.28%

12 0.12% 1.19% 5.65% 10.64% 18.83% 32.42%

13 0.13% 1.28% 6.09% 11.41% 19.99% 33.37%

14 0.14% 1.38% 6.53% 12.16% 21.10% 34.14%

15 0.15% 1.48% 6.96% 12.90% 22.16% 34.75%

NUMBER OF TIMES MORE RELIABLE WITH UNRAID

99.99% 99.9% 99.5% 99% 98% 95%

2 6666.8X 666.8X 133.4X 66.8X 33.4X 13.4X

3 5000.2X 500.2X 100.2X 50.2X 25.2X 10.2X

4 4000.2X 400.2X 80.2X 40.2X 20.2X 8.2X

5 3333.6X 333.6X 66.9X 33.6X 16.9X 6.9X

6 2857.4X 286X 57.4X 28.8X 14.5X 6X

7 2500.3X 250.3X 50.3X 25.3X 12.8X 5.3X

8 2222.5X 222.5X 44.7X 22.5X 11.4X 4.7X

9 2000.3X 200.3X 40.3X 20.3X 10.3X 4.3X

10 1818.5X 182.1X 36.6X 18.5X 9.4X 3.9X

11 1666.9X 166.9X 33.6X 17X 8.6X 3.6X

12 1538.7X 154.1X 31.1X 15.7X 8X 3.4X

13 1428.9X 143.1X 28.9X 14.6X 7.4X 3.2X

14 1333.6X 133.6X 27X 13.6X 7X 3X

15 1250.3X 125.3X 25.3X 12.8X 6.6X 2.8X[/pre]

Joe L. · April 4, 2008

I think your analysis is slightly flawed.

I can accept the chance of a single 99% reliable drive failing in a set of 10 as 0.914% or for 10 drives the possibility being 9.14% of one of them failing.

Where I see a flaw in your logic is in the multiple disk percentage. You numbers might be accurate if you did not replace the first failing disk, and then waited for a second disk to fail, but I do not think they take into consideration the fact that I can have all 10 of my disks fail multiple times in a year, and as long as I replace the failed drive as soon as practical, and no two drives have failed at the same time, I've lost no data.

Did I miss something? Did you account for simultaneous failure? or just two disks failing within the same year?

Let's pretend it takes 5 days to replace a failed drive. What are the odds another drive will fail in the same five day window of time? If 0.914% chances exist for a drive failure in a years time, then 0.914 * (5/365) = 0.01252% chances exist for a single drive failure in a 5 day window of time.

Chances of 1 drive in a array of 10 having failed within a 5 day window of time would then be 0.1252%

I don't know how to calculate the odds of two disks failing in the SAME 5 day window of time. I know I could do like you, and subtract

100% - 0.1252 - 0.01252 to get the odds of two disks failing in a 5 day window of time, but I'm not sure if that would be the simultaneous failure of the two drives, or just that two will each be unavailable for 5 days some time during the year. Since there are 365/5 = 73 windows of 5 days, the chance of it being in the same 5 days (or overlapping the same 5 days) is 72 times less.

Did I miss something... It has been a very long time since I took a math class... and I know I'm rusty. I think reliability is much higher than you predict, otherwise, all RAID arrays would be losing data a lot more often than I've read about.

This following link uses math as I describe... It uses the MEAN-TIME-TO-REPAIR to calculate the odds of a second drive failure in a raid-5 array.

http://www.northriversolutions.com/downloads/Calculating%20the%20True%20Reliability%20of%20RAID%20-%20A%20White%20Paper%20from%20NRS-2.pdf

Joe L.

SSD · April 4, 2008

Thanks Joe.

Great catch. You are right that the model does not acknowledge the ability to take corrective action within the 1 year time interval. In order to take that into account, I need to shrink the time interval to a period of time (say 8 hours on the average ) during which a failure would not be noticed and therefore a corrective action would not occur. I would need a corresponding increase in reliability percentage for 8 hours rather than a year. Once I get these results, I'd have to multiply the failure rate by 365*3 (8-hour periods in a year). I think this will raise the reliability. I'll try to do this and post the results.

Interesting link. I'll give that a careful read as well.

josetann · April 4, 2008

I think five days is a bit more realistic than eight hours, might want to even up it a few days. The typical user won't have a spare drive which is as big or bigger than the largest drive in the array just laying around gathering dust. You first have to assume it'll take up to a day to notice (and that's assuming the user checks the array every day, or has gone to the trouble to create a script to notify them of such happenings). Then the user has to order the new drive (and as luck has it, you just missed the cut-off for same-day shipping, so it won't ship until the next day). Overnight shipping is outrageous, so you have it shipped an economically priced three-day method. So, the drive can fail Wednesday, you find out Thursday, select three-day shipping, it ships out Friday, and gets to you the next Wednesday (can't count Saturday and Sunday for delivery times, or in the case of FedEx Home you can't count on Sunday or Monday). Once you get it you install and wait for the array to rebuild. You should be back up by that night.

Oh, and don't forget not every company ships right away, and not every package arrives on time (I've had FedEx Home deliveries take about a week longer to get to me than they should have), plus not every user will even select three-day (but there's some who'll get it faster because they're so close to the warehouse, plus there's those who will pay to have it overnight, etc.). Now, once unRAID supports hot spares, then you can figure in eight hours or so, as that'd be the amount of time it'd take to rebuild the array.

Joe L. · April 4, 2008

Don't forget that the MEAN-TIME-TO-REPAIR includes both the time it takes for you to notice the failed drive AND the time it takes to get a new one AND the time it takes for the array to rebuild it once installed. You are not safe from a concurrent failure of a second drive until the first failed drive is rebuilt and the array is healthy once more.

For my current array it could take 7 hours to rebuild parity. Probably about 5-6 to rebuild a replaced data drive (my biggest data drive is 500G, my parity drive is 750G).

My 5 day estimate is probably pretty close to what it would take to get a new drive up and running again from the time it first fails and I notice it.

To make sure I notice it, my array checks the array status once an hour and will e-mail me if a drive fails. It might be 8-10 hours between a failure and I read the mail (assuming I sleep once in a while) I might not be able to get to a store any quicker than ordering over the web. Frys is very fast at delivery and usually has decent pricing. I usually get their deliveries within 2 or 3 days. My MTTR might be as little as 3 days or as much as 5.

Joe L.

Joe L. · April 4, 2008

Now, once unRAID supports hot spares, then you can figure in eight hours or so, as that'd be the amount of time it'd take to rebuild the array.

It supports hot spares now... Install a disk equal in size, or bigger than your parity disk, and leave it unassigned in the devices assignment page.

If you have a drive failure, un-assign the failed drive and assign the "hot spare" to the same logical slot in your array and you will be up and running in 8 hours or so.

Yes, the process is manual, but you can do it all from the existing unRaid management page.

Joe L.

josetann · April 4, 2008

It supports hot spares now... Install a disk equal in size, or bigger than your parity disk, and leave it unassigned in the devices assignment page.

If you have a drive failure, un-assign the failed drive and assign the "hot spare" to the same logical slot in your array and you will be up and running in 8 hours or so.

Yes, the process is manual, but you can do it all from the existing unRaid management page.

Well, you'd still have to figure in at least a day, plus however long it takes to rebuild (maybe 36 hours tops?). Plus, if hot spares was a feature right now, I'd be more likely to go ahead and order another drive. If I'm gone for say, three weeks and a drive fails, it's going to take me about a month to get that drive replaced. If I had a hot spare, it'd take about 8-12 hours or so. I might buy another drive for that peace of mind.

NAS · April 4, 2008

Now, once unRAID supports hot spares, then you can figure in eight hours or so, as that'd be the amount of time it'd take to rebuild the array.

It supports hot spares now... Install a disk equal in size, or bigger than your parity disk, and leave it unassigned in the devices assignment page.

If you have a drive failure, un-assign the failed drive and assign the "hot spare" to the same logical slot in your array and you will be up and running in 8 hours or so.

Yes, the process is manual, but you can do it all from the existing unRaid management page.

Joe L.

With all appropriate respect thats not a hot spare. The definition of a hot spare is that it is live and ready and comes up "automatically" if a drive fails.

Equally it is not a cold spare so i suppose it could be a "warm spare"

lovingHDTV · April 4, 2008

Here is a calculator that you can use.

http://www.bgdsoftware.com/freeTools/storage_availability_calc_tool.htm

It also has a link to a paper explaining it all.

On the Seagate website they spec the 1TB drive with a 750,000 MTBF. Using 500,000 should be a good conservative number to use.

Joe L. · April 4, 2008

Here is a calculator that you can use.

http://www.bgdsoftware.com/freeTools/storage_availability_calc_tool.htm

It also has a link to a paper explaining it all.

On the Seagate website they spec the 1TB drive with a 750,000 MTBF. Using 500,000 should be a good conservative number to use.

Again my math skills are very rusty, but I have a difficult time with their calculations on that spreadsheet. They predict "downtime," not chances of data loss. Even with parity AND two hot spares they predict I will have up to 0.6 minutes of downtime with my 10 disk array. I find that very hard to believe.

In my mind if a drive is rated for 500,000 hours MTBF and I have 10 drives, my reliability is 50,000 hours MTBF. My down-time window for any given drive is 5x24 hours (5 days) or 120 hours. 120/50,000 = .0024% downtime.

I think I need to multiply that by itself to get the downtime potential of a second drive failing in the same 120 hour window... (I told you my math skills are rusty...)

Here is another discussion, more in line with how my poor little mind is thinking. This time from adaptec.com (they might have the math more correct)

http://storageadvisors.adaptec.com/2005/11/01/raid-reliability-calculations

and

http://storageadvisors.adaptec.com/2005/11/02/actual-reliability-calculations-for-raid/

And here is another spreadsheet. Also from adaptec.

http://graphics.adaptec.com/us/TT_SA/MTTDL.xls

For a 12 disk array of 750Gig disks, with a MTBF of 1,000,000 hrs, a Bit error rate of 1.00E-14, a data rate of 80MB/s and a MTTR of 8 hours to re-build a disk + 120 hours to detect and replace the failed disk for a total MTTR or 128 hours, MTTDL (Mean time till data loss) is listed as 19.12 years. That is much closer to what I'm thinking a RAID array can offer. The nice feature of unRaid is that sometime in the next 19 years, unlike RAID-5, if I have a two-drive failure that looses data, the remaining working disks will not lose their data too.

Joe L.

josetann · April 4, 2008

For a 12 disk array of 750Gig disks, with a MTBF of 1,000,000 hrs, a Bit error rate of 1.00E-14, a data rate of 80MB/s and a MTTR of 8 hours to re-build a disk + 120 hours to detect and replace the failed disk for a total MTTR or 128 hours, MTTDL (Mean time till data loss) is listed as 19.12 years. That is much closer to what I'm thinking a RAID array can offer. The nice feature of unRaid is that sometime in the next 19 years, unlike RAID-5, if I have a two-drive failure that looses data, the remaining working disks will not lose their data too.

Those odds are still just a bit scary at first, I don't want to lose any data EVER, even if it is just a loss every twenty-ish years. The reality though is that we're not going to keep any of these drives for twenty years. Those nice shiny 1TB drives I have now will be eclipsed within ten years (though I still have some drives that are about ten years old). I think I'll be safe by keeping two copies of anything really important (the chance that those two exact drives going down at the same time would be slim), and then make occasional copies to another format (i.e. DVD).

Joe L. · April 4, 2008

For a 12 disk array of 750Gig disks, with a MTBF of 1,000,000 hrs, a Bit error rate of 1.00E-14, a data rate of 80MB/s and a MTTR of 8 hours to re-build a disk + 120 hours to detect and replace the failed disk for a total MTTR or 128 hours, MTTDL (Mean time till data loss) is listed as 19.12 years. That is much closer to what I'm thinking a RAID array can offer. The nice feature of unRaid is that sometime in the next 19 years, unlike RAID-5, if I have a two-drive failure that looses data, the remaining working disks will not lose their data too.

Those odds are still just a bit scary at first, I don't want to lose any data EVER, even if it is just a loss every twenty-ish years. The reality though is that we're not going to keep any of these drives for twenty years. Those nice shiny 1TB drives I have now will be eclipsed within ten years (though I still have some drives that are about ten years old). I think I'll be safe by keeping two copies of anything really important (the chance that those two exact drives going down at the same time would be slim), and then make occasional copies to another format (i.e. DVD).

You are right on target. A RAID array is not a substitute for backups of your critical data. All it would take is one fire/flood/direct-hit-by-lightning and your RAID disks might all be damaged beyond repair. Off-site backup of critical files are the only solution. Bank safe-deposit boxes store lots of writable DVDs and are about as safe as you can get for fire/flood/etc.

In the past 15 years or so I've had at least 4 or 5 disk failures. About 1 every three years... In almost every case, I lost some data. I'll bet you have had similar experiences. These were all before I had an unRaid server, but I still have about 12 disks spinning in various machines/laptops/media players/dvr's around this house that are not protected from a failure. Odds are high one or more disks will fail within a few years. I now back up my windows PCs to my unRaid server using Acronis TrueImage. I use removable USB drives for off-site storage of critical data. Can't get too paranoid... Odds are I'll still lose data, but only that from after the last backup. (reminder to self... do another set of backups)

Joe L.

josetann · April 4, 2008

In the past 15 years or so I've had at least 4 or 5 disk failures. About 1 every three years... In almost every case, I lost some data. I'll bet you have had similar experiences. These were all before I had an unRaid server, but I still have about 12 disks spinning in various machines/laptops/media players/dvr's around this house that are not protected from a failure. Odds are high one or more disks will fail within a few years. I now back up my windows PCs to my unRaid server using Acronis TrueImage. I use removable USB drives for off-site storage of critical data. Can't get too paranoid... Odds are I'll still lose data, but only that from after the last backup. (reminder to self... do another set of backups)

To be honest, I can't remember a true drive failure where I wasn't able to get off any important data. I've had a few develop the click of death which means an immediate backup and replace. Well, except for one drive, 250 or 300GB, it's had the click of death for quite a few years now and other than the occasional stutter and even total disappearance (just have to reboot and there it is again) it's been working fine. With all this consolidation I'm doing, it's going in the trash. Had one drive that I wanted to backup but it wouldn't complete a dd, so I used the freezer trick. Worked just fine. I think in the unlikely case that you do have two drives go bad, there's a halfway decent chance you can recover the data off of one of them yourself. If you can, then nothing's lost.

Worst luck I had with drives was a bunch of Maxtors I bought for TiVo upgrades. For whatever reason they'd stutter just long enough to make the TiVo reboot. Worked fine as data drives though. I ended up with a lot of PC storage after that fiasco.

Anyways...with data now including digital pictures and video of my son, I'm having to worry more about keeping the data around. A lot of pictures are backed up on my webserver (with the side effect that family can view them whenever they want). Videos are still on the mini-dv tapes. I think I had about four copies of the pictures on various drives and computers. Now I just need to make some backups to DVD and hand them out to family, they can act as multiple safety deposit boxes.

SSD · April 5, 2008

Thanks for your comments and references. Please double check my math again to see if I closer to being accurate now ...

Remember, these statistics are PURELY based on drive reliability. It doesn't consider chances of your PSU blowing, your MB failing, your controller going bonkers, your house catching on fire, nuclear war, a plane crashed into your house ... you get the idea!

My prior approach computed things on an annualized basis, and as a result considered 2 failures in a YEAR a data loss situation. I shrunk the interval to an adjustable number of hours, and multipled to get to an annualized percentage. So now you'd have to have 2 failures in a day (or 5 days), instead of 2 failures in a year, to create data loss. (Thanks Joe!)

I tried to take into consideration that a rebuild would fail if any of the drives in the array were to fail during the rebuild. I multiplied the failure rate during a rebuild interval by 11x (1 more than the number of drives in the array). (A rebuild is assumed to be completed within 24 hours).

I used MTTB (1,000,000 hrs) to try to compute single drive reliability to see if my 99% reliability figure was reasonable. Based on reading, MTTB is the number of drives that would have to be running simultaneously to average 1 drive failure per hour. So if there will be one failure per hour for a million drives, there would be one millionth of a failure per hour for a single drive. That means that you'd have an hourly failure rate of 0.0001%. If you multiply that by the number of hours in a year you get an annualized failure rate of 0.636% (99.354% reliable). 1% is pretty reasonable!

Highlights:

A 10 drive unprotected array will likely experience data loss about 1 in 10 times per year (50-50 chance of a failure within a 5 years period)

A 10 drive parity protected array will likely experience data loss about 4 in 100,0000 times per year (IF you detect a single drive failure within 1 day)

A 10 drive parity protected array will likely experience data loss about 2 in 10,000 times per year (IF you detect any single drive failure within 5 days)

(This is the same as your chances of having a second drive fail wiithin 5 days of an initial failure)

Your chances of NOT being able to recover AFTER a single drive failure (assuming recovery within a day) is about 15 in 10,000 per year.

My assumption was that an individual drive is this reliable per year.

99.00%

Failure Interval (assume that if a failure occurs within this many hours that it would be detected and corrected)

24

Disk reliability per failure interval

99.9973%

Number of disks in the array

10

The chances that ALL the disks would stay good (parity not needed)

99.9726% (89.994384% annualized)

The chances of one particular disk failing and the rest staying good

0.00274% (1.000438% annualized)

Chance of any one of the disks failing while all other say good

0.027391% (10.00438% annualized)

Adjusted for recoverability (if there is a failure of any disk during the rebuild, recovery is not possible)

0.027382% (10.00137% annualized)

So your overall relability with unRAID is

99.9999884% (99.995751% annualized)

Your chances of failure due to multiple drive failures

0.00001163% (0.0042486% annualized)

Your changes of data loss are this many times higher without unRAID

2355.1X

josetann · April 5, 2008

Very nice work, someone else will have to check the numbers (as my eyes kinda glossed over just a tiny bit). It looks like you're calculating the odds of ANY data failure though, and the numbers would be the same with unRAID vs any RAID5 out there. The main reason why I chose unRAID over just rolling my own, is that having two or more drives fail will NOT result in an entire loss of all data. Heck if something is important enough, I can make a copy of it on every single of the drives. Can't do that with RAID5.

I think it'd be interesting if you take into account a few different installs. Maybe we can assume one case where we have three drives (Basic version), six drives (Plus version), and sixteen drives (Pro version). Personally I'd like to see something around ten drives as well. Now, what are the odds that a PARTICULAR drive will lose all its data? In the case of a three-drive setup, if two drives randomly fail, what's the odds that we will lose data on drive #1? We have about a 66% chance of losing all data (if my math is correct), since the three possibilities are that drive 1 and 3 fail (you lose all data on drive #1), drive 1 and 2 fail (you lose all data on drive #1), and drive 2 and 3 fail (all data on drive #1 is intact). Another way to look at the numbers would be, what percentage of data am I likely to lose if two drives fail at the same time? In the three drive setup, you're likely to lose 66% of your data. If drive 1 and 3 fail, you still have 50% of your data. If drive 1 and 2 fail, you have 0% of your data. If drive 2 and 3 fail, you still have 50% of your data. So, 66% of the time we still have 50% of our data. My numbers are probably a bit off, but you get the picture. The numbers look a LOT better with more drives. If you have 16 drives and two fail, at worst you lost two data drives out of fifteen. You still have data intact on thirteen drives. The odds of losing all data on a particular drive are very slim.

Joe L. · April 5, 2008

The main reason why I chose unRAID over just rolling my own, is that having two or more drives fail will NOT result in an entire loss of all data. Heck if something is important enough, I can make a copy of it on every single of the drives. Can't do that with RAID5.

Very good point, and one I did not think of at all. It does not solve the issue of needing off-site-backups, but it sure reduces the chance of losing critical files even further.... (A direct hit by lightning might melt most of the disks, so an off site disk drive, or CD copy of critical files is STILL a good idea.)

Joe L.

Billped · April 5, 2008

Your approach appears sound, but I question the annual 99% drive reliability factor. From personal experience (my drives and those of people I know), it is nowhere near that. 90% is more like it. That may not agree with the stated MTBFs, but remember that our drives live in the real-world, with poor power, excess vibration/shock, and often uncontrolled temperatures that make the theoretical MTBFs laughable.

That means that your unprotected and protected values are too rosy, making the need to have the protected system far greater. For example, I would expect an unprotected 10 drive array to have a problem virtually guaranteed in five years, not the 50/50 chance you give it.

Again, the approach is sound so I see no need to alter the spreadsheet, just change the variables.

Bill

SSD · April 5, 2008

Interesting. My experience with drives has been good to excellent. Occasionally I get a sector that goes bad, but have had good luck getting SMART to correct them. I do use active cooling on my drives. I'd say that I am getting at least 99% reliability. (Now that I've said this, Murphy is going to come along and take out my entire array!)

I ran the model lowering the reliability percent to 90% and setting the "failure interval" to 5 days. It came out to 97.898% reliable with unRAID (as compared to 99.9957% using 99% / 1 day interval).

bubbaQ · April 6, 2008

If you look at Google's vast experience with disk failures, drive temp, utilization, and power-on hours are practically irrelevant.

http://labs.google.com/papers/disk_failures.pdf

jimwhite · April 6, 2008

I don't know, I've had my 12 drive array up since the very earliest of 1.x release with 11 recertified Maxtor 300gb 5400rpm drives with a new Seagate 7200 300mb parity drive.... 10 drives are crammed into 5in3 brackets, and I taped a cardboard box over the front of the computer with an 8" ac powered fan Hot-Glued in place... judicious cutting with a jigsaw allowed me to mount two 400 watt pwr supplies in there... sounds and looks REALLY ugly.... the UPS battery has needed replacing for a while now so probably once or month or so, I get a long enough power outage to trigger a "spontaneous reboot", so I get an automatic parity check (LOL) whether I need it or not.... In the first few weeks I was getting some errors which I tracked down to cables and the motherboard's "pickiness" about memory.... removed one of the 2 512mb sticks and it's been running that way ever since.... certainly a scenario for MTBF credibility... at any rate, I log into the webpage of the server periodically out of curiosity (no errors in years), other than that, I never give it any thought As to the future, well I found an old SCSI based ATX server chassis that I'm re-doing with an Abit AB9pro and I'm going to bestbuy this morning to score 6 of those $190 1tb Seagates...

Billped · April 6, 2008

If you look at Google's vast experience with disk failures, drive temp, utilization, and power-on hours are practically irrelevant.

http://labs.google.com/papers/disk_failures.pdf

Thanks for linking - despite my following paragraphs, I find the study enlightening and informative. I just don't find it conclusive or necessarily relevant to us.

There are several problems with this study. Some of them include no differentiation of manufacturers and they don't vary drive temps, just report them. A good study for drive temps would be to take several batches of drives, note their manufacturers and mfg dates, and intentionally load them at different temps, then see if like mfgs/dates have a different failure rate. Additionally, this is a professional environment where they would never allow drives to get hot enough for the differentiation to be an issue (the study's max was 50C).

As a silly example of this last point, if we correlated the temps in the state of Florida with human lifespan, we may find no correlation. But I am willing to bet that none of us would survive in 80C weather for very long. Similar with drives. The temps reported by Google are all "in the range" where the lack of correlation would only be interesting if we ASSUME that correlated observations followed some sort of smooth function (i.e. 1C of temp increase = 1% higher failure rate) as opposed to a more complex function (i.e. anything within 30C-50C = no correlation, temps from 50-60C have 1% = 1% higher failure rate, but every 1C above 60C is 25% higher failure rate.

All drives in this study are spinning continuously. Most of us power down our drives despite the understanding that power cycling is a highly abusive factor. Could the heat left over from when it was spinning build up in pockets that no longer have internal air flow and damage the electronics? Perhaps. It is like shutting down your car's engine when it is hot. Many cars let their electronic fans cool the car for many minutes after it is shut off (two of my three cars, for example).

Another issue is installation and handling throughout the lifecycle of the drive. How many folks have ever moved their computers when the drive was spinning? I have. How about bumping it with a knee or elbow? Again, I have. Did you bump it against the chassis while installing it? Again, I am guilty and I am fairly careful.

I can go on.

Again, I am not dismissing the importance of the google study. The immense size of the study's population and the careful compilation of data make it far more powerful than my earlier "in my experience" anecdotal reference so we need to put our collective experiences in the back seat compared to this one, but let's not assume being in the front seat means it is necessarily representative of real-world behaviors in our non-commercial environments.

Bill

SSD · April 6, 2008

Thanks Bubbaq for posting!

I found the chart at the top of the 4th page very informative. It shows that drives are getting more reliable, but a 2-3 year old drive is only 91 - 92% reliable, whereas a 1 year old drive is 98 - 99% reliable. Given that I have replaced almost every drive in my system with a new drive in the past year, I think I'm closer to the better stat - but have to say I am surprised that the older drives have such a (relatively) high failure rate. I think my personal experience has been better, but this seems closer to Billped's experience.

It seems that relocated sectors are a significant indicator of future (short-term) failure.

One of the more interesting aspects is "what is a failure". The study defines failure as anything that might happen to cause a person to replace a drive (other than to upgrade its capacity). So if one person gets a bad sector and uses a tool to relocate it, and another person has exactly the same thing happen but replaces the drive as a result, one would be a failure and the other would not be. You can see why getting good statistics is hard.

How Reliable is an unRAID array

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived