New build, need to figure out some components...


joeyke87

Recommended Posts

Since i'm looking for more space i'm looking at the following components:

 

Case: Lian Li PC343B

Bays: Icydock MB455SPF

Cpu:  E5400

Ram: ddr2 1066 2GB

 

With this components i would be able to use a total of 30 drives. Therefore i need a motherboard that will support 3 Supermicro AOC-SASLP-MV8 and 6 internal sata connections.

 

Someone told me that my current motherboard (p5q deluxe) would support this, however this board doesn't have onboard video (which i prefer). And then we got the psu, i'm looking at a Corsair 850 HX right now but only supports a max of 24 drives. So are there any bigger on the market or is the only solution to buy a second psu for the other six?

 

And finally which 2tb drives to get... I was looking at the new f4 samsung drives (very cheap here) but read some problems about the 4k drive thing.

 

I could really use some advice on this. Thanks in advance  :)

 

 

Link to comment
  • Replies 60
  • Created
  • Last Reply

Top Posters In This Topic

Am I the only one that feels like having a huge number of disks with a single parity drive is less than safe? For the money you're spending on that case, a motherboard with at least 3 PCIE x4 connectors, and all those backplanes, you could build another server.

 

Also, UnRAID currently only supports 22 drives, of which only 20 are data.

 

Drive wise, I just got the Seagate "green" drive, and have been very happy with it. Others prefer the WD Green. The Seagate has lower power consumption at idle, and the WD better at load. I figure it spends more time idling, so I picked the Seagate.

Link to comment

I wouldn't consider it unsafe, but it is definitely more risky than a 20 drive server.

 

However, the point is moot as unRAID only supports 22 drives anyway (1 parity + 20 data + 1 cache).

 

Of course that may change in the future, but I doubt we'll get to 30 drives any time soon.

 

As for drives, I would recommend the WD Green drives as they are the easiest to 'fix' (just add a jumper).  The rest of the drives are a pain to work with.  Definitely don't get the Samsungs, as there is no known fix at the moment.

Link to comment

As for drives, I would recommend the WD Green drives as they are the easiest to 'fix' (just add a jumper).  The rest of the drives are a pain to work with.  Definitely don't get the Samsungs, as there is no known fix at the moment.

 

Not to derail the thread, but, what do you mean by 'fix'?

Link to comment

Thanks for all the reply's so far. First off all, i understand that some of you might think that this is not safe or something like that. However i'm giving my old server to my parents so i'll always have a back up (however space is limited so i have to upgrade on that to in a few months). Also i know that 30 drives are not supported right, but i think i'll reach the 30 drive maybe 5 years from now. So therefore i bought this case that is future proof.

 

Anyway as for the motherboard. I'm looking at a Asus M4A89GTD PRO right now with a AMD Sempron 140 processor. The specs of this motherboard:

 

2 x PCIe 2.0 x16 support ATI CrossFireX™ technology(@dual x8 speed)

1 x PCIe 2.0 x4

1 x PCIe 2.0 x1

2 x PCI

 

So i should be able to use 3 Supermicro cards right?

 

As for the drives, i'll take a look at the seagtae and wd drives.

Link to comment

As for drives, I would recommend the WD Green drives as they are the easiest to 'fix' (just add a jumper).  The rest of the drives are a pain to work with.  Definitely don't get the Samsungs, as there is no known fix at the moment.

 

Not to derail the thread, but, what do you mean by 'fix'?

 

Basically what Traxxus said - advanced format drives don't work natively with unRAID at the moment, so each one requires some sort of manipulation to make them work.  I feel that physically installing a jumper (which takes 2 seconds and costs about 2 cents) is far easier than messing around with firmware updates (which took me about 30 minutes to try several times and finally give up, although it was free).

 

~~~~~~~~~~~~~~~~~

 

joeyke87: That motherboard looks good, and yes it should be able to support three SuperMicro cards.  Just understand that you'll be breaking new ground with this build - I don't know of anyone else who has used more than 2 of these cards.  To be extra safe, you may want to double check with Asus and make sure that this will work.  And by the way, that PCIe x1 slot can support another 2 drives if you want to take your build to the limit ;)

 

The Sempron 140 is a great CPU and perfectly tailored to unRAID, as long as you don't plan on doing anything too CPU intensive on the server (such as encoding video).

 

In the spirit of 'future proofing', I would recommend that you buy all three SuperMicro cards now, or at least within the next few months.  You'll likely waste a bit of money by doing this, but I wouldn't take them for granted.  Hopefully they will be around for a long time, but who knows - SuperMicro could come out with something new in 6 months and these cards could become hard to find.  Normally it is best to space out purchases of identical components as much as possible, but in this case you might want to buy them all now.

Link to comment

Allright i'll buy the 3 cards soon then. As for the drives, the old samsung drives are still available so i buy one of these every now and then. As soon as it runs out i'll buy the wd drives. Btw i've also contacted asus about the support of the 3 supermicro cards, hoping to hear from them soon. And what do you mean with: the PCIe x1 slot can support another 2 drives? you mean another sata card wich holds 2 ports?

 

Also i will not encode or anything with the Sempron 140. It's just for streaming 1080P movies to my htpc.

 

Once i get a reply from Asus i'll post it here, might be usefull for someone else some day.

Link to comment

I also personally think a 30 drive system is too risky.  Heck, I think a 20 drive system is too risky.  Predicting failure rate in an unRAID array is difficult because it is so different than RAID, where most of the research has been done.

 

First consider there is an estimated 3.1% industry average for pre-mature drive failure (things like faulty electricals, bad bearings, etc. within 1 year).  Your odds of loosing a drive to these reasons is 1 - .969^#ofdrives.  So a new 3 drive array has a 9% chance of one drive with pre-mature failure.  12 drives is a 31% chance.  A new 20 drive array has a 46.7% chance of one drive failing prematurely.  30 drives would have a 61% chance.  The odds of two drives pre-maturely failing in a 3 disc array is .8%, 12 disk 10%, 20 disk 22%, 30 disk 37%.  Just hope both wouldn't fail at once, or you will have data loss!  My personal risk limit is a 12 disk array (10% chance of a double drive failure, potentially resulting in lost data).  This is why it is not recommended to buy all your drives from the same lot.  The odds of two drives failing at the same time, with the same failure mode are greatly increased!

 

The next mode of failure to consider is usage.  You are concerned with the unrecoverable read error (URE) spec of a hard drive.  There are two different consumer drive ratings on the market, unrecoverable read error (URE) every 10^14 bits and 10^15 bits.  So a 10^14 bit drive will lose data after 11.37 TB of data has been read from it.  A 10^15 drive will fail after 114TB have been read.

 

So again, let's look at worst case.  Let's say you buy all 2TB Seagate LP drives and you start with 3 of them.  Those drives have a 10^14 URE rate.  Now lets look at unRAID.  First, let's say you undertake common practice and run a pre-clear on each disc before adding it to the array.  Well, that means you read every bit on the drive, write a 0 to every bit on the drive, and then read every bit on the drive again.

 

So each of your three discs have read 4TB before you even start your array.  Over 1/3 of the way to an error before you even get started.  Now you bring your array online for the first time and unRAID builds parity.  Well, during the parity build, it reads from each data disc and writes to the parity disc, so that is another 2TB per data disc.  Now your data discs are at 6TB, with parity at 4TB read.  Now you fill each data drive in your array.  Well, to fill each data drive completely, you write 2TB from an external source (PC, external HDD, etc.).  Then unRAID reads each drive again, recalulates parity (which requires a read from the parity drive) and writes to parity.  So filling the drives means 2TB of data read from each data drive, and 2TB read from parity for each data drive, so 4TB read.  This now brings you up to 8TB read for each of the three drives and you've done nothing but pre-clear, built the array, and copied data over to fill it.

 

Now a month has gone buy and it is time for a monthly parity check.  Well, that will require reading 2TB from each data drive, and reading 2TB from the parity drive, which puts each disk at 10TB read.  At your second months parity check, you will statistically have an unrecoverable read.  This will cause a parity error, which will require a parity rebuild.  This rebuild will statistically cause another unrecoverable error in your data drives.  And since your parity drive is corrupt too, you can't recover the lost bit.  This means you've lost data as the file with the bit will be corrupt.

 

This will leave you with a major headache, as you can potentially still recover the lost bit on the data drive assuming that was not the corrupted bit on the parity drive.  But it will require you to isolate the locations of all corrupted bits, and basically manually calculate parity, and correct the missing bit manually.  Not something most people could do I think.  Most people will just loose that file.

 

Or if after filling your discs in the first month, you add a fourth disk, I think you would statistically fail your parity drive, leading to all the same problems as above.  So hardly before you even get started, you end up with a mess on your hands.  All this with only three disks, so imagine 30!

 

Now, your array will last a little longer if you don' run pre-clear.  A regular clear is not as read intensive (I think).  And if you don't have a backlog of 4 GB to dump on your array, it will obviously take you longer to fill it.  But still, you are probably looking at URE's within the first 1-1.5 years, even if lightly used (minimal data, no pre-clear, just regular parity checks).

 

So what is the first lesson to be learned here?  Don't use any drives with a URE rating of 10^14.  As much as people will probably disagree, that means no Seagate LP drives, no Seagate XT drives.  Basically, no consumer 2TB Seagate drives at all.  Only the 2TB Constellation drives pass muster.  You might be able to get away with lower capacity drives 500GB - 1TB) as your reads in the above process will be much less.  From a 2TB perspective, the WD green drives (but not the black) and 2TB Samsungs are the only ones rated 10^15.  Enterprise drives are even better at 10^16.

Link to comment

Wow i must say that is very interesting KYthrill, really it gives me something to think about. I understand the the change of 2 drives failing at the same time is getting bigger. However i have the exact same data on an other server at my parents house, so all i have to do is recover the data on 1 disk and consider the other one as a loss and transfer the data back from the other server. At least that's what i think, please let me know if i'm wrong. And if i get this right, after a while the parity drive will crash because it has got lot's of reading to do with all the drives, but then can't i just change the parity drive? Sorry a bit confused and especially scared right now because in the end, all i want to do is expand my hobby wich is collecting movies.

 

Anyway here's the reply from Asus:

 

Dear Customer,

 

Thank you for your email.

 

The PCIe slots on the mainboard are not only for made graphics cards.

So you can use addon cards on the PCIe slots without problems.

 

I think that the sata cards you want to use will work with this mainboard as looking at all the specifications the mainboard has all the neccessary connections for these cards and should support these cards.

 

However as we have never tested any of these cards on our mainboard we cannot say for certain if this will work.

So I would advise you to contact the manufactor of the sata cards to ask them if they know more about this configuration.

 

So that's what i'm going to do for now :)

Link to comment

Well, URE's don't mean you have to throw away the drive.  It just means it had an error reading on bit of data, and the ECC couldn't correct the error.  The drive could actually determine that the bit is working at that location (just a random botched read for some other reason), or the drive may mark the sector as bad, etc.  There are a number of outcomes for a URE depending on your configuration and HDD manufacturer.

 

I think the general advice around here though is that if one sector reports bad, start shopping for a new drive.

 

But your description sounded like you had a server full of stuff, and you wanted to build a new server and copy over your data.  If you do that with three or more 10^14 rated drives, and you have 4TB or more of data, then you will likely find yourself running into failed parity checks, corrupted drives, etc. much more quickly than you want, if you follow the best practice guidelines (pre-clear drives, monthly parity checks, etc).

 

In my above scenario, a 10^14 rated parity drive has a URE after a few months.  The drive could just determine the drive is still fine, or it could mark the sector as bad.  The former you might keep using the drive and it might then be quite a while before you get another 14TB of reads on it.  But if the latter occurs, it may mean replacing it.  So then you stick in a new parity drive (or rebuild the old one). While rebuilding your parity, a data drive the fails because of URE.

 

At that point, the data drive may not be a total loss.  If it was, you could put in a new drive and copy again from your old server.  But anything not on your old server could potentially be lost (ie, the new movies you've "collected").

 

I personally think it is less risky to start with a lower density configuration.  Say you need 4TB of storage.  Don't start with 3 2TB drives (2 data + 1 parity).  Start with 5 1TB drives.  Even if you get 10^14 URE drives, it will be longer before you encounter a URE error.  If you are smart at get 10^15 1TB drives, it may be years before you have a URE problem.

 

My risk approach is to expand to 12 drives, using 500GB-1TB data drives.  In theory I can get my first 12TB this way, and URE is much less likely (but pre-mature drive failure odds do go up).  And in the future when I may need more than 12TB, I have the option of making it 24TB easily.  And by then, hopefully all 2TB drives will be 10^15 (or I can buy 10^16 enterprise drives).

Link to comment

My understanding of the URE rate specification is that it is a measure of how often, on average, the drive will report an (unrecovered) read error.  It doesn't mean that the drive will be okay until 10^15 (115TB?) of data have been read.  You may get a URE on the very first read, but then, on average, not get another URE for another 115TB.  I also didn't believe that a URE is a hard error, but just a read which didn't succeed, despite error recovery processing performed by the drive controller (ECC, re-reads, etc) - a subsequent attempt to read the same data could still succeed.

 

In any case, with SMART, if it is a hard error, the sector would (could?) simply be marked for replacement on the next occasion it is written.

 

The other thing to bear in mind is that URE is a measure of how many bits are read before an error occurs, and the spec (certainly for the WD green drives) is the same whether it is a 320GB drive, or a 2TB drive.  Given the same quantity of data read, errors will occur just as frequently whether spread across several 320GB drives, or on a smaller number of 2TB drives.

 

Further, it is not clear from the simple published specifications, whether the specified URE relates to ideal environmental conditions, or at worst case extremes of the specified environmental conditions.  Neither of those conditions is likely to pertain to the environment in which your drives are operating, so the URE rate may well be a lot worse (or a lot better) than your theoretical figure.  Certainly, the WD spec sheet stipulates the following in relation to the environmental specifications: 'NO non-recoverable errors during operating tests or after non-operating tests'.

 

Anyway, I think that it is too easy to get tied up in theoretical limitations, based on published specifications, rather than just getting on with the realities of life!

Link to comment

My understanding of the URE rate specification is that it is a measure of how often, on average, the drive will report an (unrecovered) read error.  It doesn't mean that the drive will be okay until 10^15 (115TB?) of data have been read.  You may get a URE on the very first read, but then, on average, not get another URE for another 115TB.  I also didn't believe that a URE is a hard error, but just a read which didn't succeed, despite error recovery processing performed by the drive controller (ECC, re-reads, etc) - a subsequent attempt to read the same data could still succeed.

 

In any case, with SMART, if it is a hard error, the sector would (could?) simply be marked for replacement on the next occasion it is written.

 

The other thing to bear in mind is that URE is a measure of how many bits are read before an error occurs, and the spec (certainly for the WD green drives) is the same whether it is a 320GB drive, or a 2TB drive.  Given the same quantity of data read, errors will occur just as frequently whether spread across several 320GB drives, or on a smaller number of 2TB drives.

 

Further, it is not clear from the simple published specifications, whether the specified URE relates to ideal environmental conditions, or at worst case extremes of the specified environmental conditions.  Neither of those conditions is likely to pertain to the environment in which your drives are operating, so the URE rate may well be a lot worse (or a lot better) than your theoretical figure.  Certainly, the WD spec sheet stipulates the following in relation to the environmental specifications: 'NO non-recoverable errors during operating tests or after non-operating tests'.

 

Anyway, I think that it is too easy to get tied up in theoretical limitations, based on published specifications, rather than just getting on with the realities of life!

 

My understanding of a URE before the published rating is that it would be considered a pre-mature drive failure by the manufacturer.  True, you could have one one the first read, but that would constitute a premature failure.  You could also power it up and it won't spin, also a pre-mature failure.  In a normal drive, you should not encounter a URE until you are at least close to the published spec.  Basically its a minimum number of reads in a typical drive before URE. Your drive is atypical if it throws a URE significantly before the published number.

 

I'm not certain if ideal operating conditions or worst case operating conditions apply.  I would work under the assumption that operating condition is irrelevant.  A drive is also rated for failure based on the number of load/unload cycles, and that spec does not reference any environmental conditions.  At least at WD, these two specs are listed together, so I would say URE does not reference any environmental conditions.  And we have plenty of evidence of high load cycle counts causing premature drive failure in WD drives in typical, non-hostile environments.  So that would lead me to believe that if anything, these failure specs are for ideal operating conditions.

 

It is also my understanding (from WD white papers) that a URE error will always be treated as a hard error on WD drives.  WD claims that 100% of soft errors can be corrected by ECC, if not on one attempt, then after several. So a WD drive does not consider it a URE unless it is truely unreadable by any method and can't be corrected (I believe the same applies to all drive manufacturers).  So if a WD drives throws a URE, it is a hard error and that sector is marked bad and the contents reallocated (but on a full drive, as in my example, there is no room to reallocate a full sector).  In some drives, the drive will throw a URE and that bit is lost, but then the drives hardware will try to read or write random bits to the same spot afterwards.  If those subsequent read/writes are all good, the drive will not mark the entire sector as bad, but if another read fails the test, the sector will be marked as bad.

 

Then the data loss could be a single file (bad read in the file), a whole directory (bad read on a folder), or potentially the whole drive (bad read on the MBR or partition table).

 

As for the spec being the same on a 500GB drive as a 2TB drive, that is true, but the rest of your conclusion isn't always true.  It depends on how a drive is used.  Back to my example of three 2TB drives in unRAID vs five 1TB drives in unRAID.  First I pre-clear all my discs.  That means each 2TB drive has 4TB worth of reads before being assigned to an array.  Each 1TB drive will only have 2TB of reads.  Both drives have the same 10^14 URE rating, so I am 33% of the way there on my 2TB drive, but only 17% on my 1TB drives.

 

Then I mount my drives and unRAID builds parity.  Well, on the 2TB drives that requires a read from each data disc.  So each data disc will then have a total of 6TB worth of reads.  With the 5 1 TB drives, each data drive will only have 3TB total reads.  So I am now 50% to a URE on 2TB drives and only 25% on 1TB drives.

 

Then I have data on another server that I migrate over.  Let's say 4TB worth.  Well now each data drive will read, and the parity drive will read for each data drive.  So with 2TB drives, all three drives will now have had 8TB worth of reads on the drive, just to pre-clear, start parity, and then copy existing data.  Your 67% of the way to a URE on each drive.  However, after copying all the data over, the 1TB drives will have 4TB of reads on each data drive and my parity drive will have 6TB worth of reads.  So my data drives are 33% of the way to a URE and my parity drive is 50%, all better conditions than with 2TB drives.  Then when I start reading data back from my drives (playing movies in the OPS case), all things being equal, my reads will be shared across 4 drives instead of 2, resulting in fewer URE's on the sever with more drives.

 

So your statement that your chance for error is the same given the same amount of data is wrong in the unRAID application.  A smaller number of larger drives will encounter a URE more frequently than a larger number of smaller drives, assuming the URE ratings are the same on all drives.

 

And the real world observations I've had (not just with unRAID) is that Seagate drives fail on bad sectors much more often than WD drives.  I also see many people on these forums asking for help because one Seagate went bad, and then another followed.  Or that their drives are just a year old, but their parity has failed, or their SMART report is showing bad sectors.  I think some of this can be traced back to the URE problems I'm describing.

Link to comment

KYThrill, I understand the logic behind your argument however I'm betting that your definition of usage and the OEM's definition of usage are different.  There is a distinct difference between reading the entire drive several times (say 6 times for a 2TB 10^14 drive) and getting a failure and traditional HDD usage where a high number of reads are going to be concentrated on a small subset of the total bits.  Your scenario is on one edge of the spectrum.  The other edge is to assume that we pick just one bit on the drive and read it over and over again until we experience a URE.  Theoretically we should be able to read that single bit more that 10^14 times before we encounter an error.  However I'm betting that it will fail much sooner.  You really need to take a bit level perspective.  Your scenario has the drive failing after all the bits on the drive have been read 6 times but the example I just laid out has a single bit being read 10^14 times before failure.  There are so many different ways to interpret the OEM's nebelous URE rate specifications.  In my view a URE would be attributed to a bad bit on the physical disk which would only be encountered after a high number of read cycles on that individual bit and most likely there would be read errors (not unrecoverable) before the URE occurred.  The OEMs URE specification probably assumes that the bit reads are concentrated on certain areas of the drive (aka usage profile) not spread evenly across the drive like during parity checks and preclear cycles.

 

As evidence that your conclusion is faulty I submit to you the posts on this forum.  If failures occurred at the rate that you suggest we would see hundreds if not thousands of posts concerning failed disks during parity checks and unRAID would not be as popular as it is.

Link to comment

As evidence that your conclusion is faulty I submit to you the posts on this forum.  If failures occurred at the rate that you suggest we would see hundreds if not thousands of posts concerning failed disks during parity checks and unRAID would not be as popular as it is.

 

I disagree with this statement.  In the past, a 10^14 drive was more than suitable for smaller drives.  2TB drives are only a few years old and unRAID is older than that.  So when you are talking 200 GB and 500 GB drives, you can read the entire content of a drive for years before any problem is developed.  Many people still use systems with the smaller drives.  They may have a 20 drive system of all 500GB drives.

 

Pre-clearing a drive is also a recent practice, which adds significantly to the wear of 2TB drives.  This pre-clearing script is less than two years old.  Also, I would say that many, if not most, don't pre-clear, don't run monthly parity checks, etc.  Heck, I've heard of people using unRAID without a parity disk.

 

Also, many drives are rated 10^15 or higher, which means you would need to read 114TB, which will take a much, much greater period of time to achieve.  And no one will complain if a disk fails three years after you get it.

 

No, the problem of quick failure is specifically with 2TB drives rated with a 10^14 URE.  To date, WD Green and all Samsung 2 TB drives are 10^15.  Only the caviar black, all consumer seagates, and Hitachi drives are 2TB and 10^14.  Out of those, no one is spending $180 a pop to deck out an unRAID with all Caviar blacks, they just aren't used widely in unRAID.  Hitachi has less than a 5% market share, so again, they aren't wildly popular.  That leaves Samsung, Seagate, and WD Green as the most widely used 2TB drives.  Two out of three of those are rated 10^15, so they can run for years before the theory states they should throw errors.  Only Seagate 2TB drives are candidates for early failure.

 

And just search the forum, there are 100's or reports of drive failures.  There are dozens of reasons why there would not be reports of widespread problems.  Mostly because the problems I "theorize" about only concerns a very specific group, which is those who use 2TB drives with a 10^14 URE rating.

 

To your point, I agree that there is no definition of how URE is determined.  It could be reading a single spot a trillion times, or reading every spot on the disc 6 times.  But again, where information is lacking, one must apply logic.  Which drive is easier and cheaper for a vendor to make?  One where each bit must be readable a trillion times or one where each bit can be readable just 6 times?  Which ever is cheaper and easier is most likely the method used by a modern manufacturer.  After all, these giant 2TB drives are intended as storage only, not really of OS's, applications and such (even though people use them that way).  And when you store something, you aren't accessing it all that often.

 

But you are correct, most likely there is some sort of profile used and we don't know what that is, and it may vary from vendor to vendor, which means the numbers from one drive to another may not even be comparable.  However, I'm betting that whatever the test method, it is the least strenuous as possible to make the drives look as good on paper as they can.  Call me a cynic...

Link to comment

My understanding of the URE rate specification is that it is a measure of how often, on average, the drive will report an (unrecovered) read error.  It doesn't mean that the drive will be okay until 10^15 (115TB?) of data have been read.  You may get a URE on the very first read, but then, on average, not get another URE for another 115TB.  I also didn't believe that a URE is a hard error, but just a read which didn't succeed, despite error recovery processing performed by the drive controller (ECC, re-reads, etc) - a subsequent attempt to read the same data could still succeed.

 

In any case, with SMART, if it is a hard error, the sector would (could?) simply be marked for replacement on the next occasion it is written

This is my understanding of unRAID's logic in handling a "read" error.

 

When a drive reports a "read" error (and what I'm assuming you and the manufacturer refer to as an URE) unRAID will re-construct the data that could not be read by reading parity and all the other data disks and supply the re-constructed data to the process attempting to read the disk.   (You did not lose data)

 

In addition, unRAID will write that same data back to the disk that reported the error.   This will allow the SMART firmware on that disk to first attempt to re-write the original sector where the error occurred, and if not successful, re-allocate the sector to one of its pool of spare sectors.   Now, the disk that reported a URE has the correct data in either the original sector (if possible) or a re-allocated sector.  Again.. no data was lost.

 

 

If, by chance the write to the disk with the URE fails, the disk is immediately taken out of service.  From that point until you re-establish parity protection by replacing the drive, or fixing the loose connection, or whatever,  you can experience a URE and have no ability to correct it.  During that same window of time you have a statistical chance of suffering a second concurrent disk failure.  Again unRAID has characteristics that make it hard to compare statistics with other raid arrays.  Most of the disks might be sleeping, so if you have a 12 disk array with 11 sleeping, is the risk the same as with all spinning?  I don't think so.

 

Personally I welcome a URE over a disk that randomly reports the wrong value to the OS, but thinks it is correct and shows no sign of error at all.  i know I've read of two or three users whose disks acted in that fashion.  It is exactly that symptom that made me add the additional step of verification of the data being read in the post-clear to be all zeros.  On several disks reported over the past year or so, the values read back were not all zeros, but not a URE either.   Those errors frighten me more.

 

I have no  qualms reading my disks 10^14 times... or more... as long as I do not ignore disk "read" errors, or disks taken out of service because a "write" failed.

 

Joe L.

Link to comment

My understanding of the URE rate specification is that it is a measure of how often, on average, the drive will report an (unrecovered) read error.  It doesn't mean that the drive will be okay until 10^15 (115TB?) of data have been read.  You may get a URE on the very first read, but then, on average, not get another URE for another 115TB.  I also didn't believe that a URE is a hard error, but just a read which didn't succeed, despite error recovery processing performed by the drive controller (ECC, re-reads, etc) - a subsequent attempt to read the same data could still succeed.

 

In any case, with SMART, if it is a hard error, the sector would (could?) simply be marked for replacement on the next occasion it is written

This is my understanding of unRAID's logic in handling a "read" error.

 

When a drive reports a "read" error (and what I'm assuming you and the manufacturer refer to as an URE) unRAID will re-construct the data that could not be read by reading parity and all the other data disks and supply the re-constructed data to the process attempting to read the disk.   (You did not lose data)

 

In addition, unRAID will write that same data back to the disk that reported the error.   This will all the SMART firmware on that disk to first attempt to re-write the original sector where the error occurred, and if not successful, re-allocate the sector to one of its pool of spare sectors.   Now, the disk that reported a URE has the correct data in either the original sector (if possible) or a re-allocated sector.  Again.. no data was lost.

 

I didn't realize unRAID attempted to reconstruct data on the fly.  I just though that if it encountered an error, you had to run a parity sync or rebuild to correct it.  Nice to know that it does it on the fly.  It sounds more like a traditional RAID array in this regard (RAID would rad the bad bit from parity on another disc, attempt to write it to the bad location, and if unsuccessful, would reallocate, and then pass the new "good" data on to whatever process was calling it).

 

 

Personally I welcome a URE over a disk that randomly reports the wrong value to the OS, but thinks it is correct and shows no sign of error at all.  i know I've read of two or three users whose disks acted in that fashion.  It is exactly that symptom that made me add the additional step of verification of the data being read in the post-clear to be all zeros.  On several disks reported over the past year or so, the values read back were not all zeros, but not a URE either.   Those errors frighten me more.

 

Yep, miscorrection is rare, but it sucks.  supposedly it can occur every 10^21 bits, but that is much less frequently than URE (6-7 orders of magnitude less frequent).

 

Link to comment

unRAID is RAID4, but without striping of the data.  The striping of data was traditionally done to improve performance, something not as necessary with today's faster drives.  The striping of data is what causes all data to be lost in traditional RAID4/5 arrays if two drives concurrently fail.  Not striping the data allows unRAID to not lose all the data (unless, of course, you only have two drives)

 

RAID would rad the bad bit from parity on another disc, attempt to write it to the bad location, and if unsuccessful, would reallocate, and then pass the new "good" data on to whatever process was calling it.

And that is exactly how unRAID works too.  It is why a URE in either RAID5 or unRAID is not the issue it is when the drive is used singly.

 

Joe L.

Link to comment

Interesting thread. Just wanted to insert my $.02 worth and say that the preclear script is extremely important and should NOT be skipped. It does not wear out the drive. Truth is that brand new drives fail at a relatively high rate compared to drives with some usage. Preclear helps weed out the bad ones. But even more common is that newly introduced drives have some cabling problem that are detected as poor performance or logged errors. Detecting these problems BEFORE a disk is added to the array is well worth the time and minuscule wear and tear on the drive.

Link to comment

Interesting thread. Just wanted to insert my $.02 worth and say that the preclear script is extremely important and should NOT be skipped. It does not wear out the drive.

Well if I follow what was written above running a single preclear cycle on a any 2TB Seagate, Hitachi, or WD Black HDD and then adding it to the array will put it about 50% of the way towards having a read failure and possibly being identified as a failed drive by the array.

 

I happen to agree with you, bjp999, and I'll keep running preclear on all my new my drives.  It does weed out the bad ones and makes sure you have everything squared away before adding it to the array.  I believe the benefits of running preclear far outweigh any possible "wear" that the drive may experience.

Link to comment

It's not like the URE happens automatically at a certain number of I/Os. It is a probability. And there is zero evidence of UREs happening and causing any type of problem. Otherwise we'd be seeing 1 or 2 parity errors on every few parity checks. We have NEVER seen this type of unexplained parity sync error.

 

Run preclear, keep your drive temps in the 20s or 30s, don't buy large batches of disks at the same time, and run monthly parity checks. Those are the best ways to have a healthy array.

Link to comment

Many of us feel the analysis by KYThrill of wear is incorrect.   You will not wear out a drive 50% in 40 hours.   If your drive fails in that time-frame you do not want it in your array.  

 

Now, all that said, all drives will eventually fail.  I just expect it to be many many years.   I've probably had 1 disk failure every 5 years in my household.  (But then I've got a LOT of disks spinning here and some have been spinning for more than 5 years)

 

 

Link to comment
So I am now 50% to a URE on 2TB drives and only 25% on 1TB drives.[

 

I think that this is a rather spurious conclusion!

 

Then I have data on another server that I migrate over.  Let's say 4TB worth.  Well now each data drive will read, and the parity drive will read for each data drive.  So with 2TB drives, all three drives will now have had 8TB worth of reads on the drive, just to pre-clear, start parity, and then copy existing data.  Your 67% of the way to a URE on each drive.  However, after copying all the data over, the 1TB drives will have 4TB of reads on each data drive and my parity drive will have 6TB worth of reads.  So my data drives are 33% of the way to a URE and my parity drive is 50%, all better conditions than with 2TB drives.  Then when I start reading data back from my drives (playing movies in the OPS case), all things being equal, my reads will be shared across 4 drives instead of 2, resulting in fewer URE's on the sever with more drives.

 

It sounds as though your parity drive should already have curled up its toes!  Clearly, by your argument, I should never run a parity check!

 

I think that you're also overlooking the problems of running 4kB drives in 512 byte mode.  Each time you write one logical sector, the physical sector is read first!  If the writes are random, the physical sector is read eight times in order to populate it with data!

 

And the real world observations I've had (not just with unRAID) is that Seagate drives fail on bad sectors much more often than WD drives.  I also see many people on these forums asking for help because one Seagate went bad, and then another followed.

 

Well, I've never trusted Seagate drives, and I've been buying HDDs for 25 years (my first HDD was a huge 5MB!) - I've never bought a Seagate for personal use!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.