Jump to content

How to know a new disk is "good"?


jhyler

Recommended Posts

It ain't preclear.  Or at least preclear ain't enough.  I posted another thread today where it was determined that a brand new disk bought straight from the manufacturer was put through a full 2-day run of preclear, passed with flying colors, and immediately failed with read errors when added to the array as a parity disk.

 

I didn't do that preclear because I wanted the disk zeroed - it was going to be a parity disk, after all.  I ran preclear because I was under the impression that it also acted as a sort of stress test that would get the disk to fail if it was going to.  And maybe it does for some kinds of errors.  But obviously it's not enough.  Certainly it's not impossible that the disk failed after the preclear completed, but the overwhelming likelihood is that there was some kind of manufacturing error that preclear didn't find but it showed up almost instantly when the disk went into the  array.

 

So.  Let's say I want to do something to a new disk when I acquire it.  Some test or series of tests that if the disk passes, there is a strong likelihood that it will work in the array.  Preclear is part of it - because usually i do want the disk zeroed, and some errors it may find - but what other test steps should I be taking?

 

Anybody else feel this way?  What do you do?

Edited by jhyler
Link to comment

This not a definitive answer to your question!  It is only my opinion...

 

A quote from you in another thread:

Quote

Though now that I'm looking I see other people in the forum having issues with 22T drives.

 

A possibility:  There is a design flaw or a defective manufacturing process in this model of hard drive that is causing an excessive number of early drive failures. (Remember, Preclear was not initially designed to force failures in drives but rather to find drives that were presumably defective when first received.  The detection of infant morality in the drives was a side benefit.  Infant Mortality failures--- by definition ---are assumed to occur in the first few hours of operation.  A random disk with the issue could fail at 5 hours or 150 hours.  Several years ago, it was recommended that a new disk be subjected to three preclear cycles.   I doesn't know if this was to put more hours on the disk to get through the Infant Mortality period or if it was found that there were disks that failed a basic read-write operation on subsequent cycles---  probably a combination of the two.)

 

This would not be the first time that this has happened.  In fact, Seagate was sued over the high failure rates on its ST3000DM001 drives back in 2016.  You can read about that here:

 

        https://www.extremetech.com/extreme/222267-seagate-faces-lawsuit-over-3tb-hard-drive-failure-rates

 

The drive in question was introduced about four years prior to the date of this article.  I actually used several of these drives at the time in my servers and I did have a couple fail.  But the 'survivors' run for another ten years.  I just pulled the last one about a month ago.  It was not because it failed but I needed more storage space.  (I don't know the actual running hours of that last drive because the SMART "Power on hours" counter had rolled over!) 

 

When one buys on the leading edge of technology, there is always a higher risk that you could encounter unexpected issues that were not fully appreciated at the time when they are first introduced.  (Remember that NASA used ~10 year technology when selecting CPU's for its deep space craft for years to avoid problems many years after launch!)

 

It may be several years until the reliability of your 22TB model is finally determined.  (I am not singling out Seagate.  There have been other HD models from other companies that don't have good quality marks.  The ST3000DM001 was just a standout because its modest cost and other features made it very attractive to all types of users and, thus, it had a very high and long production run.)

Edited by Frank1940
Link to comment

Thanks for the reply, Frank1940. What you describe is indeed a possibility, hard to know except in retrospect.   My main concern now is data integrity in emergencies.  I, like I hope a lot of other people, keep a spare, pre-cleared disk on hand.  When a disk starts showing signs of failure I can swap it in immediately before things get out of hand.  Now I am increasingly concerned that my hot spare could be ready to fail me too because I didn't stress test it sufficiently.  In my situation that could be bad, bad, bad. 

 

So how can I regain confidence? My immediate thought is that in preparing a standby replacement disk, I will first use it to replace one of my "archive" array disks that I know isn't being written to anymore, let Unraid rebuild it on the as-received new disk.  Afterwards I'll replace the original disk and trust the array.  Then - if the rebuild succeeded - I'll preclear the new disk and set it aside as a spare.

 

The downside of doing that is it puts a lot of I/O on all the other disks, enough to make me think that keeping two ready-spare disks might be wise.  In other words that solution may be overkill.  Which is why I asked the community here what they do to gain confidence in their new disks - which, unfortunately, remains unanswered.

Edited by jhyler
Link to comment

Short answer:  Buy drives with a known good track record from a reliable vendor, and have a solid backup strategy.

 

Long answer:  This is a rabbit hole many have found themselves lost in.  Don't go too far down the rabbit hole, else you may also get lost.

 

It is all about risk assessment.  One needs to look at a number of things, then make a decision for themselves.  How valuable is the data?  Can it be recovered/replaced from other currently available sources?  What kind of failures might I see?  How much money/time/resources am I willing to spend to mitigate the risk?  How much down time is acceptable if I need to perform data recovery?

 

The first thing to remember is that parity is not a backup strategy.  It does make recovering from a drive failure more convenient as well as minimize down time by emulating the data on the problem drive.  A backup is a duplicate copy of data which you have high confidence can be read and restored without error, be it back onto the original server or to another system (in the case where the original server has a catastrophic failure).

 

It is too bad that you had an issue with the new drive.  But stuff happens.  New hardware is pretty reliable, and failures typically fall along what is called a bathtub curve - it will either fail very early in is operational life or very late (many years).  Running a preclear typically is a good burn in test.  At over 2 hours per TB, a drive is run through its paces and stressed a bit.  There are more aggressive tests available which do various data patterns, random reads/writes, etc.  It comes down to you deciding good enough is good enough and trusting the drive.

 

In the end, with a backup of your data in hand, a drive failure becomes an inconvenience and not a stressful data recovery event.

  • Upvote 1
Link to comment
1 hour ago, ConnerVT said:

The first thing to remember is that parity is not a backup strategy. 

 

Thanks @ConnerVT!   That is a most important point.  Any data that is truly irreplaceable (and valuable to you) should be duplicated and stored off-site.  There are many, many more ways to use data than the failure of a hard drive-- fire, theft, storms, vandalism, floods, lightning, etc.   Read this forum thread from a few years ago:

 

       https://forums.unraid.net/topic/50504-dual-or-single-parity-its-your-choice/

 

Be careful with statistics.  They predict what will happen in large populations.  If you are concerned about a single event occurring, they are useless.  (If you are the only person the USA be hit and killed by a meteor in 2024, you will still be dead.  Devil hang the probability of that happening to you!)

 

@jhyler, what we don't at this point is what the infant mortality is for these drives.  (The manufacturers consider this to be proprietary information.) Is it 1 in 2000 or 1 in 1,000,000?   As a matter of fact, it does not matter!  You got a defective drive.  Period.  Were you simply unlucky and got the 1 in 1,000,00 drive or a drive from a production lot with a failure rate of 1 in 1000?  We simply don't know...  

 

All you can do, is to read the user reviews at the one-star reviews on vendor websites and make a judgment based on what it happening there.  

 

In recent years, I have purchased HD's only from vendors who have a 30-day no-questions-asked refund policy.  

Edited by Frank1940
Link to comment

Thanks for the replies, folks.  Not sure how we got onto the subject of backups, but that really isn't the point for me.  LIke most people I suppose, my array contains data I could lose with no issue, data which if I lost would be expensive or difficult to recreate, and essential data that must be preserved.  My backups are planned and managed accordingly.  Data survivability is not the issue.  Data availability is what can keep me up nights.  I won't go into the details of me and my business, let's suffice to say it's very important to have near-real-time access to a lot of data and having to get a backup retrieved and reloaded would be looked at as a serious failure.  Unfair, maybe, but such is life. (Digression - you may be tempted at this point to wonder why I don't have multiple servers, then.  I have definitely thought about it, and it may eventually happen.  Now I just keep spare parts. Redundant servers creates its own set of issues, and even if I did have them, I'd still be faced with this question.  End digression.)

 

It's interesting to note that in back-to-back posts we have (1) make decisions based on product track records (i.e., rely on statistics) and (2) "be careful with statistics".  Both statements are correct, if seemingly contradictory.  As Frank points out, statistics describe populations, not elements of a population.  I buy drives based on track records because it's better than not doing so.  But statistics don't tell me if that particular 4.5-star disk I just bought is any good.  Which is what happened here.  And it's why I asked for comments on how other people address the issue of whether the individual disks they buy out of good-track-record populations are individually good ones.  Not for proof positive, which is of course impossible, but I was hoping for "I run this script and it's weeded out most of my clunkers".  

 

Not having gotten any replies like that, I'm starting to think most people do rely on statistics without realizing it.  Which I suppose is probably fine for many if not most Unraid users.  

 

Thanks again for the input.

Edited by jhyler
  • Like 1
Link to comment
19 hours ago, jhyler said:

Thanks for the replies, folks.  Not sure how we got onto the subject of backups, but that really isn't the point for me.

 

Sorry if I took your thread a bit off-topic.  Concerns about failing drives, replacing drives and such usually include issues of data loss and system downtime.  For me, that ties the subject to data backup as well. 

 

Earlier in the thread it was mentioned about keeping two cold spare drives on hand, in case the first one failed quickly during replacement.  Having a drive on hand is good (as I have one as well).  But to have two (or more) in cold storage ties up several hundreds of dollars, just as insurance against something that very rarely happens.

Link to comment
On 12/27/2023 at 4:35 PM, jhyler said:

failed with read errors when added to the array as a parity disk

Your other post says

Quote

read errors on one of my older data disks. 

 

SATA connections are not (and have never been) known for being a robust connector and most times the non-locking connectors work better than the so-called locking ones.

 

But assuming that your original posting is correct and not this one, then you slightly dislodged the sata cabling to the other drive when replacing the parity drive.  Reseat all the cabling at both ends.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...