Disk health (time to replace some of my disks?)


rickardk

Recommended Posts

Yep - definitely have some issue.

 

(Assuming this is two separate arrays)

 

Top array:

 

Looks like disk5 is failing.  You could run another parity check and see if the numbers creep up.  If not, maybe ok.  But my guess is that this is only going to get worse.

 

Disk6 has some logged errors.  You should get a smart report and look at the errors that have been logged.  Frequently these log messages indicate a bad or loose cable.

 

Bottom array:

 

Disk 1 is showing only a very small amount of reallocated sectors. I'd keep an eye on it.  If it creeps upward after every parity check, RMA it.

 

Disk 6 looks bad.  Like disk5 in the top array, you could run a parity check and watch it, but likely it is going to continue to get worse.

 

Glad to see you're using myMain.  This smart view across all disks was one of the motivations I had to create it in the first place.  Glad to see people are using it!

Link to comment

bjp999, what do the multi zone and ata errors mean?  I understand the rest.

 

Also, I thought you had to actually write to a disk to get it to reallocate sectors?  Since a parity check is read-only, will that work to test out a dodgy disk?  I figured you would have to do some test transfers with large files...

Link to comment

According to the Wiki linked on the bottom of MyMain:

Multi-Zone Error Rate - The number of errors found when writing a sector. The higher the value, the worse the disk's mechanical condition is.

 

I'm guessing my WD20EARS is not looking good:

 

» reallocated_sector_ct=442

» reallocated_event_count=3

» current_pending_sector=115

» offline_uncorrectable=109

» multi_zone_error_rate=123

 

I just built this server and set up unRAID and there's only one drive in it so far. It's only got 202 power on hours. I just ordered it from Newegg on 10/26.

 

I do have the jumper on pins 7-8.

Link to comment

According to the Wiki linked on the bottom of MyMain:

Multi-Zone Error Rate - The number of errors found when writing a sector. The higher the value, the worse the disk's mechanical condition is.

 

I'm guessing my WD20EARS is not looking good:

 

» reallocated_sector_ct=442

» reallocated_event_count=3

» current_pending_sector=115

» offline_uncorrectable=109

» multi_zone_error_rate=123

 

I just built this server and set up unRAID and there's only one drive in it so far. It's only got 202 power on hours. I just ordered it from Newegg on 10/26.

 

I do have the jumper on pins 7-8.

That drive needs to be replaced.  No question about it.
Link to comment

So would:

 

» current_pending_sector=10

» offline_uncorrectable=10

» multi_zone_error_rate=221

 

Just be watch the disk, or should I swap it out?  After noticing those numbers initially, I ran a parity check (non-correcting from unmenu) that reported no sync errors and the SMART error values did not change.

 

Edit:  The normalized values for all three of those shows: 200 200 000

Link to comment

I did run preclear on this one. It took about 30 hours which it sounds like is normal for a 2 TB drive, based on what I've read.

 

Do you still have the smart output from after the preclear?  If it was clean after the preclear, this is the first disk I've heard of that died an early death after preclear.

Link to comment

bjp999, what do the multi zone and ata errors mean?  I understand the rest.

 

Also, I thought you had to actually write to a disk to get it to reallocate sectors?  Since a parity check is read-only, will that work to test out a dodgy disk?  I figured you would have to do some test transfers with large files...

 

I answered the first part of this already, but wanted to answer the second part.

 

My understanding is that smart WILL reallocate a sector if it detects that a sector on the hard disk is failing.  Obviously if it has failed and can't be read, it cannot do this invisibly.  But if the sector is readable but the drive detects that the sector is going bad, it can reallocate it without putting it in pending state.  unRAID would never see it.

 

But if a sector has really failed, and unRAID gets an error from the disk, you're right that only a write of that sector would force the remapping.  But the way unRAID works, if it can't read a sector, it will read the corresponding sectors on all other disks to determine what should have been in the bad sector.  It goes one step further and WRITES that sector value back to problematic drive.  This act will cause the reallocation to occur.

 

Given this it is hard to understand how we get these "pending sectors".  Equally confusing is that we've seen examples where a bunch of pending sectors suddenly disappear with no sector relocations at all.  My theory on this is that the drive is putting the sector in the pending status and not raising an error back to the calling program.  On the next access, if the drive is able to read the sector without a problem it will return it to service. But if it gets an error it will attempt to reallocate itself, or return the error (which unRAID handles).  Can anyone confirm or deny this is what is happening?

Link to comment

I did run preclear on this one. It took about 30 hours which it sounds like is normal for a 2 TB drive, based on what I've read.

 

Do you still have the smart output from after the preclear?  If it was clean after the preclear, this is the first disk I've heard of that died an early death after preclear.

 

I have now had two drives die early deaths after completing the preclear.  The first one only completed one cycle.  The drive I just posted about completed 3 preclear cycles and then ran for less then a month.  Died on the 3rd parity check.  The smart values I posted above (or any others for that matter) did not change at all during the first 2 parity checks.

Link to comment

Given this it is hard to understand how we get these "pending sectors". Equally confusing is that we've seen examples where a bunch of pending sectors suddenly disappear with no sector relocations at all.  My theory on this is that the drive is putting the sector in the pending status and not raising an error back to the calling program.  On the next access, if the drive is able to read the sector without a problem it will return it to service. But if it gets an error it will attempt to reallocate itself, or return the error (which unRAID handles).  Can anyone confirm or deny this is what is happening?

Easy in addition to the reading of sectors we request there are several other times when sectors are accessed.  One guess is that the disk performs its own self tests when the disk is otherwise idle.  It is then that it discovers the un-readable sectors.  Those are referred to as "off-line" tests.  Some disks perform these tests as long as they are spinning but otherwise idle.

 

In addition, you, as end users of the drives can request either a "short" offline test, or a "long" offline test.  The short test asks the disk to read a sample of its sectors when not otherwise busy.  The long test asks it to read all its sectors.  Either test can produce sectors pending re-allocation.  (Sectors that are detected as  un-readable but because they are not an error returned to the OS, cannot be re-constructed and re-written by the unRAID "md" driver.)  I suspect that some disks also perform these types of tests on their own as part of their reading entire cylinders and read-ahead buffering of data.  The extra buffered data (and extra errors regarding un-readable sectors) might not be sent back to the OS as it was not requested.

 

In the same way, during any initial parity calculation, when new disk configurations are established, every block on every disk is being read.  There is no way to re-construct a sector since there is no parity in existence yet.  It could also result in sectors pending re-allocation.

 

Joe L.

Link to comment

I did run preclear on this one. It took about 30 hours which it sounds like is normal for a 2 TB drive, based on what I've read.

 

Do you still have the smart output from after the preclear?  If it was clean after the preclear, this is the first disk I've heard of that died an early death after preclear.

 

Unfortunately, I ran it before I learned of the mail utility (that I installed from unMenu) and the mail options in preclear.

 

I haven't had a WD drive fail on me yet. This one may be bad from the start, but it is still working. I'm not sure how to proceed on an RMA. I can return it to Newegg, since it's still within the 30 days or have WD do an Advance Replacement. I'm not sure what information they need. I don't have any data on it that isn't on another disk, but it would be easier to do an Advance Replacement.

 

I loaded FreeDOS and the WD Data Lifeguard Diagnostics on a USB drive with unetbootin and was able to boot that up on the unRAID box and run the diagnostics program but it said "Western Digital drives were found, however, none of them are supported in this version of the program."

 

I'm not sure if that's because of the different formatting that unRAID does or that the diagnostics program is not really finding the 2TB drive. I did search the forum and wiki to see if there was any info on running the DLD program within unRAID and didn't find any.

Link to comment

Can't really advise you on best way to get it replaced. I'd probably use newegg having never gone the RMA route.

 

The wd tools may not be finding your drives on add-on controllers.

 

If you are having multilple drives fail after successful preclears then you are either unlucky or you have some problem with your setup. Might be a bad psu. My father, years ago, went through 3 monitors in a year. He got an electrician come to the house who found a problem in the house wiring and the next monitor lasted years. Most users have found preclear very effective at weeding out bad drives.

Link to comment

I may have assumed the most recent disk was bad to early.  When it was invalidated it stopped responding (hdparm, smartctl.)  I replaced it with a precleared drive I have in case of failure.  I also put the assumed failed drive back into the 4220 at a different location.  At the new location it was now responding (unfortunately I didn't reboot at all with the drive in the same location.)  With the drive responding and the array rebuilt with the new drive (I waited for this process to complete to be safe) I ran a preclear on the assumed failed drive.  At the conclusion of the preclear, the errors had dropped to 5 (except the multi_zone_error_rate which was still sitting at the same value.)  I left the drives in and checked smart again last night.  All the errors had dropped to zero and the normal values are showing like nothing ever happened.  Also Reallocated_sector_ct is also still 0 so I am assuming none of the sectors were actually reallocated.  I am currently running 2 more pre-clear cycles on the drive.

Link to comment

I recently RMA 2 x WD 1.5TB EARS which I purchased via newegg. Even though I was within the 30 days I decided to deal directly with WD. It was very straightforward process, the only snag I found is if you want advance replacements you can only return one drive at a time, I decided to take the other option, returned both drives via UPS. It took 5 days for WD to receive them, they sent out replacements on the sameday they received them, I expect the whole process to take around 10 days.

 

I have no idea if the replacements will be new or refurbished, but they are different s/n to the ones I returned. I had previously aligned the drives using WDalign, without jumper for use with XP, this screwed up using them for unRAID (see my posts), so I just sent them back with a vague message saying they didn't work in my PC.

 

See http://lime-technology.com/forum/index.php?topic=8333.15 for comments on RMA process.

Link to comment

Replacement drives are due to be delivered today, they they have replaced them with WD20EARS, I returned WD15EARS...so not so bad...very impressed with WD warranty system, as technically the drives were not faulty.

That's a nice little upgrade.  I just did an advanced replacement on a WD15EARS, but I got the same model back. 

Link to comment

Bit of a lottery I guess, probably depends on stock levels etc...etc...I don't often luck out like this :). Generally, I've been impressed with the WD green series drives (I also have a couple of 640EADS), for the low heat, noise output and spindle speed they perform pretty good.

 

I know some people like to pour scorn on WD ( see http://forums.storagereview.com/index.php/topic/28363-is-western-digital-trustful-hdd-vendor-today/ ), but all manufacturers have problems at some time or other and 4kB sector technology is something they all face sooner or later.

Link to comment

So after running 4 individual (didn't use -c 4) preclear cycles on the misbehaving disk:

 

a)  On the second cycle all the errors went away.  The pending sectors did not become reallocated sectors and all the normalized values seemed to go to default after first run values

b)  After the 4th cycle I got the following:

 

< 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0

---

> 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0

 

which doesn't look bad at all, but also isn't exactly comforting (gives me the impression that there is still something odd going on considering the disks history.)

Link to comment

I recently RMA 2 x WD 1.5TB EARS which I purchased via newegg. Even though I was within the 30 days I decided to deal directly with WD. It was very straightforward process, the only snag I found is if you want advance replacements you can only return one drive at a time, I decided to take the other option, returned both drives via UPS. It took 5 days for WD to receive them, they sent out replacements on the sameday they received them, I expect the whole process to take around 10 days.

 

I have no idea if the replacements will be new or refurbished, but they are different s/n to the ones I returned. I had previously aligned the drives using WDalign, without jumper for use with XP, this screwed up using them for unRAID (see my posts), so I just sent them back with a vague message saying they didn't work in my PC.

 

See http://lime-technology.com/forum/index.php?topic=8333.15 for comments on RMA process.

 

I'm guessing you have to pay for shipping back to WD and they pay for shipping to you?

 

I requested an RMA from Newegg and it looks like it's going to be $12 to ship it back to them. I assume they pay for shipping the replacement to me.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.