Multi_Zone_Error_Rate (again)


Recommended Posts

One of my drives, a WD10EADS, which was redeployed from my media player in to unRAID, and is now 18 months old, is showing a 'Multi_Zone_Error_Rate' of over 2000.

 

For the first three months this drive was in my unRAID box, the value for this parameter was zero.  Over the following three months, the value rose to around 200.  Now, in the last three months it has increased to 2133.

 

The only other point of note in the SMART history for this drive is that it did show one 'Current_Pending_Sector' around five months ago, but this returned to zero and has stayed at zero ever since.

 

I understand that manufacturers give little weight to the Multi_Zone_Error_Rate, but 2000 seems abnormally high. Should I be worried?

Link to comment

Errr .. okay.  I was merely looking at what the SMART History in unMENU was telling me, and the pretty graph it draws.  It clearly tells me: 'WD-WMAVU0236768: OK - Multi_Zone_Error_Rate is 2133' and shows a graph which rises steeply over the last 2-3 months.

 

However, here is what the basic SMART report shows:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   185   177   021    Pre-fail  Always       -       5725
  4 Start_Stop_Count        0x0032   096   096   000    Old_age   Always       -       4254
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   088   088   000    Old_age   Always       -       8999
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       355
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       136
193 Load_Cycle_Count        0x0032   186   186   000    Old_age   Always       -       44150
194 Temperature_Celsius     0x0022   119   091   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   187   187   000    Old_age   Offline      -       2133

 

So, are you suggesting that this is nothing to worry about and I should ignore it?

Link to comment

I'd keep an eye on it.

 

Apparently the current value is 187, down from 200.  If it keeps heading toward the failure threshold of zero, then it might be a candidate for replacement. (but it has a long way to go to get to zero)

 

I am not all that trustful of the manufacturers and their "normalized" values.  They are, after all, interested in sold drives staying sold and not coming back for service.  But the only parameters we KNOW to look out for are the reallocated and pending sectors.  If they start heading north, we know we have a problem.

 

If I see one of the other parameters starting to head north, much higher than other drives of the same model, I am inclined to start tracking the value over time.  I'd also run some benchmarks on the drive and see if I see some correlation between the drive with the high values and poor performance or some other indicator of a problem. If you said that the multizone error rate was 2000 and the performance of the drive was half that of other same-model drives in the array that have a value of 0 for that attribute, it would be a very different conversation.

 

I, personally, would take a value of 2000 on multi-zone error rate as something to watch carefully, even if SMART normalized values are telling you it is a long way from failure.

Link to comment
  • 1 month later...

Okay, this drive red-balled yesterday, so I purchased a replacement (I already have a 2TB drive on order, but it's going to take the shop a couple of weeks to obtain, so it's lucky that I found a shop with a 1TB drive on the shelf - most here don't stock anything bigger than 500GB).

 

Pre-clearing (single pass) and re-building has taken 21 hours, but at least I'm back up and running now.

 

I could do with a little assistance in interpreting logs etc.

 

The drive in question is sdf/drive3, and I attach the syslog.

 

I ran a pre-clear on the drive, which took more than 24 hours (read rates at around 80% of the disk slowed to less than 5MB/s on both pre and post reads, before speeding up to 60MB/s again).

 

Here is the final screen of the preclear:

========================================================================1.11
==  WDC WD10EADS-00P8B0    WD-WMAVU0236768
== Disk /dev/sdf has been successfully precleared
== with a starting sector of 64
============================================================================
** Changed attributes in files: /tmp/smart_start_sdf  /tmp/smart_finish_sdf
               ATTRIBUTE   NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS      RAW_VALUE
     Raw_Read_Error_Rate =   199     200           51        ok          84668
        Start_Stop_Count =    94      96            0        ok          6397
     Temperature_Celsius =   116     114            0        ok          34
No SMART attributes are FAILING_NOW

2 sectors were pending re-allocation before the start of the preclear.
3 sectors were pending re-allocation after pre-read in cycle 1 of 1.
0 sectors were pending re-allocation after zero of disk in cycle 1 of 1.
0 sectors are pending re-allocation at the end of the preclear,
   a change of -2 in the number of sectors pending re-allocation.
0 sectors had been re-allocated before the start of the preclear.
0 sectors are re-allocated at the end of the preclear,
   the number of sectors re-allocated did not change. 
root@Tower:~#

 

The starting SMART report:

Disk: /dev/sdf
smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green family
Device Model:     WDC WD10EADS-00P8B0
Serial Number:    WD-WMAVU0236768
Firmware Version: 01.00A01
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon May 23 18:25:15 2011 SGT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (23100) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
 3 Spin_Up_Time            0x0027   182   177   021    Pre-fail  Always       -       5875
 4 Start_Stop_Count        0x0032   096   096   000    Old_age   Always       -       4778
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10261
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       383
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       143
193 Load_Cycle_Count        0x0032   184   184   000    Old_age   Always       -       48432
194 Temperature_Celsius     0x0022   114   091   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       2
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   095   001   000    Old_age   Offline      -       16863

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      10%      6160         -
# 2  Extended offline    Completed without error       00%      6140         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

and the finishing SMART report:

 

Disk: /dev/sdf
smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green family
Device Model:     WDC WD10EADS-00P8B0
Serial Number:    WD-WMAVU0236768
Firmware Version: 01.00A01
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue May 24 21:31:28 2011 SGT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (23100) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303f)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       84668
 3 Spin_Up_Time            0x0027   182   177   021    Pre-fail  Always       -       5875
 4 Start_Stop_Count        0x0032   094   094   000    Old_age   Always       -       6397
 5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10287
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       383
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       143
193 Load_Cycle_Count        0x0032   184   184   000    Old_age   Always       -       50052
194 Temperature_Celsius     0x0022   116   091   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   095   001   000    Old_age   Offline      -       16863

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      10%      6160         -
# 2  Extended offline    Completed without error       00%      6140         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Note the Raw values for Raw_Read_Error_Rate and Multi_Zone_Error_Rate.

 

Also note that the current pending sector count went from 2 to 3 and back to zero, without the Reallocated_Sector_Ct changing.  This might indicate that the drive is now 'repaired' ... but I don't like the fact that the read rate slows at around 80%, at the same time that the raw value of the Raw_Read_Error_Rate increases rapidly at that point.

 

This drive is still in warranty, but I would have to ship it internationally from Philippines to Singapore - what factors should I be pointing to when I request RMA?

syslog-20110523-110143.zip

Link to comment

Is that the drive you just bought?  I take it was used?

 

If so, I would return it to where I bought it.  It looks like it is close to death.

 

Nothing in your SMART report looks RMA'able except maybe the Multi Zone Error Rate.  Everything else looks typical for a used drive.  You may be able to point out that the worst value on the MZER was previously a 01.  That is very close to failure.  Since MZER is supposed to determine the overall health of the physical mechanisms of the HDD, having been at zero says the drive previously had some sort of severe mechanical problem.

 

If any SMART attribute fails, WD will replace your drive under warranty, but yours technically hasn't failed yet, so I don't know if you can arm twist them or not.  Run their data lifeguard tool on the drive.  If it throws any errors, you can get an RMA under warranty.

Link to comment

I hope you did not expect it to be brand new.  It has extremely high power on hour counts, over 10K! It's been powered up for nearly 1.2 years!

 

Personally, I would return that drive but then again I do not buy used drives.

 

   9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10261

Link to comment

I hope you did not expect it to be brand new.  It has extremely high power on hour counts, over 10K! It's been powered up for nearly 1.2 years!

 

Personally, I would return that drive but then again I do not buy used drives.

 

  9 Power_On_Hours          0x0032   086   086   000    Old_age   Always       -       10261

 

No, it's not new ... from my first post in this thread:

One of my drives, a WD10EADS, which was redeployed from my media player in to unRAID, and is now 18 months old, is showing a 'Multi_Zone_Error_Rate' of over 2000.

 

Nothing in your SMART report looks RMA'able except maybe the Multi Zone Error Rate.  Everything else looks typical for a used drive.

 

That was my worry ... nothing RMAable, but yet it suffered write failure ... I'm not keen to continue using it in my unRAID array.  Perhaps I should preclear it to death?

 

 

You may be able to point out that the worst value on the MZER was previously a 01.  That is very close to failure.  Since MZER is supposed to determine the overall health of the physical mechanisms of the HDD, having been at zero says the drive previously had some sort of severe mechanical problem.

 

Ah, I've never been able to discover what MZER represented.  This drive has certainly never been subject to any mechanical abuse.  After about a year, the MZER started showing small values then, in March this year, started to advance more rapidly.  In the last couple of weeks it's shot up.

 

What about the Raw_Read_Error_Rate - what does that represent?  I'm convinced that, particularly around the 80% mark (in preclear read test) the drive does lots of retries (hence the read rate dropping below 5MB/s).

 

I ran a long SMART test overnight, which reports 'Completed without error'.

 

If it's not RMAable, perhaps I should just throw it in the bin?  I'm loath to pay international shipping on it just to be told that there's no fault!

 

I will investigate the 'data lifeguard tool'.

Link to comment

Okay,

 

If you are saying that the drive in Reply #2 is the same as the drive in Reply #6, then in the past 1000 hours of use your MZER has went from 187 to 95, and at some point spiked as low as 01.  I would guess that if you used it another 1000 hours, it would probably fail.

 

I would definitely run WD's Data Lifeguard tool.  I have found two instances on the web of people reporting that the Data Lifeguard tool considers a value below 51 on MZER to be a failure.  Sine your low value is 01, Data Lifeguard would consider your drive failed.  It would return a failure code, and that is all you would need for an RMA.

 

Now, why the Threshold on the SMART data says zero and not 51 (like the Data Lifeguard software), I don't know.  Maybe this is WD's mechanism of ensuring you use their diagnostic software, and not someone elses.  Only Hitachi and WD use this SMART parameter, so I can maybe understand why smartctl still reports a healthy drive (it probably only compares to the threshold values, as their is no industry wide standard).

 

Data Lifeguard is definitely your next step.  You can download it from WD's website.

Link to comment

Well, I've booted my unRAID server from a DOS usb stick.  I ran the dlg tool short test, which completed in 7 1/2 minutes and told me the drive is healthy.  I started the long test which initially estimated it would run for 2 1/2 hours.  It's now been running for more than 5 hours, the last 3 of which it's been telling me that it will complete in 1 hour and 59 seconds.  The elapsed time keeps increasing but the remaining time estimate doesn't change.

 

I'm wondering how long to leave it before hitting abort.

Link to comment

I'm wondering how long to leave it before hitting abort.

 

I would take that is it is having problems.

I would go as long as it takes so you can get an RMA for it.

 

I am assuming it is still under warranty. If it is not under warranty, just toss it out. it is  at its end of life.

Link to comment

I'm wondering how long to leave it before hitting abort.

 

I would take that is it is having problems.

 

Indeed - it seems to have great difficulty reading in an area which is somewhere around 80% of maximum capacity. Anyway, the long test eventually completed and reported no errors!

 

I would go as long as it takes so you can get an RMA for it.

 

Yep, that's what I'm doing - it's now back on preclear cycles.

 

I am assuming it is still under warranty. If it is not under warranty, just toss it out. it is  at its end of life.

 

Indeed!  According to WD website, the warranty expires October 17, 2012.

 

I paid the equivalent of USD130 for this drive when new, although the replacement Hitachi I bought on Monday only cost USD58.

Link to comment

I'm wondering how long to leave it before hitting abort.

 

I would take that is it is having problems.

 

Indeed - it seems to have great difficulty reading in an area which is somewhere around 80% of maximum capacity. Anyway, the long test eventually completed and reported no errors!

 

I would go as long as it takes so you can get an RMA for it.

 

Yep, that's what I'm doing - it's now back on preclear cycles.

 

I am assuming it is still under warranty. If it is not under warranty, just toss it out. it is  at its end of life.

 

Indeed!  According to WD website, the warranty expires October 17, 2012.

 

I paid the equivalent of USD130 for this drive when new, although the replacement Hitachi I bought on Monday only cost USD58.

 

You may want to run the DLG tool as much as possible.  A preclear takes longer and does probably wear the drive more.  But it wears the drive equally, all over.  You seem to think it may be one portion of the disk that is bad (always slows down reads/writes near the end).  But the DLG tool seemed to move more quickly than preclear through the good parts of the disk, then bogged down on the bad parts, spending a couple of hours just hashing through the bad parts.  Maybe the DLG tool would focus the wear on the already bad areas. 

 

Do you have another SMART report?  Has the MZER decreased further?

Link to comment
You may want to run the DLG tool as much as possible.  A preclear takes longer and does probably wear the drive more.  But it wears the drive equally, all over.  You seem to think it may be one portion of the disk that is bad (always slows down reads/writes near the end).  But the DLG tool seemed to move more quickly than preclear through the good parts of the disk, then bogged down on the bad parts, spending a couple of hours just hashing through the bad parts.  Maybe the DLG tool would focus the wear on the already bad areas.

 

Perhaps, but I can run preclear while unRAID is active.  To run DLG, I have to boot under dos.  I may have a look at the preclear script to see whether it can be adapted to concentrate on the 70% - 90% area.

 

Do you have another SMART report?  Has the MZER decreased further?

It currently reads:

200 Multi_Zone_Error_Rate  0x0008  198  001  000    Old_age  Offline      -      427

 

I think it gets reset every time I run a SMART test.

 

This is extremely frustrating - I know that the disk is bad, but it passes all of the standard tests.

Link to comment

 

This is extremely frustrating - I know that the disk is bad, but it passes all of the standard tests.

 

I had one of those.

 

I cant read or write to part of it and it takes 3 days to format. it took a ton of bitching to get them to RMA it. I have not gotten it back yet.

It has only been a week so far.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.