PeterB Posted March 30, 2011 Share Posted March 30, 2011 One of my drives, a WD10EADS, which was redeployed from my media player in to unRAID, and is now 18 months old, is showing a 'Multi_Zone_Error_Rate' of over 2000. For the first three months this drive was in my unRAID box, the value for this parameter was zero. Over the following three months, the value rose to around 200. Now, in the last three months it has increased to 2133. The only other point of note in the SMART history for this drive is that it did show one 'Current_Pending_Sector' around five months ago, but this returned to zero and has stayed at zero ever since. I understand that manufacturers give little weight to the Multi_Zone_Error_Rate, but 2000 seems abnormally high. Should I be worried? Quote Link to comment
BRiT Posted March 30, 2011 Share Posted March 30, 2011 Once again, you're reading the SMART report incorrectly. You need to look at the Normalized Current Value and Normalized Threshold Value for Multi_Zone_Error_Rate, ignore the RAW field. Quote Link to comment
PeterB Posted March 31, 2011 Author Share Posted March 31, 2011 Errr .. okay. I was merely looking at what the SMART History in unMENU was telling me, and the pretty graph it draws. It clearly tells me: 'WD-WMAVU0236768: OK - Multi_Zone_Error_Rate is 2133' and shows a graph which rises steeply over the last 2-3 months. However, here is what the basic SMART report shows: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 185 177 021 Pre-fail Always - 5725 4 Start_Stop_Count 0x0032 096 096 000 Old_age Always - 4254 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 088 088 000 Old_age Always - 8999 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 355 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 136 193 Load_Cycle_Count 0x0032 186 186 000 Old_age Always - 44150 194 Temperature_Celsius 0x0022 119 091 000 Old_age Always - 31 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 187 187 000 Old_age Offline - 2133 So, are you suggesting that this is nothing to worry about and I should ignore it? Quote Link to comment
Joe L. Posted March 31, 2011 Share Posted March 31, 2011 I'd keep an eye on it. Apparently the current value is 187, down from 200. If it keeps heading toward the failure threshold of zero, then it might be a candidate for replacement. (but it has a long way to go to get to zero) Quote Link to comment
SSD Posted March 31, 2011 Share Posted March 31, 2011 I'd keep an eye on it. Apparently the current value is 187, down from 200. If it keeps heading toward the failure threshold of zero, then it might be a candidate for replacement. (but it has a long way to go to get to zero) I am not all that trustful of the manufacturers and their "normalized" values. They are, after all, interested in sold drives staying sold and not coming back for service. But the only parameters we KNOW to look out for are the reallocated and pending sectors. If they start heading north, we know we have a problem. If I see one of the other parameters starting to head north, much higher than other drives of the same model, I am inclined to start tracking the value over time. I'd also run some benchmarks on the drive and see if I see some correlation between the drive with the high values and poor performance or some other indicator of a problem. If you said that the multizone error rate was 2000 and the performance of the drive was half that of other same-model drives in the array that have a value of 0 for that attribute, it would be a very different conversation. I, personally, would take a value of 2000 on multi-zone error rate as something to watch carefully, even if SMART normalized values are telling you it is a long way from failure. Quote Link to comment
PeterB Posted March 31, 2011 Author Share Posted March 31, 2011 Obviously I've been aware of this change in parameter value for some time, and have been keeping a watch on that drive - I will continue to monitor it but, perhaps, I should start saving for a replacement drive. Quote Link to comment
PeterB Posted May 24, 2011 Author Share Posted May 24, 2011 Okay, this drive red-balled yesterday, so I purchased a replacement (I already have a 2TB drive on order, but it's going to take the shop a couple of weeks to obtain, so it's lucky that I found a shop with a 1TB drive on the shelf - most here don't stock anything bigger than 500GB). Pre-clearing (single pass) and re-building has taken 21 hours, but at least I'm back up and running now. I could do with a little assistance in interpreting logs etc. The drive in question is sdf/drive3, and I attach the syslog. I ran a pre-clear on the drive, which took more than 24 hours (read rates at around 80% of the disk slowed to less than 5MB/s on both pre and post reads, before speeding up to 60MB/s again). Here is the final screen of the preclear: ========================================================================1.11 == WDC WD10EADS-00P8B0 WD-WMAVU0236768 == Disk /dev/sdf has been successfully precleared == with a starting sector of 64 ============================================================================ ** Changed attributes in files: /tmp/smart_start_sdf /tmp/smart_finish_sdf ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VALUE Raw_Read_Error_Rate = 199 200 51 ok 84668 Start_Stop_Count = 94 96 0 ok 6397 Temperature_Celsius = 116 114 0 ok 34 No SMART attributes are FAILING_NOW 2 sectors were pending re-allocation before the start of the preclear. 3 sectors were pending re-allocation after pre-read in cycle 1 of 1. 0 sectors were pending re-allocation after zero of disk in cycle 1 of 1. 0 sectors are pending re-allocation at the end of the preclear, a change of -2 in the number of sectors pending re-allocation. 0 sectors had been re-allocated before the start of the preclear. 0 sectors are re-allocated at the end of the preclear, the number of sectors re-allocated did not change. root@Tower:~# The starting SMART report: Disk: /dev/sdf smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD10EADS-00P8B0 Serial Number: WD-WMAVU0236768 Firmware Version: 01.00A01 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon May 23 18:25:15 2011 SGT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (23100) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 182 177 021 Pre-fail Always - 5875 4 Start_Stop_Count 0x0032 096 096 000 Old_age Always - 4778 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 10261 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 383 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 143 193 Load_Cycle_Count 0x0032 184 184 000 Old_age Always - 48432 194 Temperature_Celsius 0x0022 114 091 000 Old_age Always - 36 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 2 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 095 001 000 Old_age Offline - 16863 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Interrupted (host reset) 10% 6160 - # 2 Extended offline Completed without error 00% 6140 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. and the finishing SMART report: Disk: /dev/sdf smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green family Device Model: WDC WD10EADS-00P8B0 Serial Number: WD-WMAVU0236768 Firmware Version: 01.00A01 User Capacity: 1,000,204,886,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Tue May 24 21:31:28 2011 SGT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (23100) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 199 199 051 Pre-fail Always - 84668 3 Spin_Up_Time 0x0027 182 177 021 Pre-fail Always - 5875 4 Start_Stop_Count 0x0032 094 094 000 Old_age Always - 6397 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 10287 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 383 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 143 193 Load_Cycle_Count 0x0032 184 184 000 Old_age Always - 50052 194 Temperature_Celsius 0x0022 116 091 000 Old_age Always - 34 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 095 001 000 Old_age Offline - 16863 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Interrupted (host reset) 10% 6160 - # 2 Extended offline Completed without error 00% 6140 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Note the Raw values for Raw_Read_Error_Rate and Multi_Zone_Error_Rate. Also note that the current pending sector count went from 2 to 3 and back to zero, without the Reallocated_Sector_Ct changing. This might indicate that the drive is now 'repaired' ... but I don't like the fact that the read rate slows at around 80%, at the same time that the raw value of the Raw_Read_Error_Rate increases rapidly at that point. This drive is still in warranty, but I would have to ship it internationally from Philippines to Singapore - what factors should I be pointing to when I request RMA? syslog-20110523-110143.zip Quote Link to comment
KYThrill Posted May 24, 2011 Share Posted May 24, 2011 Is that the drive you just bought? I take it was used? If so, I would return it to where I bought it. It looks like it is close to death. Nothing in your SMART report looks RMA'able except maybe the Multi Zone Error Rate. Everything else looks typical for a used drive. You may be able to point out that the worst value on the MZER was previously a 01. That is very close to failure. Since MZER is supposed to determine the overall health of the physical mechanisms of the HDD, having been at zero says the drive previously had some sort of severe mechanical problem. If any SMART attribute fails, WD will replace your drive under warranty, but yours technically hasn't failed yet, so I don't know if you can arm twist them or not. Run their data lifeguard tool on the drive. If it throws any errors, you can get an RMA under warranty. Quote Link to comment
BRiT Posted May 24, 2011 Share Posted May 24, 2011 I hope you did not expect it to be brand new. It has extremely high power on hour counts, over 10K! It's been powered up for nearly 1.2 years! Personally, I would return that drive but then again I do not buy used drives. 9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 10261 Quote Link to comment
PeterB Posted May 25, 2011 Author Share Posted May 25, 2011 I hope you did not expect it to be brand new. It has extremely high power on hour counts, over 10K! It's been powered up for nearly 1.2 years! Personally, I would return that drive but then again I do not buy used drives. 9 Power_On_Hours 0x0032 086 086 000 Old_age Always - 10261 No, it's not new ... from my first post in this thread: One of my drives, a WD10EADS, which was redeployed from my media player in to unRAID, and is now 18 months old, is showing a 'Multi_Zone_Error_Rate' of over 2000. Nothing in your SMART report looks RMA'able except maybe the Multi Zone Error Rate. Everything else looks typical for a used drive. That was my worry ... nothing RMAable, but yet it suffered write failure ... I'm not keen to continue using it in my unRAID array. Perhaps I should preclear it to death? You may be able to point out that the worst value on the MZER was previously a 01. That is very close to failure. Since MZER is supposed to determine the overall health of the physical mechanisms of the HDD, having been at zero says the drive previously had some sort of severe mechanical problem. Ah, I've never been able to discover what MZER represented. This drive has certainly never been subject to any mechanical abuse. After about a year, the MZER started showing small values then, in March this year, started to advance more rapidly. In the last couple of weeks it's shot up. What about the Raw_Read_Error_Rate - what does that represent? I'm convinced that, particularly around the 80% mark (in preclear read test) the drive does lots of retries (hence the read rate dropping below 5MB/s). I ran a long SMART test overnight, which reports 'Completed without error'. If it's not RMAable, perhaps I should just throw it in the bin? I'm loath to pay international shipping on it just to be told that there's no fault! I will investigate the 'data lifeguard tool'. Quote Link to comment
KYThrill Posted May 25, 2011 Share Posted May 25, 2011 Okay, If you are saying that the drive in Reply #2 is the same as the drive in Reply #6, then in the past 1000 hours of use your MZER has went from 187 to 95, and at some point spiked as low as 01. I would guess that if you used it another 1000 hours, it would probably fail. I would definitely run WD's Data Lifeguard tool. I have found two instances on the web of people reporting that the Data Lifeguard tool considers a value below 51 on MZER to be a failure. Sine your low value is 01, Data Lifeguard would consider your drive failed. It would return a failure code, and that is all you would need for an RMA. Now, why the Threshold on the SMART data says zero and not 51 (like the Data Lifeguard software), I don't know. Maybe this is WD's mechanism of ensuring you use their diagnostic software, and not someone elses. Only Hitachi and WD use this SMART parameter, so I can maybe understand why smartctl still reports a healthy drive (it probably only compares to the threshold values, as their is no industry wide standard). Data Lifeguard is definitely your next step. You can download it from WD's website. Quote Link to comment
PeterB Posted May 25, 2011 Author Share Posted May 25, 2011 Well, I've booted my unRAID server from a DOS usb stick. I ran the dlg tool short test, which completed in 7 1/2 minutes and told me the drive is healthy. I started the long test which initially estimated it would run for 2 1/2 hours. It's now been running for more than 5 hours, the last 3 of which it's been telling me that it will complete in 1 hour and 59 seconds. The elapsed time keeps increasing but the remaining time estimate doesn't change. I'm wondering how long to leave it before hitting abort. Quote Link to comment
Johnm Posted May 25, 2011 Share Posted May 25, 2011 I'm wondering how long to leave it before hitting abort. I would take that is it is having problems. I would go as long as it takes so you can get an RMA for it. I am assuming it is still under warranty. If it is not under warranty, just toss it out. it is at its end of life. Quote Link to comment
PeterB Posted May 25, 2011 Author Share Posted May 25, 2011 I'm wondering how long to leave it before hitting abort. I would take that is it is having problems. Indeed - it seems to have great difficulty reading in an area which is somewhere around 80% of maximum capacity. Anyway, the long test eventually completed and reported no errors! I would go as long as it takes so you can get an RMA for it. Yep, that's what I'm doing - it's now back on preclear cycles. I am assuming it is still under warranty. If it is not under warranty, just toss it out. it is at its end of life. Indeed! According to WD website, the warranty expires October 17, 2012. I paid the equivalent of USD130 for this drive when new, although the replacement Hitachi I bought on Monday only cost USD58. Quote Link to comment
KYThrill Posted May 25, 2011 Share Posted May 25, 2011 I'm wondering how long to leave it before hitting abort. I would take that is it is having problems. Indeed - it seems to have great difficulty reading in an area which is somewhere around 80% of maximum capacity. Anyway, the long test eventually completed and reported no errors! I would go as long as it takes so you can get an RMA for it. Yep, that's what I'm doing - it's now back on preclear cycles. I am assuming it is still under warranty. If it is not under warranty, just toss it out. it is at its end of life. Indeed! According to WD website, the warranty expires October 17, 2012. I paid the equivalent of USD130 for this drive when new, although the replacement Hitachi I bought on Monday only cost USD58. You may want to run the DLG tool as much as possible. A preclear takes longer and does probably wear the drive more. But it wears the drive equally, all over. You seem to think it may be one portion of the disk that is bad (always slows down reads/writes near the end). But the DLG tool seemed to move more quickly than preclear through the good parts of the disk, then bogged down on the bad parts, spending a couple of hours just hashing through the bad parts. Maybe the DLG tool would focus the wear on the already bad areas. Do you have another SMART report? Has the MZER decreased further? Quote Link to comment
BRiT Posted May 26, 2011 Share Posted May 26, 2011 No, it's not new ... from my first post in this thread Oh, my bad. I read about you finding a replacement drive and thought the SMART reports were for that replacement drive. Quote Link to comment
PeterB Posted May 26, 2011 Author Share Posted May 26, 2011 You may want to run the DLG tool as much as possible. A preclear takes longer and does probably wear the drive more. But it wears the drive equally, all over. You seem to think it may be one portion of the disk that is bad (always slows down reads/writes near the end). But the DLG tool seemed to move more quickly than preclear through the good parts of the disk, then bogged down on the bad parts, spending a couple of hours just hashing through the bad parts. Maybe the DLG tool would focus the wear on the already bad areas. Perhaps, but I can run preclear while unRAID is active. To run DLG, I have to boot under dos. I may have a look at the preclear script to see whether it can be adapted to concentrate on the 70% - 90% area. Do you have another SMART report? Has the MZER decreased further? It currently reads: 200 Multi_Zone_Error_Rate 0x0008 198 001 000 Old_age Offline - 427 I think it gets reset every time I run a SMART test. This is extremely frustrating - I know that the disk is bad, but it passes all of the standard tests. Quote Link to comment
Johnm Posted May 26, 2011 Share Posted May 26, 2011 This is extremely frustrating - I know that the disk is bad, but it passes all of the standard tests. I had one of those. I cant read or write to part of it and it takes 3 days to format. it took a ton of bitching to get them to RMA it. I have not gotten it back yet. It has only been a week so far. Quote Link to comment
PeterB Posted May 26, 2011 Author Share Posted May 26, 2011 it took a ton of bitching to get them to RMA it. I have not gotten it back yet. This is my fear - especially when I will have to pay international postage to return it. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.