tr0910 Posted April 14, 2012 Share Posted April 14, 2012 Yesterday Disk1 was disabled on 5b13. SAMSUNG_HD203WI_S1UYJ1WZ405310 sdd. This is after using this disk for 6 months in unRaid, 3 months of which have been using 5b13 without complaint. It had lived a previous life for up to one year in a Buffalo NAS. Trouble is Smarthistory doesn't think it is gone. I had just made some changes to the server, replaced some 2tb disks with 3 tb disks, done some parity checks, and everything was fine. Now I was just buttoning things up to let her run. She started up fine, but 4 hours later, we had a red balled disk1. I have unassigned the disk, started the array, then reassigned the disk, and it is doing a data rebuild on disk1 right now. Am I wasting my time? Is she likely dead? Syslog of the time of failure and then the next syslog after the failure and Smarthistory attached. (Note: This is a backup server that is a complete rsync of the main server so no worries about potential to lose data. We can play around with her without worry.) Latest Smarthistory from Unmenu after the failure (full smarthistory attached) 2012-04-14 querytime 1334411485 2012-04-14 health PASSED 2012-04-14 ATA_Error_Count 2012-04-14 Raw_Read_Error_Rate 49 2012-04-14 Spin_Up_Time 12134 2012-04-14 Start_Stop_Count 353 2012-04-14 Spin_Retry_Count 0 2012-04-14 Calibration_Retry_Count 0 2012-04-14 Power_Cycle_Count 94 2012-04-14 Reallocated_Sector_Ct 0 2012-04-14 Seek_Error_Rate 0 2012-04-14 Temperature_Celsius 24 (Min/Max 12/41) 2012-04-14 Reallocated_Event_Count 0 2012-04-14 Current_Pending_Sector 0 2012-04-14 Power_On_Hours 718 2012-04-14 Offline_Uncorrectable 0 2012-04-14 UDMA_CRC_Error_Count 0 2012-04-14 Multi_Zone_Error_Rate 34155 syslog-smarthistory.zip Link to comment
Johnm Posted April 14, 2012 Share Posted April 14, 2012 Multi_Zone_Error_Rate = "The count of errors found when writing a sector. The higher the value, the worse the disk's mechanical condition is."... well..since you said that you expanded your system... there is a slight chance of bad cabling. there is also a chance that you are under-powering the drive. is your PSU large enough for the server? Link to comment
tr0910 Posted April 14, 2012 Author Share Posted April 14, 2012 The SATA cabling to this drive was replaced before I started the rebuild. The power is a Corsair CX430 builders series powering 6 drives, all green, except one of the new 3tb drives is a seagate ST33000651AS 7200 rpm 3tb drive. (One of the drives I replaced was an old Hitachi 1tb 7200 rpm toaster so likely power requirements are lower with this new config but I can't be sure.) From rajahal's notes on this PS: CORSAIR Builder Series CX430 CMPSU-430CX 430W - 28A, supports up to 12 green drives or 7 7200 rpm drives - recommended for small servers, 10 drives or less Link to comment
tr0910 Posted April 15, 2012 Author Share Posted April 15, 2012 We have rebuilt the drive, followed by a fresh parity check, followed by a complete rsync to resync this drive to its mother, and it is performing wonderfully. (knock on wood) Smarthistory shows the multi-zone error rate creeping up though: 2012-02-06 Multi_Zone_Error_Rate 30008 2012-04-14 Multi_Zone_Error_Rate 34155 2012-04-15 Multi_Zone_Error_Rate 34405 Compared with the other drives, non has a multi-zone error rate higher than 500, and that is a drive with over 5000 power on hours, and this one is less than 800 hours. I suppose its only a matter of time..... (sigh) Link to comment
S80_UK Posted April 15, 2012 Share Posted April 15, 2012 If it's accumulating errors at that rate I would be investigating replacement / warranty options and I would remove it from my server. Link to comment
tr0910 Posted April 15, 2012 Author Share Posted April 15, 2012 Yes, RMA created. Warranty surprisingly still valid. At what level of multi-zone errors is it time to replace a drive? All of my Samsung 2tb 203WI or 204UI have some level of error. (though none like this one) Link to comment
dgaschk Posted April 15, 2012 Share Posted April 15, 2012 The raw value only has meaning to the vendor. Compare the VALUE to the THRESHOLD. The drive is considered failing when the value drops below threshold in the SMART report. There are only a handful of metrics for which the raw value has a standard meaning. Link to comment
tr0910 Posted April 16, 2012 Author Share Posted April 16, 2012 Compare the VALUE to the THRESHOLD. The drive is considered failing when the value drops below threshold in the SMART report. There are only a handful of metrics for which the raw value has a standard meaning. Wikipedia definition for Multi-Zone-Error-Rate: The count of errors found when writing a sector. The higher the value, the worse the disk's mechanical condition is. The threshold is 0, and the multi zone error rate I have on this drive is 34000+ and steadily increasing, not decreasing. Are you suggesting this drive is improving with age. :-) Surprisingly per Wikipedia, this isn't one of the failure imminent codes. I am just wondering why all my other Samsung drives have some low level of multi zone error rate (100-500), and my Seagate, Hitachi and WD drives all show zeros. (At 34000+, this Samsung isn't considered low level!!) Apr 14 07:21:14 Tower2 smartctl[3620]: === START OF INFORMATION SECTION === Apr 14 07:21:14 Tower2 smartctl[3620]: Device Model: SAMSUNG HD203WI Apr 14 07:21:14 Tower2 smartctl[3620]: Serial Number: S1UYJ1WZ405310 Apr 14 07:21:14 Tower2 smartctl[3620]: Firmware Version: 1AN10002 Apr 14 07:21:14 Tower2 smartctl[3620]: User Capacity: 2,000,398,934,016 bytes Apr 14 07:21:14 Tower2 smartctl[3620]: Device is: Not in smartctl database [for details use: -P showall] Apr 14 07:21:14 Tower2 smartctl[3620]: ATA Version is: 8 Apr 14 07:21:14 Tower2 smartctl[3620]: ATA Standard is: ATA-8-ACS revision 6 Apr 14 07:21:14 Tower2 smartctl[3620]: Local Time is: Sat Apr 14 07:21:12 2012 MDT Apr 14 07:21:14 Tower2 smartctl[3620]: SMART support is: Available - device has SMART capability. Apr 14 07:21:14 Tower2 smartctl[3620]: SMART support is: Enabled Apr 14 07:21:14 Tower2 smartctl[3620]: Power mode is: ACTIVE or IDLE Apr 14 07:21:14 Tower2 smartctl[3620]: Apr 14 07:21:14 Tower2 smartctl[3620]: === START OF READ SMART DATA SECTION === Apr 14 07:21:14 Tower2 smartctl[3620]: SMART overall-health self-assessment test result: PASSED Apr 14 07:21:14 Tower2 smartctl[3620]: Apr 14 07:21:14 Tower2 smartctl[3620]: General SMART Values: Apr 14 07:21:14 Tower2 smartctl[3620]: Offline data collection status: (0x00) Offline data collection activity Apr 14 07:21:14 Tower2 smartctl[3620]: was never started. Apr 14 07:21:14 Tower2 smartctl[3620]: Auto Offline Data Collection: Disabled. Apr 14 07:21:14 Tower2 smartctl[3620]: Self-test execution status: ( 0) The previous self-test routine completed Apr 14 07:21:14 Tower2 smartctl[3620]: without error or no self-test has ever Apr 14 07:21:14 Tower2 smartctl[3620]: been run. Apr 14 07:21:14 Tower2 smartctl[3620]: Total time to complete Offline Apr 14 07:21:14 Tower2 smartctl[3620]: data collection: (25860) seconds. Apr 14 07:21:14 Tower2 smartctl[3620]: Offline data collection Apr 14 07:21:15 Tower2 smartctl[3620]: capabilities: (0x5b) SMART execute Offline immediate. Apr 14 07:21:15 Tower2 smartctl[3620]: Auto Offline data collection on/off support. Apr 14 07:21:15 Tower2 smartctl[3620]: Suspend Offline collection upon new Apr 14 07:21:15 Tower2 smartctl[3620]: command. Apr 14 07:21:15 Tower2 smartctl[3620]: Offline surface scan supported. Apr 14 07:21:15 Tower2 smartctl[3620]: Self-test supported. Apr 14 07:21:15 Tower2 smartctl[3620]: No Conveyance Self-test supported. Apr 14 07:21:15 Tower2 smartctl[3620]: Selective Self-test supported. Apr 14 07:21:15 Tower2 smartctl[3620]: SMART capabilities: (0x0003) Saves SMART data before entering Apr 14 07:21:15 Tower2 smartctl[3620]: power-saving mode. Apr 14 07:21:15 Tower2 smartctl[3620]: Supports SMART auto save timer. Apr 14 07:21:15 Tower2 smartctl[3620]: Error logging capability: (0x01) Error logging supported. Apr 14 07:21:15 Tower2 smartctl[3620]: General Purpose Logging supported. Apr 14 07:21:15 Tower2 smartctl[3620]: Short self-test routine Apr 14 07:21:15 Tower2 smartctl[3620]: recommended polling time: ( 2) minutes. Apr 14 07:21:15 Tower2 smartctl[3620]: Extended self-test routine Apr 14 07:21:15 Tower2 smartctl[3620]: recommended polling time: ( 255) minutes. Apr 14 07:21:15 Tower2 smartctl[3620]: SCT capabilities: (0x003f) SCT Status supported. Apr 14 07:21:15 Tower2 smartctl[3620]: SCT Error Recovery Control supported. Apr 14 07:21:15 Tower2 smartctl[3620]: SCT Feature Control supported. Apr 14 07:21:15 Tower2 smartctl[3620]: SCT Data Table supported. Apr 14 07:21:15 Tower2 smartctl[3620]: Apr 14 07:21:15 Tower2 smartctl[3620]: SMART Attributes Data Structure revision number: 16 Apr 14 07:21:15 Tower2 smartctl[3620]: Vendor Specific SMART Attributes with Thresholds: Apr 14 07:21:15 Tower2 smartctl[3620]: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE Apr 14 07:21:15 Tower2 smartctl[3620]: 1 Raw_Read_Error_Rate 0x002f 100 100 051 Pre-fail Always - 49 Apr 14 07:21:15 Tower2 smartctl[3620]: 2 Throughput_Performance 0x0026 252 252 000 Old_age Always - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 3 Spin_Up_Time 0x0023 060 035 025 Pre-fail Always - 12138 Apr 14 07:21:15 Tower2 smartctl[3620]: 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 352 Apr 14 07:21:15 Tower2 smartctl[3620]: 5 Reallocated_Sector_Ct 0x0033 252 252 010 Pre-fail Always - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 7 Seek_Error_Rate 0x002e 252 252 051 Old_age Always - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 8 Seek_Time_Performance 0x0024 252 252 015 Old_age Offline - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 718 Apr 14 07:21:15 Tower2 smartctl[3620]: 10 Spin_Retry_Count 0x0032 252 252 051 Old_age Always - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 11 Calibration_Retry_Count 0x0032 252 252 000 Old_age Always - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 93 Apr 14 07:21:15 Tower2 smartctl[3620]: 191 G-Sense_Error_Rate 0x0022 001 001 000 Old_age Always - 5536277 Apr 14 07:21:15 Tower2 smartctl[3620]: 192 Power-Off_Retract_Count 0x0022 252 252 000 Old_age Always - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 194 Temperature_Celsius 0x0002 064 059 000 Old_age Always - 21 (Min/Max 12/41) Apr 14 07:21:15 Tower2 smartctl[3620]: 195 Hardware_ECC_Recovered 0x003a 100 100 000 Old_age Always - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 196 Reallocated_Event_Count 0x0032 252 252 000 Old_age Always - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 197 Current_Pending_Sector 0x0032 252 252 000 Old_age Always - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 198 Offline_Uncorrectable 0x0030 252 252 000 Old_age Offline - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 199 UDMA_CRC_Error_Count 0x0036 200 200 000 Old_age Always - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 200 Multi_Zone_Error_Rate 0x002a 001 001 000 Old_age Always - 34155 Apr 14 07:21:15 Tower2 smartctl[3620]: 223 Load_Retry_Count 0x0032 252 252 000 Old_age Always - 0 Apr 14 07:21:15 Tower2 smartctl[3620]: 225 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 2167 Apr 14 07:21:15 Tower2 smartctl[3620]: Apr 14 07:21:15 Tower2 smartctl[3620]: SMART Error Log Version: 1 Apr 14 07:21:15 Tower2 smartctl[3620]: No Errors Logged Apr 14 07:21:15 Tower2 smartctl[3620]: Apr 14 07:21:15 Tower2 smartctl[3620]: SMART Self-test log structure revision number 1 Apr 14 07:21:15 Tower2 smartctl[3620]: Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error Apr 14 07:21:15 Tower2 smartctl[3620]: # 1 Short offline Completed without error 00% 143 - Apr 14 07:21:15 Tower2 smartctl[3620]: Apr 14 07:21:15 Tower2 smartctl[3620]: Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run Apr 14 07:21:15 Tower2 smartctl[3620]: SMART Selective self-test log data structure revision number 0 Apr 14 07:21:15 Tower2 smartctl[3620]: Note: revision number not 1 implies that no selective self-test has ever been run Apr 14 07:21:15 Tower2 smartctl[3620]: SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS Apr 14 07:21:16 Tower2 smartctl[3620]: 1 0 0 Completed [00% left] (0-65535) Apr 14 07:21:16 Tower2 smartctl[3620]: 2 0 0 Not_testing Apr 14 07:21:16 Tower2 smartctl[3620]: 3 0 0 Not_testing Apr 14 07:21:16 Tower2 smartctl[3620]: 4 0 0 Not_testing Apr 14 07:21:16 Tower2 smartctl[3620]: 5 0 0 Not_testing Apr 14 07:21:16 Tower2 smartctl[3620]: Selective self-test flags (0x0): Apr 14 07:21:16 Tower2 smartctl[3620]: After scanning selected spans, do NOT read-scan remainder of disk. Apr 14 07:21:16 Tower2 smartctl[3620]: If Selective self-test is pending on power-up, resume after 0 minute delay. Apr 14 07:21:16 Tower2 smartctl[3620]: Apr 14 07:21:16 Tower2 smartctl[3620]: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Apr 14 07:21:16 Tower2 smartctl[3620]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Apr 14 07:21:16 Tower2 smartctl[3620]: Link to comment
dgaschk Posted April 16, 2012 Share Posted April 16, 2012 200 Multi_Zone_Error_Rate 0x002a 001 001 000 Old_age Always - 34155 The VALUE is 001 and the THRESHOLD is 000. This may be the normal initial values for these drives. 34155 is a raw value that only has meaning to the manufacturer. The SMART report indicates that there is nothing wrong with this drive. The Raw values of the following metrics have standard meanings: Start_Stop_Count Reallocated_Sector_Ct * Power_On_Hours Power-Off_Retract_Count Temperature_Celsius Reallocated_Event_Count * Current_Pending_Sector * Offline_Uncorrectable Load_Retry_Count Load_Cycle_Count * These are the ones to watch. They may cause problems in unRAID regardless of the normalizes VALUEs. Link to comment
tr0910 Posted April 16, 2012 Author Share Posted April 16, 2012 The Raw values of the following metrics have standard meanings: Reallocated_Sector_Ct * Reallocated_Event_Count * Current_Pending_Sector * * These are the ones to watch. They may cause problems in unRAID regardless of the normalizes VALUEs. Thanks for the reply. So you don't think she is dead at all, and you don't think I need to worry about her? I get the standard meanings vs non-standard, but this is the only Samsung 203WI drive that is going crazy with the multi-zone error rate. The fact that it red-balled and went disabled was what triggered me. Link to comment
dgaschk Posted April 17, 2012 Share Posted April 17, 2012 If it can pass a pre-clear it should be ok. It would be better if the value had stayed at zero. Do you have another Samsung so you can compare the normalized value? Link to comment
tr0910 Posted April 17, 2012 Author Share Posted April 17, 2012 As you were typing that, she just red-balled on me again. I will replace her with a spare, and then try to pre_clear her. We'll see what happens.... Link to comment
tr0910 Posted April 18, 2012 Author Share Posted April 18, 2012 Funny, but she precleared fine. No sectors are showing problems but the g-sense-error-rate and the multi-zone-error-rates keep incrementing and she has red-balled twice. Does the near thresh mean anything? ** Changed attributes in files: /tmp/smart_start_sdc /tmp/smart_finish_sdc ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VA LUE G-Sense_Error_Rate = 1 1 0 near_thresh 906748 7 Multi_Zone_Error_Rate = 1 1 0 near_thresh 36481 No SMART attributes are FAILING_NOW Link to comment
dgaschk Posted April 18, 2012 Share Posted April 18, 2012 I would run several pre-clear cycles and see what happens. If the replacement drive does not have any problems in the same location then RMA. Link to comment
razmajazz Posted April 19, 2012 Share Posted April 19, 2012 For what it's worth, I have a Samsung HD103 which had similar characteristics. I had it in the array for months and one morning a red ball for no apparent reason. Short and long SMART runs were fine so I put it back in the array and rebuilt it. After another few months the same thing and the same solution but I moved it to a different SATA port. Worked fine for a few more months but then a 3rd red ball convinced me to remove it from the array. I used SNAP to add it as a scratch disk for non-critical data. In over a year its failed once outside the array but a simple reboot restored access with no data loss. I never did figure out the problem but I'm not concerned at this point as it is outside the array. I've had great results with Samsung drives in general and the 8 currently in my array have worked fine, including 2 other HD103s. My multi-zone errors are only 3. It shows 124 calibration retries but I don't think either one of those is the root cause. Link to comment
Johnm Posted April 19, 2012 Share Posted April 19, 2012 I also had a drive that passed all tests and would fully write zeros (samsung 1.5 tb F2). It always had issues and eventually lost data. I eneded up getting an RMA. Samsung took it. Back no questions asked. Link to comment
tr0910 Posted April 20, 2012 Author Share Posted April 20, 2012 She is now installed in a non-critical slot on a different server to see if she red balls again. I have an RMA for her, should she continue to mis-behave. Now we just wait..... Link to comment
JonathanM Posted April 20, 2012 Share Posted April 20, 2012 Unless the entire server contains nothing but non-critical data, there is no such thing as a "non-critical slot". Remember unraid relies on all drives minus 1 bad to recover, so having a questionable drive in an array puts the rest of the drives at risk of lost data if they fail. That's why it's so critical to stay on top of any failures and act quickly to put only trusted drives back into the array. Link to comment
tr0910 Posted April 21, 2012 Author Share Posted April 21, 2012 Unless the entire server contains nothing but non-critical data, there is no such thing as a "non-critical slot". Good point. Although I have a spare 2tb drive, and a spare 3 tb drive to replace any bad drives, if I was to have 2 drives go bad at the same time, it would be bad news. I am giving her one last chance to see how she behaves. Hopefully I don't get burned.... Link to comment
tr0910 Posted April 24, 2012 Author Share Posted April 24, 2012 Unless the entire server contains nothing but non-critical data, there is no such thing as a "non-critical slot". Good point. Although I have a spare 2tb drive, and a spare 3 tb drive to replace any bad drives, if I was to have 2 drives go bad at the same time, it would be bad news. I am giving her one last chance to see how she behaves. Hopefully I don't get burned.... Well I almost did get burned, I had another drive (SAMSUNG HD204UI S2H7J1BZA27926) red ball on me in the server where I put this drive for testing. Immediately, I put the spare 2tb into use. Now we are good again, and this second problem Samsung has been precleared once and is on its second pass. In 20 years I haven't had drive failures like this. It never rains but it pours.... ============================================================================ ** Changed attributes in files: /tmp/smart_start_sdf /tmp/smart_finish_sdf ATTRIBUTE NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS RAW_VALUE Current_Pending_Sector = 252 100 0 ok 0 No SMART attributes are FAILING_NOW 1 sector was pending re-allocation before the start of the preclear. 1 sector was pending re-allocation after pre-read in cycle 1 of 1. 0 sectors were pending re-allocation after zero of disk in cycle 1 of 1. 0 sectors are pending re-allocation at the end of the preclear, a change of -1 in the number of sectors pending re-allocation. 0 sectors had been re-allocated before the start of the preclear. 0 sectors are re-allocated at the end of the preclear, the number of sectors re-allocated did not change. ============================================================================ Link to comment
mbryanr Posted April 24, 2012 Share Posted April 24, 2012 I haven't had good luck with the Samsung HD204UI either..one RMA, and one creeping towards RMA with the Multi-Zone Error Rate.. Link to comment
dgaschk Posted April 24, 2012 Share Posted April 24, 2012 If two drives have failed in the same physical bay I would suspect a power issue. Link to comment
tr0910 Posted April 24, 2012 Author Share Posted April 24, 2012 If two drives have failed in the same physical bay I would suspect a power issue. No, not the same bay. One drive failed in totally separate server that had been totally problem free for months. This one was also problem free, until this other drive failed, just after I loaded the new questionable drive. Random event, or connected? Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.