Is she really dead Jim??


Recommended Posts

Yesterday Disk1 was disabled on 5b13.  SAMSUNG_HD203WI_S1UYJ1WZ405310 sdd.

 

This is after using this disk for 6 months in unRaid, 3 months of which have been using 5b13 without complaint.  It had lived a previous life for up to one year in a Buffalo NAS.  Trouble is Smarthistory doesn't think it is gone.  I had just made some changes to the server, replaced some 2tb disks with 3 tb disks, done some parity checks, and everything was fine.  Now I was just buttoning things up to let her run.  She started up fine, but 4 hours later, we had a red balled disk1.

 

I have unassigned the disk, started the array, then reassigned the disk, and it is doing a data rebuild on disk1 right now.  Am I wasting my time?  Is she likely dead?

 

Syslog of the time of failure and then the next syslog after the failure and Smarthistory attached.

 

(Note: This is a backup server that is a complete rsync of the main server so no worries about potential to lose data.  We can play around with her without worry.)

 

Latest Smarthistory from Unmenu after the failure (full smarthistory attached)
2012-04-14	querytime	1334411485
2012-04-14	health	PASSED
2012-04-14	ATA_Error_Count	
2012-04-14	Raw_Read_Error_Rate	49
2012-04-14	Spin_Up_Time	12134
2012-04-14	Start_Stop_Count	353
2012-04-14	Spin_Retry_Count	0 
2012-04-14	Calibration_Retry_Count	0 
2012-04-14	Power_Cycle_Count	94
2012-04-14	Reallocated_Sector_Ct	0 
2012-04-14	Seek_Error_Rate	0 
2012-04-14	Temperature_Celsius	24 (Min/Max 12/41)
2012-04-14	Reallocated_Event_Count	0 
2012-04-14	Current_Pending_Sector	0 
2012-04-14	Power_On_Hours	718
2012-04-14	Offline_Uncorrectable	0 
2012-04-14	UDMA_CRC_Error_Count	0 
2012-04-14	Multi_Zone_Error_Rate	34155

syslog-smarthistory.zip

Link to comment

Multi_Zone_Error_Rate = "The count of errors found when writing a sector. The higher the value, the worse the disk's mechanical condition is."...

 

well..since you said that you expanded your system... there is a slight chance of bad cabling. there is  also a chance that you are under-powering the drive. is your PSU large enough for the server?

 

 

 

 

Link to comment

The SATA cabling to this drive was replaced before I started the rebuild.  The power is a Corsair CX430 builders series powering 6 drives, all green, except one of the new 3tb drives is a seagate ST33000651AS 7200 rpm 3tb drive.    (One of the drives I replaced was an old Hitachi 1tb 7200 rpm toaster so likely power requirements are lower with this new config but I can't be sure.)

 

From rajahal's notes on this PS:

CORSAIR Builder Series CX430 CMPSU-430CX 430W

- 28A, supports up to 12 green drives or 7 7200 rpm drives

- recommended for small servers, 10 drives or less

 

Link to comment

We have rebuilt the drive, followed by a fresh parity check, followed by a complete rsync to resync this drive to its mother, and it is performing wonderfully. (knock on wood)

 

Smarthistory shows the multi-zone error rate creeping up though:

 

2012-02-06	Multi_Zone_Error_Rate	30008
2012-04-14	Multi_Zone_Error_Rate	34155
2012-04-15	Multi_Zone_Error_Rate	34405

 

Compared with the other drives, non has a multi-zone error rate higher than 500, and that is a drive with over 5000 power on hours, and this one is less than 800 hours.  I suppose its only a matter of time..... (sigh)

Link to comment

Compare the VALUE to the THRESHOLD. The drive is considered failing when the value drops below threshold in the SMART report. There are only a handful of metrics for which the raw value has a standard meaning.

 

Wikipedia definition for Multi-Zone-Error-Rate: The count of errors found when writing a sector. The higher the value, the worse the disk's mechanical condition is.

 

The threshold is 0, and the multi zone error rate I have on this drive is 34000+ and steadily increasing, not decreasing.  Are you suggesting this drive is improving with age.  :-)

 

Surprisingly per Wikipedia, this isn't one of the failure imminent codes.  I am just wondering why all my other Samsung drives have some low level of multi zone error rate (100-500), and my Seagate, Hitachi and WD drives all show zeros.  (At 34000+, this Samsung isn't considered low level!!)

 

Apr 14 07:21:14 Tower2 smartctl[3620]: === START OF INFORMATION SECTION ===
Apr 14 07:21:14 Tower2 smartctl[3620]: Device Model:     SAMSUNG HD203WI
Apr 14 07:21:14 Tower2 smartctl[3620]: Serial Number:    S1UYJ1WZ405310
Apr 14 07:21:14 Tower2 smartctl[3620]: Firmware Version: 1AN10002
Apr 14 07:21:14 Tower2 smartctl[3620]: User Capacity:    2,000,398,934,016 bytes
Apr 14 07:21:14 Tower2 smartctl[3620]: Device is:        Not in smartctl database [for details use: -P showall]
Apr 14 07:21:14 Tower2 smartctl[3620]: ATA Version is:   8
Apr 14 07:21:14 Tower2 smartctl[3620]: ATA Standard is:  ATA-8-ACS revision 6
Apr 14 07:21:14 Tower2 smartctl[3620]: Local Time is:    Sat Apr 14 07:21:12 2012 MDT
Apr 14 07:21:14 Tower2 smartctl[3620]: SMART support is: Available - device has SMART capability.
Apr 14 07:21:14 Tower2 smartctl[3620]: SMART support is: Enabled
Apr 14 07:21:14 Tower2 smartctl[3620]: Power mode is:    ACTIVE or IDLE
Apr 14 07:21:14 Tower2 smartctl[3620]: 
Apr 14 07:21:14 Tower2 smartctl[3620]: === START OF READ SMART DATA SECTION ===
Apr 14 07:21:14 Tower2 smartctl[3620]: SMART overall-health self-assessment test result: PASSED
Apr 14 07:21:14 Tower2 smartctl[3620]: 
Apr 14 07:21:14 Tower2 smartctl[3620]: General SMART Values:
Apr 14 07:21:14 Tower2 smartctl[3620]: Offline data collection status:  (0x00)        Offline data collection activity
Apr 14 07:21:14 Tower2 smartctl[3620]:                                         was never started.
Apr 14 07:21:14 Tower2 smartctl[3620]:                                         Auto Offline Data Collection: Disabled.
Apr 14 07:21:14 Tower2 smartctl[3620]: Self-test execution status:      (   0)        The previous self-test routine completed
Apr 14 07:21:14 Tower2 smartctl[3620]:                                         without error or no self-test has ever 
Apr 14 07:21:14 Tower2 smartctl[3620]:                                         been run.
Apr 14 07:21:14 Tower2 smartctl[3620]: Total time to complete Offline 
Apr 14 07:21:14 Tower2 smartctl[3620]: data collection:                  (25860) seconds.
Apr 14 07:21:14 Tower2 smartctl[3620]: Offline data collection
Apr 14 07:21:15 Tower2 smartctl[3620]: capabilities:                          (0x5b) SMART execute Offline immediate.
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         Auto Offline data collection on/off support.
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         Suspend Offline collection upon new
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         command.
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         Offline surface scan supported.
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         Self-test supported.
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         No Conveyance Self-test supported.
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         Selective Self-test supported.
Apr 14 07:21:15 Tower2 smartctl[3620]: SMART capabilities:            (0x0003)        Saves SMART data before entering
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         power-saving mode.
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         Supports SMART auto save timer.
Apr 14 07:21:15 Tower2 smartctl[3620]: Error logging capability:        (0x01)        Error logging supported.
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         General Purpose Logging supported.
Apr 14 07:21:15 Tower2 smartctl[3620]: Short self-test routine 
Apr 14 07:21:15 Tower2 smartctl[3620]: recommended polling time:          (   2) minutes.
Apr 14 07:21:15 Tower2 smartctl[3620]: Extended self-test routine
Apr 14 07:21:15 Tower2 smartctl[3620]: recommended polling time:          ( 255) minutes.
Apr 14 07:21:15 Tower2 smartctl[3620]: SCT capabilities:                (0x003f)        SCT Status supported.
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         SCT Error Recovery Control supported.
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         SCT Feature Control supported.
Apr 14 07:21:15 Tower2 smartctl[3620]:                                         SCT Data Table supported.
Apr 14 07:21:15 Tower2 smartctl[3620]: 
Apr 14 07:21:15 Tower2 smartctl[3620]: SMART Attributes Data Structure revision number: 16
Apr 14 07:21:15 Tower2 smartctl[3620]: Vendor Specific SMART Attributes with Thresholds:
Apr 14 07:21:15 Tower2 smartctl[3620]: ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
Apr 14 07:21:15 Tower2 smartctl[3620]:   1 Raw_Read_Error_Rate     0x002f   100   100   051    Pre-fail  Always       -       49
Apr 14 07:21:15 Tower2 smartctl[3620]:   2 Throughput_Performance  0x0026   252   252   000    Old_age   Always       -       0
Apr 14 07:21:15 Tower2 smartctl[3620]:   3 Spin_Up_Time            0x0023   060   035   025    Pre-fail  Always       -       12138
Apr 14 07:21:15 Tower2 smartctl[3620]:   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       352
Apr 14 07:21:15 Tower2 smartctl[3620]:   5 Reallocated_Sector_Ct   0x0033   252   252   010    Pre-fail  Always       -       0
Apr 14 07:21:15 Tower2 smartctl[3620]:   7 Seek_Error_Rate         0x002e   252   252   051    Old_age   Always       -       0
Apr 14 07:21:15 Tower2 smartctl[3620]:   8 Seek_Time_Performance   0x0024   252   252   015    Old_age   Offline      -       0
Apr 14 07:21:15 Tower2 smartctl[3620]:   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       718
Apr 14 07:21:15 Tower2 smartctl[3620]:  10 Spin_Retry_Count        0x0032   252   252   051    Old_age   Always       -       0
Apr 14 07:21:15 Tower2 smartctl[3620]:  11 Calibration_Retry_Count 0x0032   252   252   000    Old_age   Always       -       0
Apr 14 07:21:15 Tower2 smartctl[3620]:  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       93
Apr 14 07:21:15 Tower2 smartctl[3620]: 191 G-Sense_Error_Rate      0x0022   001   001   000    Old_age   Always       -       5536277
Apr 14 07:21:15 Tower2 smartctl[3620]: 192 Power-Off_Retract_Count 0x0022   252   252   000    Old_age   Always       -       0
Apr 14 07:21:15 Tower2 smartctl[3620]: 194 Temperature_Celsius     0x0002   064   059   000    Old_age   Always       -       21 (Min/Max 12/41)
Apr 14 07:21:15 Tower2 smartctl[3620]: 195 Hardware_ECC_Recovered  0x003a   100   100   000    Old_age   Always       -       0
Apr 14 07:21:15 Tower2 smartctl[3620]: 196 Reallocated_Event_Count 0x0032   252   252   000    Old_age   Always       -       0
Apr 14 07:21:15 Tower2 smartctl[3620]: 197 Current_Pending_Sector  0x0032   252   252   000    Old_age   Always       -       0
Apr 14 07:21:15 Tower2 smartctl[3620]: 198 Offline_Uncorrectable   0x0030   252   252   000    Old_age   Offline      -       0
Apr 14 07:21:15 Tower2 smartctl[3620]: 199 UDMA_CRC_Error_Count    0x0036   200   200   000    Old_age   Always       -       0
Apr 14 07:21:15 Tower2 smartctl[3620]: 200 Multi_Zone_Error_Rate   0x002a   001   001   000    Old_age   Always       -       34155
Apr 14 07:21:15 Tower2 smartctl[3620]: 223 Load_Retry_Count        0x0032   252   252   000    Old_age   Always       -       0
Apr 14 07:21:15 Tower2 smartctl[3620]: 225 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       2167
Apr 14 07:21:15 Tower2 smartctl[3620]: 
Apr 14 07:21:15 Tower2 smartctl[3620]: SMART Error Log Version: 1
Apr 14 07:21:15 Tower2 smartctl[3620]: No Errors Logged
Apr 14 07:21:15 Tower2 smartctl[3620]: 
Apr 14 07:21:15 Tower2 smartctl[3620]: SMART Self-test log structure revision number 1
Apr 14 07:21:15 Tower2 smartctl[3620]: Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
Apr 14 07:21:15 Tower2 smartctl[3620]: # 1  Short offline       Completed without error       00%       143         -
Apr 14 07:21:15 Tower2 smartctl[3620]: 
Apr 14 07:21:15 Tower2 smartctl[3620]: Note: selective self-test log revision number (0) not 1 implies that no selective self-test has ever been run
Apr 14 07:21:15 Tower2 smartctl[3620]: SMART Selective self-test log data structure revision number 0
Apr 14 07:21:15 Tower2 smartctl[3620]: Note: revision number not 1 implies that no selective self-test has ever been run
Apr 14 07:21:15 Tower2 smartctl[3620]:  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
Apr 14 07:21:16 Tower2 smartctl[3620]:     1        0        0  Completed [00% left] (0-65535)
Apr 14 07:21:16 Tower2 smartctl[3620]:     2        0        0  Not_testing
Apr 14 07:21:16 Tower2 smartctl[3620]:     3        0        0  Not_testing
Apr 14 07:21:16 Tower2 smartctl[3620]:     4        0        0  Not_testing
Apr 14 07:21:16 Tower2 smartctl[3620]:     5        0        0  Not_testing
Apr 14 07:21:16 Tower2 smartctl[3620]: Selective self-test flags (0x0):
Apr 14 07:21:16 Tower2 smartctl[3620]:   After scanning selected spans, do NOT read-scan remainder of disk.
Apr 14 07:21:16 Tower2 smartctl[3620]: If Selective self-test is pending on power-up, resume after 0 minute delay.
Apr 14 07:21:16 Tower2 smartctl[3620]: 
Apr 14 07:21:16 Tower2 smartctl[3620]: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Apr 14 07:21:16 Tower2 smartctl[3620]: Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
Apr 14 07:21:16 Tower2 smartctl[3620]: 

Link to comment

 200 Multi_Zone_Error_Rate   0x002a   001   001   000    Old_age   Always       -       34155

 

The VALUE is 001 and the THRESHOLD is 000. This may be the normal initial values for these drives. 34155 is a raw value that only has meaning to the manufacturer. The SMART report indicates that there is nothing wrong with this drive.

 

The Raw values of the following metrics have standard meanings:

 

Start_Stop_Count

Reallocated_Sector_Ct *

Power_On_Hours

Power-Off_Retract_Count

Temperature_Celsius   

Reallocated_Event_Count *

Current_Pending_Sector *

Offline_Uncorrectable

Load_Retry_Count

Load_Cycle_Count

 

* These are the ones to watch. They may cause problems in unRAID regardless of the normalizes VALUEs.

Link to comment

The Raw values of the following metrics have standard meanings:

Reallocated_Sector_Ct *

Reallocated_Event_Count *

Current_Pending_Sector *

* These are the ones to watch. They may cause problems in unRAID regardless of the normalizes VALUEs.

Thanks for the reply.  So you don't think she is dead at all, and you don't think I need to worry about her?

 

I get the standard meanings vs non-standard, but this is the only Samsung 203WI drive that is going crazy with the multi-zone error rate.  The fact that it red-balled and went disabled was what triggered me. 

Link to comment

Funny, but she precleared fine.  No sectors are showing problems but the g-sense-error-rate and the multi-zone-error-rates keep incrementing and she has red-balled twice.

 

Does the near thresh mean anything?

 

** Changed attributes in files: /tmp/smart_start_sdc  /tmp/smart_finish_sdc

                ATTRIBUTE  NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS      RAW_VA                                  LUE

      G-Sense_Error_Rate =    1      1            0        near_thresh 906748                                  7

    Multi_Zone_Error_Rate =    1      1            0        near_thresh 36481

No SMART attributes are FAILING_NOW

 

 

Link to comment

For what it's worth, I have a Samsung HD103 which had similar characteristics. I had it in the array for months and one morning a red ball for no apparent reason. Short and long SMART runs were fine so I put it back in the array and rebuilt it. After another few months the same thing and the same solution but I moved it to a different SATA port. Worked fine for a few more months but then a 3rd red ball convinced me to remove it from the array. I used SNAP to add it as a scratch disk for non-critical data. In over a year its failed once outside the array but a simple reboot restored access with no data loss.

 

I never did figure out the problem but I'm not concerned at this point as it is outside the array. I've had great results with Samsung drives in general and the 8 currently in my array have worked fine, including 2 other HD103s. My multi-zone errors are only 3. It shows 124 calibration retries but I don't think either one of those is the root cause.

Link to comment

Unless the entire server contains nothing but non-critical data, there is no such thing as a "non-critical slot". Remember unraid relies on all drives minus 1 bad to recover, so having a questionable drive in an array puts the rest of the drives at risk of lost data if they fail. That's why it's so critical to stay on top of any failures and act quickly to put only trusted drives back into the array.

Link to comment

Unless the entire server contains nothing but non-critical data, there is no such thing as a "non-critical slot".

 

Good point.  Although I have a spare 2tb drive, and a spare 3 tb drive to replace any bad drives, if I was to have 2 drives go bad at the same time, it would be bad news.  I am giving her one last chance to see how she behaves.  Hopefully I don't get burned....

Link to comment

Unless the entire server contains nothing but non-critical data, there is no such thing as a "non-critical slot".

 

Good point.  Although I have a spare 2tb drive, and a spare 3 tb drive to replace any bad drives, if I was to have 2 drives go bad at the same time, it would be bad news.  I am giving her one last chance to see how she behaves.  Hopefully I don't get burned....

 

Well I almost did get burned, I had another drive (SAMSUNG HD204UI    S2H7J1BZA27926) red ball on me in the server where I put this drive for testing.  Immediately, I put the spare 2tb into use.  Now we are good again, and this second problem Samsung has been precleared once and is on its second pass.  In 20 years I haven't had drive failures like this.  It never rains but it pours....

 

============================================================================
** Changed attributes in files: /tmp/smart_start_sdf  /tmp/smart_finish_sdf
                ATTRIBUTE   NEW_VAL OLD_VAL FAILURE_THRESHOLD STATUS      RAW_VALUE
   Current_Pending_Sector =   252     100            0        ok          0
No SMART attributes are FAILING_NOW

1 sector was pending re-allocation before the start of the preclear.
1 sector was pending re-allocation after pre-read in cycle 1 of 1.
0 sectors were pending re-allocation after zero of disk in cycle 1 of 1.
0 sectors are pending re-allocation at the end of the preclear,
    a change of -1 in the number of sectors pending re-allocation.
0 sectors had been re-allocated before the start of the preclear.
0 sectors are re-allocated at the end of the preclear,
    the number of sectors re-allocated did not change. 
============================================================================

Link to comment

If two drives have failed in the same physical bay I would suspect a power issue.

 

No, not the same bay.  One drive failed in totally separate server that had been totally problem free for months.  This one was also problem free, until this other drive failed, just after I loaded the new questionable drive.  Random event, or connected????

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.