[solved] SMART long tests interupted - bad drive?


Recommended Posts

So I had a drive go disabled... I was able to move all data off the drive, rebuild parity.  Array is up and running without the drive.

 

Initial short smartctl test was ok, so I assumed something got messed up and I attempted to add it back to the array.  Normal Unraid clearing seemed to be taking a very long time, 7% in about 5 hours.  Speeds at first was being reported but at the 5 hour mark was not, so I cancelled the process and brought the array back online.

 

I then attempted to run a couple of Smart long tests... both seemed to stop with a "Interupted (host restart)" message at 90% left. 

 

No errors but the long test being interrupted... is causing me to wonder about the health of the drive.  I am going to look for additional HDD tests, just wondering if anyone else has an opinion.

 

root@Tower:~# smartctl -a /dev/sde

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

=== START OF INFORMATION SECTION ===

Model Family:    Seagate Barracuda 7200.11 family

Device Model:    ST31500341AS

Serial Number:    9VS1G1VS

Firmware Version: CC1H

User Capacity:    1,500,301,910,016 bytes

Device is:        In smartctl database [for details use: -P show]

ATA Version is:  8

ATA Standard is:  ATA-8-ACS revision 4

Local Time is:    Tue Aug  6 13:13:59 2013 CDT

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

 

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

See vendor-specific Attribute list for marginal Attributes.

 

General SMART Values:

Offline data collection status:  (0x82) Offline data collection activity

                                        was completed without error.

                                        Auto Offline Data Collection: Enabled.

Self-test execution status:      (  41) The self-test routine was interrupted

                                        by the host with a hard or soft reset.

Total time to complete Offline

data collection:                ( 617) seconds.

Offline data collection

capabilities:                    (0x7b) SMART execute Offline immediate.

                                        Auto Offline data collection on/off support.

                                        Suspend Offline collection upon new

                                        command.

                                        Offline surface scan supported.

                                        Self-test supported.

                                        Conveyance Self-test supported.

                                        Selective Self-test supported.

SMART capabilities:            (0x0003) Saves SMART data before entering

                                        power-saving mode.

                                        Supports SMART auto save timer.

Error logging capability:        (0x01) Error logging supported.

                                        General Purpose Logging supported.

Short self-test routine

recommended polling time:        (  1) minutes.

Extended self-test routine

recommended polling time:        ( 255) minutes.

Conveyance self-test routine

recommended polling time:        (  2) minutes.

SCT capabilities:              (0x103f) SCT Status supported.

                                        SCT Error Recovery Control supported.

                                        SCT Feature Control supported.

                                        SCT Data Table supported.

 

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate    0x000f  120  099  006    Pre-fail  Always      -      241219340

  3 Spin_Up_Time            0x0003  100  092  000    Pre-fail  Always      -      0

  4 Start_Stop_Count        0x0032  099  099  020    Old_age  Always      -      1248

  5 Reallocated_Sector_Ct  0x0033  100  100  036    Pre-fail  Always      -      0

  7 Seek_Error_Rate        0x000f  071  060  030    Pre-fail  Always      -      14349190

  9 Power_On_Hours          0x0032  058  058  000    Old_age  Always      -      37570

10 Spin_Retry_Count        0x0013  100  100  097    Pre-fail  Always      -      2

12 Power_Cycle_Count      0x0032  100  037  020    Old_age  Always      -      155

184 End-to-End_Error        0x0032  100  100  099    Old_age  Always      -      0

187 Reported_Uncorrect      0x0032  100  100  000    Old_age  Always      -      0

188 Command_Timeout        0x0032  037  017  000    Old_age  Always      -      691502391322

189 High_Fly_Writes        0x003a  066  066  000    Old_age  Always      -      34

190 Airflow_Temperature_Cel 0x0022  062  042  045    Old_age  Always  In_the_past 38 (0 127 40 29)

194 Temperature_Celsius    0x0022  038  058  000    Old_age  Always      -      38 (0 14 0 0)

195 Hardware_ECC_Recovered  0x001a  042  024  000    Old_age  Always      -      241219340

197 Current_Pending_Sector  0x0012  100  100  000    Old_age  Always      -      0

198 Offline_Uncorrectable  0x0010  100  100  000    Old_age  Offline      -      0

199 UDMA_CRC_Error_Count    0x003e  200  194  000    Old_age  Always      -      1860

240 Head_Flying_Hours      0x0000  100  253  000    Old_age  Offline      -      255722352811197

241 Total_LBAs_Written      0x0000  100  253  000    Old_age  Offline      -      2061402732

242 Total_LBAs_Read        0x0000  100  253  000    Old_age  Offline      -      794048154

 

SMART Error Log Version: 1

No Errors Logged

 

SMART Self-test log structure revision number 1

Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error

# 1  Extended offline    Interrupted (host reset)      90%    37570        -

# 2  Extended offline    Interrupted (host reset)      90%    37568        -

# 3  Short offline      Completed without error      00%    37542        -

 

SMART Selective self-test log data structure revision number 1

SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS

    1        0        0  Not_testing

    2        0        0  Not_testing

    3        0        0  Not_testing

    4        0        0  Not_testing

    5        0        0  Not_testing

Selective self-test flags (0x0):

  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

 

root@Tower:~#

Link to comment

I wouldn't scrap it because it's 4 years old (the hours); but I would scrap it due to the history you've just had with it => it's failing the long SMART test;  has been dropped out of the array;  and did not clear at normal speeds with UnRAID.

 

This is a good indication of how SMART doesn't always identify bad drives.  Your SMART report is actually okay.  Looking at a few numbers that may seem troublesome:

 

=>  The large raw read count is simply because Seagate reports the raw reads; others do not.  The important # is the 120 value ... which is fine.

 

=>  Seek errors are fairly high, but this is also typical of Seagate's reporting.  The 71 value isn't bad ... the failure threshold for this is 30 - I wouldn't be concerned unless it drops below 50.

 

=>  The very high command timeout value likely explains why the drive is failing.  32 is a pretty low value for a SMART parameter, even though it's above the failure threshold.

 

=>  The high ECC recovery count, together with the fairly low resulting value (42) is another area that could explain the very long time the drive takes for some operations, resulting in effective failures

 

Everything else looks fine -- but the bottom line is that even though the SMART report technically "passes", there are a few areas that clearly indicate problems ==> and that, particularly coupled with the issues you've had with the drive, is a good reason to scrap the drive.

 

Link to comment

37500 power on hours.  Personally I would be scrapping it.

 

lol most of my drives are that old... its taken a while to fill them up.  I actually have a few 750gb drives in there that are older  :o

 

Thanks for the other evaluation Gary, most was over my head.  As for the clearing time, I has since read it can take days? so an estimated 50hours for a 1.6 TB drive could be normal?  I am not sure what to think, I saw the speed was 80 MB/s (When it was still reporting) which shouldn't take that long... at those speeds can't see any size disk taking 2+ days.

 

I still have plenty of space and as I have been reminded my disks are getting pretty old so I probably should just play it safe and leave the drive out...

 

I am running seatools long test now... will probably find out more in a few hours but it is pretty much moot.

Link to comment

Clearing time can vary a good bit based on your controller, CPU, and a variety of other factors ... but a full pre-clear cycle [pre-read, clear, post-read] with Joe L's preclear script can easily take 10 hrs/TB or a bit longer.  I would NOT expect it to take 50 hrs for a 1.5TB drive, however.

 

Link to comment

Thanks for all the info and replies guys... Figured I would update this post for anyone else searching in the future.

 

A little recap:

 

Messing around, trying to clean up my unraid, old shares for plugins no longer in use etc. Had to run the permission script but it hung on a drive locking the server.  Hard shutdown, reboot... the drive was disabled.  Data was accessible but some smart results showed odd infrequent errors.  Short smart test was fine, long would interupt/stop after a min or so.

 

I ended up copying the data off the disk in question, removing it and then using the new config utility to rebuild parity.  Took a while but it worked to get array protected again.

 

Since tests like seatools, said the drive was ok... tried to pre-clear and readd to array.  Preclear also stall/froze along with continued smart test failures...  As per the convo here, drive is probably going down the tubes or risky enough to ditch.

 

Got a new drive... pre-clear still hangs/stalls at 0% but doesn't hang server.  Swapped out the SAS breakout cable and boom, much better results.  Looks like I had the beginnings of a failing drive AND spotty cable.  Pretty sure the drive was going because though the symptoms were similar they were slightly better with the new drive (failed with more responsiveness lol).

 

 

 

Link to comment
  • 6 years later...

I know this is a very old thread, but just in case it's helpful I notice the UDMA CRC count is non-zero. These are errors on the link between the drive and controller. The cable or connectors might have become unreliable, which also could conceivably explain the spurious drive resets.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.