dandirk Posted August 6, 2013 Share Posted August 6, 2013 So I had a drive go disabled... I was able to move all data off the drive, rebuild parity. Array is up and running without the drive. Initial short smartctl test was ok, so I assumed something got messed up and I attempted to add it back to the array. Normal Unraid clearing seemed to be taking a very long time, 7% in about 5 hours. Speeds at first was being reported but at the 5 hour mark was not, so I cancelled the process and brought the array back online. I then attempted to run a couple of Smart long tests... both seemed to stop with a "Interupted (host restart)" message at 90% left. No errors but the long test being interrupted... is causing me to wonder about the health of the drive. I am going to look for additional HDD tests, just wondering if anyone else has an opinion. root@Tower:~# smartctl -a /dev/sde smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.11 family Device Model: ST31500341AS Serial Number: 9VS1G1VS Firmware Version: CC1H User Capacity: 1,500,301,910,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Tue Aug 6 13:13:59 2013 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 41) The self-test routine was interrupted by the host with a hard or soft reset. Total time to complete Offline data collection: ( 617) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 120 099 006 Pre-fail Always - 241219340 3 Spin_Up_Time 0x0003 100 092 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1248 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 071 060 030 Pre-fail Always - 14349190 9 Power_On_Hours 0x0032 058 058 000 Old_age Always - 37570 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 2 12 Power_Cycle_Count 0x0032 100 037 020 Old_age Always - 155 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 037 017 000 Old_age Always - 691502391322 189 High_Fly_Writes 0x003a 066 066 000 Old_age Always - 34 190 Airflow_Temperature_Cel 0x0022 062 042 045 Old_age Always In_the_past 38 (0 127 40 29) 194 Temperature_Celsius 0x0022 038 058 000 Old_age Always - 38 (0 14 0 0) 195 Hardware_ECC_Recovered 0x001a 042 024 000 Old_age Always - 241219340 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 194 000 Old_age Always - 1860 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 255722352811197 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2061402732 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 794048154 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Interrupted (host reset) 90% 37570 - # 2 Extended offline Interrupted (host reset) 90% 37568 - # 3 Short offline Completed without error 00% 37542 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. root@Tower:~# Quote Link to comment
Chris Pollard Posted August 6, 2013 Share Posted August 6, 2013 37500 power on hours. Personally I would be scrapping it. Quote Link to comment
garycase Posted August 6, 2013 Share Posted August 6, 2013 I wouldn't scrap it because it's 4 years old (the hours); but I would scrap it due to the history you've just had with it => it's failing the long SMART test; has been dropped out of the array; and did not clear at normal speeds with UnRAID. This is a good indication of how SMART doesn't always identify bad drives. Your SMART report is actually okay. Looking at a few numbers that may seem troublesome: => The large raw read count is simply because Seagate reports the raw reads; others do not. The important # is the 120 value ... which is fine. => Seek errors are fairly high, but this is also typical of Seagate's reporting. The 71 value isn't bad ... the failure threshold for this is 30 - I wouldn't be concerned unless it drops below 50. => The very high command timeout value likely explains why the drive is failing. 32 is a pretty low value for a SMART parameter, even though it's above the failure threshold. => The high ECC recovery count, together with the fairly low resulting value (42) is another area that could explain the very long time the drive takes for some operations, resulting in effective failures Everything else looks fine -- but the bottom line is that even though the SMART report technically "passes", there are a few areas that clearly indicate problems ==> and that, particularly coupled with the issues you've had with the drive, is a good reason to scrap the drive. Quote Link to comment
dandirk Posted August 6, 2013 Author Share Posted August 6, 2013 37500 power on hours. Personally I would be scrapping it. lol most of my drives are that old... its taken a while to fill them up. I actually have a few 750gb drives in there that are older Thanks for the other evaluation Gary, most was over my head. As for the clearing time, I has since read it can take days? so an estimated 50hours for a 1.6 TB drive could be normal? I am not sure what to think, I saw the speed was 80 MB/s (When it was still reporting) which shouldn't take that long... at those speeds can't see any size disk taking 2+ days. I still have plenty of space and as I have been reminded my disks are getting pretty old so I probably should just play it safe and leave the drive out... I am running seatools long test now... will probably find out more in a few hours but it is pretty much moot. Quote Link to comment
garycase Posted August 6, 2013 Share Posted August 6, 2013 Clearing time can vary a good bit based on your controller, CPU, and a variety of other factors ... but a full pre-clear cycle [pre-read, clear, post-read] with Joe L's preclear script can easily take 10 hrs/TB or a bit longer. I would NOT expect it to take 50 hrs for a 1.5TB drive, however. Quote Link to comment
Chris Pollard Posted August 6, 2013 Share Posted August 6, 2013 Not just because its old, if it can't pass self tests.... and its old, then chances are its on the way out. Better safe than sorry in my book. Quote Link to comment
dandirk Posted August 9, 2013 Author Share Posted August 9, 2013 Thanks for all the info and replies guys... Figured I would update this post for anyone else searching in the future. A little recap: Messing around, trying to clean up my unraid, old shares for plugins no longer in use etc. Had to run the permission script but it hung on a drive locking the server. Hard shutdown, reboot... the drive was disabled. Data was accessible but some smart results showed odd infrequent errors. Short smart test was fine, long would interupt/stop after a min or so. I ended up copying the data off the disk in question, removing it and then using the new config utility to rebuild parity. Took a while but it worked to get array protected again. Since tests like seatools, said the drive was ok... tried to pre-clear and readd to array. Preclear also stall/froze along with continued smart test failures... As per the convo here, drive is probably going down the tubes or risky enough to ditch. Got a new drive... pre-clear still hangs/stalls at 0% but doesn't hang server. Swapped out the SAS breakout cable and boom, much better results. Looks like I had the beginnings of a failing drive AND spotty cable. Pretty sure the drive was going because though the symptoms were similar they were slightly better with the new drive (failed with more responsiveness lol). Quote Link to comment
Phil Karn Posted July 10, 2020 Share Posted July 10, 2020 I know this is a very old thread, but just in case it's helpful I notice the UDMA CRC count is non-zero. These are errors on the link between the drive and controller. The cable or connectors might have become unreliable, which also could conceivably explain the spurious drive resets. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.