November 3, 201411 yr hi all, for the first time, i see errors on the main page for one HDD. 1467 so far. what should i do to avoid data lost? is there a check possible? or should i turn it off and replace the hdd? please help, or point me to some howtos. its a productive 12TB server. thanks in advance!
November 3, 201411 yr Author thanks for the hint. here are the logs. thanks for the quick reaction logs.zip
November 3, 201411 yr One of the disks is having issues. Looks to be sde, the .zip has the smart report for sdc which shows no errors. Please post smart report for sde. Oct 31 11:30:33 r5-server kernel: mdcmd (3): import 2 8,64 1953514552 WDC_WD20EARS-11J99B1_WD-WCAWZ0931907 Oct 31 11:30:33 r5-server kernel: md: import disk2: [8,64] (sde) WDC_WD20EARS-11J99B1_WD-WCAWZ0931907 size: 1953514552 Oct 31 15:46:01 r5-server kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Oct 31 15:46:01 r5-server kernel: ata4.00: irq_stat 0x40000001 Oct 31 15:46:01 r5-server kernel: ata4.00: failed command: READ DMA EXT Oct 31 15:46:01 r5-server kernel: ata4.00: cmd 25/00:00:78:1f:b9/00:04:bb:00:00/e0 tag 5 dma 524288 in Oct 31 15:46:01 r5-server kernel: res 51/40:2f:48:21:b9/00:02:bb:00:00/e0 Emask 0x9 (media error) Oct 31 15:46:01 r5-server kernel: ata4.00: status: { DRDY ERR } Oct 31 15:46:01 r5-server kernel: ata4.00: error: { UNC } Oct 31 15:46:01 r5-server kernel: ata4.00: configured for UDMA/133 Oct 31 15:46:01 r5-server kernel: sd 4:0:0:0: [sde] Unhandled sense code Oct 31 15:46:01 r5-server kernel: sd 4:0:0:0: [sde] Oct 31 15:46:01 r5-server kernel: Result: hostbyte=0x00 driverbyte=0x08 Oct 31 15:46:01 r5-server kernel: sd 4:0:0:0: [sde] Oct 31 15:46:01 r5-server kernel: Sense Key : 0x3 [current] [descriptor] Oct 31 15:46:01 r5-server kernel: Descriptor sense data with sense descriptors (in hex): Oct 31 15:46:01 r5-server kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Oct 31 15:46:01 r5-server kernel: bb b9 21 48 Oct 31 15:46:01 r5-server kernel: sd 4:0:0:0: [sde] Oct 31 15:46:01 r5-server kernel: ASC=0x11 ASCQ=0x4 Oct 31 15:46:01 r5-server kernel: sd 4:0:0:0: [sde] CDB: Oct 31 15:46:01 r5-server kernel: cdb[0]=0x28: 28 00 bb b9 1f 78 00 04 00 00 Oct 31 15:46:01 r5-server kernel: end_request: I/O error, dev sde, sector 3149472072 Oct 31 15:46:01 r5-server kernel: ata4: EH complete Oct 31 15:46:01 r5-server kernel: md: disk2 read error, sector=3149472008 Oct 31 15:46:01 r5-server kernel: md: multiple disk errors, sector=3149472008 ...<SNIP>...
November 3, 201411 yr Author sorry. jumped in the wrong line. sde is throwing errors: smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.16.3-unRAID] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (AF) Device Model: WDC WD20EARS-11J99B1 Serial Number: WD-WCAWZ0931907 LU WWN Device Id: 5 0014ee 2060bcdac Firmware Version: 80.00A80 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Mon Nov 3 16:13:27 2014 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (35760) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 345) minutes. Conveyance self-test routine recommended polling time: ( 5) minutes. SCT capabilities: (0x3035) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 14 3 Spin_Up_Time 0x0027 143 133 021 Pre-fail Always - 9816 4 Start_Stop_Count 0x0032 090 090 000 Old_age Always - 10661 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 069 069 000 Old_age Always - 23344 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 54 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 34 193 Load_Cycle_Count 0x0032 139 139 000 Old_age Always - 185097 194 Temperature_Celsius 0x0022 130 079 000 Old_age Always - 22 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 24 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
November 3, 201411 yr Here is your issue. 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 24 If you have a spare precleared drive, you can try rebuilding on that. pending sectors only go away during a write operation. There's a chance that the drive could be saved in a last ditch effort by writing to those sectors, but they have to be identified. It's a tedious and long process. Using ddrescue to copy the drive to another, then running badblocks in read/write mode or something like spinrite. For example. I have a drive with pending sectors. I've been running badblocks in read/write test mode for 145 hours and it's only 60% of the way through for a 2tb drive. My drive happens to have no important data (which is why I'm doing this test). You may want to take the array off line, and issue a smart long test so it can scan the whole hard drive. This could take over 10 hours. smartctl will tell you how long with an ETA of when it should be complete. Rebuild onto a spare drive would be prudent. Also reviewing what was updated in the past few days on that drive might be worth review.
November 3, 201411 yr Author thank you, i will try to shut it down, replace the 2TB with a precleared and let the system reconstruct the disc. do you think its worth to do the rescue of the faulty 2TB afterwards? (with badblocks or spinrite)
November 3, 201411 yr thank you, i will try to shut it down, replace the 2TB with a precleared and let the system reconstruct the disc. do you think its worth to do the rescue of the faulty 2TB afterwards? (with badblocks or spinrite) A regular preclear of a few passes with JoeL's script may cause those pending sectors to be reallocated on purpose. I happen to use badblocks manually and it works for me. Here's a recent post with examples. http://lime-technology.com/forum/index.php?topic=32564.msg335021#msg335021 you still need to use Joe L's script afterwards to put the signature on the drive.
November 10, 201411 yr Author hi WeeboTech, i installed a new precleared hdd as a replacement for the "old" one. unraid is running fine. after that i inserted the old hdd with the pending sectors and did a preclear with joe's script. how many times should i do this? after one normal run, the smartctrl --all showed the same output with the value of 200 Current_Pending_Sector. what should i do exactly to try to rescue the hdd. wanted to keep it as a rescue hdd for immediate possibilities. or is that not a good idea?
November 10, 201411 yr Any disk where the value for Pending Sectors does not go down to zero after a pre-clear cycle is unreliable and should not be used with unRAID.
November 10, 201411 yr 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 24 There are 24 pending sectors on the smart report provided (last column) If you run the preclear a few times and that number does not go to 0 then the drive is not a good candidate for unRAID. The pending sectors need to be 0. If pending sectors go to 0 and reallocated sectors go from 0 to 24. i.e. pending sectors are reallocated, you can use the drive. Pending sectors need to be 0. Otherwise the drive can possibly fail during a recovery effort.
November 13, 201411 yr Author SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 14 3 Spin_Up_Time 0x0027 144 133 021 Pre-fail Always - 9775 4 Start_Stop_Count 0x0032 090 090 000 Old_age Always - 10668 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0 9 Power_On_Hours 0x0032 068 068 000 Old_age Always - 23575 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 58 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 34 193 Load_Cycle_Count 0x0032 139 139 000 Old_age Always - 185120 194 Temperature_Celsius 0x0022 122 079 000 Old_age Always - 30 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 72 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 23517 - # 2 Extended offline Completed without error 00% 23496 - after two runs of preclear i checked smart two times and the pending sectors were at 0. but the Reallocated_Sector Count didn't go up. so is the drive good for later use in an emergency or in a productive environment?
November 13, 201411 yr SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE ... 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 72 ... SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 23517 - # 2 Extended offline Completed without error 00% 23496 - after two runs of preclear i checked smart two times and the pending sectors were at 0. but the Reallocated_Sector Count didn't go up. so is the drive good for later use in an emergency or in a productive environment? Only concerning thing I see is the 72 for offline_uncorrectable. According to Wikipedia, this attribute indicates ... The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem. I am not used to seeing offline_uncorrectables without reallocated sectors that give you some confidence that the uncorrectable errors are on sectors that have taken out of service. But I have seen numerous cases where pending sectors spontaneously go away with no corresponding reallocation. Surprisingly, in my experience, disks that do this have not tended to give more trouble. You've already run a smart long test, which would have been my first suggestion. You could continue to run preclear for a few more cycles carefully looking for new signs of smart issues. If it seems to be solid, I (personally) would use it in my array, but would monitor it closely, using it heavily for reading and writing and monitoring the smart reports. Treat it as if it is on probation, and one more hiccup and I'd replace it. But I think you have a good chance of the disk behaving itself.
November 13, 201411 yr Consider this link also. http://daemon-notes.com/articles/system/smartmontools/offline-uncorrectable Offline_Uncorrectable is the number of sectors that the drive has attempted to correct itself, but failed. Running the offline self-test should cause the drive to test the sectors and attempt to fix them. Not all drives support this though. I've never attempted this. I usually run the drive through badblocks for a few pattern tests.
Archived
This topic is now archived and is closed to further replies.