1st time Errors on HDD - what should i do - the right steps? - General Support (V5 and Older)

November 3, 201411 yr

hi all,

for the first time, i see errors on the main page for one HDD. 1467 so far.

what should i do to avoid data lost? is there a check possible? or should i turn it off and replace the hdd?

please help, or point me to some howtos. its a productive 12TB server.

thanks in advance!

Quote

November 3, 201411 yr

post A syslog and smart log for the respective drive in question.

Quote

November 3, 201411 yr

Author

thanks for the hint. here are the logs. thanks for the quick reaction

logs.zip

Quote

November 3, 201411 yr

One of the disks is having issues. Looks to be sde, the .zip has the smart report for sdc which shows no errors.

Please post smart report for sde.

Oct 31 11:30:33 r5-server kernel: mdcmd (3): import 2 8,64 1953514552 WDC_WD20EARS-11J99B1_WD-WCAWZ0931907
Oct 31 11:30:33 r5-server kernel: md: import disk2: [8,64] (sde) WDC_WD20EARS-11J99B1_WD-WCAWZ0931907 size: 1953514552

Oct 31 15:46:01 r5-server kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 31 15:46:01 r5-server kernel: ata4.00: irq_stat 0x40000001
Oct 31 15:46:01 r5-server kernel: ata4.00: failed command: READ DMA EXT
Oct 31 15:46:01 r5-server kernel: ata4.00: cmd 25/00:00:78:1f:b9/00:04:bb:00:00/e0 tag 5 dma 524288 in
Oct 31 15:46:01 r5-server kernel:         res 51/40:2f:48:21:b9/00:02:bb:00:00/e0 Emask 0x9 (media error)
Oct 31 15:46:01 r5-server kernel: ata4.00: status: { DRDY ERR }
Oct 31 15:46:01 r5-server kernel: ata4.00: error: { UNC }
Oct 31 15:46:01 r5-server kernel: ata4.00: configured for UDMA/133
Oct 31 15:46:01 r5-server kernel: sd 4:0:0:0: [sde] Unhandled sense code
Oct 31 15:46:01 r5-server kernel: sd 4:0:0:0: [sde]  
Oct 31 15:46:01 r5-server kernel: Result: hostbyte=0x00 driverbyte=0x08
Oct 31 15:46:01 r5-server kernel: sd 4:0:0:0: [sde]  
Oct 31 15:46:01 r5-server kernel: Sense Key : 0x3 [current] [descriptor]
Oct 31 15:46:01 r5-server kernel: Descriptor sense data with sense descriptors (in hex):
Oct 31 15:46:01 r5-server kernel:        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Oct 31 15:46:01 r5-server kernel:        bb b9 21 48 
Oct 31 15:46:01 r5-server kernel: sd 4:0:0:0: [sde]  
Oct 31 15:46:01 r5-server kernel: ASC=0x11 ASCQ=0x4
Oct 31 15:46:01 r5-server kernel: sd 4:0:0:0: [sde] CDB: 
Oct 31 15:46:01 r5-server kernel: cdb[0]=0x28: 28 00 bb b9 1f 78 00 04 00 00
Oct 31 15:46:01 r5-server kernel: end_request: I/O error, dev sde, sector 3149472072
Oct 31 15:46:01 r5-server kernel: ata4: EH complete
Oct 31 15:46:01 r5-server kernel: md: disk2 read error, sector=3149472008
Oct 31 15:46:01 r5-server kernel: md: multiple disk errors, sector=3149472008
...<SNIP>...

Quote

November 3, 201411 yr

Author

sorry. jumped in the wrong line. sde is throwing errors:

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.16.3-unRAID] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (AF)
Device Model:     WDC WD20EARS-11J99B1
Serial Number:    WD-WCAWZ0931907
LU WWN Device Id: 5 0014ee 2060bcdac
Firmware Version: 80.00A80
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Nov  3 16:13:27 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		(35760) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 345) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       14
  3 Spin_Up_Time            0x0027   143   133   021    Pre-fail  Always       -       9816
  4 Start_Stop_Count        0x0032   090   090   000    Old_age   Always       -       10661
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   069   069   000    Old_age   Always       -       23344
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       54
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       34
193 Load_Cycle_Count        0x0032   139   139   000    Old_age   Always       -       185097
194 Temperature_Celsius     0x0022   130   079   000    Old_age   Always       -       22
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       24
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Quote

November 3, 201411 yr

Here is your issue.

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       24

If you have a spare precleared drive, you can try rebuilding on that.

pending sectors only go away during a write operation.

There's a chance that the drive could be saved in a last ditch effort by writing to those sectors, but they have to be identified.

It's a tedious and long process.

Using ddrescue to copy the drive to another, then running badblocks in read/write mode or something like spinrite.

For example. I have a drive with pending sectors.

I've been running badblocks in read/write test mode for 145 hours and it's only 60% of the way through for a 2tb drive.

My drive happens to have no important data (which is why I'm doing this test).

You may want to take the array off line, and issue a smart long test so it can scan the whole hard drive.

This could take over 10 hours. smartctl will tell you how long with an ETA of when it should be complete.

Rebuild onto a spare drive would be prudent. Also reviewing what was updated in the past few days on that drive might be worth review.

Quote

November 3, 201411 yr

Author

thank you, i will try to shut it down, replace the 2TB with a precleared and let the system reconstruct the disc.

do you think its worth to do the rescue of the faulty 2TB afterwards? (with badblocks or spinrite)

Quote

November 3, 201411 yr

thank you, i will try to shut it down, replace the 2TB with a precleared and let the system reconstruct the disc.

do you think its worth to do the rescue of the faulty 2TB afterwards? (with badblocks or spinrite)

A regular preclear of a few passes with JoeL's script may cause those pending sectors to be reallocated on purpose.

I happen to use badblocks manually and it works for me.

Here's a recent post with examples.

http://lime-technology.com/forum/index.php?topic=32564.msg335021#msg335021

you still need to use Joe L's script afterwards to put the signature on the drive.

Quote

November 10, 201411 yr

Author

hi WeeboTech,

i installed a new precleared hdd as a replacement for the "old" one. unraid is running fine. after that i inserted the old hdd with the pending sectors and did a preclear with joe's script. how many times should i do this? after one normal run, the smartctrl --all showed the same output with the value of 200 Current_Pending_Sector.

what should i do exactly to try to rescue the hdd. wanted to keep it as a rescue hdd for immediate possibilities.

or is that not a good idea?

Quote

November 10, 201411 yr

Any disk where the value for Pending Sectors does not go down to zero after a pre-clear cycle is unreliable and should not be used with unRAID.

Quote

November 10, 201411 yr

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       24

There are 24 pending sectors on the smart report provided (last column)

If you run the preclear a few times and that number does not go to 0 then the drive is not a good candidate for unRAID.

The pending sectors need to be 0.

If pending sectors go to 0 and reallocated sectors go from 0 to 24.

i.e. pending sectors are reallocated, you can use the drive.

Pending sectors need to be 0.

Otherwise the drive can possibly fail during a recovery effort.

Quote

November 13, 201411 yr

Author

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       14
  3 Spin_Up_Time            0x0027   144   133   021    Pre-fail  Always       -       9775
  4 Start_Stop_Count        0x0032   090   090   000    Old_age   Always       -       10668
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   068   068   000    Old_age   Always       -       23575
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       58
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       34
193 Load_Cycle_Count        0x0032   139   139   000    Old_age   Always       -       185120
194 Temperature_Celsius     0x0022   122   079   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       72
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     23517         -
# 2  Extended offline    Completed without error       00%     23496         -

after two runs of preclear i checked smart two times and the pending sectors were at 0.

but the Reallocated_Sector Count didn't go up.

so is the drive good for later use in an emergency or in a productive environment?

Quote

November 13, 201411 yr

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       72
...

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     23517         -
# 2  Extended offline    Completed without error       00%     23496         -

after two runs of preclear i checked smart two times and the pending sectors were at 0.

but the Reallocated_Sector Count didn't go up.

so is the drive good for later use in an emergency or in a productive environment?

Only concerning thing I see is the 72 for offline_uncorrectable.

According to Wikipedia, this attribute indicates ...

The total count of uncorrectable errors when reading/writing a sector. A rise in the value of this attribute indicates defects of the disk surface and/or problems in the mechanical subsystem.

I am not used to seeing offline_uncorrectables without reallocated sectors that give you some confidence that the uncorrectable errors are on sectors that have taken out of service. But I have seen numerous cases where pending sectors spontaneously go away with no corresponding reallocation. Surprisingly, in my experience, disks that do this have not tended to give more trouble.

You've already run a smart long test, which would have been my first suggestion. You could continue to run preclear for a few more cycles carefully looking for new signs of smart issues. If it seems to be solid, I (personally) would use it in my array, but would monitor it closely, using it heavily for reading and writing and monitoring the smart reports. Treat it as if it is on probation, and one more hiccup and I'd replace it. But I think you have a good chance of the disk behaving itself.

Quote

November 13, 201411 yr

Consider this link also.

http://daemon-notes.com/articles/system/smartmontools/offline-uncorrectable

Offline_Uncorrectable is the number of sectors that the drive has attempted to correct itself, but failed. Running the offline self-test should cause the drive to test the sectors and attempt to fix them. Not all drives support this though.

I've never attempted this. I usually run the drive through badblocks for a few pattern tests.

Quote

1st time Errors on HDD - what should i do - the right steps?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)