Jump to content

[Third episode] Red-balled Hitachi 2Tb (5b10) - see end of thread


Recommended Posts

Posted

For details of this latest event (third!!)  See http://lime-technology.com/forum/index.php?topic=17023.msg157208#msg157208

(in this same thread)

 

Yesterday evening, I began receiving trouble emails from my server.  Checking, I found my drive 5 (Hitachi 2Tb w/ about 1Tb used), was red, with the following message repeating basically constantly thru my syslog:

 

Dec  1 23:36:59 Tower kernel: md: disk224: ATA_OP 5 ioctl error: -5

Dec  1 23:37:09 Tower emhttp: mdcmd: write: Input/output error

Dec  1 23:37:09 Tower kernel: mdcmd (5425): spindown 5

Dec  1 23:37:09 Tower kernel: md: disk224: ATA_OP 5 ioctl error: -5

Dec  1 23:37:18 Tower emhttp: mdcmd: write: Input/output error

Dec  1 23:37:18 Tower kernel: mdcmd (5426): spindown 5

Dec  1 23:37:18 Tower kernel: md: disk224: ATA_OP 5 ioctl error: -5

Dec  1 23:37:18 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO

Dec  1 23:37:18 Tower last message repeated 3 times

Dec  1 23:37:19 Tower emhttp: mdcmd: write: Input/output error

 

The array successfully completed the first-of-the-month check earlier the same day....FWIW

 

Any advice would be appreciated.

Thanks!

Posted

Thanks for responding.

 

I read that at 1am last night, but I can't tell by the logs whether I have a drive that's failed or a controller, or....if you read through the text enough- it might be neither.... "Some times, it is just a disk controller that went offline, making it impossible to access the drive." 

 

So I figured best to ask before I start trying things and potentially adding complications to a situation I already don't understand.

Posted

I turned the server back on to see if I could get a smart report.  The Main tab tells me 'the array will be unprotected' if I do this.

 

Is that normal?  (Possibly because the system's already simulating the failed disk 5?)

Posted

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     Hitachi HDS5C3020ALA632
Serial Number:    ML2220F30Z3MPE
Firmware Version: ML6OA580
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Dec  3 08:54:42 2011 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (21608) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
 2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       94
 3 Spin_Up_Time            0x0007   136   136   024    Pre-fail  Always       -       404 (Average 402)
 4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       117
 5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
 8 Seek_Time_Performance   0x0005   146   146   020    Pre-fail  Offline      -       29
 9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       2222
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       6
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       117
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       117
194 Temperature_Celsius     0x0002   253   253   000    Old_age   Always       -       18 (Min/Max 15/39)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       6

SMART Error Log Version: 1
ATA Error Count: 6 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 6 occurred at disk power-on lifetime: 2056 hours (85 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 40 50 9c 1c 02  Error: ICRC, ABRT 64 sectors at LBA = 0x021c9c50 = 35429456

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 35 00 40 4f 9c 1c e2 ff   5d+17:28:21.091  WRITE DMA EXT
 35 00 00 90 9b 1c e0 08   5d+17:28:21.090  WRITE DMA EXT
 25 00 80 40 ac 1c e0 08   5d+17:28:21.089  READ DMA EXT
 25 00 00 40 a8 1c e0 08   5d+17:28:21.087  READ DMA EXT
 25 00 00 40 a4 1c e0 08   5d+17:28:21.085  READ DMA EXT

Error 5 occurred at disk power-on lifetime: 2056 hours (85 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 d9 67 0d 1c 02  Error: ICRC, ABRT 217 sectors at LBA = 0x021c0d67 = 35392871

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 35 00 f8 48 0c 1c e0 08   5d+17:28:18.415  WRITE DMA EXT
 35 00 00 48 08 1c e0 08   5d+17:28:18.413  WRITE DMA EXT
 35 00 00 48 04 1c e0 08   5d+17:28:18.410  WRITE DMA EXT
 35 00 00 48 00 1c e0 08   5d+17:28:18.408  WRITE DMA EXT
 35 00 08 30 da 1b e0 08   5d+17:28:18.408  WRITE DMA EXT

Error 4 occurred at disk power-on lifetime: 2056 hours (85 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 c8 e0 c9 1b 02  Error: ICRC, ABRT 200 sectors at LBA = 0x021bc9e0 = 35375584

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 35 00 d8 d0 c9 1b e0 08   5d+17:28:17.507  WRITE DMA EXT
 35 00 08 c8 c9 1b e0 08   5d+17:28:17.506  WRITE DMA EXT
 35 00 08 c0 c9 1b e0 08   5d+17:28:17.506  WRITE DMA EXT
 35 00 08 b8 c9 1b e0 08   5d+17:28:17.503  WRITE DMA EXT
 35 00 08 b0 c9 1b e0 08   5d+17:28:17.502  WRITE DMA EXT

Error 3 occurred at disk power-on lifetime: 2056 hours (85 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 a0 c0 65 16 02  Error: ICRC, ABRT 160 sectors at LBA = 0x021665c0 = 35022272

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 35 00 00 60 63 16 e0 08   5d+17:27:55.547  WRITE DMA EXT
 35 00 00 60 5f 16 e0 08   5d+17:27:55.545  WRITE DMA EXT
 35 00 00 60 5b 16 e0 08   5d+17:27:55.543  WRITE DMA EXT
 25 00 f0 70 6f 16 e0 08   5d+17:27:55.540  READ DMA EXT
 25 00 00 70 6b 16 e0 08   5d+17:27:55.536  READ DMA EXT

Error 2 occurred at disk power-on lifetime: 2056 hours (85 days + 16 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 20 58 88 8d 0d  Error: ICRC, ABRT 32 sectors at LBA = 0x0d8d8858 = 227379288

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 35 00 00 78 86 8d e0 08   5d+17:15:04.376  WRITE DMA EXT
 35 00 00 78 82 8d e0 08   5d+17:15:04.374  WRITE DMA EXT
 35 00 00 78 7e 8d e0 08   5d+17:15:04.372  WRITE DMA EXT
 35 00 00 78 7a 8d e0 08   5d+17:15:04.370  WRITE DMA EXT
 35 00 00 78 76 8d e0 08   5d+17:15:04.367  WRITE DMA EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Posted

The disk was taken off-line when a write to it failed.  It will not just restore itself to service, since unRAID knows it must be reconstructed to have the correct data.

 

The errors seem to be CRC related.  (checksums)  That could be a bad cable, or a cable picking up induced noise from adjacent cabling. (it is bundled with other wires? with power supply wires?) Or it could be power supply related ( a noisy supply line, improperly regulated, could cause all kinds of electrical issues in a drive that is sensitive to poor power supply regulation) or the power supply could be over its capacity.  Or it could be a bad port on the disk controller, Or, it could be a disk drive that is actually failing.

 

At this point, since the drive seems to be responding to the smart commands you can:

stop the array

un-assign the failed drive

power down

re-seat the cables to that drive, being careful to not dislodge cables to the other drives

power up

start the array without the failed drive assigned.  It will emulate it, as it is now, but it will forget the mode/serial number of the drive so it will think of it as its own replacement when you re-assign it next.

 

Then, stop the array

re-assign the failed drive

start the array once more. 

 

It will re-construct the failed drive based on parity and all the other disks.

 

If it succeeds, fine.  (It might have just been the cabling) 

 

Posted

Machine was on, but the array wasn't started.

Selected drive and chose "no device"

Turned off the array- safely from menu.

Restarted machine.

Array isn't started.  Menu says "Start will bring the array on-line (array will be unprotected)."

Drive is still listed with red ball.  I expected it to be gone.  ?

REPEATED above.  It's still there.

??

Posted

Is it safe to start the array with disk 5 set to "no device" while the "start" button warns 'the array will be unprotected'

 

Despite setting the drive as instructed, I've cycled power to the system 3x and each time, the array doesn't start, but drive 5 IS still listed and red-balled....so the system doesn't appear to be forgetting anything...

 

 

Posted

Is it safe to start the array with disk 5 set to "no device" while the "start" button warns 'the array will be unprotected'

Yes.  It is safe.  Just do NOT use the management utility to set a new disk configuration.  That would invalidate parity.  Unprotected is true, at that point, if you were to lose a second drive, you will lose data.  It is why you want to get the failure resolved as soon as possible.

 

Despite setting the drive as instructed, I've cycled power to the system 3x and each time, the array doesn't start, but drive 5 IS still listed and red-balled....so the system doesn't appear to be forgetting anything...

It won't forget the disk model/serial number until you start the array without it.  At that time, all it will forget is the original model/serial number of the disk.  It will not forget the data, or that you have a disk5.  In fact, with the disk un-assigned you'll still be able to access the files on it as re-constructed by the other remaining disks and the parity drive.

 

 

Posted

Thank you, Joe.

 

Previously:  Following Joe's advice, drive added back and after about 5 hours rebuilding, everything appears normal again.

 

The drive red-balled again today.  This time, a smart report terminates, saying:

 

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)

Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

 

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

 

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

Ideas?

Posted

Without a syslog, it is hard to be definitive, but I can say generally that when a drive suddenly can't return a SMART report, and there is nothing physically wrong with the drive, a reboot will almost always restore communications to the drive, and enable SMART reports and normal operation again.  A syslog covering the faulty period might help to reveal what is causing the drive to lose contact.

 

Peter's advice is the standard recommendation for this kind of intermittent drive problem, especially with your recent evidence of CRC issues (and no sector issues).

Posted

Added controller card.

Moved drive to the card, abandoning port 6 on Mb.

Allowed system to rebuild it (again)

Been running fine for for a few days now, and I've since written a few hundred Gb to it.

 

Time will tell.  Thanks to everyone that responded.

Posted

This evening, the same drive red-balled again.  This time, in the middle of watching a standard def movie- nothing demanding.  I swear I'm really beginning to hate this Hitachi.

 

Recap:  This is the third failure event for this drive in about a month.  It's now on a different controller.  Its cable isn't bundled, and its connections seems absolutely solid.  

 

The server sent the 'unraid ok' email this morning at 7am.  I've included an excerpt of the syslog starting just before the event, below.

 

The initial smart report is empty- as before.  (Prior to restarting server)

 

The second smart report is also pasted below.  It's more interesting.

Any advice appreciated...

First SMART report (prior to restarting server)

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

Smartctl: Device Read Identity Failed (not an ATA/ATAPI device)

A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

Second SMART report (after safely cycling power on server)

smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Device Model:     Hitachi HDS5C3020ALA632
Serial Number:    ML2220F30Z3MPE
Firmware Version: ML6OA580
User Capacity:    2,000,398,934,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Sat Dec 17 20:39:56 2011 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)	Offline data collection activity
				was suspended by an interrupting command from host.
				Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
				without error or no self-test has ever 
				been run.
Total time to complete Offline 
data collection: 		 (21608) seconds.
Offline data collection
capabilities: 			 (0x5b) SMART execute Offline immediate.
				Auto Offline data collection on/off support.
				Suspend Offline collection upon new
				command.
				Offline surface scan supported.
				Self-test supported.
				No Conveyance Self-test supported.
				Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
				power-saving mode.
				Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
				General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 255) minutes.
SCT capabilities: 	       (0x003d)	SCT Status supported.
				SCT Error Recovery Control supported.
				SCT Feature Control supported.
				SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
 1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
 2 Throughput_Performance  0x0005   136   136   054    Pre-fail  Offline      -       94
 3 Spin_Up_Time            0x0007   137   137   024    Pre-fail  Always       -       400 (Average 400)
 4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       156
 5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
 7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
 8 Seek_Time_Performance   0x0005   146   146   020    Pre-fail  Offline      -       29
 9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       2542
10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       156
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       156
194 Temperature_Celsius     0x0002   193   193   000    Old_age   Always       -       31 (Min/Max 15/39)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       17

SMART Error Log Version: 1
ATA Error Count: 17 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 17 occurred at disk power-on lifetime: 2389 hours (99 days + 13 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 c8 b0 41 19 02  Error: ICRC, ABRT 200 sectors at LBA = 0x021941b0 = 35209648

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 35 00 d8 a0 41 19 e0 08   1d+12:07:53.812  WRITE DMA EXT
 35 00 e8 b8 40 19 e0 08   1d+12:07:53.812  WRITE DMA EXT
 35 00 08 b0 40 19 e0 08   1d+12:07:53.812  WRITE DMA EXT
 35 00 08 a8 40 19 e0 08   1d+12:07:53.811  WRITE DMA EXT
 35 00 28 80 3e 19 e0 08   1d+12:07:53.806  WRITE DMA EXT

Error 16 occurred at disk power-on lifetime: 2389 hours (99 days + 13 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 e8 48 25 18 02  Error: ICRC, ABRT 232 sectors at LBA = 0x02182548 = 35136840

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 35 00 18 18 24 18 e0 08   1d+12:07:48.659  WRITE DMA EXT
 35 00 e8 30 23 18 e0 08   1d+12:07:48.659  WRITE DMA EXT
 35 00 f0 40 22 18 e0 08   1d+12:07:48.658  WRITE DMA EXT
 35 00 08 38 22 18 e0 08   1d+12:07:48.658  WRITE DMA EXT
 35 00 08 30 22 18 e0 08   1d+12:07:48.658  WRITE DMA EXT

Error 15 occurred at disk power-on lifetime: 2389 hours (99 days + 13 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 78 90 7a 13 02  Error: ICRC, ABRT 120 sectors at LBA = 0x02137a90 = 34830992

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 35 00 e8 20 7a 13 e0 08   1d+12:07:16.655  WRITE DMA EXT
 35 00 f8 28 79 13 e0 08   1d+12:07:16.655  WRITE DMA EXT
 35 00 08 20 79 13 e0 08   1d+12:07:16.654  WRITE DMA EXT
 35 00 08 18 79 13 e0 08   1d+12:07:16.654  WRITE DMA EXT
 35 00 28 f0 76 13 e0 08   1d+12:07:16.649  WRITE DMA EXT

Error 14 occurred at disk power-on lifetime: 2389 hours (99 days + 13 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 70 a0 1e 0c 02  Error: ICRC, ABRT 112 sectors at LBA = 0x020c1ea0 = 34348704

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 35 00 00 10 1e 0c e0 08   1d+12:06:26.847  WRITE DMA EXT
 35 00 00 10 1a 0c e0 08   1d+12:06:26.845  WRITE DMA EXT
 35 00 00 10 16 0c e0 08   1d+12:06:26.843  WRITE DMA EXT
 35 00 00 10 12 0c e0 08   1d+12:06:26.840  WRITE DMA EXT
 35 00 00 10 0e 0c e0 08   1d+12:06:26.838  WRITE DMA EXT

Error 13 occurred at disk power-on lifetime: 2388 hours (99 days + 12 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 84 51 50 e0 65 4d 01  Error: ICRC, ABRT 80 sectors at LBA = 0x014d65e0 = 21849568

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 35 00 b0 80 65 4d e0 08   1d+11:44:36.042  WRITE DMA EXT
 35 00 18 68 64 4d e0 08   1d+11:44:36.042  WRITE DMA EXT
 35 00 18 50 63 4d e0 08   1d+11:44:36.041  WRITE DMA EXT
 35 00 08 48 63 4d e0 08   1d+11:44:36.041  WRITE DMA EXT
 35 00 08 40 63 4d e0 08   1d+11:44:36.041  WRITE DMA EXT

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

 

System log (excerpt- was too large to post here)

Dec 17 17:39:55 Tower kernel: mdcmd (119): spindown 0
Dec 17 17:48:47 Tower kernel: mdcmd (120): spindown 1
Dec 17 17:49:49 Tower kernel: mdcmd (121): spindown 2
Dec 17 17:50:42 Tower kernel: mdcmd (122): spindown 3
Dec 17 17:51:12 Tower kernel: mdcmd (123): spindown 4
Dec 17 18:19:01 Tower crond[1111]: ignoring /var/spool/cron/crontabs/root- (non-existent user) 
Dec 17 18:30:44 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x6 frozen
Dec 17 18:30:44 Tower kernel: ata7: SError: { Handshk }
Dec 17 18:30:44 Tower kernel: ata7.00: failed command: READ DMA EXT
Dec 17 18:30:44 Tower kernel: ata7.00: cmd 25/00:b8:50:7d:aa/00:03:75:00:00/e0 tag 0 dma 487424 in
Dec 17 18:30:44 Tower kernel:          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 17 18:30:44 Tower kernel: ata7.00: status: { DRDY }
Dec 17 18:30:44 Tower kernel: ata7: hard resetting link
Dec 17 18:30:44 Tower kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Dec 17 18:30:44 Tower kernel: ata7.00: configured for UDMA/133
Dec 17 18:30:44 Tower kernel: ata7.00: device reported invalid CHS sector 0
Dec 17 18:30:44 Tower kernel: ata7: EH complete
Dec 17 18:38:18 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x6 frozen
Dec 17 18:38:18 Tower kernel: ata7: SError: { Handshk }
Dec 17 18:38:18 Tower kernel: ata7.00: failed command: READ DMA EXT
Dec 17 18:38:18 Tower kernel: ata7.00: cmd 25/00:e0:d0:a4:aa/00:00:75:00:00/e0 tag 0 dma 114688 in
Dec 17 18:38:18 Tower kernel:          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 17 18:38:18 Tower kernel: ata7.00: status: { DRDY }
Dec 17 18:38:18 Tower kernel: ata7: hard resetting link
Dec 17 18:38:18 Tower kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Dec 17 18:38:18 Tower kernel: ata7.00: configured for UDMA/133
Dec 17 18:38:18 Tower kernel: ata7.00: device reported invalid CHS sector 0
Dec 17 18:38:18 Tower kernel: ata7: EH complete
Dec 17 18:39:36 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x6 frozen
Dec 17 18:39:36 Tower kernel: ata7: SError: { Handshk }
Dec 17 18:39:36 Tower kernel: ata7.00: failed command: READ DMA EXT
Dec 17 18:39:36 Tower kernel: ata7.00: cmd 25/00:60:d0:b8:aa/00:00:75:00:00/e0 tag 0 dma 49152 in
Dec 17 18:39:36 Tower kernel:          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 17 18:39:36 Tower kernel: ata7.00: status: { DRDY }
Dec 17 18:39:36 Tower kernel: ata7: hard resetting link
Dec 17 18:39:46 Tower kernel: ata7: softreset failed (1st FIS failed)
Dec 17 18:39:46 Tower kernel: ata7: hard resetting link
Dec 17 18:39:56 Tower kernel: ata7: softreset failed (1st FIS failed)
Dec 17 18:39:56 Tower kernel: ata7: hard resetting link
Dec 17 18:40:31 Tower kernel: ata7: softreset failed (1st FIS failed)
Dec 17 18:40:31 Tower kernel: ata7: limiting SATA link speed to 1.5 Gbps
Dec 17 18:40:31 Tower kernel: ata7: hard resetting link
Dec 17 18:40:36 Tower kernel: ata7: softreset failed (1st FIS failed)
Dec 17 18:40:36 Tower kernel: ata7: reset failed, giving up
Dec 17 18:40:36 Tower kernel: ata7.00: disabled
Dec 17 18:40:36 Tower kernel: ata7.00: device reported invalid CHS sector 0
Dec 17 18:40:36 Tower kernel: ata7: EH complete
Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code
Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg]  Result: hostbyte=0x04 driverbyte=0x00
Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x28: 28 00 75 aa b8 d0 00 00 60 00
Dec 17 18:40:36 Tower kernel: end_request: I/O error, dev sdg, sector 1974122704
Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code
Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg]  Result: hostbyte=0x04 driverbyte=0x00
Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x28: 28 00 75 aa b9 b0 00 03 a0 00
Dec 17 18:40:36 Tower kernel: end_request: I/O error, dev sdg, sector 1974122928
Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code
Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg]  Result: hostbyte=0x04 driverbyte=0x00
Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x28: 28 00 75 aa bd 50 00 02 f8 00
Dec 17 18:40:36 Tower kernel: end_request: I/O error, dev sdg, sector 1974123856
Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code
Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg]  Result: hostbyte=0x04 driverbyte=0x00
Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x28: 28 00 75 9c 80 30 00 00 08 00
Dec 17 18:40:36 Tower kernel: end_request: I/O error, dev sdg, sector 1973190704
Dec 17 18:40:36 Tower kernel: md: disk5 read error
Dec 17 18:40:36 Tower kernel: handle_stripe read error: 1974122640/5, count: 1
Dec 17 18:40:36 Tower kernel: md: disk5 read error
Dec 17 18:40:36 Tower kernel: handle_stripe read error: 1974122648/5, count: 1
Dec 17 18:40:36 Tower kernel: md: disk5 read error
Dec 17 18:40:36 Tower kernel: handle_stripe read error: 1974122656/5, count: 1
Dec 17 18:40:36 Tower kernel: md: disk5 read error
Dec 17 18:40:36 Tower kernel: handle_stripe read error: 1974122664/5, count: 1

====this repeats MANY MANY TIMES=====

Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg]  Result: hostbyte=0x04 driverbyte=0x00
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x2a: 2a 00 75 aa b8 d0 00 00 08 00
Dec 17 18:40:46 Tower kernel: end_request: I/O error, dev sdg, sector 1974122704
Dec 17 18:40:46 Tower kernel: md: disk5 write error
Dec 17 18:40:46 Tower kernel: handle_stripe write error: 1974122640/5, count: 1
Dec 17 18:40:46 Tower kernel: md: recovery thread woken up ...
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg]  Result: hostbyte=0x04 driverbyte=0x00
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x2a: 2a 00 75 aa b8 d8 00 00 08 00
Dec 17 18:40:46 Tower kernel: end_request: I/O error, dev sdg, sector 1974122712
Dec 17 18:40:46 Tower kernel: md: disk5 write error
Dec 17 18:40:46 Tower kernel: handle_stripe write error: 1974122648/5, count: 1
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg]  Result: hostbyte=0x04 driverbyte=0x00
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x2a: 2a 00 75 aa b8 e0 00 00 08 00
Dec 17 18:40:46 Tower kernel: end_request: I/O error, dev sdg, sector 1974122720
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg]  Result: hostbyte=0x04 driverbyte=0x00
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x2a: 2a 00 75 aa b8 e8 00 00 08 00
Dec 17 18:40:46 Tower kernel: end_request: I/O error, dev sdg, sector 1974122728
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg]  Result: hostbyte=0x04 driverbyte=0x00
Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x2a: 2a 00 75 aa b8 f0 00 00 40 00
Dec 17 18:40:46 Tower kernel: end_request: I/O error, dev sdg, sector 1974122736

==== I think you get the idea =====

Posted

Have you changed the sata cable?

Re routed it?

is this a newly added drive? Could be a PSU issue sometimes when you reach a certain amount drives, they intermittantly go offiline like this (I had that happen when I upgraded to a 9th drive on a 600w PSU that was not single rail(.

 

also,  I did not see a smart short or long test

 

SMART Self-test log structure revision number 1

No self-tests have been logged.  [To run self-tests, use: smartctl -t]

 

After the array is in a clean state, set your go script so that it does not run emhttp.

Reboot.

 

Run a smart short test.

 

smartctl -t short /dev/sd?

 

wait 2 minutes, capture a smart log on your flash.

 

do a smart long test

 

smartctl -t long /dev/sd?

wait as long as it says, then capture a smart log again on your flash.

usually saya something like 255 minutes or so.

 

 

If you want you can do a badblocks readonly test

 

badblocks -vs -/tmp/badblocks.out /dev/sd?

 

See if there are trouble spots.

 

 

 

if this were my drive first thing I would do is a smart -t short test since it's very fast.

After that I would consider the long test or swap out the sata cable.

 

Sometimes the contacts are not tight enough and the movement of the head vibrates the cable.

Posted

Thanks for the response.

 

Cable has been re-routed previously.

Drive is a few months old.

P/S is 550w, with 6-drives total, but only two should have been spinning during tonight's failure (was 1-hour into the movie w/ no other activity)

Will change the cable tonight, just to be sure. (I have extras)

What do you mean by "in a clean state"  (re-add the drive and let it rebuild?)

How do I set my go script "so that it does not run emhttp"

(won't this prevent me from accessing unRAID via browser?  I'll caution you- I know next-to-nothing about Linux)

 

Posted

Thanks for the response.

 

Cable has been re-routed previously.

Drive is a few months old.

P/S is 550w, with 6-drives total, but only two should have been spinning during tonight's failure (was 1-hour into the movie w/ no other activity)

Will change the cable tonight, just to be sure. (I have extras)

What do you mean by "in a clean state"  (re-add the drive and let it rebuild?)

Yes.

 

How do I set my go script "so that it does not run emhttp"

(won't this prevent me from accessing unRAID via browser?  I'll caution you- I know next-to-nothing about Linux)

 

Yes it would prevent you from accessing via browser. Since this is a more advanced way of accessing the system.

Let's skip that part.

 

Perhaps you can install unmenu, and use the smart tools there to run a smart test (just to be sure it's not a surface issue).

However, I would suggest if you are going to do the long test then set the spindown timer off on this particular drive.

Then issue the smart long test.

Do not access this drive for the duration of the long test.

 

It really seems like a power or cable issue, but if it's a surface issue, the short and long test may reveal it.

Posted

Dec 17 18:40:36 Tower kernel: ata7.00: disabled

I'll just add that once you see the line above, indicating that the drive has been marked 'disabled' by the kernel, then you can completely ignore the disk and stripe errors that follow.

 

I very much agree with WeeboTech, sounds like a cable (power or SATA) or PSU issue, although I would add that so far, there has been no evidence of any surface issues.

Posted

I had a similar problem with my 3TB hitachi's in my norco.

 

after a few months of being problem free. two or 3 of the drives vibrated loose while in the hot swap bays. they had power lights and worked for the most part, i could even rebuild the red ball..  then they reball a gain the next day or two.

 

I had to push on them a bit more to get a solid connection and to get them to stop redballing.

 

no problems since.

Posted

P/S is the CoolerMaster 550 that came with the case.  (I'm not inclined to suspect it.  It has, after all, had zero problems running ALL the drives thru three rebuilds [15+ hours, at least] and monthly parity checks...and when this last failure occurred, only two drives should have been running....just saying)

 

Since my last post: 

I've allowed the array to rebuild.  (I'm getting good at that)

Made backup copies of the files on this drive, and have even copied 30-40Gb of new data onto it.

Server's been powered on 24/7 for a few days, now- with default drive parking enabled.

 

But haven't yet ran the short/long SMART tests.  Are they destructive? 

 

Posted

But haven't yet ran the short/long SMART tests.  Are they destructive? 

They are both read-only tests and not destructive.

 

The "long" test reads all the sectors on the disk, the short test reads a much smaller sample of sectors.  The short completes in about 5 minutes or so on most disks, the "long" test can take 5 or more hours on a large disk.  (Be sure to disable spin-down timers, as spinning down the disk will abort the test)

 

After submitting either the short or long test, you must wait sufficient time and then submit a normal smart status report request.  It will let you know if it is still running.

 

Joe L.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...