johnny121b Posted December 2, 2011 Posted December 2, 2011 For details of this latest event (third!!) See http://lime-technology.com/forum/index.php?topic=17023.msg157208#msg157208 (in this same thread) Yesterday evening, I began receiving trouble emails from my server. Checking, I found my drive 5 (Hitachi 2Tb w/ about 1Tb used), was red, with the following message repeating basically constantly thru my syslog: Dec 1 23:36:59 Tower kernel: md: disk224: ATA_OP 5 ioctl error: -5 Dec 1 23:37:09 Tower emhttp: mdcmd: write: Input/output error Dec 1 23:37:09 Tower kernel: mdcmd (5425): spindown 5 Dec 1 23:37:09 Tower kernel: md: disk224: ATA_OP 5 ioctl error: -5 Dec 1 23:37:18 Tower emhttp: mdcmd: write: Input/output error Dec 1 23:37:18 Tower kernel: mdcmd (5426): spindown 5 Dec 1 23:37:18 Tower kernel: md: disk224: ATA_OP 5 ioctl error: -5 Dec 1 23:37:18 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Dec 1 23:37:18 Tower last message repeated 3 times Dec 1 23:37:19 Tower emhttp: mdcmd: write: Input/output error The array successfully completed the first-of-the-month check earlier the same day....FWIW Any advice would be appreciated. Thanks!
dgaschk Posted December 3, 2011 Posted December 3, 2011 http://lime-technology.com/wiki/index.php?title=FAQ#What_does_the_Red_Ball_mean.3F
johnny121b Posted December 3, 2011 Author Posted December 3, 2011 Thanks for responding. I read that at 1am last night, but I can't tell by the logs whether I have a drive that's failed or a controller, or....if you read through the text enough- it might be neither.... "Some times, it is just a disk controller that went offline, making it impossible to access the drive." So I figured best to ask before I start trying things and potentially adding complications to a situation I already don't understand.
johnny121b Posted December 3, 2011 Author Posted December 3, 2011 I turned the server back on to see if I could get a smart report. The Main tab tells me 'the array will be unprotected' if I do this. Is that normal? (Possibly because the system's already simulating the failed disk 5?)
dgaschk Posted December 3, 2011 Posted December 3, 2011 See here: http://lime-technology.com/wiki/index.php?title=Troubleshooting#Obtaining_a_SMART_report
johnny121b Posted December 3, 2011 Author Posted December 3, 2011 smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: Hitachi HDS5C3020ALA632 Serial Number: ML2220F30Z3MPE Firmware Version: ML6OA580 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Dec 3 08:54:42 2011 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (21608) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 94 3 Spin_Up_Time 0x0007 136 136 024 Pre-fail Always - 404 (Average 402) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 117 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 146 146 020 Pre-fail Offline - 29 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 2222 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 6 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 117 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 117 194 Temperature_Celsius 0x0002 253 253 000 Old_age Always - 18 (Min/Max 15/39) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 6 SMART Error Log Version: 1 ATA Error Count: 6 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 6 occurred at disk power-on lifetime: 2056 hours (85 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 40 50 9c 1c 02 Error: ICRC, ABRT 64 sectors at LBA = 0x021c9c50 = 35429456 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 35 00 40 4f 9c 1c e2 ff 5d+17:28:21.091 WRITE DMA EXT 35 00 00 90 9b 1c e0 08 5d+17:28:21.090 WRITE DMA EXT 25 00 80 40 ac 1c e0 08 5d+17:28:21.089 READ DMA EXT 25 00 00 40 a8 1c e0 08 5d+17:28:21.087 READ DMA EXT 25 00 00 40 a4 1c e0 08 5d+17:28:21.085 READ DMA EXT Error 5 occurred at disk power-on lifetime: 2056 hours (85 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 d9 67 0d 1c 02 Error: ICRC, ABRT 217 sectors at LBA = 0x021c0d67 = 35392871 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 35 00 f8 48 0c 1c e0 08 5d+17:28:18.415 WRITE DMA EXT 35 00 00 48 08 1c e0 08 5d+17:28:18.413 WRITE DMA EXT 35 00 00 48 04 1c e0 08 5d+17:28:18.410 WRITE DMA EXT 35 00 00 48 00 1c e0 08 5d+17:28:18.408 WRITE DMA EXT 35 00 08 30 da 1b e0 08 5d+17:28:18.408 WRITE DMA EXT Error 4 occurred at disk power-on lifetime: 2056 hours (85 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 c8 e0 c9 1b 02 Error: ICRC, ABRT 200 sectors at LBA = 0x021bc9e0 = 35375584 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 35 00 d8 d0 c9 1b e0 08 5d+17:28:17.507 WRITE DMA EXT 35 00 08 c8 c9 1b e0 08 5d+17:28:17.506 WRITE DMA EXT 35 00 08 c0 c9 1b e0 08 5d+17:28:17.506 WRITE DMA EXT 35 00 08 b8 c9 1b e0 08 5d+17:28:17.503 WRITE DMA EXT 35 00 08 b0 c9 1b e0 08 5d+17:28:17.502 WRITE DMA EXT Error 3 occurred at disk power-on lifetime: 2056 hours (85 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 a0 c0 65 16 02 Error: ICRC, ABRT 160 sectors at LBA = 0x021665c0 = 35022272 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 35 00 00 60 63 16 e0 08 5d+17:27:55.547 WRITE DMA EXT 35 00 00 60 5f 16 e0 08 5d+17:27:55.545 WRITE DMA EXT 35 00 00 60 5b 16 e0 08 5d+17:27:55.543 WRITE DMA EXT 25 00 f0 70 6f 16 e0 08 5d+17:27:55.540 READ DMA EXT 25 00 00 70 6b 16 e0 08 5d+17:27:55.536 READ DMA EXT Error 2 occurred at disk power-on lifetime: 2056 hours (85 days + 16 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 20 58 88 8d 0d Error: ICRC, ABRT 32 sectors at LBA = 0x0d8d8858 = 227379288 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 35 00 00 78 86 8d e0 08 5d+17:15:04.376 WRITE DMA EXT 35 00 00 78 82 8d e0 08 5d+17:15:04.374 WRITE DMA EXT 35 00 00 78 7e 8d e0 08 5d+17:15:04.372 WRITE DMA EXT 35 00 00 78 7a 8d e0 08 5d+17:15:04.370 WRITE DMA EXT 35 00 00 78 76 8d e0 08 5d+17:15:04.367 WRITE DMA EXT SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
Joe L. Posted December 3, 2011 Posted December 3, 2011 The disk was taken off-line when a write to it failed. It will not just restore itself to service, since unRAID knows it must be reconstructed to have the correct data. The errors seem to be CRC related. (checksums) That could be a bad cable, or a cable picking up induced noise from adjacent cabling. (it is bundled with other wires? with power supply wires?) Or it could be power supply related ( a noisy supply line, improperly regulated, could cause all kinds of electrical issues in a drive that is sensitive to poor power supply regulation) or the power supply could be over its capacity. Or it could be a bad port on the disk controller, Or, it could be a disk drive that is actually failing. At this point, since the drive seems to be responding to the smart commands you can: stop the array un-assign the failed drive power down re-seat the cables to that drive, being careful to not dislodge cables to the other drives power up start the array without the failed drive assigned. It will emulate it, as it is now, but it will forget the mode/serial number of the drive so it will think of it as its own replacement when you re-assign it next. Then, stop the array re-assign the failed drive start the array once more. It will re-construct the failed drive based on parity and all the other disks. If it succeeds, fine. (It might have just been the cabling)
johnny121b Posted December 3, 2011 Author Posted December 3, 2011 Machine was on, but the array wasn't started. Selected drive and chose "no device" Turned off the array- safely from menu. Restarted machine. Array isn't started. Menu says "Start will bring the array on-line (array will be unprotected)." Drive is still listed with red ball. I expected it to be gone. ? REPEATED above. It's still there. ??
johnny121b Posted December 4, 2011 Author Posted December 4, 2011 Is it safe to start the array with disk 5 set to "no device" while the "start" button warns 'the array will be unprotected' Despite setting the drive as instructed, I've cycled power to the system 3x and each time, the array doesn't start, but drive 5 IS still listed and red-balled....so the system doesn't appear to be forgetting anything...
Joe L. Posted December 4, 2011 Posted December 4, 2011 Is it safe to start the array with disk 5 set to "no device" while the "start" button warns 'the array will be unprotected'Yes. It is safe. Just do NOT use the management utility to set a new disk configuration. That would invalidate parity. Unprotected is true, at that point, if you were to lose a second drive, you will lose data. It is why you want to get the failure resolved as soon as possible. Despite setting the drive as instructed, I've cycled power to the system 3x and each time, the array doesn't start, but drive 5 IS still listed and red-balled....so the system doesn't appear to be forgetting anything... It won't forget the disk model/serial number until you start the array without it. At that time, all it will forget is the original model/serial number of the disk. It will not forget the data, or that you have a disk5. In fact, with the disk un-assigned you'll still be able to access the files on it as re-constructed by the other remaining disks and the parity drive.
johnny121b Posted December 4, 2011 Author Posted December 4, 2011 Thank you, Joe. Previously: Following Joe's advice, drive added back and after about 5 hours rebuilding, everything appears normal again. The drive red-balled again today. This time, a smart report terminates, saying: smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright © 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Smartctl: Device Read Identity Failed (not an ATA/ATAPI device) A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. Ideas?
lionelhutz Posted December 9, 2011 Posted December 9, 2011 Have you changed the cables, both power and SATA, connected to the drive? Connect it to another SATA port if you have one available. Then try a rebuild again. Peter
RobJ Posted December 10, 2011 Posted December 10, 2011 Without a syslog, it is hard to be definitive, but I can say generally that when a drive suddenly can't return a SMART report, and there is nothing physically wrong with the drive, a reboot will almost always restore communications to the drive, and enable SMART reports and normal operation again. A syslog covering the faulty period might help to reveal what is causing the drive to lose contact. Peter's advice is the standard recommendation for this kind of intermittent drive problem, especially with your recent evidence of CRC issues (and no sector issues).
johnny121b Posted December 14, 2011 Author Posted December 14, 2011 Added controller card. Moved drive to the card, abandoning port 6 on Mb. Allowed system to rebuild it (again) Been running fine for for a few days now, and I've since written a few hundred Gb to it. Time will tell. Thanks to everyone that responded.
johnny121b Posted December 18, 2011 Author Posted December 18, 2011 This evening, the same drive red-balled again. This time, in the middle of watching a standard def movie- nothing demanding. I swear I'm really beginning to hate this Hitachi. Recap: This is the third failure event for this drive in about a month. It's now on a different controller. Its cable isn't bundled, and its connections seems absolutely solid. The server sent the 'unraid ok' email this morning at 7am. I've included an excerpt of the syslog starting just before the event, below. The initial smart report is empty- as before. (Prior to restarting server) The second smart report is also pasted below. It's more interesting. Any advice appreciated... First SMART report (prior to restarting server) smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net Smartctl: Device Read Identity Failed (not an ATA/ATAPI device) A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. Second SMART report (after safely cycling power on server) smartctl 5.40 2010-10-16 r3189 [i486-slackware-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Device Model: Hitachi HDS5C3020ALA632 Serial Number: ML2220F30Z3MPE Firmware Version: ML6OA580 User Capacity: 2,000,398,934,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Sat Dec 17 20:39:56 2011 EST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x84) Offline data collection activity was suspended by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (21608) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 255) minutes. SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 016 Pre-fail Always - 0 2 Throughput_Performance 0x0005 136 136 054 Pre-fail Offline - 94 3 Spin_Up_Time 0x0007 137 137 024 Pre-fail Always - 400 (Average 400) 4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 156 5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0 8 Seek_Time_Performance 0x0005 146 146 020 Pre-fail Offline - 29 9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 2542 10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 156 193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 156 194 Temperature_Celsius 0x0002 193 193 000 Old_age Always - 31 (Min/Max 15/39) 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 17 SMART Error Log Version: 1 ATA Error Count: 17 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 17 occurred at disk power-on lifetime: 2389 hours (99 days + 13 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 c8 b0 41 19 02 Error: ICRC, ABRT 200 sectors at LBA = 0x021941b0 = 35209648 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 35 00 d8 a0 41 19 e0 08 1d+12:07:53.812 WRITE DMA EXT 35 00 e8 b8 40 19 e0 08 1d+12:07:53.812 WRITE DMA EXT 35 00 08 b0 40 19 e0 08 1d+12:07:53.812 WRITE DMA EXT 35 00 08 a8 40 19 e0 08 1d+12:07:53.811 WRITE DMA EXT 35 00 28 80 3e 19 e0 08 1d+12:07:53.806 WRITE DMA EXT Error 16 occurred at disk power-on lifetime: 2389 hours (99 days + 13 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 e8 48 25 18 02 Error: ICRC, ABRT 232 sectors at LBA = 0x02182548 = 35136840 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 35 00 18 18 24 18 e0 08 1d+12:07:48.659 WRITE DMA EXT 35 00 e8 30 23 18 e0 08 1d+12:07:48.659 WRITE DMA EXT 35 00 f0 40 22 18 e0 08 1d+12:07:48.658 WRITE DMA EXT 35 00 08 38 22 18 e0 08 1d+12:07:48.658 WRITE DMA EXT 35 00 08 30 22 18 e0 08 1d+12:07:48.658 WRITE DMA EXT Error 15 occurred at disk power-on lifetime: 2389 hours (99 days + 13 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 78 90 7a 13 02 Error: ICRC, ABRT 120 sectors at LBA = 0x02137a90 = 34830992 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 35 00 e8 20 7a 13 e0 08 1d+12:07:16.655 WRITE DMA EXT 35 00 f8 28 79 13 e0 08 1d+12:07:16.655 WRITE DMA EXT 35 00 08 20 79 13 e0 08 1d+12:07:16.654 WRITE DMA EXT 35 00 08 18 79 13 e0 08 1d+12:07:16.654 WRITE DMA EXT 35 00 28 f0 76 13 e0 08 1d+12:07:16.649 WRITE DMA EXT Error 14 occurred at disk power-on lifetime: 2389 hours (99 days + 13 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 70 a0 1e 0c 02 Error: ICRC, ABRT 112 sectors at LBA = 0x020c1ea0 = 34348704 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 35 00 00 10 1e 0c e0 08 1d+12:06:26.847 WRITE DMA EXT 35 00 00 10 1a 0c e0 08 1d+12:06:26.845 WRITE DMA EXT 35 00 00 10 16 0c e0 08 1d+12:06:26.843 WRITE DMA EXT 35 00 00 10 12 0c e0 08 1d+12:06:26.840 WRITE DMA EXT 35 00 00 10 0e 0c e0 08 1d+12:06:26.838 WRITE DMA EXT Error 13 occurred at disk power-on lifetime: 2388 hours (99 days + 12 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 50 e0 65 4d 01 Error: ICRC, ABRT 80 sectors at LBA = 0x014d65e0 = 21849568 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 35 00 b0 80 65 4d e0 08 1d+11:44:36.042 WRITE DMA EXT 35 00 18 68 64 4d e0 08 1d+11:44:36.042 WRITE DMA EXT 35 00 18 50 63 4d e0 08 1d+11:44:36.041 WRITE DMA EXT 35 00 08 48 63 4d e0 08 1d+11:44:36.041 WRITE DMA EXT 35 00 08 40 63 4d e0 08 1d+11:44:36.041 WRITE DMA EXT SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. System log (excerpt- was too large to post here) Dec 17 17:39:55 Tower kernel: mdcmd (119): spindown 0 Dec 17 17:48:47 Tower kernel: mdcmd (120): spindown 1 Dec 17 17:49:49 Tower kernel: mdcmd (121): spindown 2 Dec 17 17:50:42 Tower kernel: mdcmd (122): spindown 3 Dec 17 17:51:12 Tower kernel: mdcmd (123): spindown 4 Dec 17 18:19:01 Tower crond[1111]: ignoring /var/spool/cron/crontabs/root- (non-existent user) Dec 17 18:30:44 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x6 frozen Dec 17 18:30:44 Tower kernel: ata7: SError: { Handshk } Dec 17 18:30:44 Tower kernel: ata7.00: failed command: READ DMA EXT Dec 17 18:30:44 Tower kernel: ata7.00: cmd 25/00:b8:50:7d:aa/00:03:75:00:00/e0 tag 0 dma 487424 in Dec 17 18:30:44 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Dec 17 18:30:44 Tower kernel: ata7.00: status: { DRDY } Dec 17 18:30:44 Tower kernel: ata7: hard resetting link Dec 17 18:30:44 Tower kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Dec 17 18:30:44 Tower kernel: ata7.00: configured for UDMA/133 Dec 17 18:30:44 Tower kernel: ata7.00: device reported invalid CHS sector 0 Dec 17 18:30:44 Tower kernel: ata7: EH complete Dec 17 18:38:18 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x6 frozen Dec 17 18:38:18 Tower kernel: ata7: SError: { Handshk } Dec 17 18:38:18 Tower kernel: ata7.00: failed command: READ DMA EXT Dec 17 18:38:18 Tower kernel: ata7.00: cmd 25/00:e0:d0:a4:aa/00:00:75:00:00/e0 tag 0 dma 114688 in Dec 17 18:38:18 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Dec 17 18:38:18 Tower kernel: ata7.00: status: { DRDY } Dec 17 18:38:18 Tower kernel: ata7: hard resetting link Dec 17 18:38:18 Tower kernel: ata7: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Dec 17 18:38:18 Tower kernel: ata7.00: configured for UDMA/133 Dec 17 18:38:18 Tower kernel: ata7.00: device reported invalid CHS sector 0 Dec 17 18:38:18 Tower kernel: ata7: EH complete Dec 17 18:39:36 Tower kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x400000 action 0x6 frozen Dec 17 18:39:36 Tower kernel: ata7: SError: { Handshk } Dec 17 18:39:36 Tower kernel: ata7.00: failed command: READ DMA EXT Dec 17 18:39:36 Tower kernel: ata7.00: cmd 25/00:60:d0:b8:aa/00:00:75:00:00/e0 tag 0 dma 49152 in Dec 17 18:39:36 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Dec 17 18:39:36 Tower kernel: ata7.00: status: { DRDY } Dec 17 18:39:36 Tower kernel: ata7: hard resetting link Dec 17 18:39:46 Tower kernel: ata7: softreset failed (1st FIS failed) Dec 17 18:39:46 Tower kernel: ata7: hard resetting link Dec 17 18:39:56 Tower kernel: ata7: softreset failed (1st FIS failed) Dec 17 18:39:56 Tower kernel: ata7: hard resetting link Dec 17 18:40:31 Tower kernel: ata7: softreset failed (1st FIS failed) Dec 17 18:40:31 Tower kernel: ata7: limiting SATA link speed to 1.5 Gbps Dec 17 18:40:31 Tower kernel: ata7: hard resetting link Dec 17 18:40:36 Tower kernel: ata7: softreset failed (1st FIS failed) Dec 17 18:40:36 Tower kernel: ata7: reset failed, giving up Dec 17 18:40:36 Tower kernel: ata7.00: disabled Dec 17 18:40:36 Tower kernel: ata7.00: device reported invalid CHS sector 0 Dec 17 18:40:36 Tower kernel: ata7: EH complete Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00 Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x28: 28 00 75 aa b8 d0 00 00 60 00 Dec 17 18:40:36 Tower kernel: end_request: I/O error, dev sdg, sector 1974122704 Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00 Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x28: 28 00 75 aa b9 b0 00 03 a0 00 Dec 17 18:40:36 Tower kernel: end_request: I/O error, dev sdg, sector 1974122928 Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00 Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x28: 28 00 75 aa bd 50 00 02 f8 00 Dec 17 18:40:36 Tower kernel: end_request: I/O error, dev sdg, sector 1974123856 Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00 Dec 17 18:40:36 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x28: 28 00 75 9c 80 30 00 00 08 00 Dec 17 18:40:36 Tower kernel: end_request: I/O error, dev sdg, sector 1973190704 Dec 17 18:40:36 Tower kernel: md: disk5 read error Dec 17 18:40:36 Tower kernel: handle_stripe read error: 1974122640/5, count: 1 Dec 17 18:40:36 Tower kernel: md: disk5 read error Dec 17 18:40:36 Tower kernel: handle_stripe read error: 1974122648/5, count: 1 Dec 17 18:40:36 Tower kernel: md: disk5 read error Dec 17 18:40:36 Tower kernel: handle_stripe read error: 1974122656/5, count: 1 Dec 17 18:40:36 Tower kernel: md: disk5 read error Dec 17 18:40:36 Tower kernel: handle_stripe read error: 1974122664/5, count: 1 ====this repeats MANY MANY TIMES===== Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00 Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x2a: 2a 00 75 aa b8 d0 00 00 08 00 Dec 17 18:40:46 Tower kernel: end_request: I/O error, dev sdg, sector 1974122704 Dec 17 18:40:46 Tower kernel: md: disk5 write error Dec 17 18:40:46 Tower kernel: handle_stripe write error: 1974122640/5, count: 1 Dec 17 18:40:46 Tower kernel: md: recovery thread woken up ... Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00 Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x2a: 2a 00 75 aa b8 d8 00 00 08 00 Dec 17 18:40:46 Tower kernel: end_request: I/O error, dev sdg, sector 1974122712 Dec 17 18:40:46 Tower kernel: md: disk5 write error Dec 17 18:40:46 Tower kernel: handle_stripe write error: 1974122648/5, count: 1 Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00 Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x2a: 2a 00 75 aa b8 e0 00 00 08 00 Dec 17 18:40:46 Tower kernel: end_request: I/O error, dev sdg, sector 1974122720 Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00 Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x2a: 2a 00 75 aa b8 e8 00 00 08 00 Dec 17 18:40:46 Tower kernel: end_request: I/O error, dev sdg, sector 1974122728 Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Unhandled error code Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] Result: hostbyte=0x04 driverbyte=0x00 Dec 17 18:40:46 Tower kernel: sd 7:0:0:0: [sdg] CDB: cdb[0]=0x2a: 2a 00 75 aa b8 f0 00 00 40 00 Dec 17 18:40:46 Tower kernel: end_request: I/O error, dev sdg, sector 1974122736 ==== I think you get the idea =====
WeeboTech Posted December 18, 2011 Posted December 18, 2011 Have you changed the sata cable? Re routed it? is this a newly added drive? Could be a PSU issue sometimes when you reach a certain amount drives, they intermittantly go offiline like this (I had that happen when I upgraded to a 9th drive on a 600w PSU that was not single rail(. also, I did not see a smart short or long test SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] After the array is in a clean state, set your go script so that it does not run emhttp. Reboot. Run a smart short test. smartctl -t short /dev/sd? wait 2 minutes, capture a smart log on your flash. do a smart long test smartctl -t long /dev/sd? wait as long as it says, then capture a smart log again on your flash. usually saya something like 255 minutes or so. If you want you can do a badblocks readonly test badblocks -vs -/tmp/badblocks.out /dev/sd? See if there are trouble spots. if this were my drive first thing I would do is a smart -t short test since it's very fast. After that I would consider the long test or swap out the sata cable. Sometimes the contacts are not tight enough and the movement of the head vibrates the cable.
johnny121b Posted December 18, 2011 Author Posted December 18, 2011 Thanks for the response. Cable has been re-routed previously. Drive is a few months old. P/S is 550w, with 6-drives total, but only two should have been spinning during tonight's failure (was 1-hour into the movie w/ no other activity) Will change the cable tonight, just to be sure. (I have extras) What do you mean by "in a clean state" (re-add the drive and let it rebuild?) How do I set my go script "so that it does not run emhttp" (won't this prevent me from accessing unRAID via browser? I'll caution you- I know next-to-nothing about Linux)
WeeboTech Posted December 18, 2011 Posted December 18, 2011 Thanks for the response. Cable has been re-routed previously. Drive is a few months old. P/S is 550w, with 6-drives total, but only two should have been spinning during tonight's failure (was 1-hour into the movie w/ no other activity) Will change the cable tonight, just to be sure. (I have extras) What do you mean by "in a clean state" (re-add the drive and let it rebuild?) Yes. How do I set my go script "so that it does not run emhttp" (won't this prevent me from accessing unRAID via browser? I'll caution you- I know next-to-nothing about Linux) Yes it would prevent you from accessing via browser. Since this is a more advanced way of accessing the system. Let's skip that part. Perhaps you can install unmenu, and use the smart tools there to run a smart test (just to be sure it's not a surface issue). However, I would suggest if you are going to do the long test then set the spindown timer off on this particular drive. Then issue the smart long test. Do not access this drive for the duration of the long test. It really seems like a power or cable issue, but if it's a surface issue, the short and long test may reveal it.
RobJ Posted December 18, 2011 Posted December 18, 2011 Dec 17 18:40:36 Tower kernel: ata7.00: disabled I'll just add that once you see the line above, indicating that the drive has been marked 'disabled' by the kernel, then you can completely ignore the disk and stripe errors that follow. I very much agree with WeeboTech, sounds like a cable (power or SATA) or PSU issue, although I would add that so far, there has been no evidence of any surface issues.
Johnm Posted December 18, 2011 Posted December 18, 2011 I had a similar problem with my 3TB hitachi's in my norco. after a few months of being problem free. two or 3 of the drives vibrated loose while in the hot swap bays. they had power lights and worked for the most part, i could even rebuild the red ball.. then they reball a gain the next day or two. I had to push on them a bit more to get a solid connection and to get them to stop redballing. no problems since.
lionelhutz Posted December 19, 2011 Posted December 19, 2011 What make of power supply? It wouldn't happen to be an Antec?
johnny121b Posted December 19, 2011 Author Posted December 19, 2011 P/S is the CoolerMaster 550 that came with the case. (I'm not inclined to suspect it. It has, after all, had zero problems running ALL the drives thru three rebuilds [15+ hours, at least] and monthly parity checks...and when this last failure occurred, only two drives should have been running....just saying) Since my last post: I've allowed the array to rebuild. (I'm getting good at that) Made backup copies of the files on this drive, and have even copied 30-40Gb of new data onto it. Server's been powered on 24/7 for a few days, now- with default drive parking enabled. But haven't yet ran the short/long SMART tests. Are they destructive?
Joe L. Posted December 19, 2011 Posted December 19, 2011 But haven't yet ran the short/long SMART tests. Are they destructive? They are both read-only tests and not destructive. The "long" test reads all the sectors on the disk, the short test reads a much smaller sample of sectors. The short completes in about 5 minutes or so on most disks, the "long" test can take 5 or more hours on a large disk. (Be sure to disable spin-down timers, as spinning down the disk will abort the test) After submitting either the short or long test, you must wait sufficient time and then submit a normal smart status report request. It will let you know if it is still running. Joe L.
Recommended Posts
Archived
This topic is now archived and is closed to further replies.