April 21, 201115 yr I'm trying to fix my HPA drives and while rebuilding the drive I've started getting a lot of these messages. The error would happen only a few times when I do a full parity check, but right now every time I do a rebuild/parity check my syslog looks like someone got murdered. The result of the rebuild: Parity updated 2196 times to address sync errors. My syslog: Apr 20 21:10:04 Tower kernel: ata1.00: failed command: READ DMA EXT (Minor Issues) Apr 20 21:10:04 Tower kernel: ata1.00: cmd 25/00:00:7f:63:00/00:04:3a:00:00/e0 tag 0 dma 524288 in (Drive related) Apr 20 21:10:04 Tower kernel: res 51/40:cf:9f:65:00/40:01:3a:00:00/e0 Emask 0x9 (media error) (Errors) Apr 20 21:10:04 Tower kernel: ata1.00: status: { DRDY ERR } (Drive related) Apr 20 21:10:04 Tower kernel: ata1.00: error: { UNC } (Errors) Apr 20 21:10:04 Tower kernel: ata1.00: configured for UDMA/133 (Drive related) Apr 20 21:10:04 Tower kernel: ata1: EH complete (Drive related) Apr 20 21:10:06 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors) Apr 20 21:10:06 Tower kernel: ata1.00: BMDMA stat 0x25 (Drive related) Apr 20 21:10:06 Tower kernel: ata1.00: failed command: READ DMA EXT (Minor Issues) Apr 20 21:10:06 Tower kernel: ata1.00: cmd 25/00:00:7f:63:00/00:04:3a:00:00/e0 tag 0 dma 524288 in (Drive related) Apr 20 21:10:06 Tower kernel: res 51/40:cf:9f:65:00/40:01:3a:00:00/e0 Emask 0x9 (media error) (Errors) Apr 20 21:10:06 Tower kernel: ata1.00: status: { DRDY ERR } (Drive related) Apr 20 21:10:06 Tower kernel: ata1.00: error: { UNC } (Errors) Apr 20 21:10:06 Tower kernel: ata1.00: configured for UDMA/133 (Drive related) Apr 20 21:10:06 Tower kernel: sd 1:0:0:0: [sda] Unhandled sense code (Drive related) Apr 20 21:10:06 Tower kernel: sd 1:0:0:0: [sda] Result: hostbyte=0x00 driverbyte=0x08 (System) Apr 20 21:10:06 Tower kernel: sd 1:0:0:0: [sda] Sense Key : 0x3 [current] [descriptor] (Drive related) Apr 20 21:10:06 Tower kernel: Descriptor sense data with sense descriptors (in hex): Apr 20 21:10:06 Tower kernel: 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Apr 20 21:10:06 Tower kernel: 3a 00 65 9f Apr 20 21:10:06 Tower kernel: sd 1:0:0:0: [sda] ASC=0x11 ASCQ=0x4 (Drive related) Apr 20 21:10:06 Tower kernel: sd 1:0:0:0: [sda] CDB: cdb[0]=0x28: 28 00 3a 00 63 7f 00 04 00 00 (Drive related) Apr 20 21:10:06 Tower kernel: end_request: I/O error, dev sda, sector 973104543 (Errors) Apr 20 21:10:06 Tower kernel: ata1: EH complete (Drive related) Apr 20 21:10:06 Tower kernel: md: disk0 read error (Errors) Apr 20 21:10:06 Tower kernel: handle_stripe read error: 973104480/0, count: 1 (Errors) Apr 20 21:10:06 Tower kernel: md: disk0 read error (Errors) Apr 20 21:10:06 Tower kernel: handle_stripe read error: 973104488/0, count: 1 (Errors) Apr 20 21:10:06 Tower kernel: md: disk0 read error (Errors) Apr 20 21:10:06 Tower kernel: handle_stripe read error: 973104496/0, count: 1 (Errors) The handle_stripe read error goes on for hundreds of lines and the first bunch of errors repeat several times. I'm not sure if this is my drive or the controller itself causing the issue. Does ata1 refer to the drive on the controller or my motherboard? sda is my parity drive and it doesn't show any errors in SMARTtest Here's the attributes result for the drive SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 35747 3 Spin_Up_Time 0x0027 040 040 021 Pre-fail Always - 15000 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 551 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 089 089 000 Old_age Always - 8191 10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 35 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 25 193 Load_Cycle_Count 0x0032 190 190 000 Old_age Always - 32762 194 Temperature_Celsius 0x0022 099 081 000 Old_age Always - 53 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 196 196 000 Old_age Always - 1425 198 Offline_Uncorrectable 0x0030 197 194 000 Old_age Offline - 1111 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 165 001 000 Old_age Offline - 7022 Any help would be appreciated.
April 21, 201115 yr These two attributes in the SMART report show that there are many un-readable sectors on the parity disk. Basically, it is failing, and pretty badly. 197 Current_Pending_Sector 0x0032 196 196 000 Old_age Always - 1425 198 Offline_Uncorrectable 0x0030 197 194 000 Old_age Offline - 1111 On modern drives there are usually about 2000 spare sectors. You've got 1425 pending re-allocation the next time they are written. What you are seeing in the syslog are all the errors every time an attempt is made to read them. The drive needs to be replaced. Do not wait for it to say it has failed. Start the RMA process now. You would not be able to recover from another data disk failing with so many errors existing in reading the parity disk. Joe L.
April 21, 201115 yr Author Thanks Joe, I checked the rest of the drives and none of the other drives are above 0 for the two attributes you posted, whew. I suspected it was my parity drive but I wasn't sure what the deal was with the controller resetting.
April 21, 201115 yr And you are also potentially cooking this one at 53 degrees C: 194 Temperature_Celsius 0x0022 099 081 000 Old_age Always - 53
April 21, 201115 yr Author And you are also potentially cooking this one at 53 degrees C: 194 Temperature_Celsius 0x0022 099 081 000 Old_age Always - 53 This was right after the rebuild so 15 drives making heat... It's at 29 C right now.
April 21, 201115 yr This was right after the rebuild so 15 drives making heat... It's at 29 C right now. You probably need to take a serious look at your cooling if you're seeing temperature variations that high, 29C --> 53C. If one of my drives were that hot I think I'd be getting pretty worried would crap my pants. That may not be what caused this specific failure but temps that high - even if it is only during parity calculations - certainly are not helping you to get the most life out of your drives.
Archived
This topic is now archived and is closed to further replies.