Parity sync errors..

April 21, 201115 yr

I'm trying to fix my HPA drives and while rebuilding the drive I've started getting a lot of these messages. The error would happen only a few times when I do a full parity check, but right now every time I do a rebuild/parity check my syslog looks like someone got murdered.

The result of the rebuild:

Parity updated 2196 times to address sync errors.

My syslog:

Apr 20 21:10:04 Tower kernel: ata1.00: failed command: READ DMA EXT (Minor Issues)
Apr 20 21:10:04 Tower kernel: ata1.00: cmd 25/00:00:7f:63:00/00:04:3a:00:00/e0 tag 0 dma 524288 in (Drive related)
Apr 20 21:10:04 Tower kernel:          res 51/40:cf:9f:65:00/40:01:3a:00:00/e0 Emask 0x9 (media error) (Errors)
Apr 20 21:10:04 Tower kernel: ata1.00: status: { DRDY ERR } (Drive related)
Apr 20 21:10:04 Tower kernel: ata1.00: error: { UNC } (Errors)
Apr 20 21:10:04 Tower kernel: ata1.00: configured for UDMA/133 (Drive related)
Apr 20 21:10:04 Tower kernel: ata1: EH complete (Drive related)
Apr 20 21:10:06 Tower kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 (Errors)
Apr 20 21:10:06 Tower kernel: ata1.00: BMDMA stat 0x25 (Drive related)
Apr 20 21:10:06 Tower kernel: ata1.00: failed command: READ DMA EXT (Minor Issues)
Apr 20 21:10:06 Tower kernel: ata1.00: cmd 25/00:00:7f:63:00/00:04:3a:00:00/e0 tag 0 dma 524288 in (Drive related)
Apr 20 21:10:06 Tower kernel:          res 51/40:cf:9f:65:00/40:01:3a:00:00/e0 Emask 0x9 (media error) (Errors)
Apr 20 21:10:06 Tower kernel: ata1.00: status: { DRDY ERR } (Drive related)
Apr 20 21:10:06 Tower kernel: ata1.00: error: { UNC } (Errors)
Apr 20 21:10:06 Tower kernel: ata1.00: configured for UDMA/133 (Drive related)
Apr 20 21:10:06 Tower kernel: sd 1:0:0:0: [sda] Unhandled sense code (Drive related)
Apr 20 21:10:06 Tower kernel: sd 1:0:0:0: [sda] Result: hostbyte=0x00 driverbyte=0x08 (System)
Apr 20 21:10:06 Tower kernel: sd 1:0:0:0: [sda] Sense Key : 0x3 [current] [descriptor] (Drive related)
Apr 20 21:10:06 Tower kernel: Descriptor sense data with sense descriptors (in hex):
Apr 20 21:10:06 Tower kernel:         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Apr 20 21:10:06 Tower kernel:         3a 00 65 9f 
Apr 20 21:10:06 Tower kernel: sd 1:0:0:0: [sda] ASC=0x11 ASCQ=0x4 (Drive related)
Apr 20 21:10:06 Tower kernel: sd 1:0:0:0: [sda] CDB: cdb[0]=0x28: 28 00 3a 00 63 7f 00 04 00 00 (Drive related)
Apr 20 21:10:06 Tower kernel: end_request: I/O error, dev sda, sector 973104543 (Errors)
Apr 20 21:10:06 Tower kernel: ata1: EH complete (Drive related)
Apr 20 21:10:06 Tower kernel: md: disk0 read error (Errors)
Apr 20 21:10:06 Tower kernel: handle_stripe read error: 973104480/0, count: 1 (Errors)
Apr 20 21:10:06 Tower kernel: md: disk0 read error (Errors)
Apr 20 21:10:06 Tower kernel: handle_stripe read error: 973104488/0, count: 1 (Errors)
Apr 20 21:10:06 Tower kernel: md: disk0 read error (Errors)
Apr 20 21:10:06 Tower kernel: handle_stripe read error: 973104496/0, count: 1 (Errors)

The handle_stripe read error goes on for hundreds of lines and the first bunch of errors repeat several times. I'm not sure if this is my drive or the controller itself causing the issue. Does ata1 refer to the drive on the controller or my motherboard? sda is my parity drive and it doesn't show any errors in SMARTtest

Here's the attributes result for the drive

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       35747
  3 Spin_Up_Time            0x0027   040   040   021    Pre-fail  Always       -       15000
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       551
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       8191
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       35
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       25
193 Load_Cycle_Count        0x0032   190   190   000    Old_age   Always       -       32762
194 Temperature_Celsius     0x0022   099   081   000    Old_age   Always       -       53
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   196   196   000    Old_age   Always       -       1425
198 Offline_Uncorrectable   0x0030   197   194   000    Old_age   Offline      -       1111
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   165   001   000    Old_age   Offline      -       7022

Any help would be appreciated.

Quote

April 21, 201115 yr

See this: http://lime-technology.com/forum/index.php?topic=9880.0

Quote

April 21, 201115 yr

These two attributes in the SMART report show that there are many un-readable sectors on the parity disk. Basically, it is failing, and pretty badly.

197 Current_Pending_Sector 0x0032 196 196 000 Old_age Always - 1425

198 Offline_Uncorrectable 0x0030 197 194 000 Old_age Offline - 1111

On modern drives there are usually about 2000 spare sectors. You've got 1425 pending re-allocation the next time they are written. What you are seeing in the syslog are all the errors every time an attempt is made to read them.

The drive needs to be replaced. Do not wait for it to say it has failed. Start the RMA process now. You would not be able to recover from another data disk failing with so many errors existing in reading the parity disk.

Joe L.

Quote

April 21, 201115 yr

Author

Thanks Joe, I checked the rest of the drives and none of the other drives are above 0 for the two attributes you posted, whew. I suspected it was my parity drive but I wasn't sure what the deal was with the controller resetting.

Quote

April 21, 201115 yr

And you are also potentially cooking this one at 53 degrees C:

194 Temperature_Celsius 0x0022 099 081 000 Old_age Always - 53

Quote

April 21, 201115 yr

Author

And you are also potentially cooking this one at 53 degrees C:

194 Temperature_Celsius 0x0022 099 081 000 Old_age Always - 53

This was right after the rebuild so 15 drives making heat... It's at 29 C right now.

Quote

April 21, 201115 yr

This was right after the rebuild so 15 drives making heat... It's at 29 C right now.

You probably need to take a serious look at your cooling if you're seeing temperature variations that high, 29C --> 53C. If one of my drives were that hot I think I'd be getting pretty worried would crap my pants. That may not be what caused this specific failure but temps that high - even if it is only during parity calculations - certainly are not helping you to get the most life out of your drives.

Quote

Parity sync errors..

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)