RC3 redballs then blueballs parity

tr0910 · June 10, 2012

I have 3 servers, one of which is running RC3, with no problems until last night. This is a backup server that rsyncs files from the main server which is running 5b14. During a 100 gig rsync, the parity drive redballed. After restarting the server the parity drive was blueballed.

This may be a random act, and I'll be upgrading to the latest RC after the parity drive rebuilds but I during the long rsync I completely lost connection to the parity drive, at which time it was redballed, but the server completed the task and was responsive. Then after successfully rebooting the same parity drive showed up as a new drive (blue balled). It started rebuilding parity and is almost complete.

Here is part of the syslog, from the time of the failure, and a full syslog is attached.

Jun  9 23:48:08 Tower1 kernel: ata2: exception Emask 0x10 SAct 0x0 SErr 0x90202 action 0xe frozen
Jun  9 23:48:08 Tower1 kernel: ata2: irq_stat 0x00400000, PHY RDY changed
Jun  9 23:48:08 Tower1 kernel: ata2: SError: { RecovComm Persist PHYRdyChg 10B8B }
Jun  9 23:48:08 Tower1 kernel: ata2: hard resetting link
Jun  9 23:48:18 Tower1 kernel: ata2: softreset failed (1st FIS failed)
Jun  9 23:48:18 Tower1 kernel: ata2: hard resetting link
Jun  9 23:48:28 Tower1 kernel: ata2: softreset failed (1st FIS failed)
Jun  9 23:48:28 Tower1 kernel: ata2: hard resetting link
Jun  9 23:49:03 Tower1 kernel: ata2: softreset failed (1st FIS failed)
Jun  9 23:49:03 Tower1 kernel: ata2: limiting SATA link speed to 1.5 Gbps
Jun  9 23:49:03 Tower1 kernel: ata2: hard resetting link
Jun  9 23:49:08 Tower1 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jun  9 23:49:08 Tower1 kernel: ata2.00: link online but device misclassifed
Jun  9 23:49:13 Tower1 kernel: ata2.00: qc timeout (cmd 0xec)
Jun  9 23:49:13 Tower1 kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Jun  9 23:49:13 Tower1 kernel: ata2.00: revalidation failed (errno=-5)
Jun  9 23:49:13 Tower1 kernel: ata2: hard resetting link
Jun  9 23:49:23 Tower1 kernel: ata2: softreset failed (1st FIS failed)
Jun  9 23:49:23 Tower1 kernel: ata2: hard resetting link
Jun  9 23:49:33 Tower1 kernel: ata2: softreset failed (1st FIS failed)
Jun  9 23:49:33 Tower1 kernel: ata2: hard resetting link
Jun  9 23:50:08 Tower1 kernel: ata2: softreset failed (1st FIS failed)
Jun  9 23:50:08 Tower1 kernel: ata2: hard resetting link
Jun  9 23:50:14 Tower1 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jun  9 23:50:14 Tower1 kernel: ata2.00: link online but device misclassifed
Jun  9 23:50:24 Tower1 kernel: ata2.00: qc timeout (cmd 0xec)
Jun  9 23:50:24 Tower1 kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Jun  9 23:50:24 Tower1 kernel: ata2.00: revalidation failed (errno=-5)
Jun  9 23:50:24 Tower1 kernel: ata2: hard resetting link
Jun  9 23:50:34 Tower1 kernel: ata2: softreset failed (1st FIS failed)
Jun  9 23:50:34 Tower1 kernel: ata2: hard resetting link
Jun  9 23:50:44 Tower1 kernel: ata2: softreset failed (1st FIS failed)
Jun  9 23:50:44 Tower1 kernel: ata2: hard resetting link
Jun  9 23:51:19 Tower1 kernel: ata2: softreset failed (1st FIS failed)
Jun  9 23:51:19 Tower1 kernel: ata2: hard resetting link
Jun  9 23:51:24 Tower1 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jun  9 23:51:24 Tower1 kernel: ata2.00: link online but device misclassifed
Jun  9 23:51:54 Tower1 kernel: ata2.00: qc timeout (cmd 0xec)
Jun  9 23:51:54 Tower1 kernel: ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Jun  9 23:51:54 Tower1 kernel: ata2.00: revalidation failed (errno=-5)
Jun  9 23:51:54 Tower1 kernel: ata2.00: disabled
Jun  9 23:51:54 Tower1 kernel: ata2: hard resetting link
Jun  9 23:51:54 Tower1 kernel: sd 1:0:0:0: rejecting I/O to offline device
Jun  9 23:51:54 Tower1 kernel: sd 1:0:0:0: [sdb] killing request
Jun  9 23:51:54 Tower1 kernel: sd 1:0:0:0: rejecting I/O to offline device
Jun  9 23:51:54 Tower1 last message repeated 12 times
Jun  9 23:51:54 Tower1 kernel: sd 1:0:0:0: [sdb] Unhandled error code
Jun  9 23:51:54 Tower1 kernel: sd 1:0:0:0: [sdb]  Result: hostbyte=0x01 driverbyte=0x00
Jun  9 23:51:54 Tower1 kernel: sd 1:0:0:0: [sdb] CDB: cdb[0]=0x28: 28 00 1a bd 6c 80 00 00 08 00
Jun  9 23:51:54 Tower1 kernel: end_request: I/O error, dev sdb, sector 448621696
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 448621640/0, count: 1
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 448621648/0, count: 1
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 448621656/0, count: 1
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 448621664/0, count: 1
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 448621672/0, count: 1
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 448621680/0, count: 1
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 448621688/0, count: 1
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 448621696/0, count: 1
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 449030392/0, count: 1
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 449030400/0, count: 1
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 449030408/0, count: 1
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 449030416/0, count: 1
Jun  9 23:51:54 Tower1 kernel: md: disk0 read error
Jun  9 23:51:54 Tower1 kernel: handle_stripe read error: 449030424/0, count: 1

syslog-20120610-052021.zip

RobJ · June 11, 2012

I moved this to General Support, because it appears to be a hardware issue, in my opinion, in no way related to any particular version of the Linux kernel. If for any reason it does appear to be directly or indirectly related to an RC, then I'll be happy to move it back.

As you noted, the kernel completely lost contact with the drive, and after trying to reset the channel, it was only able to re-establish the low-level SATA link layer, but not the higher level drive communications. After repeated attempts were unsuccessful, the kernel gave up on the drive and marked it 'disabled', and so far, I have yet to see a drive correctly recovered once disabled. In fact, the only way I know is to reboot the system, and often that includes powering off and back on. As you saw, after booting, the drive was found and set up without issue, but unRAID considers it a new drive now, because it considered it removed (it could not see it before), so it gets a blue ball. In these cases, there is usually nothing at all wrong with the drive, just a temporary loss of contact.

Unfortunately, the error messages in cases like this never tell you precisely what went wrong, but the possible causes are somewhat limited - loose cable that partially disconnected, a back frame that is loose and possibly subject to vibration, bad or faulty power connection or power spike, or faulty SATA port. I would check the cable connections, both ends, SATA and power, and make sure they are connected or seated well and appear to have clean leads. If a back frame or tray, make sure it cannot vibrate loose. And last, try a different SATA port. The cables themselves look good, if they were not, we would have seen different errors than appear here.

mbryanr · June 11, 2012

Post a smart test of that drive. Either CRC errors or pending sectors...maybe more.

tr0910 · June 12, 2012

Well she done it again. Rebuild parity, upgraded to RC4 just to make sure. Was deleting about 50 gig via a Windows SMB connection, and we got another red-ball then blue-ball. Rebuilding parity now. Syslog attached again...

Here is the smart history file from just after the last time.

2012-06-10	querytime	1339371725
2012-06-10	health	PASSED
2012-06-10	ATA_Error_Count	
2012-06-10	Raw_Read_Error_Rate	229317346
2012-06-10	Spin_Up_Time	0 
2012-06-10	Start_Stop_Count	44
2012-06-10	Spin_Retry_Count	0 
2012-06-10	Power_Cycle_Count	26
2012-06-10	Reallocated_Sector_Ct	0 
2012-06-10	Seek_Error_Rate	4127147
2012-06-10	Temperature_Celsius	34 (0 19 0 0)
2012-06-10	Current_Pending_Sector	0 
2012-06-10	Power_On_Hours	574
2012-06-10	Offline_Uncorrectable	0 
2012-06-10	UDMA_CRC_Error_Count	0

syslog-20120612-165305.zip

RobJ · June 13, 2012

This second occurrence is EXACTLY the same as the first. In fact, a file compare of the 2 syslogs could find no differences at all in the error behavior and flags. So my comments above still stand. That small piece of the SMART report is very incomplete, but from what I can see, looks very good, no apparent issues. That has been my experience with error sequences like the ones you have seen, that something causes a sudden complete or partial loss of communications with the drive, and the drive usually is completely OK, probably still listening fruitlessly for commands to execute. Only a reboot fixes the communications.

Post a smart test of that drive. Either CRC errors or pending sectors...maybe more.

That was a good idea, checking the SMART status of the drive, just to be sure that the drive is fine. In cases like this, I've always found the drive to be completely OK, so I expect it to be so again, and forget that the user is still wondering whether the drive is working at all! The SMART report proves to the user that their drive is still great.

As to CRC errors, they not only increase the drive's CRC error count, they also appear in the syslog, with either the BadCRC or ICRC flags raised. No CRC errors appeared at all.

Pending sectors are not likely, because if the issue was bad sector related, then there would be Media errors visible in the syslog, and there are none. In addition, drive communications would be completely fine, but they were not in this case.

tr0910 · June 13, 2012

Thanks for the in-depth analysis.

Unfortunately, the error messages in cases like this never tell you precisely what went wrong, but the possible causes are somewhat limited - loose cable that partially disconnected, a back frame that is loose and possibly subject to vibration, bad or faulty power connection or power spike, or faulty SATA port. I would check the cable connections, both ends, SATA and power, and make sure they are connected or seated well and appear to have clean leads. If a back frame or tray, make sure it cannot vibrate loose. And last, try a different SATA port. The cables themselves look good, if they were not, we would have seen different errors than appear here.

We have no backplane as the drives are directly connected to the 6 motherboard ports of an Asus 475M board (there is a 2 port card that one of the drives is attached to, but not the failed parity drive). Once parity is recalculated, I'll reseat all the cables to this drive.

tr0910 · June 13, 2012

Once parity is recalculated, I'll reseat all the cables to this drive.

Tower1 parity completed successfully, and reseating all cables to this drive resulted in the drive not coming alive at all after the next reboot. Swapped motherboard SATA port, and still no joy. Swapped power connection, and we are back in business. So, the last lead on that power connection was going bad and the wiggling involved in reseating the cable broke it completely.....

RC3 redballs then blueballs parity

Recommended Posts

tr0910

Link to comment

RobJ

Link to comment

mbryanr

Link to comment

tr0910

Link to comment

RobJ

Link to comment

tr0910

Link to comment

tr0910

Link to comment

Join the conversation