Failed command & Hard Resetting Link - What is happening?


Recommended Posts

Hi,

 

I'm getting a lot of Failed command & Hard Resetting Link errors during a parity rebuild. This is happening across the motherboard SATA and SATA card. Faulty Sata cable or power supply? I did have a SATA cable die on me this morning...

 

System: Using 2x SI-PEX40064 SATA Cards and Sandybridge Motherboard

 

Feb 17 18:16:12 kernel: virbr0: port 1(virbr0-nic) entered disabled state

Feb 17 18:17:22 kernel: ata14.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 17 18:17:22 kernel: ata14.00: failed command: IDENTIFY DEVICE

Feb 17 18:17:22 kernel: ata14.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 3 pio 512 in

Feb 17 18:17:22 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 17 18:17:22 kernel: ata14.00: status: { DRDY }

Feb 17 18:17:22 kernel: ata14: hard resetting link

Feb 17 18:17:23 kernel: ata14: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Feb 17 18:17:23 kernel: ata14.00: configured for UDMA/133

Feb 17 18:17:23 kernel: ata14: EH complete

Feb 17 18:17:44 kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 17 18:17:44 kernel: ata7.00: failed command: IDENTIFY DEVICE

Feb 17 18:17:44 kernel: ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag 13 pio 512 in

Feb 17 18:17:44 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 17 18:17:44 kernel: ata7.00: status: { DRDY }

Feb 17 18:17:44 kernel: ata7: hard resetting link

Feb 17 18:17:45 kernel: ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Feb 17 18:17:45 kernel: ata7.00: configured for UDMA/133

Feb 17 18:17:45 kernel: ata7: EH complete

Feb 17 18:18:07 kernel: ata11.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

Feb 17 18:18:07 kernel: ata11.00: failed command: SMART

Feb 17 18:18:07 kernel: ata11.00: cmd b0/d1:01:01:4f:c2/00:00:00:00:00/00 tag 16 pio 512 in

Feb 17 18:18:07 kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

Feb 17 18:18:07 kernel: ata11.00: status: { DRDY }

Feb 17 18:18:07 kernel: ata11: hard resetting link

Feb 17 18:18:08 kernel: ata11: SATA link up 6.0 Gbps (SStatus 133 SControl 300)

Feb 17 18:18:08 kernel: ata11.00: configured for UDMA/133

Feb 17 18:18:08 kernel: ata11: EH complete

 

Thanks!

 

Link to comment

SATA cable problems. Three different ones in that syslog snippet alone, though there isn't enough information to identify which. Check (or, preferably replace) them all.

 

You mention your power supply too - do you have doubts about it? If so consider replacing it. Faulty power supplies cause all sorts of obscure problems. Pretty much essential for unRAID is a quality one with a single (and therefore pretty high current) +12 V rail of adequate capacity.

 

Link to comment

SATA cable problems. Three different ones in that syslog snippet alone, though there isn't enough information to identify which. Check (or, preferably replace) them all.

 

You mention your power supply too - do you have doubts about it? If so consider replacing it. Faulty power supplies cause all sorts of obscure problems. Pretty much essential for unRAID is a quality one with a single (and therefore pretty high current) +12 V rail of adequate capacity.

 

Hey John,

 

I did a bit more research and it appears if I plug the drives directly to the board then everything works fine (same SATA cable / power cable). If I plug them directly into the PCI cards (SI-PEX40064), I periodically get these SMART and Identify device errors. Could it be because the drives are WD Green (slow to spin up) or the PCI card is just slow?

Link to comment

That card is not ideal because it connects up to four disks to a single PCIe lane but in normal use you shouldn't really notice the difference, except during a parity check. So use the motherboard ports up first and put your parity and cache disks on the motherboard. The error messages are about the controller failing to communicate with the disks and resetting the SATA link so the first thing to look at is the cables. However, your card is susceptible to a somewhat obscure bug as it's based on a Marvell chip but without further information about your system I can't tell whether it's affected. Post your diagnostics (Tools -> Diagnostics).

 

Link to comment

That card is not ideal because it connects up to four disks to a single PCIe lane but in normal use you shouldn't really notice the difference, except during a parity check. So use the motherboard ports up first and put your parity and cache disks on the motherboard. The error messages are about the controller failing to communicate with the disks and resetting the SATA link so the first thing to look at is the cables. However, your card is susceptible to a somewhat obscure bug as it's based on a Marvell chip but without further information about your system I can't tell whether it's affected. Post your diagnostics (Tools -> Diagnostics).

 

I went and bought another PEX40064 from the store instead of eBay and I'm not running into the same issue. My guess is the ebay cards are either fakes or same batch with different firmware. I'll keep an eye on things for the next 48 hours but I haven't seen the same timeout errors as the previous two cards from ebay.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.