ATA hard resetting link: ata1 and ata8 failing differently. How to diagnose hardware issues?

ctietze · January 9

Hi,

I upgraded my drives to 12TB Seagate IronWolf and WD Reds, but it appears two Seagate drives don't enjoy the experience so far. At least I believe it's the drives, not the cables.

I only found one similar thread, and it was probably power splitters back then:

I get similar messages at the moment for ata1 and ata8. Can't complete parity checks (kilobytes/s), so something's not good.

I believe I identified the culprits via lshw -class disk -short and ls -l /sys/class/ata_port/ to be

* ata1: /dev/sdb ST12000VN0008-2Y disk1
* ata8: /dev/sdg ST12000VN0008-2Y parity

Spoiler contains the raw output that led me to the conclusions.

Spoiler

root@NASminion:~# ls -l /sys/class/ata_port/
total 0
lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata1 -> ../../devices/pci0000:00/0000:00:12.0/ata1/ata_port/ata1/
lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata2 -> ../../devices/pci0000:00/0000:00:12.0/ata2/ata_port/ata2/
lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata3 -> ../../devices/pci0000:00/0000:00:13.0/0000:01:00.0/ata3/ata_port/ata3/
lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata4 -> ../../devices/pci0000:00/0000:00:13.0/0000:01:00.0/ata4/ata_port/ata4/
lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata5 -> ../../devices/pci0000:00/0000:00:13.0/0000:01:00.0/ata5/ata_port/ata5/
lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata6 -> ../../devices/pci0000:00/0000:00:13.0/0000:01:00.0/ata6/ata_port/ata6/
lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata7 -> ../../devices/pci0000:00/0000:00:13.3/0000:04:00.0/ata7/ata_port/ata7/
lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata8 -> ../../devices/pci0000:00/0000:00:13.3/0000:04:00.0/ata8/ata_port/ata8/

root@NASminion:~# ls -l /dev/disk/by-path/
total 0
lrwxrwxrwx 1 root root 9 Jan 8 12:40 pci-0000:00:15.0-usb-0:6:1.0-scsi-0:0:0:0 -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 8 12:40 pci-0000:00:15.0-usb-0:6:1.0-scsi-0:0:0:0-part1 -> ../../sda1

root@NASminion:~# ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root 9 Jan 8 12:40 ata-ST12000VN0008-2YS101_ZRT0N4MB -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 8 12:40 ata-ST12000VN0008-2YS101_ZRT0N4MB-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 9 Jan 8 12:40 ata-ST12000VN0008-2YS101_ZRT0NAYD -> ../../sdd
lrwxrwxrwx 1 root root 10 Jan 8 12:40 ata-ST12000VN0008-2YS101_ZRT0NAYD-part1 -> ../../sdd1
lrwxrwxrwx 1 root root 9 Jan 8 12:40 ata-ST12000VN0008-2YS101_ZRT0NB3G -> ../../sdg
lrwxrwxrwx 1 root root 10 Jan 8 12:40 ata-ST12000VN0008-2YS101_ZRT0NB3G-part1 -> ../../sdg1
lrwxrwxrwx 1 root root 9 Jan 8 12:40 ata-SanDisk_SDSSDH3_1T00_21410P459108 -> ../../sdc
lrwxrwxrwx 1 root root 10 Jan 8 12:40 ata-SanDisk_SDSSDH3_1T00_21410P459108-part1 -> ../../sdc1
lrwxrwxrwx 1 root root 9 Jan 8 12:40 ata-WDC_WD120EFBX-68B0EN0_D7H7R62N -> ../../sde
lrwxrwxrwx 1 root root 10 Jan 8 12:40 ata-WDC_WD120EFBX-68B0EN0_D7H7R62N-part1 -> ../../sde1
lrwxrwxrwx 1 root root 9 Jan 8 12:40 ata-WDC_WD120EFBX-68B0EN0_D7H8MTAN -> ../../sdf
lrwxrwxrwx 1 root root 10 Jan 8 12:40 ata-WDC_WD120EFBX-68B0EN0_D7H8MTAN-part1 -> ../../sdf1
lrwxrwxrwx 1 root root 9 Jan 8 12:40 usb-Samsung_Flash_Drive_0376020110009682-0:0 -> ../../sda
lrwxrwxrwx 1 root root 10 Jan 8 12:40 usb-Samsung_Flash_Drive_0376020110009682-0:0-part1 -> ../../sda1
lrwxrwxrwx 1 root root 9 Jan 8 12:40 wwn-0x5000c500e6c74c64 -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 8 12:40 wwn-0x5000c500e6c74c64-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 9 Jan 8 12:40 wwn-0x5000c500e6c7a241 -> ../../sdg
lrwxrwxrwx 1 root root 10 Jan 8 12:40 wwn-0x5000c500e6c7a241-part1 -> ../../sdg1
lrwxrwxrwx 1 root root 9 Jan 8 12:40 wwn-0x5000c500e6c7ae94 -> ../../sdd
lrwxrwxrwx 1 root root 10 Jan 8 12:40 wwn-0x5000c500e6c7ae94-part1 -> ../../sdd1
lrwxrwxrwx 1 root root 9 Jan 8 12:40 wwn-0x5000cca2dfd1996c -> ../../sde
lrwxrwxrwx 1 root root 10 Jan 8 12:40 wwn-0x5000cca2dfd1996c-part1 -> ../../sde1
lrwxrwxrwx 1 root root 9 Jan 8 12:40 wwn-0x5000cca2dfd204be -> ../../sdf
lrwxrwxrwx 1 root root 10 Jan 8 12:40 wwn-0x5000cca2dfd204be-part1 -> ../../sdf1
lrwxrwxrwx 1 root root 9 Jan 8 12:40 wwn-0x5001b444a7f24540 -> ../../sdc
lrwxrwxrwx 1 root root 10 Jan 8 12:40 wwn-0x5001b444a7f24540-part1 -> ../../sdc1

root@NASminion:~# lshw -class disk -short
H/W path Device Class Description
========================================================
/0/100/12/0 /dev/sdb disk 12TB ST12000VN0008-2Y
/0/100/12/1 /dev/sdc disk 1TB SanDisk SDSSDH3
/0/100/13/0/0 /dev/sdd disk 12TB ST12000VN0008-2Y
/0/100/13/0/0/0 /dev/sdd disk 12TB
/0/100/13/0/1 /dev/sde disk 12TB WDC WD120EFBX-68
/0/100/13/0/1/0 /dev/sde disk 12TB
/0/100/13.3/0/0 /dev/sdf disk 12TB WDC WD120EFBX-68
/0/100/13.3/0/1 /dev/sdg disk 12TB ST12000VN0008-2Y
/0/6/0.0.0 /dev/sda disk 32GB Flash Drive
/0/6/0.0.0/0 /dev/sda disk 32GB

root@NASminion:~# ls -l /sys/block
total 0
[...]
lrwxrwxrwx 1 root root 0 Jan 8 12:51 sda -> ../devices/pci0000:00/0000:00:15.0/usb1/1-6/1-6:1.0/host0/target0:0:0/0:0:0:0/block/sda/
lrwxrwxrwx 1 root root 0 Jan 9 09:25 sdb -> ../devices/pci0000:00/0000:00:12.0/ata1/host1/target1:0:0/1:0:0:0/block/sdb/
lrwxrwxrwx 1 root root 0 Jan 8 12:40 sdc -> ../devices/pci0000:00/0000:00:12.0/ata2/host2/target2:0:0/2:0:0:0/block/sdc/
lrwxrwxrwx 1 root root 0 Jan 9 09:25 sdd -> ../devices/pci0000:00/0000:00:13.0/0000:01:00.0/ata3/host3/target3:0:0/3:0:0:0/block/sdd/
lrwxrwxrwx 1 root root 0 Jan 9 09:25 sde -> ../devices/pci0000:00/0000:00:13.0/0000:01:00.0/ata4/host4/target4:0:0/4:0:0:0/block/sde/
lrwxrwxrwx 1 root root 0 Jan 9 09:25 sdf -> ../devices/pci0000:00/0000:00:13.3/0000:04:00.0/ata7/host7/target7:0:0/7:0:0:0/block/sdf/
lrwxrwxrwx 1 root root 0 Jan 9 09:25 sdg -> ../devices/pci0000:00/0000:00:13.3/0000:04:00.0/ata8/host8/target8:0:0/8:0:0:0/block/sdg/

It's quite tricky to identify ataX numbers and also the drives I wired 😬

Parity fails a bit differently than disk1, with "found unknown device (class 0)" errors in the end of a failure cycle.

I've removed disk1 from the Global Shares to not put more data onto it.
I've checked the SATA cables. One did 'click' a bit deeper, actually, but that didn't change anything.
SMART report produced no errors.
I've used xfs_repair and it found issues on disk1, but no other data drive.
- Parity can't be checked that way because it has no file system. But I'd love to! Any ideas?
I've used the unbalance app to move all but 100GB of data off of the 12TB drive (was very new and not very full). During the process, the log spawned problems like this again and the device audibly clicks and seems to spin up a disk again and again.

Someone suggested this might also be a SATA card in the PCIe slot that fails. I wouldn't know how to test this, though, without 'blindly' buying more equipment to check replacements.

Anything else I might try to diagnose?

Log in the spoilers:

Spoiler

Jan  9 22:45:00 NASminion kernel: ata1.00: exception Emask 0x50 SAct 0x100020 SErr 0x4890800 action 0xe frozen
Jan  9 22:45:00 NASminion kernel: ata1.00: irq_stat 0x0c400040, interface fatal error, connection status changed
Jan  9 22:45:00 NASminion kernel: ata1: SError: { HostInt PHYRdyChg 10B8B LinkSeq DevExch }
Jan  9 22:45:00 NASminion kernel: ata1.00: failed command: READ FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata1.00: cmd 60/00:28:18:3a:d1/01:00:89:01:00/40 tag 5 ncq dma 131072 in
Jan  9 22:45:00 NASminion kernel:         res 40/00:28:18:3a:d1/00:00:89:01:00/40 Emask 0x50 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata1.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata1.00: failed command: READ FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata1.00: cmd 60/00:a0:18:3b:d1/01:00:89:01:00/40 tag 20 ncq dma 131072 in
Jan  9 22:45:00 NASminion kernel:         res 40/00:28:18:3a:d1/00:00:89:01:00/40 Emask 0x50 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata1.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata1: hard resetting link
Jan  9 22:45:00 NASminion kernel: ata8.00: exception Emask 0x10 SAct 0x1ffc00 SErr 0x10002 action 0xe frozen
Jan  9 22:45:00 NASminion kernel: ata8.00: irq_stat 0x00400000, PHY RDY changed
Jan  9 22:45:00 NASminion kernel: ata8: SError: { RecovComm PHYRdyChg }
Jan  9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata8.00: cmd 60/40:50:90:43:cb/05:00:53:02:00/40 tag 10 ncq dma 688128 in
Jan  9 22:45:00 NASminion kernel:         res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata8.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata8.00: cmd 60/c8:58:d0:48:cb/03:00:53:02:00/40 tag 11 ncq dma 495616 in
Jan  9 22:45:00 NASminion kernel:         res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata8.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata8.00: cmd 60/20:60:98:4c:cb/04:00:53:02:00/40 tag 12 ncq dma 540672 in
Jan  9 22:45:00 NASminion kernel:         res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata8.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata8.00: cmd 60/c8:68:b8:50:cb/02:00:53:02:00/40 tag 13 ncq dma 364544 in
Jan  9 22:45:00 NASminion kernel:         res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata8.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata8.00: cmd 60/28:70:80:53:cb/01:00:53:02:00/40 tag 14 ncq dma 151552 in
Jan  9 22:45:00 NASminion kernel:         res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata8.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata8.00: cmd 60/18:78:a8:54:cb/04:00:53:02:00/40 tag 15 ncq dma 536576 in
Jan  9 22:45:00 NASminion kernel:         res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata8.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata8.00: cmd 61/40:80:38:3a:cb/01:00:53:02:00/40 tag 16 ncq dma 163840 out
Jan  9 22:45:00 NASminion kernel:         res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata8.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata8.00: cmd 60/78:88:c0:58:cb/01:00:53:02:00/40 tag 17 ncq dma 192512 in
Jan  9 22:45:00 NASminion kernel:         res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata8.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata8.00: cmd 61/08:90:78:3b:cb/00:00:53:02:00/40 tag 18 ncq dma 4096 out
Jan  9 22:45:00 NASminion kernel:         res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata8.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata8.00: cmd 61/38:98:80:3b:cb/05:00:53:02:00/40 tag 19 ncq dma 684032 out
Jan  9 22:45:00 NASminion kernel:         res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata8.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata8.00: failed command: WRITE FPDMA QUEUED
Jan  9 22:45:00 NASminion kernel: ata8.00: cmd 61/d8:a0:b8:40:cb/02:00:53:02:00/40 tag 20 ncq dma 372736 out
Jan  9 22:45:00 NASminion kernel:         res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error)
Jan  9 22:45:00 NASminion kernel: ata8.00: status: { DRDY }
Jan  9 22:45:00 NASminion kernel: ata8: hard resetting link
Jan  9 22:45:06 NASminion kernel: ata1: link is slow to respond, please be patient (ready=0)
Jan  9 22:45:06 NASminion kernel: ata8: link is slow to respond, please be patient (ready=0)
Jan  9 22:45:10 NASminion kernel: ata8: found unknown device (class 0)
Jan  9 22:45:10 NASminion kernel: ata8: found unknown device (class 0)
Jan  9 22:45:10 NASminion kernel: ata8: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Jan  9 22:45:10 NASminion kernel: ata1: softreset failed (device not ready)
Jan  9 22:45:10 NASminion kernel: ata1: hard resetting link

I'd appreciate any other hints: How would you interpret these issues, for example?

JorgeB · January 10

Please post the complete diagnostics.

ctietze · January 10

Sure thingnasminion-diagnostics-20240110-1552.zip, thanks for looking

JorgeB · January 10

Nothing so far suggests a disk issue to me, the drives are using different controllers so it's not that also, it looks more like a power/connection issue, I would swap cables (both power and SATA) between a known good disk, like disk3, and parity, then see where the problem follows.

ctietze · January 10

Thanks for looking @JorgeB!

So to summarize: the kernel-reported ATA errors are more about unresponsiveness and timeouts, not about disk errors? Found researching the codes a bit tricky. (ChatGPT helped a bit though)

Did you infer that they use different controllers from the ID's/codes like 13:0... etc?

I'll check the power cables and swap them and SATA cables next.

I do wonder whether disk2/3/4 being good are false positives: the errors only occur when I'm trying to access parts of disk1, not all of it (90% of unbalancing worked at ~200mb/s until it slowed to a crawl), I wonder why that might be if it's a power thing.

Since each change to disk1 also needs to update parity, is there a way to figure out if the access to disk1 or to parity is the real culprit, or are really both affected independently?

I would imagine that moving multiple TB of data from a 'good' disk, maybe disk2, should then produce ata8/parity errors, but should not report errors for disk2 (ata7). Would that make sense to verify it's parity on ata8?

JorgeB · January 10

7 minutes ago, ctietze said:

the kernel-reported ATA errors are more about unresponsiveness and timeouts, not about disk errors?

That's what it looks like to me.

7 minutes ago, ctietze said:

Did you infer that they use different controllers from the ID's/codes like 13:0... etc?

From the diags.

ctietze · January 18

Replaced cables (the one to disk1 was awkwardly bent I noticed) and tried to move the remaining files off of the disk with unBALANCE because accessing these used to trigger the errors.

Some files have moved over night, but in the morning, xfs warned me I should run `xfs_repair`, and it turns out the file system is now unmountable.

Jan 18 08:31:38 NASminion kernel: XFS (sdh1): Corruption detected. Unmount and run xfs_repair
Jan 18 08:31:38 NASminion kernel: XFS (sdh1): Internal error xfs_trans_cancel at line 1097 of file fs/xfs/xfs_trans.c.  Caller xfs_efi_item_recover+0x16a/0x1a8 [xfs]
Jan 18 08:31:38 NASminion kernel: CPU: 3 PID: 14957 Comm: mount Tainted: P           O       6.1.64-Unraid #1
Jan 18 08:31:38 NASminion kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J5040-ITX, BIOS P1.60 01/17/2020

I love the `Hardware name: To Be Filled By O.E.M.`

I'm installing a replacement drive in disk1's place now and try to repair and recover what I can from the old one.

Edited January 18 by ctietze

JorgeB · January 18

3 hours ago, ctietze said:
XFS (sdh1)

I assume this is the cache device? If yes check filesystem, if not post new diags.

ctietze · January 18

Actually no: Mounting the former disk1 drive via USB put it into the sdh1 slot. Should've copied the same log message from before removing the disk. Sorry for the confusion!

JorgeB · January 18

Note that if you mount/repair disk1 outside the array parity will no longer be valid.

ctietze · January 18

xfs_repair can't find any block at the moment anyway, so my current plan is to rebuild the replacement disk from parity instead. If the former disk1 would spit out the remaining files, that'd be a bonus, but in the worst case, we can live without what was left on the disk.

JorgeB · January 18

2 minutes ago, ctietze said:

xfs_repair can't find any block at the moment anyway

How are you running it? If you use the whole device it won't work, you need to specify the partition, /dev/sdh1

ctietze · January 24

I checked the history after the process finished -- you're right, I forgot the trailing `1`!

Overall, there's been some stuff in lost+found that's annoying to sort back.

The new cable to disk1's location in the case seems to have fixed it. So I probably just broke the SATA cable 😵‍💫

ATA hard resetting link: ata1 and ata8 failing differently. How to diagnose hardware issues?

Recommended Posts

ctietze

Link to comment

JorgeB

Link to comment

ctietze

Link to comment

JorgeB

Link to comment

ctietze

Link to comment

JorgeB

Link to comment

ctietze

Link to comment

JorgeB

Link to comment

ctietze

Link to comment

JorgeB

Link to comment

ctietze

Link to comment

JorgeB

Link to comment

ctietze

Link to comment

Join the conversation