ctietze Posted January 9 Share Posted January 9 Hi, I upgraded my drives to 12TB Seagate IronWolf and WD Reds, but it appears two Seagate drives don't enjoy the experience so far. At least I believe it's the drives, not the cables. I only found one similar thread, and it was probably power splitters back then: I get similar messages at the moment for ata1 and ata8. Can't complete parity checks (kilobytes/s), so something's not good. I believe I identified the culprits via lshw -class disk -short and ls -l /sys/class/ata_port/ to be * ata1: /dev/sdb ST12000VN0008-2Y disk1 * ata8: /dev/sdg ST12000VN0008-2Y parity Spoiler contains the raw output that led me to the conclusions. Spoiler root@NASminion:~# ls -l /sys/class/ata_port/ total 0 lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata1 -> ../../devices/pci0000:00/0000:00:12.0/ata1/ata_port/ata1/ lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata2 -> ../../devices/pci0000:00/0000:00:12.0/ata2/ata_port/ata2/ lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata3 -> ../../devices/pci0000:00/0000:00:13.0/0000:01:00.0/ata3/ata_port/ata3/ lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata4 -> ../../devices/pci0000:00/0000:00:13.0/0000:01:00.0/ata4/ata_port/ata4/ lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata5 -> ../../devices/pci0000:00/0000:00:13.0/0000:01:00.0/ata5/ata_port/ata5/ lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata6 -> ../../devices/pci0000:00/0000:00:13.0/0000:01:00.0/ata6/ata_port/ata6/ lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata7 -> ../../devices/pci0000:00/0000:00:13.3/0000:04:00.0/ata7/ata_port/ata7/ lrwxrwxrwx 1 root root 0 Jan 9 09:11 ata8 -> ../../devices/pci0000:00/0000:00:13.3/0000:04:00.0/ata8/ata_port/ata8/ root@NASminion:~# ls -l /dev/disk/by-path/ total 0 lrwxrwxrwx 1 root root 9 Jan 8 12:40 pci-0000:00:15.0-usb-0:6:1.0-scsi-0:0:0:0 -> ../../sda lrwxrwxrwx 1 root root 10 Jan 8 12:40 pci-0000:00:15.0-usb-0:6:1.0-scsi-0:0:0:0-part1 -> ../../sda1 root@NASminion:~# ls -l /dev/disk/by-id/ total 0 lrwxrwxrwx 1 root root 9 Jan 8 12:40 ata-ST12000VN0008-2YS101_ZRT0N4MB -> ../../sdb lrwxrwxrwx 1 root root 10 Jan 8 12:40 ata-ST12000VN0008-2YS101_ZRT0N4MB-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 9 Jan 8 12:40 ata-ST12000VN0008-2YS101_ZRT0NAYD -> ../../sdd lrwxrwxrwx 1 root root 10 Jan 8 12:40 ata-ST12000VN0008-2YS101_ZRT0NAYD-part1 -> ../../sdd1 lrwxrwxrwx 1 root root 9 Jan 8 12:40 ata-ST12000VN0008-2YS101_ZRT0NB3G -> ../../sdg lrwxrwxrwx 1 root root 10 Jan 8 12:40 ata-ST12000VN0008-2YS101_ZRT0NB3G-part1 -> ../../sdg1 lrwxrwxrwx 1 root root 9 Jan 8 12:40 ata-SanDisk_SDSSDH3_1T00_21410P459108 -> ../../sdc lrwxrwxrwx 1 root root 10 Jan 8 12:40 ata-SanDisk_SDSSDH3_1T00_21410P459108-part1 -> ../../sdc1 lrwxrwxrwx 1 root root 9 Jan 8 12:40 ata-WDC_WD120EFBX-68B0EN0_D7H7R62N -> ../../sde lrwxrwxrwx 1 root root 10 Jan 8 12:40 ata-WDC_WD120EFBX-68B0EN0_D7H7R62N-part1 -> ../../sde1 lrwxrwxrwx 1 root root 9 Jan 8 12:40 ata-WDC_WD120EFBX-68B0EN0_D7H8MTAN -> ../../sdf lrwxrwxrwx 1 root root 10 Jan 8 12:40 ata-WDC_WD120EFBX-68B0EN0_D7H8MTAN-part1 -> ../../sdf1 lrwxrwxrwx 1 root root 9 Jan 8 12:40 usb-Samsung_Flash_Drive_0376020110009682-0:0 -> ../../sda lrwxrwxrwx 1 root root 10 Jan 8 12:40 usb-Samsung_Flash_Drive_0376020110009682-0:0-part1 -> ../../sda1 lrwxrwxrwx 1 root root 9 Jan 8 12:40 wwn-0x5000c500e6c74c64 -> ../../sdb lrwxrwxrwx 1 root root 10 Jan 8 12:40 wwn-0x5000c500e6c74c64-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 9 Jan 8 12:40 wwn-0x5000c500e6c7a241 -> ../../sdg lrwxrwxrwx 1 root root 10 Jan 8 12:40 wwn-0x5000c500e6c7a241-part1 -> ../../sdg1 lrwxrwxrwx 1 root root 9 Jan 8 12:40 wwn-0x5000c500e6c7ae94 -> ../../sdd lrwxrwxrwx 1 root root 10 Jan 8 12:40 wwn-0x5000c500e6c7ae94-part1 -> ../../sdd1 lrwxrwxrwx 1 root root 9 Jan 8 12:40 wwn-0x5000cca2dfd1996c -> ../../sde lrwxrwxrwx 1 root root 10 Jan 8 12:40 wwn-0x5000cca2dfd1996c-part1 -> ../../sde1 lrwxrwxrwx 1 root root 9 Jan 8 12:40 wwn-0x5000cca2dfd204be -> ../../sdf lrwxrwxrwx 1 root root 10 Jan 8 12:40 wwn-0x5000cca2dfd204be-part1 -> ../../sdf1 lrwxrwxrwx 1 root root 9 Jan 8 12:40 wwn-0x5001b444a7f24540 -> ../../sdc lrwxrwxrwx 1 root root 10 Jan 8 12:40 wwn-0x5001b444a7f24540-part1 -> ../../sdc1 root@NASminion:~# lshw -class disk -short H/W path Device Class Description ======================================================== /0/100/12/0 /dev/sdb disk 12TB ST12000VN0008-2Y /0/100/12/1 /dev/sdc disk 1TB SanDisk SDSSDH3 /0/100/13/0/0 /dev/sdd disk 12TB ST12000VN0008-2Y /0/100/13/0/0/0 /dev/sdd disk 12TB /0/100/13/0/1 /dev/sde disk 12TB WDC WD120EFBX-68 /0/100/13/0/1/0 /dev/sde disk 12TB /0/100/13.3/0/0 /dev/sdf disk 12TB WDC WD120EFBX-68 /0/100/13.3/0/1 /dev/sdg disk 12TB ST12000VN0008-2Y /0/6/0.0.0 /dev/sda disk 32GB Flash Drive /0/6/0.0.0/0 /dev/sda disk 32GB root@NASminion:~# ls -l /sys/block total 0 [...] lrwxrwxrwx 1 root root 0 Jan 8 12:51 sda -> ../devices/pci0000:00/0000:00:15.0/usb1/1-6/1-6:1.0/host0/target0:0:0/0:0:0:0/block/sda/ lrwxrwxrwx 1 root root 0 Jan 9 09:25 sdb -> ../devices/pci0000:00/0000:00:12.0/ata1/host1/target1:0:0/1:0:0:0/block/sdb/ lrwxrwxrwx 1 root root 0 Jan 8 12:40 sdc -> ../devices/pci0000:00/0000:00:12.0/ata2/host2/target2:0:0/2:0:0:0/block/sdc/ lrwxrwxrwx 1 root root 0 Jan 9 09:25 sdd -> ../devices/pci0000:00/0000:00:13.0/0000:01:00.0/ata3/host3/target3:0:0/3:0:0:0/block/sdd/ lrwxrwxrwx 1 root root 0 Jan 9 09:25 sde -> ../devices/pci0000:00/0000:00:13.0/0000:01:00.0/ata4/host4/target4:0:0/4:0:0:0/block/sde/ lrwxrwxrwx 1 root root 0 Jan 9 09:25 sdf -> ../devices/pci0000:00/0000:00:13.3/0000:04:00.0/ata7/host7/target7:0:0/7:0:0:0/block/sdf/ lrwxrwxrwx 1 root root 0 Jan 9 09:25 sdg -> ../devices/pci0000:00/0000:00:13.3/0000:04:00.0/ata8/host8/target8:0:0/8:0:0:0/block/sdg/ It's quite tricky to identify ataX numbers and also the drives I wired 😬 Parity fails a bit differently than disk1, with "found unknown device (class 0)" errors in the end of a failure cycle. I've removed disk1 from the Global Shares to not put more data onto it. I've checked the SATA cables. One did 'click' a bit deeper, actually, but that didn't change anything. SMART report produced no errors. I've used xfs_repair and it found issues on disk1, but no other data drive. Parity can't be checked that way because it has no file system. But I'd love to! Any ideas? I've used the unbalance app to move all but 100GB of data off of the 12TB drive (was very new and not very full). During the process, the log spawned problems like this again and the device audibly clicks and seems to spin up a disk again and again. Someone suggested this might also be a SATA card in the PCIe slot that fails. I wouldn't know how to test this, though, without 'blindly' buying more equipment to check replacements. Anything else I might try to diagnose? Log in the spoilers: Spoiler Jan 9 22:45:00 NASminion kernel: ata1.00: exception Emask 0x50 SAct 0x100020 SErr 0x4890800 action 0xe frozen Jan 9 22:45:00 NASminion kernel: ata1.00: irq_stat 0x0c400040, interface fatal error, connection status changed Jan 9 22:45:00 NASminion kernel: ata1: SError: { HostInt PHYRdyChg 10B8B LinkSeq DevExch } Jan 9 22:45:00 NASminion kernel: ata1.00: failed command: READ FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata1.00: cmd 60/00:28:18:3a:d1/01:00:89:01:00/40 tag 5 ncq dma 131072 in Jan 9 22:45:00 NASminion kernel: res 40/00:28:18:3a:d1/00:00:89:01:00/40 Emask 0x50 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata1.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata1.00: failed command: READ FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata1.00: cmd 60/00:a0:18:3b:d1/01:00:89:01:00/40 tag 20 ncq dma 131072 in Jan 9 22:45:00 NASminion kernel: res 40/00:28:18:3a:d1/00:00:89:01:00/40 Emask 0x50 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata1.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata1: hard resetting link Jan 9 22:45:00 NASminion kernel: ata8.00: exception Emask 0x10 SAct 0x1ffc00 SErr 0x10002 action 0xe frozen Jan 9 22:45:00 NASminion kernel: ata8.00: irq_stat 0x00400000, PHY RDY changed Jan 9 22:45:00 NASminion kernel: ata8: SError: { RecovComm PHYRdyChg } Jan 9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata8.00: cmd 60/40:50:90:43:cb/05:00:53:02:00/40 tag 10 ncq dma 688128 in Jan 9 22:45:00 NASminion kernel: res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata8.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata8.00: cmd 60/c8:58:d0:48:cb/03:00:53:02:00/40 tag 11 ncq dma 495616 in Jan 9 22:45:00 NASminion kernel: res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata8.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata8.00: cmd 60/20:60:98:4c:cb/04:00:53:02:00/40 tag 12 ncq dma 540672 in Jan 9 22:45:00 NASminion kernel: res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata8.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata8.00: cmd 60/c8:68:b8:50:cb/02:00:53:02:00/40 tag 13 ncq dma 364544 in Jan 9 22:45:00 NASminion kernel: res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata8.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata8.00: cmd 60/28:70:80:53:cb/01:00:53:02:00/40 tag 14 ncq dma 151552 in Jan 9 22:45:00 NASminion kernel: res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata8.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata8.00: cmd 60/18:78:a8:54:cb/04:00:53:02:00/40 tag 15 ncq dma 536576 in Jan 9 22:45:00 NASminion kernel: res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata8.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata8.00: failed command: WRITE FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata8.00: cmd 61/40:80:38:3a:cb/01:00:53:02:00/40 tag 16 ncq dma 163840 out Jan 9 22:45:00 NASminion kernel: res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata8.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata8.00: failed command: READ FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata8.00: cmd 60/78:88:c0:58:cb/01:00:53:02:00/40 tag 17 ncq dma 192512 in Jan 9 22:45:00 NASminion kernel: res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata8.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata8.00: failed command: WRITE FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata8.00: cmd 61/08:90:78:3b:cb/00:00:53:02:00/40 tag 18 ncq dma 4096 out Jan 9 22:45:00 NASminion kernel: res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata8.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata8.00: failed command: WRITE FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata8.00: cmd 61/38:98:80:3b:cb/05:00:53:02:00/40 tag 19 ncq dma 684032 out Jan 9 22:45:00 NASminion kernel: res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata8.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata8.00: failed command: WRITE FPDMA QUEUED Jan 9 22:45:00 NASminion kernel: ata8.00: cmd 61/d8:a0:b8:40:cb/02:00:53:02:00/40 tag 20 ncq dma 372736 out Jan 9 22:45:00 NASminion kernel: res 40/00:98:80:3b:cb/00:00:53:02:00/40 Emask 0x10 (ATA bus error) Jan 9 22:45:00 NASminion kernel: ata8.00: status: { DRDY } Jan 9 22:45:00 NASminion kernel: ata8: hard resetting link Jan 9 22:45:06 NASminion kernel: ata1: link is slow to respond, please be patient (ready=0) Jan 9 22:45:06 NASminion kernel: ata8: link is slow to respond, please be patient (ready=0) Jan 9 22:45:10 NASminion kernel: ata8: found unknown device (class 0) Jan 9 22:45:10 NASminion kernel: ata8: found unknown device (class 0) Jan 9 22:45:10 NASminion kernel: ata8: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Jan 9 22:45:10 NASminion kernel: ata1: softreset failed (device not ready) Jan 9 22:45:10 NASminion kernel: ata1: hard resetting link I'd appreciate any other hints: How would you interpret these issues, for example? Quote Link to comment
JorgeB Posted January 10 Share Posted January 10 Please post the complete diagnostics. Quote Link to comment
ctietze Posted January 10 Author Share Posted January 10 Sure thingnasminion-diagnostics-20240110-1552.zip, thanks for looking Quote Link to comment
Solution JorgeB Posted January 10 Solution Share Posted January 10 Nothing so far suggests a disk issue to me, the drives are using different controllers so it's not that also, it looks more like a power/connection issue, I would swap cables (both power and SATA) between a known good disk, like disk3, and parity, then see where the problem follows. Quote Link to comment
ctietze Posted January 10 Author Share Posted January 10 Thanks for looking @JorgeB! So to summarize: the kernel-reported ATA errors are more about unresponsiveness and timeouts, not about disk errors? Found researching the codes a bit tricky. (ChatGPT helped a bit though) Did you infer that they use different controllers from the ID's/codes like 13:0... etc? I'll check the power cables and swap them and SATA cables next. I do wonder whether disk2/3/4 being good are false positives: the errors only occur when I'm trying to access parts of disk1, not all of it (90% of unbalancing worked at ~200mb/s until it slowed to a crawl), I wonder why that might be if it's a power thing. Since each change to disk1 also needs to update parity, is there a way to figure out if the access to disk1 or to parity is the real culprit, or are really both affected independently? I would imagine that moving multiple TB of data from a 'good' disk, maybe disk2, should then produce ata8/parity errors, but should not report errors for disk2 (ata7). Would that make sense to verify it's parity on ata8? Quote Link to comment
JorgeB Posted January 10 Share Posted January 10 7 minutes ago, ctietze said: the kernel-reported ATA errors are more about unresponsiveness and timeouts, not about disk errors? That's what it looks like to me. 7 minutes ago, ctietze said: Did you infer that they use different controllers from the ID's/codes like 13:0... etc? From the diags. Quote Link to comment
ctietze Posted January 18 Author Share Posted January 18 (edited) Replaced cables (the one to disk1 was awkwardly bent I noticed) and tried to move the remaining files off of the disk with unBALANCE because accessing these used to trigger the errors. Some files have moved over night, but in the morning, xfs warned me I should run `xfs_repair`, and it turns out the file system is now unmountable. Jan 18 08:31:38 NASminion kernel: XFS (sdh1): Corruption detected. Unmount and run xfs_repair Jan 18 08:31:38 NASminion kernel: XFS (sdh1): Internal error xfs_trans_cancel at line 1097 of file fs/xfs/xfs_trans.c. Caller xfs_efi_item_recover+0x16a/0x1a8 [xfs] Jan 18 08:31:38 NASminion kernel: CPU: 3 PID: 14957 Comm: mount Tainted: P O 6.1.64-Unraid #1 Jan 18 08:31:38 NASminion kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J5040-ITX, BIOS P1.60 01/17/2020 I love the `Hardware name: To Be Filled By O.E.M.` I'm installing a replacement drive in disk1's place now and try to repair and recover what I can from the old one. Edited January 18 by ctietze Quote Link to comment
JorgeB Posted January 18 Share Posted January 18 3 hours ago, ctietze said: XFS (sdh1) I assume this is the cache device? If yes check filesystem, if not post new diags. Quote Link to comment
ctietze Posted January 18 Author Share Posted January 18 Actually no: Mounting the former disk1 drive via USB put it into the sdh1 slot. Should've copied the same log message from before removing the disk. Sorry for the confusion! Quote Link to comment
JorgeB Posted January 18 Share Posted January 18 Note that if you mount/repair disk1 outside the array parity will no longer be valid. Quote Link to comment
ctietze Posted January 18 Author Share Posted January 18 xfs_repair can't find any block at the moment anyway, so my current plan is to rebuild the replacement disk from parity instead. If the former disk1 would spit out the remaining files, that'd be a bonus, but in the worst case, we can live without what was left on the disk. Quote Link to comment
JorgeB Posted January 18 Share Posted January 18 2 minutes ago, ctietze said: xfs_repair can't find any block at the moment anyway How are you running it? If you use the whole device it won't work, you need to specify the partition, /dev/sdh1 Quote Link to comment
ctietze Posted January 24 Author Share Posted January 24 I checked the history after the process finished -- you're right, I forgot the trailing `1`! Overall, there's been some stuff in lost+found that's annoying to sort back. The new cable to disk1's location in the case seems to have fixed it. So I probably just broke the SATA cable 😵💫 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.