May 12, 20233 yr So I upgraded my parity disk to a 16TB Seagate Exos about a week ago, along with expanding my array and adding another cache drive. I both pre-cleared w/ pre-read and post-read, as well as did an extended SMART test on the drive to ensure it was healthy before rebuilding parity on it and also added a second 16TB Exos drive to the array. Tonight I saw a notification that my parity disk is disabled, and checking the unraid dashboard is says it's "reading" at 2.5-3GB/s and had accumulated billions of errors Here is the section of log where the drive gets disabled: May 11 19:26:34 Tower emhttpd: spinning down /dev/sdh May 11 19:29:16 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x190000 SErr 0x0 action 0x6 frozen May 11 19:29:16 Tower kernel: ata3.00: failed command: WRITE FPDMA QUEUED May 11 19:29:16 Tower kernel: ata3.00: cmd 61/c0:80:80:34:dd/00:00:8c:00:00/40 tag 16 ncq dma 98304 out May 11 19:29:16 Tower kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 11 19:29:16 Tower kernel: ata3.00: status: { DRDY } May 11 19:29:16 Tower kernel: ata3.00: failed command: WRITE FPDMA QUEUED May 11 19:29:16 Tower kernel: ata3.00: cmd 61/40:98:40:35:dd/05:00:8c:00:00/40 tag 19 ncq dma 688128 out May 11 19:29:16 Tower kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 11 19:29:16 Tower kernel: ata3.00: status: { DRDY } May 11 19:29:16 Tower kernel: ata3.00: failed command: WRITE FPDMA QUEUED May 11 19:29:16 Tower kernel: ata3.00: cmd 61/40:a0:80:3a:dd/05:00:8c:00:00/40 tag 20 ncq dma 688128 out May 11 19:29:16 Tower kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) May 11 19:29:16 Tower kernel: ata3.00: status: { DRDY } May 11 19:29:16 Tower kernel: ata3: hard resetting link May 11 19:29:16 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) May 11 19:29:21 Tower kernel: ata3.00: qc timeout (cmd 0xec) May 11 19:29:21 Tower kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) May 11 19:29:21 Tower kernel: ata3.00: revalidation failed (errno=-5) May 11 19:29:21 Tower kernel: ata3: hard resetting link May 11 19:29:22 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) May 11 19:29:32 Tower kernel: ata3.00: qc timeout (cmd 0xec) May 11 19:29:32 Tower kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) May 11 19:29:32 Tower kernel: ata3.00: revalidation failed (errno=-5) May 11 19:29:32 Tower kernel: ata3: limiting SATA link speed to 3.0 Gbps May 11 19:29:32 Tower kernel: ata3: hard resetting link May 11 19:29:32 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 320) May 11 19:30:03 Tower kernel: ata3.00: qc timeout (cmd 0xec) May 11 19:30:03 Tower kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) May 11 19:30:03 Tower kernel: ata3.00: revalidation failed (errno=-5) May 11 19:30:03 Tower kernel: ata3.00: disable device May 11 19:30:03 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 320) May 11 19:30:03 Tower kernel: ata3: EH complete May 11 19:30:03 Tower kernel: sd 3:0:0:0: [sdc] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=77s May 11 19:30:03 Tower kernel: sd 3:0:0:0: [sdc] tag#2 CDB: opcode=0x8a 8a 00 00 00 00 00 8c dd 3a 80 00 00 05 40 00 00 May 11 19:30:03 Tower kernel: I/O error, dev sdc, sector 2363308672 op 0x1:(WRITE) flags 0x0 phys_seg 168 prio class 0 May 11 19:30:03 Tower kernel: md: disk0 write error, sector=2363308608 May 11 19:30:03 Tower kernel: md: disk0 write error, sector=2363308616 How should I proceed? For now, I have just tried to stop the array, but it seems like it hung, the "stop" button is still greyed out My `syslog` share for syslog server has an 85GB syslog file in it, so I think it's been spamming errors into the log file for several hours. Coudl that be what's preventing the array from stopping? tower-diagnostics-20230511-2341.zip Edited May 12, 20233 yr by veri745
May 12, 20233 yr Author My array now seems to be is some weird limbo state. Some places (like the Docker tab) say the array needs to be started, but the share filesystem still seems to be navigable from the web UI and console. There's a `rsyslogd` process that is chewing up ~200% of a cpu, although it appears the 85GB syslog file is no longer growing. Should I try restart and hope for the best?, or should I try to kill some processes first to make sure the array is shut down? Edited May 12, 20233 yr by veri745
May 12, 20233 yr Author Interestingly, the drives besides parity that are reporting errors (disk2 and 4), are the same drives that were reporting errors a couple months ago:
May 12, 20233 yr Community Expert All 3 disks connected to this controller are having issues: 05:00.0 SATA controller [0106]: ASMedia Technology Inc. Device [1b21:1064] (rev 02) Subsystem: ZyDAS Technology Corp. Device [2116:2116] Kernel driver in use: ahci Kernel modules: ahci Could be a bad controller, you can also try a different PCIe slot if available.
May 12, 20233 yr Author 6 hours ago, JorgeB said: All 3 disks connected to this controller are having issues: 05:00.0 SATA controller [0106]: ASMedia Technology Inc. Device [1b21:1064] (rev 02) Subsystem: ZyDAS Technology Corp. Device [2116:2116] Kernel driver in use: ahci Kernel modules: ahci Could be a bad controller, you can also try a different PCIe slot if available. Controller is brand-new (from since linked thread). Of course it could be the controller, but seems unlikely given the similar scenario I was in a couple months ago before I got it. My immediately concern is how to deal with my system state as it stands. The array is currently still stuck "Stopping..." I killed the `rsyslogd` process so it would stop spamming writes to my cache disk. The web UI is still responsive, but it won't response to a "reboot" or `powerdown -r` from the terminal Do I hard-reset it? Edited May 12, 20233 yr by veri745
May 12, 20233 yr Author Alright, forced a hard reset. System rebooted, started array in maintenance mode, ran filesystem checks and smart tests on the disks that had errors Nothing obvious found. tower-diagnostics-20230512-0906.zip
May 14, 20233 yr Author Solution So I pulled open my server's case tonight to swap out to all-new SATA cables and to move my pci-e SATA card to a different port. I discovered something interesting that may point to a potential failure-mode that I've been experiencing: My parity drive and disks 2 and 4, the disks that had errors on them, are in a 4-bay hot-swappable drive cage, and they're also all connected to my 4-port SATA card. One of the molex power connectors that powers the hotswap cage backplane board had come loose, so it was being powered by only two molex connectors instead of three. I'm thinking that under certain load conditions, there wasn't enough juice getting to the drives in that drive cage
May 15, 20233 yr Community Expert On 5/14/2023 at 4:05 AM, veri745 said: One of the molex power connectors that powers the hotswap cage backplane board had come loose, so it was being powered by only two molex connectors instead of three. I'm thinking that under certain load conditions, there wasn't enough juice getting to the drives in that drive cage That's a strong possibility, if that's another thing all disks have in common other than the controller
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.