Disabled parity disk, stuck "reading" at 2.5GB/s

May 12, 20233 yr

So I upgraded my parity disk to a 16TB Seagate Exos about a week ago, along with expanding my array and adding another cache drive. I both pre-cleared w/ pre-read and post-read, as well as did an extended SMART test on the drive to ensure it was healthy before rebuilding parity on it and also added a second 16TB Exos drive to the array.

Tonight I saw a notification that my parity disk is disabled, and checking the unraid dashboard is says it's "reading" at 2.5-3GB/s and had accumulated billions of errors

Here is the section of log where the drive gets disabled:

May 11 19:26:34 Tower  emhttpd: spinning down /dev/sdh
May 11 19:29:16 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x190000 SErr 0x0 action 0x6 frozen
May 11 19:29:16 Tower kernel: ata3.00: failed command: WRITE FPDMA QUEUED
May 11 19:29:16 Tower kernel: ata3.00: cmd 61/c0:80:80:34:dd/00:00:8c:00:00/40 tag 16 ncq dma 98304 out
May 11 19:29:16 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May 11 19:29:16 Tower kernel: ata3.00: status: { DRDY }
May 11 19:29:16 Tower kernel: ata3.00: failed command: WRITE FPDMA QUEUED
May 11 19:29:16 Tower kernel: ata3.00: cmd 61/40:98:40:35:dd/05:00:8c:00:00/40 tag 19 ncq dma 688128 out
May 11 19:29:16 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May 11 19:29:16 Tower kernel: ata3.00: status: { DRDY }
May 11 19:29:16 Tower kernel: ata3.00: failed command: WRITE FPDMA QUEUED
May 11 19:29:16 Tower kernel: ata3.00: cmd 61/40:a0:80:3a:dd/05:00:8c:00:00/40 tag 20 ncq dma 688128 out
May 11 19:29:16 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
May 11 19:29:16 Tower kernel: ata3.00: status: { DRDY }
May 11 19:29:16 Tower kernel: ata3: hard resetting link
May 11 19:29:16 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May 11 19:29:21 Tower kernel: ata3.00: qc timeout (cmd 0xec)
May 11 19:29:21 Tower kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May 11 19:29:21 Tower kernel: ata3.00: revalidation failed (errno=-5)
May 11 19:29:21 Tower kernel: ata3: hard resetting link
May 11 19:29:22 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May 11 19:29:32 Tower kernel: ata3.00: qc timeout (cmd 0xec)
May 11 19:29:32 Tower kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May 11 19:29:32 Tower kernel: ata3.00: revalidation failed (errno=-5)
May 11 19:29:32 Tower kernel: ata3: limiting SATA link speed to 3.0 Gbps
May 11 19:29:32 Tower kernel: ata3: hard resetting link
May 11 19:29:32 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
May 11 19:30:03 Tower kernel: ata3.00: qc timeout (cmd 0xec)
May 11 19:30:03 Tower kernel: ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
May 11 19:30:03 Tower kernel: ata3.00: revalidation failed (errno=-5)
May 11 19:30:03 Tower kernel: ata3.00: disable device
May 11 19:30:03 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 320)
May 11 19:30:03 Tower kernel: ata3: EH complete
May 11 19:30:03 Tower kernel: sd 3:0:0:0: [sdc] tag#2 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=77s
May 11 19:30:03 Tower kernel: sd 3:0:0:0: [sdc] tag#2 CDB: opcode=0x8a 8a 00 00 00 00 00 8c dd 3a 80 00 00 05 40 00 00
May 11 19:30:03 Tower kernel: I/O error, dev sdc, sector 2363308672 op 0x1:(WRITE) flags 0x0 phys_seg 168 prio class 0
May 11 19:30:03 Tower kernel: md: disk0 write error, sector=2363308608
May 11 19:30:03 Tower kernel: md: disk0 write error, sector=2363308616

How should I proceed? For now, I have just tried to stop the array, but it seems like it hung, the "stop" button is still greyed out

My `syslog` share for syslog server has an 85GB syslog file in it, so I think it's been spamming errors into the log file for several hours. Coudl that be what's preventing the array from stopping?

tower-diagnostics-20230511-2341.zip

Edited May 12, 20233 yr by veri745

Quote

May 12, 20233 yr

Author

My array now seems to be is some weird limbo state. Some places (like the Docker tab) say the array needs to be started, but the share filesystem still seems to be navigable from the web UI and console.

There's a `rsyslogd` process that is chewing up ~200% of a cpu, although it appears the 85GB syslog file is no longer growing.

Should I try restart and hope for the best?, or should I try to kill some processes first to make sure the array is shut down?

Edited May 12, 20233 yr by veri745

Quote

May 12, 20233 yr

Author

Interestingly, the drives besides parity that are reporting errors (disk2 and 4), are the same drives that were reporting errors a couple months ago:

Quote

May 12, 20233 yr

Community Expert

All 3 disks connected to this controller are having issues:

05:00.0 SATA controller [0106]: ASMedia Technology Inc. Device [1b21:1064] (rev 02)
    Subsystem: ZyDAS Technology Corp. Device [2116:2116]
    Kernel driver in use: ahci
    Kernel modules: ahci

Could be a bad controller, you can also try a different PCIe slot if available.

Quote

May 12, 20233 yr

Author

6 hours ago, JorgeB said:
All 3 disks connected to this controller are having issues:
05:00.0 SATA controller [0106]: ASMedia Technology Inc. Device [1b21:1064] (rev 02)
    Subsystem: ZyDAS Technology Corp. Device [2116:2116]
    Kernel driver in use: ahci
    Kernel modules: ahci
Could be a bad controller, you can also try a different PCIe slot if available.

Controller is brand-new (from since linked thread). Of course it could be the controller, but seems unlikely given the similar scenario I was in a couple months ago before I got it.

My immediately concern is how to deal with my system state as it stands. The array is currently still stuck "Stopping..."

I killed the `rsyslogd` process so it would stop spamming writes to my cache disk.

The web UI is still responsive, but it won't response to a "reboot" or `powerdown -r` from the terminal

Do I hard-reset it?

Edited May 12, 20233 yr by veri745

Quote

May 12, 20233 yr

Author

Alright, forced a hard reset. System rebooted, started array in maintenance mode, ran filesystem checks and smart tests on the disks that had errors

Nothing obvious found.

tower-diagnostics-20230512-0906.zip

Quote

May 14, 20233 yr

Author
Solution

So I pulled open my server's case tonight to swap out to all-new SATA cables and to move my pci-e SATA card to a different port.

I discovered something interesting that may point to a potential failure-mode that I've been experiencing:

My parity drive and disks 2 and 4, the disks that had errors on them, are in a 4-bay hot-swappable drive cage, and they're also all connected to my 4-port SATA card.

One of the molex power connectors that powers the hotswap cage backplane board had come loose, so it was being powered by only two molex connectors instead of three.

I'm thinking that under certain load conditions, there wasn't enough juice getting to the drives in that drive cage

Quote

May 15, 20233 yr

Community Expert

On 5/14/2023 at 4:05 AM, veri745 said:

One of the molex power connectors that powers the hotswap cage backplane board had come loose, so it was being powered by only two molex connectors instead of three.

I'm thinking that under certain load conditions, there wasn't enough juice getting to the drives in that drive cage

That's a strong possibility, if that's another thing all disks have in common other than the controller

Quote

Disabled parity disk, stuck "reading" at 2.5GB/s

Featured Replies

Solved by veri745

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)