I've been rolling along for years with 2 cache, 2 parity, and 22 data disks. The data and parity disks run through two SASLP and one SAS2LP cards, and the cache directly connects to the mobo SATA ports. About 1/3 times, during a monthly parity check, one finicky SASLP (the one with data disks 1-8) would start spitting back errors for all 8 of its drives. I assumed it was just overheating, as when I added better ventilation, the issue became very sporadic.
On Monday, disk 5 on that SASLP was disabled. I/O errors, old drive, simple. I swapped it out, and felt great as started the rebuild. 5 hours later, the SASLP drops offline. As is tradition, I shut down and try again - still feeling fine. Another 4-5 hours and the SASLP is offline again. Now, when I reboot, disks 2 and 5 both show up as Unmountable. I try adding more fans, hoping to just get the data fixed before I start swapping hardware, and now the card fails after just 5 minutes. I ordered a SAS2LP to replace it, thinking all of my issues would be resolved without that bad SASLP.
Now, with the brand new SAS2LP installed, the parity sync runs at ~2 MB/s before it eventually fails. Looking at the logs, disk 8 is now throwing tons of I/O errors. FWIW, Disk 8 is from the same batch of Seagate 4TB hard drives as the failed disk 5, but disk 6 is too and it's fine.
What are my next steps here? I feel like I need to buy better HBA cards and trash some drives, but I don't know where to begin.
pangu-diagnostics-20200221-1859.zip