Hi guys, I have another data point for this problem. I'm using LSI 2008 HBA's into HP sas expanders, with an assortment of drives (including Seagate Archive and EXOS) with spindown set, and am seeing exactly this thread's problem on drive spinup. Although the server is running a different Linux distro, I think the root cause might be the same as later versions of Unraid are seeing.
I have narrowed the problem down to affecting Linux kernel 5.17 and onwards. I've tested 5.17, 5.18, 5.19 and 6.1.6, all which behave identically when trying to spin up drives. (some Hitachi and western digital drives are also affected). Linux Kernel 5.16 and prior are working fine. I'm almost positive this is the same issue you're having with versions of Unraid past 6.8.* series, perhaps due to a backported patch (6.10 is running a late version of 5.15). I really hope a dev stumbles upon this and it helps narrow down the root cause.
For the sake of searchability, I'll include kernel logs from both working and non working situations.
First up is 5.16 (and prior) which is working perfectly. It produces the following errors on spin-up (but operates normally). I'm using bcache on these drives and the kernel readahead seemingly errors out with I/O errors due to spin-up timeout - but once the drive is spun up, everything operates normally and user space processes never receive any IO errors. Processes trying to access the drive are hung until its online however, which is normal. The contained BTRFS filesystem never complains, confirming data integrity is good, as its a checksummed filesystem. Note the cmd_age=10s on the first line, indicating spinup is taking longer than 10s.
[Jan18 05:08] sd 0:0:13:0: [sdn] tag#705 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=10s
[ +0.000011] sd 0:0:13:0: [sdn] tag#705 Sense Key : 0x2 [current]
[ +0.000006] sd 0:0:13:0: [sdn] tag#705 ASC=0x4 ASCQ=0x2
[ +0.000006] sd 0:0:13:0: [sdn] tag#705 CDB: opcode=0x88 88 00 00 00 00 02 3c 46 f5 f0 00 00 03 e0 00 00
[ +0.000003] I/O error, dev sdn, sector 9601218032 op 0x0:(READ) flags 0x80700 phys_seg 124 prio class 0
[ +0.123623] bcache: bch_count_backing_io_errors() sdn1: Read-ahead I/O failed on backing device, ignore
Next up is 5.17 (and onwards) which produces this upon drive spin up (which hangs the affected drive until it comes online):
[Jan16 08:49] sd 1:0:13:0: attempting task abort!scmd(0x0000000052c3e39b), outstanding for 10028 ms & timeout 60000 ms
[ +0.000015] sd 1:0:13:0: [sdak] tag#1552 CDB: opcode=0x1b 1b 00 00 00 01 00
[ +0.000004] scsi target1:0:13: handle(0x0017), sas_address(0x5001438018df2cd5), phy(21)
[ +0.000007] scsi target1:0:13: enclosure logical id(0x5001438018df2ce5), slot(54)
[ +1.449183] sd 1:0:13:0: task abort: SUCCESS scmd(0x0000000052c3e39b)
[ +0.000010] sd 1:0:13:0: attempting device reset! scmd(0x0000000052c3e39b)
[ +0.000005] sd 1:0:13:0: [sdak] tag#1552 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
[ +0.000003] scsi target1:0:13: handle(0x0017), sas_address(0x5001438018df2cd5), phy(21)
[ +0.000004] scsi target1:0:13: enclosure logical id(0x5001438018df2ce5), slot(54)
[Jan16 08:50] sd 1:0:13:0: device reset: FAILED scmd(0x0000000052c3e39b)
[ +0.000006] scsi target1:0:13: attempting target reset! scmd(0x0000000052c3e39b)
[ +0.000006] sd 1:0:13:0: [sdak] tag#1552 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
[ +0.000003] scsi target1:0:13: handle(0x0017), sas_address(0x5001438018df2cd5), phy(21)
[ +0.000003] scsi target1:0:13: enclosure logical id(0x5001438018df2ce5), slot(54)
[ +3.000146] scsi target1:0:13: target reset: SUCCESS scmd(0x0000000052c3e39b)
[ +0.248868] sd 1:0:13:0: Power-on or device reset occurred
And then occasionally (only on 5.17 onwards) I'm seeing this:
[Jan16 10:15] mpt2sas_cm1: sending diag reset !!
[ +0.480071] EDAC PCI: Signaled System Error on 0000:05:00.0
[ +0.000011] EDAC PCI: Master Data Parity Error on 0000:05:00.0
[ +0.000003] EDAC PCI: Detected Parity Error on 0000:05:00.0
[ +0.466972] mpt2sas_cm1: diag reset: SUCCESS
[ +0.057979] mpt2sas_cm1: CurrentHostPageSize is 0: Setting default host page size to 4k
[ +0.044426] mpt2sas_cm1: LSISAS2008: FWVersion(18.00.00.00), ChipRevision(0x03), BiosVersion(07.39.02.00)
[ +0.000009] mpt2sas_cm1: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
[ +0.000078] mpt2sas_cm1: sending port enable !!
[ +7.814105] mpt2sas_cm1: port enable: SUCCESS
[ +0.000314] mpt2sas_cm1: search for end-devices: start
[ +0.001358] scsi target1:0:0: handle(0x000a), sas_addr(0x5001438018df2cc0)
[ +0.000011] scsi target1:0:0: enclosure logical id(0x5001438018df2ce5), slot(35)
[ +0.000117] scsi target1:0:1: handle(0x000b), sas_addr(0x5001438018df2cc1)
[ +0.000006] scsi target1:0:1: enclosure logical id(0x5001438018df2ce5), slot(34)
[ +0.000086] scsi target1:0:2: handle(0x000c), sas_addr(0x5001438018df2cc3)
etc (all drives reset and reconnect)
which and causes some problems with MD arrays etc as all drives are hung whilst this is going on and heaps of timeouts occur. Between these two scenarios, it might explain the dropouts you're seeing in unraid.
Hope this helps! I'm looking into changing HBA to one of the ones quoted in this thread to work around the problem - I dont want to be stuck on Linux kernel 5.16 forever. I will be back with updates!