unraidwok

Members
  • Posts

    3
  • Joined

  • Last visited

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

unraidwok's Achievements

Noob

Noob (1/14)

3

Reputation

  1. OK I've gotten somewhere with my problem. I'm now running linux 6.2.13 and have got rid of the scsi timeouts on spin-up that have plagued me since v5.17, and have re-enabled EPC. The hint for me was these lines in my logs: sd 1:0:13:0: attempting task abort!scmd(0x0000000052c3e39b), outstanding for 10028 ms & timeout 60000 ms I'd set the timeout for every drive to 60s but looks like theres an additional 10s timeout somewhere thats triggering consistently. After much research and searching through /sys - I found that the scsi_device for each disk also has a timeout called eh_timeout. Here was the fix for me, applied via a local startup script on every boot: #increase scsi interface timeout from 10s to 20s ( prevents HBA resetting link when disks take more than 10s to spin up) # this is the solution to the HBA resets that were happenning version 5.17+ on SAS2008 and SAS2308 for scsi_device in /sys/bus/scsi/devices/*/eh_timeout ; do echo 20 > $scsi_device ; done Theres now only one other error that i'm consistently getting. It involves WD and Hitachi drives on spin-up only, and looks like an LSI specific timeout. I havent found where to increase this yet, but it's non-fatal - the I/O error is only for read-ahead and is not passed down to userspace: [ +11.033292] sd 1:0:11:0: [sdah] tag#259 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=11s [ +0.000012] sd 1:0:11:0: [sdah] tag#259 Sense Key : 0x2 [current] [ +0.000004] sd 1:0:11:0: [sdah] tag#259 ASC=0x4 ASCQ=0x2 [ +0.000006] sd 1:0:11:0: [sdah] tag#259 CDB: opcode=0x88 88 00 00 00 00 00 84 5e ae f8 00 00 03 80 00 00 [ +0.000004] I/O error, dev sdah, sector 2220797688 op 0x0:(READ) flags 0x80700 phys_seg 109 prio class 2 That error thusly can be treated as a warning, not error. I would love to know what UNKNOWN(0x2003) means though. Seems its vendor specific. Thanks for everyone in this thread that has helped me get to this outcome of 100% resolved. I wish everyone the best of luck with the ongoing unraid + LSI + Spin-up problems.
  2. Hi all, I previously posted about spinups causing dropouts on sas2008 using kernel 5.17 or newer. I recently upgraded to sas2308 cards to see if the issue is solved - Its not. I'm seeing precicely the same issue in the logs and the new cards are getting reset due to timeouts related to spinup. You can see the logs in my previous post - exactly the same is happenning. Back to kernel 5.16 and time to order some SAS3008 cards to try.
  3. Hi guys, I have another data point for this problem. I'm using LSI 2008 HBA's into HP sas expanders, with an assortment of drives (including Seagate Archive and EXOS) with spindown set, and am seeing exactly this thread's problem on drive spinup. Although the server is running a different Linux distro, I think the root cause might be the same as later versions of Unraid are seeing. I have narrowed the problem down to affecting Linux kernel 5.17 and onwards. I've tested 5.17, 5.18, 5.19 and 6.1.6, all which behave identically when trying to spin up drives. (some Hitachi and western digital drives are also affected). Linux Kernel 5.16 and prior are working fine. I'm almost positive this is the same issue you're having with versions of Unraid past 6.8.* series, perhaps due to a backported patch (6.10 is running a late version of 5.15). I really hope a dev stumbles upon this and it helps narrow down the root cause. For the sake of searchability, I'll include kernel logs from both working and non working situations. First up is 5.16 (and prior) which is working perfectly. It produces the following errors on spin-up (but operates normally). I'm using bcache on these drives and the kernel readahead seemingly errors out with I/O errors due to spin-up timeout - but once the drive is spun up, everything operates normally and user space processes never receive any IO errors. Processes trying to access the drive are hung until its online however, which is normal. The contained BTRFS filesystem never complains, confirming data integrity is good, as its a checksummed filesystem. Note the cmd_age=10s on the first line, indicating spinup is taking longer than 10s. [Jan18 05:08] sd 0:0:13:0: [sdn] tag#705 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=10s [ +0.000011] sd 0:0:13:0: [sdn] tag#705 Sense Key : 0x2 [current] [ +0.000006] sd 0:0:13:0: [sdn] tag#705 ASC=0x4 ASCQ=0x2 [ +0.000006] sd 0:0:13:0: [sdn] tag#705 CDB: opcode=0x88 88 00 00 00 00 02 3c 46 f5 f0 00 00 03 e0 00 00 [ +0.000003] I/O error, dev sdn, sector 9601218032 op 0x0:(READ) flags 0x80700 phys_seg 124 prio class 0 [ +0.123623] bcache: bch_count_backing_io_errors() sdn1: Read-ahead I/O failed on backing device, ignore Next up is 5.17 (and onwards) which produces this upon drive spin up (which hangs the affected drive until it comes online): [Jan16 08:49] sd 1:0:13:0: attempting task abort!scmd(0x0000000052c3e39b), outstanding for 10028 ms & timeout 60000 ms [ +0.000015] sd 1:0:13:0: [sdak] tag#1552 CDB: opcode=0x1b 1b 00 00 00 01 00 [ +0.000004] scsi target1:0:13: handle(0x0017), sas_address(0x5001438018df2cd5), phy(21) [ +0.000007] scsi target1:0:13: enclosure logical id(0x5001438018df2ce5), slot(54) [ +1.449183] sd 1:0:13:0: task abort: SUCCESS scmd(0x0000000052c3e39b) [ +0.000010] sd 1:0:13:0: attempting device reset! scmd(0x0000000052c3e39b) [ +0.000005] sd 1:0:13:0: [sdak] tag#1552 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00 [ +0.000003] scsi target1:0:13: handle(0x0017), sas_address(0x5001438018df2cd5), phy(21) [ +0.000004] scsi target1:0:13: enclosure logical id(0x5001438018df2ce5), slot(54) [Jan16 08:50] sd 1:0:13:0: device reset: FAILED scmd(0x0000000052c3e39b) [ +0.000006] scsi target1:0:13: attempting target reset! scmd(0x0000000052c3e39b) [ +0.000006] sd 1:0:13:0: [sdak] tag#1552 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00 [ +0.000003] scsi target1:0:13: handle(0x0017), sas_address(0x5001438018df2cd5), phy(21) [ +0.000003] scsi target1:0:13: enclosure logical id(0x5001438018df2ce5), slot(54) [ +3.000146] scsi target1:0:13: target reset: SUCCESS scmd(0x0000000052c3e39b) [ +0.248868] sd 1:0:13:0: Power-on or device reset occurred And then occasionally (only on 5.17 onwards) I'm seeing this: [Jan16 10:15] mpt2sas_cm1: sending diag reset !! [ +0.480071] EDAC PCI: Signaled System Error on 0000:05:00.0 [ +0.000011] EDAC PCI: Master Data Parity Error on 0000:05:00.0 [ +0.000003] EDAC PCI: Detected Parity Error on 0000:05:00.0 [ +0.466972] mpt2sas_cm1: diag reset: SUCCESS [ +0.057979] mpt2sas_cm1: CurrentHostPageSize is 0: Setting default host page size to 4k [ +0.044426] mpt2sas_cm1: LSISAS2008: FWVersion(18.00.00.00), ChipRevision(0x03), BiosVersion(07.39.02.00) [ +0.000009] mpt2sas_cm1: Protocol=(Initiator,Target), Capabilities=(TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ) [ +0.000078] mpt2sas_cm1: sending port enable !! [ +7.814105] mpt2sas_cm1: port enable: SUCCESS [ +0.000314] mpt2sas_cm1: search for end-devices: start [ +0.001358] scsi target1:0:0: handle(0x000a), sas_addr(0x5001438018df2cc0) [ +0.000011] scsi target1:0:0: enclosure logical id(0x5001438018df2ce5), slot(35) [ +0.000117] scsi target1:0:1: handle(0x000b), sas_addr(0x5001438018df2cc1) [ +0.000006] scsi target1:0:1: enclosure logical id(0x5001438018df2ce5), slot(34) [ +0.000086] scsi target1:0:2: handle(0x000c), sas_addr(0x5001438018df2cc3) etc (all drives reset and reconnect) which and causes some problems with MD arrays etc as all drives are hung whilst this is going on and heaps of timeouts occur. Between these two scenarios, it might explain the dropouts you're seeing in unraid. Hope this helps! I'm looking into changing HBA to one of the ones quoted in this thread to work around the problem - I dont want to be stuck on Linux kernel 5.16 forever. I will be back with updates!