SAS host is non-operational !!!!


Recommended Posts

Over the last couple of days I've started seeing drive errors, I don't always notice straight away. Sometimes it's 4 drives, but other times all drives. Rebooting the server brings everything back as it should be.

 

Jul  5 12:23:02 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!
Jul  5 12:23:03 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!
Jul  5 12:23:04 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!
Jul  5 12:23:05 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!
Jul  5 12:23:06 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: _base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!!
Jul  5 12:23:07 Tower kernel: sd 9:0:0:0: [sdn] Synchronizing SCSI cache
Jul  5 12:23:07 Tower kernel: sd 9:0:0:0: [sdn] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Jul  5 12:23:07 Tower kernel: sd 9:0:1:0: [sdo] Synchronizing SCSI cache
Jul  5 12:23:07 Tower kernel: sd 9:0:1:0: [sdo] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Jul  5 12:23:07 Tower kernel: sd 9:0:4:0: [sdr] tag#731 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=4s
Jul  5 12:23:07 Tower kernel: sd 9:0:4:0: [sdr] tag#731 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Jul  5 12:23:07 Tower kernel: sd 9:0:4:0: [sdr] tag#732 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s
Jul  5 12:23:07 Tower kernel: sd 9:0:4:0: [sdr] tag#732 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00
Jul  5 12:23:07 Tower kernel: sd 9:0:5:0: [sds] tag#733 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s
Jul  5 12:23:07 Tower kernel: sd 9:0:5:0: [sds] tag#733 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Jul  5 12:23:07 Tower kernel: sd 9:0:5:0: [sds] tag#734 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s
Jul  5 12:23:07 Tower kernel: sd 9:0:5:0: [sds] tag#734 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00
Jul  5 12:23:07 Tower kernel: sd 9:0:3:0: [sdq] tag#735 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s
Jul  5 12:23:07 Tower kernel: sd 9:0:3:0: [sdq] tag#735 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Jul  5 12:23:07 Tower kernel: sd 9:0:3:0: [sdq] tag#736 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s
Jul  5 12:23:07 Tower kernel: sd 9:0:3:0: [sdq] tag#736 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00
Jul  5 12:23:07 Tower kernel: sd 9:0:2:0: [sdp] Synchronizing SCSI cache
Jul  5 12:23:07 Tower kernel: sd 9:0:2:0: [sdp] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Jul  5 12:23:07 Tower emhttpd: read SMART /dev/sdr
Jul  5 12:23:07 Tower emhttpd: read SMART /dev/sds
Jul  5 12:23:07 Tower emhttpd: read SMART /dev/sdq
Jul  5 12:23:07 Tower kernel: sd 9:0:3:0: [sdq] Synchronizing SCSI cache
Jul  5 12:23:07 Tower kernel: sd 9:0:3:0: [sdq] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Jul  5 12:23:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jul  5 12:23:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jul  5 12:23:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jul  5 12:23:07 Tower kernel: sd 9:0:4:0: [sdr] Synchronizing SCSI cache
Jul  5 12:23:07 Tower kernel: sd 9:0:4:0: [sdr] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Jul  5 12:23:07 Tower unassigned.devices: Warning: Can't get rotational setting of '/dev/sdq'.
Jul  5 12:23:07 Tower unassigned.devices: Warning: Can't get rotational setting of '/dev/sdq'.
Jul  5 12:23:07 Tower unassigned.devices: Warning: Can't get rotational setting of '/dev/sdr'.
Jul  5 12:23:07 Tower unassigned.devices: Warning: Can't get rotational setting of '/dev/sdr'.
Jul  5 12:23:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jul  5 12:23:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jul  5 12:23:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Jul  5 12:23:07 Tower kernel: sd 9:0:5:0: [sds] Synchronizing SCSI cache
Jul  5 12:23:07 Tower kernel: sd 9:0:5:0: [sds] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221100000000)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: removing handle(0x0009), sas_addr(0x4433221100000000)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5b8ca3a0f0160c00), slot(3)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221101000000)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: removing handle(0x000a), sas_addr(0x4433221101000000)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5b8ca3a0f0160c00), slot(2)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221102000000)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: removing handle(0x000b), sas_addr(0x4433221102000000)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5b8ca3a0f0160c00), slot(1)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221104000000)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: removing handle(0x000c), sas_addr(0x4433221104000000)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5b8ca3a0f0160c00), slot(7)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221105000000)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: removing handle(0x000d), sas_addr(0x4433221105000000)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5b8ca3a0f0160c00), slot(6)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221103000000)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: removing handle(0x000e), sas_addr(0x4433221103000000)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5b8ca3a0f0160c00), slot(0)
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: unexpected doorbell active!
Jul  5 12:23:07 Tower kernel: mpt2sas_cm1: sending diag reset !!
Jul  5 12:23:08 Tower kernel: mpt2sas_cm1: Invalid host diagnostic register value
Jul  5 12:23:08 Tower kernel: mpt2sas_cm1: System Register set:
Jul  5 12:23:08 Tower kernel: 00000000: ffffffff
Jul  5 12:23:08 Tower kernel: 00000004: ffffffff
Jul  5 12:23:08 Tower kernel: 00000008: ffffffff
Jul  5 12:23:08 Tower kernel: 0000000c: ffffffff
Jul  5 12:23:08 Tower kernel: 00000010: ffffffff
Jul  5 12:23:08 Tower kernel: 00000014: ffffffff
Jul  5 12:23:08 Tower kernel: 00000018: ffffffff
Jul  5 12:23:08 Tower kernel: 0000001c: ffffffff
<REPEATED>
Jul  5 12:23:08 Tower kernel: 000000f8: ffffffff
Jul  5 12:23:08 Tower kernel: 000000fc: ffffffff
Jul  5 12:23:08 Tower kernel: mpt2sas_cm1: diag reset: FAILED
Jul  5 12:26:40 Tower root: Total Spundown: 8
Jul  5 12:31:41 Tower root: Total Spundown: 8
Jul  5 12:33:00 Tower kernel: md: disk5 read error, sector=9102416
Jul  5 12:33:00 Tower kernel: md: disk2 read error, sector=9102416
Jul  5 12:33:00 Tower kernel: md: disk4 read error, sector=9102416
Jul  5 12:33:00 Tower kernel: md: disk6 read error, sector=9102416
Jul  5 12:33:10 Tower emhttpd: read SMART /dev/sdj
Jul  5 12:33:10 Tower emhttpd: read SMART /dev/sdk
Jul  5 12:33:10 Tower emhttpd: read SMART /dev/sdg
Jul  5 12:33:10 Tower kernel: XFS (md5): metadata I/O error in "xfs_da_read_buf+0x9e/0xfe [xfs]" at daddr 0x8ae450 len 8 error 5
Jul  5 12:33:10 Tower kernel: XFS (md5): metadata I/O error in "xfs_da_read_buf+0x9e/0xfe [xfs]" at daddr 0x8ae450 len 8 error 5
Jul  5 12:33:10 Tower emhttpd: read SMART /dev/sdf
Jul  5 12:33:10 Tower emhttpd: read SMART /dev/sdl
Jul  5 12:33:10 Tower emhttpd: read SMART /dev/sdi

 

I have 2 SAS controller cards, I'd initially thought one of them might be failing but when all drives went I thought it must be motherboard related, nothing has changed on the machine recently it's been running quite nicely for a while? I'm a bit stuck as to where to look next. Diagnostics attached

tower-diagnostics-20210705-1244.zip

Link to comment
  • 2 weeks later...

Thank you, I took the server down, reseated the card and thought all was sorted but it's gone again tonight but this time it was both cards/all drives, could they both fail at pretty much the same time, that seems unlikely, I don't really know what else to look for in the logs, there was nothing else in the log immediately before both cards failed, again they're all back after a reboot

 

Jul 15 19:05:44 Tower root: Total Spundown: 1
Jul 15 19:05:44 Tower root: Entering Turbo Mode
Jul 15 19:05:44 Tower kernel: mdcmd (160): set md_write_method 1
Jul 15 19:05:44 Tower kernel: 
Jul 15 19:10:44 Tower root: Total Spundown: 1
### [PREVIOUS LINE REPEATED 4 TIMES] ###
Jul 15 19:30:46 Tower emhttpd: spinning down /dev/sds
Jul 15 19:31:06 Tower emhttpd: spinning down /dev/sdl
Jul 15 19:31:18 Tower emhttpd: spinning down /dev/sdj
Jul 15 19:31:29 Tower emhttpd: spinning down /dev/sdo
Jul 15 19:31:39 Tower emhttpd: spinning down /dev/sdp
Jul 15 19:31:43 Tower emhttpd: spinning down /dev/sdk
Jul 15 19:31:57 Tower emhttpd: spinning down /dev/sdh
Jul 15 19:31:57 Tower emhttpd: spinning down /dev/sdi
Jul 15 19:35:44 Tower root: Total Spundown: 9
Jul 15 19:35:44 Tower root: Entering Normal Mode
Jul 15 19:35:44 Tower kernel: mdcmd (161): set md_write_method 0
Jul 15 19:35:44 Tower kernel: 
Jul 15 19:40:44 Tower root: Total Spundown: 9
### [PREVIOUS LINE REPEATED 8 TIMES] ###
Jul 15 20:23:48 Tower emhttpd: read SMART /dev/sdo
Jul 15 20:25:44 Tower root: Total Spundown: 8
### [PREVIOUS LINE REPEATED 2 TIMES] ###
Jul 15 20:36:11 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Jul 15 20:36:11 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!
Jul 15 20:36:12 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Jul 15 20:36:12 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!
Jul 15 20:36:13 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Jul 15 20:36:13 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!
Jul 15 20:36:14 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Jul 15 20:36:14 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!
Jul 15 20:36:15 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Jul 15 20:36:15 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!!
Jul 15 20:36:16 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!!
Jul 15 20:36:16 Tower kernel: mpt2sas_cm0: _base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!!
Jul 15 20:36:16 Tower kernel: sd 8:0:0:0: [sdf] Synchronizing SCSI cache
Jul 15 20:36:16 Tower kernel: sd 8:0:0:0: [sdf] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
Jul 15 20:36:16 Tower kernel: sd 8:0:4:0: [sdj] tag#803 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=5s
Jul 15 20:36:16 Tower kernel: sd 8:0:4:0: [sdj] tag#803 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Jul 15 20:36:16 Tower kernel: sd 8:0:4:0: [sdj] tag#804 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s
Jul 15 20:36:16 Tower kernel: sd 8:0:4:0: [sdj] tag#804 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00
Jul 15 20:36:16 Tower kernel: sd 8:0:5:0: [sdk] tag#805 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s
Jul 15 20:36:16 Tower kernel: sd 8:0:5:0: [sdk] tag#805 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Jul 15 20:36:16 Tower kernel: sd 8:0:5:0: [sdk] tag#806 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s
Jul 15 20:36:16 Tower kernel: sd 8:0:5:0: [sdk] tag#806 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00
Jul 15 20:36:16 Tower kernel: sd 8:0:2:0: [sdh] tag#807 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s
Jul 15 20:36:16 Tower kernel: sd 8:0:2:0: [sdh] tag#807 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Jul 15 20:36:16 Tower kernel: sd 8:0:2:0: [sdh] tag#808 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s
Jul 15 20:36:16 Tower kernel: sd 8:0:2:0: [sdh] tag#808 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00

 

tower-diagnostics-20210715-2051.zip

Edited by michaelmcq
Link to comment

Thanks, I think there might be a correlation between this happening and me hammering the SSDs in there at the same time, they’re not via the HBA but on the motherboard (Z490-A-PRO).

 

So I was thinking either power draw or something motherboard related when it’s on board drives are working hard?

Off to research power consumption!

Link to comment
  • 1 month later...
  • 11 months later...
  • 5 months later...

I just encountered the same issue with a 9300-16i card I bought on eBay and Linux kernel 5.19.  I took the HBA out of service, went back to a known stable 5.17 kernel, and put a different model HBA 9400-16i in to see if it can be isolated to the HBA card.  It's a new motherboard/cpu/ram combo, new Ubuntu 22.04 install, new HBA, in the same SC846B chassis.  Lots of potential fault areas so I'll try to isolate one thing at a time.

Link to comment
  • 9 months later...

Sorry to necro the thread - but I have now encountered the same issue. Any long-term solutions? I just tried adding 2 fans to the HBA (2x 120mm noctua's pointed right at the heatsink, and the heatsink isnt even hot to the touch) but it seems the issue still occurs.

 

Using a 9300-16i and it seems everytime I run a parity check or drive clear, my entire system freezes/dashboard non responsive and I have to do a hard power down/restart. 

 

If I don't run parity check or add a drive, it seems to work completely fine. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.