michaelmcq Posted July 5, 2021 Share Posted July 5, 2021 Over the last couple of days I've started seeing drive errors, I don't always notice straight away. Sometimes it's 4 drives, but other times all drives. Rebooting the server brings everything back as it should be. Jul 5 12:23:02 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! Jul 5 12:23:03 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! Jul 5 12:23:04 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! Jul 5 12:23:05 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! Jul 5 12:23:06 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: _base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!! Jul 5 12:23:07 Tower kernel: sd 9:0:0:0: [sdn] Synchronizing SCSI cache Jul 5 12:23:07 Tower kernel: sd 9:0:0:0: [sdn] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Jul 5 12:23:07 Tower kernel: sd 9:0:1:0: [sdo] Synchronizing SCSI cache Jul 5 12:23:07 Tower kernel: sd 9:0:1:0: [sdo] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Jul 5 12:23:07 Tower kernel: sd 9:0:4:0: [sdr] tag#731 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=4s Jul 5 12:23:07 Tower kernel: sd 9:0:4:0: [sdr] tag#731 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00 Jul 5 12:23:07 Tower kernel: sd 9:0:4:0: [sdr] tag#732 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s Jul 5 12:23:07 Tower kernel: sd 9:0:4:0: [sdr] tag#732 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00 Jul 5 12:23:07 Tower kernel: sd 9:0:5:0: [sds] tag#733 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s Jul 5 12:23:07 Tower kernel: sd 9:0:5:0: [sds] tag#733 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00 Jul 5 12:23:07 Tower kernel: sd 9:0:5:0: [sds] tag#734 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s Jul 5 12:23:07 Tower kernel: sd 9:0:5:0: [sds] tag#734 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00 Jul 5 12:23:07 Tower kernel: sd 9:0:3:0: [sdq] tag#735 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s Jul 5 12:23:07 Tower kernel: sd 9:0:3:0: [sdq] tag#735 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00 Jul 5 12:23:07 Tower kernel: sd 9:0:3:0: [sdq] tag#736 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s Jul 5 12:23:07 Tower kernel: sd 9:0:3:0: [sdq] tag#736 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00 Jul 5 12:23:07 Tower kernel: sd 9:0:2:0: [sdp] Synchronizing SCSI cache Jul 5 12:23:07 Tower kernel: sd 9:0:2:0: [sdp] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Jul 5 12:23:07 Tower emhttpd: read SMART /dev/sdr Jul 5 12:23:07 Tower emhttpd: read SMART /dev/sds Jul 5 12:23:07 Tower emhttpd: read SMART /dev/sdq Jul 5 12:23:07 Tower kernel: sd 9:0:3:0: [sdq] Synchronizing SCSI cache Jul 5 12:23:07 Tower kernel: sd 9:0:3:0: [sdq] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Jul 5 12:23:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Jul 5 12:23:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Jul 5 12:23:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Jul 5 12:23:07 Tower kernel: sd 9:0:4:0: [sdr] Synchronizing SCSI cache Jul 5 12:23:07 Tower kernel: sd 9:0:4:0: [sdr] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Jul 5 12:23:07 Tower unassigned.devices: Warning: Can't get rotational setting of '/dev/sdq'. Jul 5 12:23:07 Tower unassigned.devices: Warning: Can't get rotational setting of '/dev/sdq'. Jul 5 12:23:07 Tower unassigned.devices: Warning: Can't get rotational setting of '/dev/sdr'. Jul 5 12:23:07 Tower unassigned.devices: Warning: Can't get rotational setting of '/dev/sdr'. Jul 5 12:23:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Jul 5 12:23:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Jul 5 12:23:07 Tower kernel: program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Jul 5 12:23:07 Tower kernel: sd 9:0:5:0: [sds] Synchronizing SCSI cache Jul 5 12:23:07 Tower kernel: sd 9:0:5:0: [sds] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221100000000) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: removing handle(0x0009), sas_addr(0x4433221100000000) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5b8ca3a0f0160c00), slot(3) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221101000000) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: removing handle(0x000a), sas_addr(0x4433221101000000) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5b8ca3a0f0160c00), slot(2) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221102000000) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: removing handle(0x000b), sas_addr(0x4433221102000000) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5b8ca3a0f0160c00), slot(1) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221104000000) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: removing handle(0x000c), sas_addr(0x4433221104000000) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5b8ca3a0f0160c00), slot(7) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221105000000) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: removing handle(0x000d), sas_addr(0x4433221105000000) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5b8ca3a0f0160c00), slot(6) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: mpt3sas_transport_port_remove: removed: sas_addr(0x4433221103000000) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: removing handle(0x000e), sas_addr(0x4433221103000000) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: enclosure logical id(0x5b8ca3a0f0160c00), slot(0) Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: unexpected doorbell active! Jul 5 12:23:07 Tower kernel: mpt2sas_cm1: sending diag reset !! Jul 5 12:23:08 Tower kernel: mpt2sas_cm1: Invalid host diagnostic register value Jul 5 12:23:08 Tower kernel: mpt2sas_cm1: System Register set: Jul 5 12:23:08 Tower kernel: 00000000: ffffffff Jul 5 12:23:08 Tower kernel: 00000004: ffffffff Jul 5 12:23:08 Tower kernel: 00000008: ffffffff Jul 5 12:23:08 Tower kernel: 0000000c: ffffffff Jul 5 12:23:08 Tower kernel: 00000010: ffffffff Jul 5 12:23:08 Tower kernel: 00000014: ffffffff Jul 5 12:23:08 Tower kernel: 00000018: ffffffff Jul 5 12:23:08 Tower kernel: 0000001c: ffffffff <REPEATED> Jul 5 12:23:08 Tower kernel: 000000f8: ffffffff Jul 5 12:23:08 Tower kernel: 000000fc: ffffffff Jul 5 12:23:08 Tower kernel: mpt2sas_cm1: diag reset: FAILED Jul 5 12:26:40 Tower root: Total Spundown: 8 Jul 5 12:31:41 Tower root: Total Spundown: 8 Jul 5 12:33:00 Tower kernel: md: disk5 read error, sector=9102416 Jul 5 12:33:00 Tower kernel: md: disk2 read error, sector=9102416 Jul 5 12:33:00 Tower kernel: md: disk4 read error, sector=9102416 Jul 5 12:33:00 Tower kernel: md: disk6 read error, sector=9102416 Jul 5 12:33:10 Tower emhttpd: read SMART /dev/sdj Jul 5 12:33:10 Tower emhttpd: read SMART /dev/sdk Jul 5 12:33:10 Tower emhttpd: read SMART /dev/sdg Jul 5 12:33:10 Tower kernel: XFS (md5): metadata I/O error in "xfs_da_read_buf+0x9e/0xfe [xfs]" at daddr 0x8ae450 len 8 error 5 Jul 5 12:33:10 Tower kernel: XFS (md5): metadata I/O error in "xfs_da_read_buf+0x9e/0xfe [xfs]" at daddr 0x8ae450 len 8 error 5 Jul 5 12:33:10 Tower emhttpd: read SMART /dev/sdf Jul 5 12:33:10 Tower emhttpd: read SMART /dev/sdl Jul 5 12:33:10 Tower emhttpd: read SMART /dev/sdi I have 2 SAS controller cards, I'd initially thought one of them might be failing but when all drives went I thought it must be motherboard related, nothing has changed on the machine recently it's been running quite nicely for a while? I'm a bit stuck as to where to look next. Diagnostics attached tower-diagnostics-20210705-1244.zip Quote Link to comment
JorgeB Posted July 5, 2021 Share Posted July 5, 2021 You have 2 HBAs, mpt2sas_cm0 and mpt2sas_cm1, cm1 is the one with the problem, you can see which one it is by the disks connected, then check that it's well seated and sufficiently cooled, you can also try a different slot if available. Quote Link to comment
michaelmcq Posted July 15, 2021 Author Share Posted July 15, 2021 (edited) Thank you, I took the server down, reseated the card and thought all was sorted but it's gone again tonight but this time it was both cards/all drives, could they both fail at pretty much the same time, that seems unlikely, I don't really know what else to look for in the logs, there was nothing else in the log immediately before both cards failed, again they're all back after a reboot Jul 15 19:05:44 Tower root: Total Spundown: 1 Jul 15 19:05:44 Tower root: Entering Turbo Mode Jul 15 19:05:44 Tower kernel: mdcmd (160): set md_write_method 1 Jul 15 19:05:44 Tower kernel: Jul 15 19:10:44 Tower root: Total Spundown: 1 ### [PREVIOUS LINE REPEATED 4 TIMES] ### Jul 15 19:30:46 Tower emhttpd: spinning down /dev/sds Jul 15 19:31:06 Tower emhttpd: spinning down /dev/sdl Jul 15 19:31:18 Tower emhttpd: spinning down /dev/sdj Jul 15 19:31:29 Tower emhttpd: spinning down /dev/sdo Jul 15 19:31:39 Tower emhttpd: spinning down /dev/sdp Jul 15 19:31:43 Tower emhttpd: spinning down /dev/sdk Jul 15 19:31:57 Tower emhttpd: spinning down /dev/sdh Jul 15 19:31:57 Tower emhttpd: spinning down /dev/sdi Jul 15 19:35:44 Tower root: Total Spundown: 9 Jul 15 19:35:44 Tower root: Entering Normal Mode Jul 15 19:35:44 Tower kernel: mdcmd (161): set md_write_method 0 Jul 15 19:35:44 Tower kernel: Jul 15 19:40:44 Tower root: Total Spundown: 9 ### [PREVIOUS LINE REPEATED 8 TIMES] ### Jul 15 20:23:48 Tower emhttpd: read SMART /dev/sdo Jul 15 20:25:44 Tower root: Total Spundown: 8 ### [PREVIOUS LINE REPEATED 2 TIMES] ### Jul 15 20:36:11 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! Jul 15 20:36:11 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! Jul 15 20:36:12 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! Jul 15 20:36:12 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! Jul 15 20:36:13 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! Jul 15 20:36:13 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! Jul 15 20:36:14 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! Jul 15 20:36:14 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! Jul 15 20:36:15 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! Jul 15 20:36:15 Tower kernel: mpt2sas_cm1: SAS host is non-operational !!!! Jul 15 20:36:16 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!! Jul 15 20:36:16 Tower kernel: mpt2sas_cm0: _base_fault_reset_work: Running mpt3sas_dead_ioc thread success !!!! Jul 15 20:36:16 Tower kernel: sd 8:0:0:0: [sdf] Synchronizing SCSI cache Jul 15 20:36:16 Tower kernel: sd 8:0:0:0: [sdf] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00 Jul 15 20:36:16 Tower kernel: sd 8:0:4:0: [sdj] tag#803 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=5s Jul 15 20:36:16 Tower kernel: sd 8:0:4:0: [sdj] tag#803 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00 Jul 15 20:36:16 Tower kernel: sd 8:0:4:0: [sdj] tag#804 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s Jul 15 20:36:16 Tower kernel: sd 8:0:4:0: [sdj] tag#804 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00 Jul 15 20:36:16 Tower kernel: sd 8:0:5:0: [sdk] tag#805 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s Jul 15 20:36:16 Tower kernel: sd 8:0:5:0: [sdk] tag#805 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00 Jul 15 20:36:16 Tower kernel: sd 8:0:5:0: [sdk] tag#806 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s Jul 15 20:36:16 Tower kernel: sd 8:0:5:0: [sdk] tag#806 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00 Jul 15 20:36:16 Tower kernel: sd 8:0:2:0: [sdh] tag#807 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s Jul 15 20:36:16 Tower kernel: sd 8:0:2:0: [sdh] tag#807 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00 Jul 15 20:36:16 Tower kernel: sd 8:0:2:0: [sdh] tag#808 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=0s Jul 15 20:36:16 Tower kernel: sd 8:0:2:0: [sdh] tag#808 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 98 00 tower-diagnostics-20210715-2051.zip Edited July 15, 2021 by michaelmcq Quote Link to comment
JorgeB Posted July 16, 2021 Share Posted July 16, 2021 If they are both failing it could be a board/power problem, unlikely for both to fail at the same time on their own. Quote Link to comment
michaelmcq Posted July 16, 2021 Author Share Posted July 16, 2021 Thanks, I think there might be a correlation between this happening and me hammering the SSDs in there at the same time, they’re not via the HBA but on the motherboard (Z490-A-PRO). So I was thinking either power draw or something motherboard related when it’s on board drives are working hard? Off to research power consumption! Quote Link to comment
JorgeB Posted July 16, 2021 Share Posted July 16, 2021 40 minutes ago, michaelmcq said: So I was thinking either power draw or something motherboard related when it’s on board drives are working hard? Yes, both are possible, in that case I would suspect the board first. Quote Link to comment
michaelmcq Posted July 16, 2021 Author Share Posted July 16, 2021 Could it be PCIe lanes? I don’t understand it enough but I wonder if I have 2 HBAs running (14 drives) and 4 SSDs could that cause this problem? Quote Link to comment
JorgeB Posted July 16, 2021 Share Posted July 16, 2021 If you mean lack of lanes no, or the HBA wouldn't be detected, though the bottom slot is only x4, but that could at most be a performance issue, not a reliability issue. Quote Link to comment
michaelmcq Posted September 11, 2021 Author Share Posted September 11, 2021 After a month or so of this not happening I’ve had it 3 times this week so I’m back to investigating 😞 any suggestions for the best way to identify the cause. I don’t really want to replace parts that are working and I suspect one of: backplane motherboard psu Quote Link to comment
jcabello7 Posted August 20, 2022 Share Posted August 20, 2022 Finally did you solved it? I'm with same problem! Thanks Quote Link to comment
Almighty Posted February 17, 2023 Share Posted February 17, 2023 I just encountered the same issue with a 9300-16i card I bought on eBay and Linux kernel 5.19. I took the HBA out of service, went back to a known stable 5.17 kernel, and put a different model HBA 9400-16i in to see if it can be isolated to the HBA card. It's a new motherboard/cpu/ram combo, new Ubuntu 22.04 install, new HBA, in the same SC846B chassis. Lots of potential fault areas so I'll try to isolate one thing at a time. Quote Link to comment
am1racle Posted November 30, 2023 Share Posted November 30, 2023 Sorry to necro the thread - but I have now encountered the same issue. Any long-term solutions? I just tried adding 2 fans to the HBA (2x 120mm noctua's pointed right at the heatsink, and the heatsink isnt even hot to the touch) but it seems the issue still occurs. Using a 9300-16i and it seems everytime I run a parity check or drive clear, my entire system freezes/dashboard non responsive and I have to do a hard power down/restart. If I don't run parity check or add a drive, it seems to work completely fine. Quote Link to comment
JorgeB Posted December 1, 2023 Share Posted December 1, 2023 Try a different PCIe slot if possible. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.