apazzy Posted June 29, 2021 Share Posted June 29, 2021 (edited) Hi all, I'm having trouble pinning down a disk issue I'm experiencing, I think it might be the HBA but I don't have another to test. I've rebuilt the disk(s), replaced the disk(s), re-seated the cable(s), replaced the cable(s). Previously, two disks had been disabled and they were both rebuilt and replaced. All has been running fine for a over a week now, maybe a few weeks, but one of the disks was disabled again. Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#504 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=3s Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#504 Sense Key : 0x2 [current] Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#504 ASC=0x4 ASCQ=0x0 Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#504 CDB: opcode=0x88 88 00 00 00 00 00 e9 1a 6d 68 00 00 00 08 00 00 Jun 29 16:03:27 ghost kernel: blk_update_request: I/O error, dev sdk, sector 3910823272 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 Jun 29 16:03:27 ghost kernel: md: disk2 read error, sector=3910823208 Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#505 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#505 Sense Key : 0x2 [current] Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#505 ASC=0x4 ASCQ=0x0 Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#505 CDB: opcode=0x8a 8a 00 00 00 00 00 e9 1a 6d 68 00 00 00 08 00 00 Jun 29 16:03:27 ghost kernel: blk_update_request: I/O error, dev sdk, sector 3910823272 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0 Jun 29 16:03:27 ghost kernel: md: disk2 write error, sector=3910823208 SMART extended test shows all clear, I've attached diagnostics here if anyone can take a look. Some people have mentioned disabling spin-down, that's already disabled in global disk settings and the current failing disk is set to 'use default.' Would anyone have any ideas on what to try next? My next thought would be to move drives around (sata cables) and see if the new drives on the same HBA fail or if the same disk fails. However, I really don't want to rebuild this disk again if possible (or cause other issues Thanks in advance for any ideas. EDIT: I think this is resolved. Replaced my drive power cables. Thanks everyone for the help. Hopefully didn't destroy my drives with all these rebuilds.... Edited November 14, 2021 by apazzy Resolution added. Quote Link to comment
trurl Posted June 30, 2021 Share Posted June 30, 2021 Which controller is that disk using? 2 hours ago, apazzy said: don't want to rebuild this disk again You will have to rebuild the disk again since it is disabled. Quote Link to comment
apazzy Posted June 30, 2021 Author Share Posted June 30, 2021 (edited) 11 minutes ago, trurl said: Which controller is that disk using? [1000:0087] 07:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05) mpt2sas_cm1: LSISAS2308: FWVersion(20.00.07.00), ChipRevision(0x05), BiosVersion(07.39.02.00) I don't believe any disks have failed on this other card but I didn't keep track unfortunately. [1000:0087] 01:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05) mpt2sas_cm0: LSISAS2308: FWVersion(20.00.07.00), ChipRevision(0x05), BiosVersion(07.39.02.00) Quote You will have to rebuild the disk again since it is disabled. Sorry, I meant more than one time. If there's anything I can do now before rebuilding the disk to prevent having to rebuild it again. Edited June 30, 2021 by apazzy Quote Link to comment
apazzy Posted June 30, 2021 Author Share Posted June 30, 2021 Annnnnd another disk failed. Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9111 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9111 Sense Key : 0x2 [current] Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9111 ASC=0x4 ASCQ=0x0 Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9111 CDB: opcode=0x88 88 00 00 00 00 00 17 45 10 c0 00 00 00 08 00 00 Jun 30 11:30:20 ghost kernel: blk_update_request: I/O error, dev sdf, sector 390402240 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 Jun 30 11:30:20 ghost kernel: md: disk1 read error, sector=390402176 Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9116 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9116 Sense Key : 0x2 [current] Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9116 ASC=0x4 ASCQ=0x0 Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9116 CDB: opcode=0x8a 8a 00 00 00 00 00 17 45 10 c0 00 00 00 08 00 00 Jun 30 11:30:20 ghost kernel: blk_update_request: I/O error, dev sdf, sector 390402240 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0 Jun 30 11:30:20 ghost kernel: md: disk1 write error, sector=390402176 Looks like it's not just the one HBA, sdk failed yesterday and sdf failed today. IOMMU group 33: [1000:0087] 01:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05) [5:0:0:0] disk ATA WDC WD121KFBX-68 0A83 /dev/sdb 12.0TB [5:0:1:0] disk ATA HGST HUS728T8TAL W414 /dev/sde 8.00TB [5:0:2:0] disk ATA WDC WD80EMAZ-00W 0A83 /dev/sdf 8.00TB [5:0:3:0] disk ATA WDC WD120EMAZ-11 0A81 /dev/sdg 12.0TB IOMMU group 36: [1000:0087] 07:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05) [6:0:0:0] disk ATA WDC WD120EMAZ-11 0A81 /dev/sdh 12.0TB [6:0:1:0] disk ATA WDC WD80EMAZ-00W 0A83 /dev/sdi 8.00TB [6:0:2:0] disk ATA WDC WD80EFAX-68L 0A83 /dev/sdj 8.00TB [6:0:3:0] disk ATA HGST HUH721212AL W925 /dev/sdk 12.0TB I have to rebuild one of these, is it better to rebuild them both at the same time (less reads?) or is it just as read intensive as doing them separately? Quote Link to comment
Michael_P Posted June 30, 2021 Share Posted June 30, 2021 Your drives are pretty warm, do you have enough airflow for the HBA? Quote Link to comment
apazzy Posted July 1, 2021 Author Share Posted July 1, 2021 10 hours ago, Michael_P said: Your drives are pretty warm, do you have enough airflow for the HBA? Great question - apparently one of my front fans was unplugged... Aligns with the failing disks pretty much exactly. Interestingly the temps didn't seem that much higher, I attributed it to the outside temperature initially. Time to rebuild the disks and see... thank you! Quote Link to comment
Nexius2 Posted July 1, 2021 Share Posted July 1, 2021 if ever, I had a similar issue, it was the case backplane dropping the HDD. removed it and plugged the disks directly to HBA and no more errors. Quote Link to comment
apazzy Posted July 1, 2021 Author Share Posted July 1, 2021 I'm not using a backplane, but that's good to know since I'm looking to get a case with a backplane soon. In other news, I think one of my SSDs died during this. Fun. Remember to keep your drives cool. Even in the same hot room, with additional cooling I'm down from ~45-50c to ~35c on all drives. Quote Link to comment
apazzy Posted September 7, 2021 Author Share Posted September 7, 2021 (edited) I had this happen again to two disks, one on each controller in different sections of my case. I've rebuilt them both and one more failed again. I'm not sure what else to check here. Motherboard SATA is not reliable, so I'd prefer not touching anything there. I'm not opposed to replacing the HBAs, if anyone has recommendations. Please let me know if you have any thoughts or need any info. Thanks in advance. Sep 7 13:56:53 ghost kernel: mpt2sas_cm1: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) ### [PREVIOUS LINE REPEATED 2 TIMES] ### Sep 7 13:56:53 ghost kernel: scsi_io_completion_action: 1 callbacks suppressed Sep 7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8547 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=3s Sep 7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8547 Sense Key : 0x2 [current] Sep 7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8547 ASC=0x4 ASCQ=0x0 Sep 7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8547 CDB: opcode=0x88 88 00 00 00 00 01 81 37 e3 08 00 00 04 00 00 00 Sep 7 13:56:53 ghost kernel: print_req_error: 1 callbacks suppressed Sep 7 13:56:53 ghost kernel: blk_update_request: I/O error, dev sdk, sector 6462890760 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 0 ... Sep 7 13:56:53 ghost kernel: md: disk2 read error, sector=6462891712 Sep 7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8548 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=3s Sep 7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8548 Sense Key : 0x2 [current] Sep 7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8548 ASC=0x4 ASCQ=0x0 Sep 7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8548 CDB: opcode=0x88 88 00 00 00 00 01 81 37 db 08 00 00 04 00 00 00 Sep 7 13:56:53 ghost kernel: blk_update_request: I/O error, dev sdk, sector 6462888712 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 0 ... Sep 7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8549 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=3s Sep 7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8549 Sense Key : 0x2 [current] Sep 7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8549 ASC=0x4 ASCQ=0x0 Sep 7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8549 CDB: opcode=0x88 88 00 00 00 00 01 81 37 df 08 00 00 04 00 00 00 Sep 7 13:56:53 ghost kernel: blk_update_request: I/O error, dev sdk, sector 6462889736 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 0 Edited November 14, 2021 by apazzy removed diagnostics Quote Link to comment
Vr2Io Posted September 7, 2021 Share Posted September 7, 2021 Does two SAS2308 physically two 8i HBA or one 16i HBA ? Quote Link to comment
apazzy Posted September 7, 2021 Author Share Posted September 7, 2021 33 minutes ago, Vr2Io said: Does two SAS2308 physically two 8i HBA or one 16i HBA ? Two SAS2308, one failing disk on each, one top middle and one bottom middle of the front of the case. Quote Link to comment
Vr2Io Posted September 7, 2021 Share Posted September 7, 2021 (edited) 2 hours ago, apazzy said: Sep 7 13:56:53 ghost kernel: mpt2sas_cm1: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) If according https://github.com/baruch/lsi_decode_loginfo to decode the error code, it indicate SATA have link down. I would suggest you try mount the disable disk by UD then check does file system / files could be read without problem, this to verify does disk only drop by link down. lsi_decode_loginfo.py 0x31110d00 Value 31110D00h Type: 30000000h SAS Origin: 01000000h PL Code: 00110000h PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE) Sub Code: 00000D00h PL_LOGINFO_SUB_CODE_SATA_LINK_DOWN Edited September 7, 2021 by Vr2Io Quote Link to comment
apazzy Posted September 8, 2021 Author Share Posted September 8, 2021 Good catch, thank you for that. I'm very confused why it's happening for two different drives on two different controllers, but I've ordered replacement cables to see if that will fix it. I appreciate your help. Quote Link to comment
Vr2Io Posted September 8, 2021 Share Posted September 8, 2021 3 hours ago, apazzy said: I'm very confused why it's happening for two different drives on two different controllers Yes, so quite hard to determine the fault cause, pls also check and reseat power cable. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.