Disk repeatedly failing - 6.9.2 / 6.10-rc1

apazzy · June 29, 2021

Hi all,

I'm having trouble pinning down a disk issue I'm experiencing, I think it might be the HBA but I don't have another to test.

I've rebuilt the disk(s), replaced the disk(s), re-seated the cable(s), replaced the cable(s).

Previously, two disks had been disabled and they were both rebuilt and replaced.

All has been running fine for a over a week now, maybe a few weeks, but one of the disks was disabled again.

Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#504 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=3s
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#504 Sense Key : 0x2 [current] 
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#504 ASC=0x4 ASCQ=0x0 
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#504 CDB: opcode=0x88 88 00 00 00 00 00 e9 1a 6d 68 00 00 00 08 00 00
Jun 29 16:03:27 ghost kernel: blk_update_request: I/O error, dev sdk, sector 3910823272 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 29 16:03:27 ghost kernel: md: disk2 read error, sector=3910823208
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#505 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#505 Sense Key : 0x2 [current] 
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#505 ASC=0x4 ASCQ=0x0 
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#505 CDB: opcode=0x8a 8a 00 00 00 00 00 e9 1a 6d 68 00 00 00 08 00 00
Jun 29 16:03:27 ghost kernel: blk_update_request: I/O error, dev sdk, sector 3910823272 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 29 16:03:27 ghost kernel: md: disk2 write error, sector=3910823208

SMART extended test shows all clear, I've attached diagnostics here if anyone can take a look.

Some people have mentioned disabling spin-down, that's already disabled in global disk settings and the current failing disk is set to 'use default.'

Would anyone have any ideas on what to try next? My next thought would be to move drives around (sata cables) and see if the new drives on the same HBA fail or if the same disk fails. However, I really don't want to rebuild this disk again if possible (or cause other issues

Thanks in advance for any ideas.

EDIT: I think this is resolved. Replaced my drive power cables. Thanks everyone for the help. Hopefully didn't destroy my drives with all these rebuilds....

Edited November 14, 2021 by apazzy
Resolution added.

trurl · June 30, 2021

Which controller is that disk using?

2 hours ago, apazzy said:

don't want to rebuild this disk again

You will have to rebuild the disk again since it is disabled.

apazzy · June 30, 2021

11 minutes ago, trurl said:

Which controller is that disk using?

[1000:0087] 07:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

mpt2sas_cm1: LSISAS2308: FWVersion(20.00.07.00), ChipRevision(0x05), BiosVersion(07.39.02.00)

I don't believe any disks have failed on this other card but I didn't keep track unfortunately.

[1000:0087] 01:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

mpt2sas_cm0: LSISAS2308: FWVersion(20.00.07.00), ChipRevision(0x05), BiosVersion(07.39.02.00)

Quote

You will have to rebuild the disk again since it is disabled.

Sorry, I meant more than one time. If there's anything I can do now before rebuilding the disk to prevent having to rebuild it again.

Edited June 30, 2021 by apazzy

apazzy · June 30, 2021

Annnnnd another disk failed.

Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9111 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9111 Sense Key : 0x2 [current]
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9111 ASC=0x4 ASCQ=0x0
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9111 CDB: opcode=0x88 88 00 00 00 00 00 17 45 10 c0 00 00 00 08 00 00
Jun 30 11:30:20 ghost kernel: blk_update_request: I/O error, dev sdf, sector 390402240 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 30 11:30:20 ghost kernel: md: disk1 read error, sector=390402176
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9116 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9116 Sense Key : 0x2 [current]
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9116 ASC=0x4 ASCQ=0x0
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9116 CDB: opcode=0x8a 8a 00 00 00 00 00 17 45 10 c0 00 00 00 08 00 00
Jun 30 11:30:20 ghost kernel: blk_update_request: I/O error, dev sdf, sector 390402240 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 30 11:30:20 ghost kernel: md: disk1 write error, sector=390402176

Looks like it's not just the one HBA, sdk failed yesterday and sdf failed today.

IOMMU group 33:			 	[1000:0087] 01:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
				[5:0:0:0]    disk    ATA      WDC WD121KFBX-68 0A83  /dev/sdb   12.0TB
				[5:0:1:0]    disk    ATA      HGST HUS728T8TAL W414  /dev/sde   8.00TB
				[5:0:2:0]    disk    ATA      WDC WD80EMAZ-00W 0A83  /dev/sdf   8.00TB
				[5:0:3:0]    disk    ATA      WDC WD120EMAZ-11 0A81  /dev/sdg   12.0TB
IOMMU group 36:			 	[1000:0087] 07:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
				[6:0:0:0]    disk    ATA      WDC WD120EMAZ-11 0A81  /dev/sdh   12.0TB
				[6:0:1:0]    disk    ATA      WDC WD80EMAZ-00W 0A83  /dev/sdi   8.00TB
				[6:0:2:0]    disk    ATA      WDC WD80EFAX-68L 0A83  /dev/sdj   8.00TB
				[6:0:3:0]    disk    ATA      HGST HUH721212AL W925  /dev/sdk   12.0TB

I have to rebuild one of these, is it better to rebuild them both at the same time (less reads?) or is it just as read intensive as doing them separately?

Michael_P · June 30, 2021

Your drives are pretty warm, do you have enough airflow for the HBA?

apazzy · July 1, 2021

10 hours ago, Michael_P said:

Your drives are pretty warm, do you have enough airflow for the HBA?

Great question - apparently one of my front fans was unplugged... Aligns with the failing disks pretty much exactly.

Interestingly the temps didn't seem that much higher, I attributed it to the outside temperature initially.

Time to rebuild the disks and see... thank you!

Nexius2 · July 1, 2021

if ever, I had a similar issue, it was the case backplane dropping the HDD.

removed it and plugged the disks directly to HBA and no more errors.

apazzy · July 1, 2021

I'm not using a backplane, but that's good to know since I'm looking to get a case with a backplane soon.

In other news, I think one of my SSDs died during this. Fun. Remember to keep your drives cool. Even in the same hot room, with additional cooling I'm down from ~45-50c to ~35c on all drives.

apazzy · September 7, 2021

I had this happen again to two disks, one on each controller in different sections of my case. I've rebuilt them both and one more failed again.

I'm not sure what else to check here. Motherboard SATA is not reliable, so I'd prefer not touching anything there. I'm not opposed to replacing the HBAs, if anyone has recommendations.

Please let me know if you have any thoughts or need any info.

Thanks in advance.

Sep  7 13:56:53 ghost kernel: mpt2sas_cm1: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
### [PREVIOUS LINE REPEATED 2 TIMES] ###
Sep  7 13:56:53 ghost kernel: scsi_io_completion_action: 1 callbacks suppressed
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8547 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=3s
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8547 Sense Key : 0x2 [current] 
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8547 ASC=0x4 ASCQ=0x0 
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8547 CDB: opcode=0x88 88 00 00 00 00 01 81 37 e3 08 00 00 04 00 00 00
Sep  7 13:56:53 ghost kernel: print_req_error: 1 callbacks suppressed
Sep  7 13:56:53 ghost kernel: blk_update_request: I/O error, dev sdk, sector 6462890760 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 0
...
Sep  7 13:56:53 ghost kernel: md: disk2 read error, sector=6462891712
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8548 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=3s
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8548 Sense Key : 0x2 [current] 
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8548 ASC=0x4 ASCQ=0x0 
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8548 CDB: opcode=0x88 88 00 00 00 00 01 81 37 db 08 00 00 04 00 00 00
Sep  7 13:56:53 ghost kernel: blk_update_request: I/O error, dev sdk, sector 6462888712 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 0
...
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8549 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=3s
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8549 Sense Key : 0x2 [current] 
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8549 ASC=0x4 ASCQ=0x0 
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8549 CDB: opcode=0x88 88 00 00 00 00 01 81 37 df 08 00 00 04 00 00 00
Sep  7 13:56:53 ghost kernel: blk_update_request: I/O error, dev sdk, sector 6462889736 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 0

Edited November 14, 2021 by apazzy
removed diagnostics

Vr2Io · September 7, 2021

Does two SAS2308 physically two 8i HBA or one 16i HBA ?

apazzy · September 7, 2021

33 minutes ago, Vr2Io said:

Does two SAS2308 physically two 8i HBA or one 16i HBA ?

Two SAS2308, one failing disk on each, one top middle and one bottom middle of the front of the case.

Vr2Io · September 7, 2021

2 hours ago, apazzy said:

Sep  7 13:56:53 ghost kernel: mpt2sas_cm1: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

If according https://github.com/baruch/lsi_decode_loginfo to decode the error code, it indicate SATA have link down.

I would suggest you try mount the disable disk by UD then check does file system / files could be read without problem, this to verify does disk only drop by link down.

lsi_decode_loginfo.py 0x31110d00
Value           31110D00h
Type:           30000000h       SAS 
Origin:         01000000h       PL 
Code:           00110000h       PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
Sub Code:       00000D00h       PL_LOGINFO_SUB_CODE_SATA_LINK_DOWN

Edited September 7, 2021 by Vr2Io

apazzy · September 8, 2021

Good catch, thank you for that. I'm very confused why it's happening for two different drives on two different controllers, but I've ordered replacement cables to see if that will fix it. I appreciate your help.

Vr2Io · September 8, 2021

3 hours ago, apazzy said:

I'm very confused why it's happening for two different drives on two different controllers

Yes, so quite hard to determine the fault cause, pls also check and reseat power cable.

Disk repeatedly failing - 6.9.2 / 6.10-rc1

Recommended Posts

apazzy

Link to comment

trurl

Link to comment

apazzy

Link to comment

apazzy

Link to comment

Michael_P

Link to comment

apazzy

Link to comment

Nexius2

Link to comment

apazzy

Link to comment

apazzy

Link to comment

Vr2Io

Link to comment

apazzy

Link to comment

Vr2Io

Link to comment

apazzy

Link to comment

Vr2Io

Link to comment

Join the conversation