Disk repeatedly failing - 6.9.2 / 6.10-rc1


Recommended Posts

Hi all,

 

I'm having trouble pinning down a disk issue I'm experiencing, I think it might be the HBA but I don't have another to test.

 

I've rebuilt the disk(s), replaced the disk(s), re-seated the cable(s), replaced the cable(s).

 

Previously, two disks had been disabled and they were both rebuilt and replaced.

 

All has been running fine for a over a week now, maybe a few weeks, but one of the disks was disabled again.

 

Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#504 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=3s
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#504 Sense Key : 0x2 [current] 
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#504 ASC=0x4 ASCQ=0x0 
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#504 CDB: opcode=0x88 88 00 00 00 00 00 e9 1a 6d 68 00 00 00 08 00 00
Jun 29 16:03:27 ghost kernel: blk_update_request: I/O error, dev sdk, sector 3910823272 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 29 16:03:27 ghost kernel: md: disk2 read error, sector=3910823208
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#505 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#505 Sense Key : 0x2 [current] 
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#505 ASC=0x4 ASCQ=0x0 
Jun 29 16:03:27 ghost kernel: sd 6:0:3:0: [sdk] tag#505 CDB: opcode=0x8a 8a 00 00 00 00 00 e9 1a 6d 68 00 00 00 08 00 00
Jun 29 16:03:27 ghost kernel: blk_update_request: I/O error, dev sdk, sector 3910823272 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 29 16:03:27 ghost kernel: md: disk2 write error, sector=3910823208

 

SMART extended test shows all clear, I've attached diagnostics here if anyone can take a look.

 

Some people have mentioned disabling spin-down, that's already disabled in global disk settings and the current failing disk is set to 'use default.'

 

Would anyone have any ideas on what to try next? My next thought would be to move drives around (sata cables) and see if the new drives on the same HBA fail or if the same disk fails. However, I really don't want to rebuild this disk again if possible (or cause other issues

 

Thanks in advance for any ideas.

 

EDIT: I think this is resolved. Replaced my drive power cables. Thanks everyone for the help. Hopefully didn't destroy my drives with all these rebuilds....

 

Edited by apazzy
Resolution added.
Link to comment

  

11 minutes ago, trurl said:

Which controller is that disk using?

[1000:0087] 07:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

mpt2sas_cm1: LSISAS2308: FWVersion(20.00.07.00), ChipRevision(0x05), BiosVersion(07.39.02.00)

 

I don't believe any disks have failed on this other card but I didn't keep track unfortunately.

[1000:0087] 01:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)

mpt2sas_cm0: LSISAS2308: FWVersion(20.00.07.00), ChipRevision(0x05), BiosVersion(07.39.02.00)

 

Quote

You will have to rebuild the disk again since it is disabled.

 

Sorry, I meant more than one time. If there's anything I can do now before rebuilding the disk to prevent having to rebuild it again.

Edited by apazzy
Link to comment

Annnnnd another disk failed.

 

Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9111 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9111 Sense Key : 0x2 [current]
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9111 ASC=0x4 ASCQ=0x0
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9111 CDB: opcode=0x88 88 00 00 00 00 00 17 45 10 c0 00 00 00 08 00 00
Jun 30 11:30:20 ghost kernel: blk_update_request: I/O error, dev sdf, sector 390402240 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jun 30 11:30:20 ghost kernel: md: disk1 read error, sector=390402176
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9116 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=0s
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9116 Sense Key : 0x2 [current]
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9116 ASC=0x4 ASCQ=0x0
Jun 30 11:30:20 ghost kernel: sd 5:0:2:0: [sdf] tag#9116 CDB: opcode=0x8a 8a 00 00 00 00 00 17 45 10 c0 00 00 00 08 00 00
Jun 30 11:30:20 ghost kernel: blk_update_request: I/O error, dev sdf, sector 390402240 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
Jun 30 11:30:20 ghost kernel: md: disk1 write error, sector=390402176

 

Looks like it's not just the one HBA, sdk failed yesterday and sdf failed today.

 

IOMMU group 33:			 	[1000:0087] 01:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
				[5:0:0:0]    disk    ATA      WDC WD121KFBX-68 0A83  /dev/sdb   12.0TB
				[5:0:1:0]    disk    ATA      HGST HUS728T8TAL W414  /dev/sde   8.00TB
				[5:0:2:0]    disk    ATA      WDC WD80EMAZ-00W 0A83  /dev/sdf   8.00TB
				[5:0:3:0]    disk    ATA      WDC WD120EMAZ-11 0A81  /dev/sdg   12.0TB
IOMMU group 36:			 	[1000:0087] 07:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
				[6:0:0:0]    disk    ATA      WDC WD120EMAZ-11 0A81  /dev/sdh   12.0TB
				[6:0:1:0]    disk    ATA      WDC WD80EMAZ-00W 0A83  /dev/sdi   8.00TB
				[6:0:2:0]    disk    ATA      WDC WD80EFAX-68L 0A83  /dev/sdj   8.00TB
				[6:0:3:0]    disk    ATA      HGST HUH721212AL W925  /dev/sdk   12.0TB

 

I have to rebuild one of these, is it better to rebuild them both at the same time (less reads?) or is it just as read intensive as doing them separately?

Link to comment
10 hours ago, Michael_P said:

Your drives are pretty warm, do you have enough airflow for the HBA?

 

Great question - apparently one of my front fans was unplugged... Aligns with the failing disks pretty much exactly.

 

Interestingly the temps didn't seem that much higher, I attributed it to the outside temperature initially.

 

Time to rebuild the disks and see... thank you!

Link to comment

I'm not using a backplane, but that's good to know since I'm looking to get a case with a backplane soon.

 

In other news, I think one of my SSDs died during this. Fun. Remember to keep your drives cool. Even in the same hot room, with additional cooling I'm down from ~45-50c to ~35c on all drives.

Link to comment
  • apazzy changed the title to Disk repeatedly failing - 6.9.2 / 6.10-rc1

I had this happen again to two disks, one on each controller in different sections of my case. I've rebuilt them both and one more failed again.

 

I'm not sure what else to check here. Motherboard SATA is not reliable, so I'd prefer not touching anything there. I'm not opposed to replacing the HBAs, if anyone has recommendations.

 

Please let me know if you have any thoughts or need any info.

 

Thanks in advance.

 

Sep  7 13:56:53 ghost kernel: mpt2sas_cm1: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
### [PREVIOUS LINE REPEATED 2 TIMES] ###
Sep  7 13:56:53 ghost kernel: scsi_io_completion_action: 1 callbacks suppressed
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8547 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=3s
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8547 Sense Key : 0x2 [current] 
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8547 ASC=0x4 ASCQ=0x0 
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8547 CDB: opcode=0x88 88 00 00 00 00 01 81 37 e3 08 00 00 04 00 00 00
Sep  7 13:56:53 ghost kernel: print_req_error: 1 callbacks suppressed
Sep  7 13:56:53 ghost kernel: blk_update_request: I/O error, dev sdk, sector 6462890760 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 0
...
Sep  7 13:56:53 ghost kernel: md: disk2 read error, sector=6462891712
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8548 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=3s
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8548 Sense Key : 0x2 [current] 
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8548 ASC=0x4 ASCQ=0x0 
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8548 CDB: opcode=0x88 88 00 00 00 00 01 81 37 db 08 00 00 04 00 00 00
Sep  7 13:56:53 ghost kernel: blk_update_request: I/O error, dev sdk, sector 6462888712 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 0
...
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8549 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=3s
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8549 Sense Key : 0x2 [current] 
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8549 ASC=0x4 ASCQ=0x0 
Sep  7 13:56:53 ghost kernel: sd 6:0:3:0: [sdk] tag#8549 CDB: opcode=0x88 88 00 00 00 00 01 81 37 df 08 00 00 04 00 00 00
Sep  7 13:56:53 ghost kernel: blk_update_request: I/O error, dev sdk, sector 6462889736 op 0x0:(READ) flags 0x0 phys_seg 128 prio class 0

 

 

Edited by apazzy
removed diagnostics
Link to comment
2 hours ago, apazzy said:
Sep  7 13:56:53 ghost kernel: mpt2sas_cm1: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)

 

If according https://github.com/baruch/lsi_decode_loginfo to decode the error code, it indicate SATA have link down.

I would suggest you try mount the disable disk by UD then check does file system / files could be read without problem, this to verify does disk only drop by link down.

 

lsi_decode_loginfo.py 0x31110d00
Value           31110D00h
Type:           30000000h       SAS 
Origin:         01000000h       PL 
Code:           00110000h       PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
Sub Code:       00000D00h       PL_LOGINFO_SUB_CODE_SATA_LINK_DOWN 

 

Edited by Vr2Io
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.