Disk failure - advice please


Recommended Posts

System went down this evening with, eventually, a disabled parity disk. I don't have  full log, but I would appreciate any advice on how to interpret the section below which i was able to capture. Was this actually an HDD failure or could it have been a failure of something else?

 

(Shut down system and now rebuilding parity back to the same disk. No SMART errors reported.)

 

Feb 14 21:53:17 Tower kernel: sd 5:0:3:0: attempting task abort!scmd(0x000000004a167f29), outstanding for 15259 ms & timeout 15000 ms
Feb 14 21:53:17 Tower kernel: sd 5:0:3:0: [sde] tag#245 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Feb 14 21:53:17 Tower kernel: scsi target5:0:3: handle(0x000c), sas_address(0x4433221102000000), phy(2)
Feb 14 21:53:17 Tower kernel: scsi target5:0:3: enclosure logical id(0x500605b001048b70), slot(1) 
Feb 14 21:53:48 Tower kernel: mpt2sas_cm0: In func: mpt3sas_scsih_issue_tm
Feb 14 21:53:48 Tower kernel: mpt2sas_cm0: Command Timeout
Feb 14 21:53:48 Tower kernel: mf:
Feb 14 21:53:48 Tower kernel: #011
Feb 14 21:53:48 Tower kernel: 0100000c 
Feb 14 21:53:48 Tower kernel: 00000100 
Feb 14 21:53:48 Tower kernel: 00000000 
Feb 14 21:53:48 Tower kernel: 00000000 
Feb 14 21:53:48 Tower kernel: 00000000 
Feb 14 21:53:48 Tower kernel: 00000000 
Feb 14 21:53:48 Tower kernel: 00000000 
Feb 14 21:53:48 Tower kernel: 00000000 
Feb 14 21:53:48 Tower kernel: 
Feb 14 21:53:48 Tower kernel: #011
Feb 14 21:53:48 Tower kernel: 00000000 
Feb 14 21:53:48 Tower kernel: 00000000 
Feb 14 21:53:48 Tower kernel: 00000000 
Feb 14 21:53:48 Tower kernel: 00000000 
Feb 14 21:53:48 Tower kernel: 000000f6 
Feb 14 21:53:48 Tower kernel: 
Feb 14 21:53:58 Tower kernel: mpt2sas_cm0: sending diag reset !!
Feb 14 21:53:59 Tower kernel: mpt2sas_cm0: diag reset: SUCCESS
Feb 14 21:53:59 Tower kernel: mpt2sas_cm0: CurrentHostPageSize is 0: Setting default host page size to 4k
Feb 14 21:54:14 Tower kernel: mpt2sas_cm0: config_request: manufacturing(0), action(0), form(0x00000000), smid(3428)
Feb 14 21:54:14 Tower kernel: mpt2sas_cm0: _config_request: command timeout
Feb 14 21:54:14 Tower kernel: mpt2sas_cm0: Command Timeout
Feb 14 21:54:14 Tower kernel: mf:
Feb 14 21:54:14 Tower kernel: #011
Feb 14 21:54:14 Tower kernel: 04000000 
Feb 14 21:54:14 Tower kernel: 00000000 
Feb 14 21:54:14 Tower kernel: 00000000 
Feb 14 21:54:14 Tower kernel: 00000000 
Feb 14 21:54:14 Tower kernel: 00000000 
Feb 14 21:54:14 Tower kernel: 09000000 
Feb 14 21:54:14 Tower kernel: 00000000 
Feb 14 21:54:14 Tower kernel: d3000000 
Feb 14 21:54:14 Tower kernel: 
Feb 14 21:54:14 Tower kernel: #011
Feb 14 21:54:14 Tower kernel: ffffffff 
Feb 14 21:54:14 Tower kernel: ffffffff 
Feb 14 21:54:14 Tower kernel: 00000000 
Feb 14 21:54:14 Tower kernel: 
Feb 14 21:54:14 Tower kernel: mpt2sas_cm0: mpt3sas_base_hard_reset_handler: FAILED
Feb 14 21:54:14 Tower kernel: sd 5:0:3:0: task abort: FAILED scmd(0x000000004a167f29)
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: attempting task abort!scmd(0x00000000bf95fe9f), outstanding for 72192 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: [sdb] tag#160 CDB: opcode=0x88 88 00 00 00 00 03 48 e6 4f a8 00 00 00 20 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:0: handle(0x000b), sas_address(0x4433221103000000), phy(3)
Feb 14 21:54:14 Tower kernel: scsi target5:0:0: enclosure logical id(0x500605b001048b70), slot(0) 
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: No reference found at driver, assuming scmd(0x00000000bf95fe9f) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: task abort: SUCCESS scmd(0x00000000bf95fe9f)
Feb 14 21:54:14 Tower kernel: sd 5:0:1:0: attempting task abort!scmd(0x00000000b17000f0), outstanding for 72192 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:1:0: [sdc] tag#161 CDB: opcode=0x88 88 00 00 00 00 03 48 e6 4f a8 00 00 00 20 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0)
Feb 14 21:54:14 Tower kernel: scsi target5:0:1: enclosure logical id(0x500605b001048b70), slot(3) 
Feb 14 21:54:14 Tower kernel: sd 5:0:1:0: No reference found at driver, assuming scmd(0x00000000b17000f0) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:1:0: task abort: SUCCESS scmd(0x00000000b17000f0)
Feb 14 21:54:14 Tower kernel: sd 5:0:2:0: attempting task abort!scmd(0x00000000f8eb2847), outstanding for 70142 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:2:0: [sdd] tag#247 CDB: opcode=0x88 88 00 00 00 00 02 ef c6 e7 18 00 00 01 00 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:2: handle(0x000a), sas_address(0x4433221101000000), phy(1)
Feb 14 21:54:14 Tower kernel: scsi target5:0:2: enclosure logical id(0x500605b001048b70), slot(2) 
Feb 14 21:54:14 Tower kernel: sd 5:0:2:0: No reference found at driver, assuming scmd(0x00000000f8eb2847) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:2:0: task abort: SUCCESS scmd(0x00000000f8eb2847)
Feb 14 21:54:14 Tower kernel: sd 5:0:2:0: attempting task abort!scmd(0x00000000bc8dfafc), outstanding for 70142 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:2:0: [sdd] tag#246 CDB: opcode=0x88 88 00 00 00 00 02 ef c6 e6 18 00 00 01 00 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:2: handle(0x000a), sas_address(0x4433221101000000), phy(1)
Feb 14 21:54:14 Tower kernel: scsi target5:0:2: enclosure logical id(0x500605b001048b70), slot(2) 
Feb 14 21:54:14 Tower kernel: sd 5:0:2:0: No reference found at driver, assuming scmd(0x00000000bc8dfafc) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:2:0: task abort: SUCCESS scmd(0x00000000bc8dfafc)
Feb 14 21:54:14 Tower kernel: sd 5:0:2:0: attempting task abort!scmd(0x0000000010fc6b1f), outstanding for 70142 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:2:0: [sdd] tag#162 CDB: opcode=0x88 88 00 00 00 00 02 ef c6 e5 18 00 00 01 00 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:2: handle(0x000a), sas_address(0x4433221101000000), phy(1)
Feb 14 21:54:14 Tower kernel: scsi target5:0:2: enclosure logical id(0x500605b001048b70), slot(2) 
Feb 14 21:54:14 Tower kernel: sd 5:0:2:0: No reference found at driver, assuming scmd(0x0000000010fc6b1f) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:2:0: task abort: SUCCESS scmd(0x0000000010fc6b1f)
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: attempting task abort!scmd(0x000000009d34caf0), outstanding for 61718 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: [sdb] tag#163 CDB: opcode=0x88 88 00 00 00 00 00 00 01 c1 98 00 00 00 08 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:0: handle(0x000b), sas_address(0x4433221103000000), phy(3)
Feb 14 21:54:14 Tower kernel: scsi target5:0:0: enclosure logical id(0x500605b001048b70), slot(0) 
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: No reference found at driver, assuming scmd(0x000000009d34caf0) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: task abort: SUCCESS scmd(0x000000009d34caf0)
Feb 14 21:54:14 Tower kernel: sd 5:0:1:0: attempting task abort!scmd(0x00000000491b1367), outstanding for 61718 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:1:0: [sdc] tag#164 CDB: opcode=0x88 88 00 00 00 00 00 00 01 c1 98 00 00 00 08 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:1: handle(0x0009), sas_address(0x4433221100000000), phy(0)
Feb 14 21:54:14 Tower kernel: scsi target5:0:1: enclosure logical id(0x500605b001048b70), slot(3) 
Feb 14 21:54:14 Tower kernel: sd 5:0:1:0: No reference found at driver, assuming scmd(0x00000000491b1367) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:1:0: task abort: SUCCESS scmd(0x00000000491b1367)
Feb 14 21:54:14 Tower kernel: sd 5:0:5:0: attempting task abort!scmd(0x00000000868fe54a), outstanding for 50176 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:5:0: [sdg] tag#170 CDB: opcode=0x88 88 00 00 00 00 03 2f 11 4d 68 00 00 00 08 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:5: handle(0x000e), sas_address(0x4433221107000000), phy(7)
Feb 14 21:54:14 Tower kernel: scsi target5:0:5: enclosure logical id(0x500605b001048b70), slot(4) 
Feb 14 21:54:14 Tower kernel: sd 5:0:5:0: No reference found at driver, assuming scmd(0x00000000868fe54a) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:5:0: task abort: SUCCESS scmd(0x00000000868fe54a)
Feb 14 21:54:14 Tower kernel: sd 5:0:5:0: attempting task abort!scmd(0x000000002aa71d1e), outstanding for 50176 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:5:0: [sdg] tag#169 CDB: opcode=0x88 88 00 00 00 00 01 d1 fb c5 c0 00 00 00 20 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:5: handle(0x000e), sas_address(0x4433221107000000), phy(7)
Feb 14 21:54:14 Tower kernel: scsi target5:0:5: enclosure logical id(0x500605b001048b70), slot(4) 
Feb 14 21:54:14 Tower kernel: sd 5:0:5:0: No reference found at driver, assuming scmd(0x000000002aa71d1e) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:5:0: task abort: SUCCESS scmd(0x000000002aa71d1e)
Feb 14 21:54:14 Tower kernel: sd 5:0:5:0: attempting task abort!scmd(0x000000008cf2aa2f), outstanding for 50176 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:5:0: [sdg] tag#168 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 c0 00 00 00 20 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:5: handle(0x000e), sas_address(0x4433221107000000), phy(7)
Feb 14 21:54:14 Tower kernel: scsi target5:0:5: enclosure logical id(0x500605b001048b70), slot(4) 
Feb 14 21:54:14 Tower kernel: sd 5:0:5:0: No reference found at driver, assuming scmd(0x000000008cf2aa2f) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:5:0: task abort: SUCCESS scmd(0x000000008cf2aa2f)
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: attempting task abort!scmd(0x0000000049c7fb2c), outstanding for 50176 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: [sdb] tag#167 CDB: opcode=0x88 88 00 00 00 00 03 2f 11 4d 68 00 00 00 08 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:0: handle(0x000b), sas_address(0x4433221103000000), phy(3)
Feb 14 21:54:14 Tower kernel: scsi target5:0:0: enclosure logical id(0x500605b001048b70), slot(0) 
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: No reference found at driver, assuming scmd(0x0000000049c7fb2c) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: task abort: SUCCESS scmd(0x0000000049c7fb2c)
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: attempting task abort!scmd(0x000000008ffdff4a), outstanding for 50176 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: [sdb] tag#166 CDB: opcode=0x88 88 00 00 00 00 01 d1 fb c5 c0 00 00 00 20 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:0: handle(0x000b), sas_address(0x4433221103000000), phy(3)
Feb 14 21:54:14 Tower kernel: scsi target5:0:0: enclosure logical id(0x500605b001048b70), slot(0) 
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: No reference found at driver, assuming scmd(0x000000008ffdff4a) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: task abort: SUCCESS scmd(0x000000008ffdff4a)
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: attempting task abort!scmd(0x00000000025932ab), outstanding for 50176 ms & timeout 30000 ms
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: [sdb] tag#165 CDB: opcode=0x88 88 00 00 00 00 00 00 00 00 c0 00 00 00 20 00 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:0: handle(0x000b), sas_address(0x4433221103000000), phy(3)
Feb 14 21:54:14 Tower kernel: scsi target5:0:0: enclosure logical id(0x500605b001048b70), slot(0) 
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: No reference found at driver, assuming scmd(0x00000000025932ab) might have completed
Feb 14 21:54:14 Tower kernel: sd 5:0:0:0: task abort: SUCCESS scmd(0x00000000025932ab)
Feb 14 21:54:14 Tower kernel: sd 5:0:3:0: attempting device reset! scmd(0x000000004a167f29)
Feb 14 21:54:14 Tower kernel: sd 5:0:3:0: [sde] tag#245 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Feb 14 21:54:14 Tower kernel: scsi target5:0:3: handle(0x000c), sas_address(0x4433221102000000), phy(2)
Feb 14 21:54:14 Tower kernel: scsi target5:0:3: enclosure logical id(0x500605b001048b70), slot(1) 
Feb 14 21:54:44 Tower kernel: mpt2sas_cm0: In func: mpt3sas_scsih_issue_tm
Feb 14 21:54:44 Tower kernel: mpt2sas_cm0: Command Timeout

 

Link to comment

Thanks. I suspect it was a problem with the LSI card but rather more serious than the similar spin-down related messages which have been posted a number of times on this forum because it actually caused a major shutdown. I have recently rebuilt this server and this is a new (to me!) LSI card so it could be the cause. I think I'll disable spin-down for now and see what happens.

Link to comment

That is an excellent question. I’ve used these cards before without issues but this is a new case so airflow will be different. There is a case ventilation extractor fan above it, and the case runs very cool overall, but I’m not specifically cooling the card. I’ll measure its temperature. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.