Disk Failure - Need advice - General Support

December 5, 201510 yr

Hi folks,

I recently had a disk fail and during the rebuild another disk (Disk 8 - sdl) started throwing errors. I thought maybe while trying to find the disk that had failed I had not seated properly the second disk when I put it back. I cancelled the rebuild, reseated all disks and rebooted the array. The rebuild started again this time with no errors. However, during the night later in the rebuild the same disk started throwing errors again. The rebuild of the first failed disk completed and now I have Disk 8 showing as failed. Can you tell from the logs if this disk is truly dead? The smart report on the disk seems fine. I'm enclosing the diagnostics file. Any help would be greatly appreciated.

Doug

tower-diagnostics-20151204-1941.zip

Quote

December 5, 201510 yr

Author

Adding SMART report. I did a short test which passed and the extended is in progress.

tower-smart-20151204-2107.zip

Quote

December 5, 201510 yr

Author

The extended SMART test passed as well. So I'm guessing the drive is okay. I'll try resetting the cables. How do I get the disk to no longer be disabled to test it out?

Quote

December 5, 201510 yr

Author

I found the procedure to rebuild the drive onto itself. That's running now.

Quote

December 5, 201510 yr

Community Expert

The extended SMART test passed as well. So I'm guessing the drive is okay. I'll try resetting the cables. How do I get the disk to no longer be disabled to test it out?

To get the disk to rebuild onto itself (which will clear the disabled state)

Stop the array and unassign the disk
Start the array. It will warn you that the array is unprotected but will start the array with the disk missing. This step makes unRAID 'forget' the current assignment
Stop the array and reassign the disk. unRAID will now warn you that starting the array will start the rebuild of the disk
Start the array and let the disk rebuild. If everything is working fine this should complete without error.
When the rebuild completes run a parity check to check out the rebuild. If the rebuild was error free this should complete with 0 errors.

The only difference between this process and that for rebuilding onto a new disk is the addition of the step that makes unRAID 'forget' the current assignment (although doing that does not hurt even when using a new disk).

Quote

December 5, 201510 yr

Author

Okay so the rebuild is in progress and now I'm getting read errors from a different disk (Disk 9). I had this problem last time a drive failed and I tried to rebuild, which was right after I upgrade to v6. I'm enclosing another set of diagnostics after the new read errors started.

What is going on here?

Thanks,

Doug

tower-diagnostics-20151205-0953.zip

Quote

December 6, 201510 yr

Author

From the log right before the read errors start near the end of the log i see some sas-scsi errors. Does this mean my controller is failing? Or does this look like some cabling issue?

Dec  5 09:24:08 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 0000004C,  slot [1].
Dec  5 09:24:39 Tower kernel: sas: Enter sas_scsi_recover_host busy: 3 failed: 3
Dec  5 09:24:39 Tower kernel: sas: trying to find task 0xffff8800b9347000
Dec  5 09:24:39 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff8800b9347000
Dec  5 09:24:39 Tower kernel: sas: sas_scsi_find_task: task 0xffff8800b9347000 is aborted
Dec  5 09:24:39 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff8800b9347000 is aborted
Dec  5 09:24:39 Tower kernel: sas: trying to find task 0xffff8800b9346500
Dec  5 09:24:39 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff8800b9346500
Dec  5 09:24:39 Tower kernel: sas: sas_scsi_find_task: task 0xffff8800b9346500 is aborted
Dec  5 09:24:39 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff8800b9346500 is aborted
Dec  5 09:24:39 Tower kernel: sas: trying to find task 0xffff8800b9346c00
Dec  5 09:24:39 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff8800b9346c00
Dec  5 09:24:39 Tower kernel: sas: sas_scsi_find_task: task 0xffff8800b9346c00 is aborted
Dec  5 09:24:39 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff8800b9346c00 is aborted
Dec  5 09:24:39 Tower kernel: sas: ata14: end_device-1:7: cmd error handler
Dec  5 09:24:39 Tower kernel: sas: ata7: end_device-1:0: dev error handler
Dec  5 09:24:39 Tower kernel: sas: ata8: end_device-1:1: dev error handler
Dec  5 09:24:39 Tower kernel: sas: ata9: end_device-1:2: dev error handler
Dec  5 09:24:39 Tower kernel: sas: ata10: end_device-1:3: dev error handler
Dec  5 09:24:39 Tower kernel: sas: ata11: end_device-1:4: dev error handler
Dec  5 09:24:39 Tower kernel: sas: ata12: end_device-1:5: dev error handler
Dec  5 09:24:39 Tower kernel: sas: ata13: end_device-1:6: dev error handler
Dec  5 09:24:39 Tower kernel: sas: ata14: end_device-1:7: dev error handler
Dec  5 09:24:39 Tower kernel: ata14.00: exception Emask 0x0 SAct 0x3200 SErr 0x0 action 0x6 frozen
Dec  5 09:24:39 Tower kernel: ata14.00: failed command: READ FPDMA QUEUED
Dec  5 09:24:39 Tower kernel: ata14.00: cmd 60/00:00:f0:24:b8/04:00:03:00:00/40 tag 9 ncq 524288 in
Dec  5 09:24:39 Tower kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Dec  5 09:24:39 Tower kernel: ata14.00: status: { DRDY }
Dec  5 09:24:39 Tower kernel: ata14.00: failed command: READ FPDMA QUEUED
Dec  5 09:24:39 Tower kernel: ata14.00: cmd 60/00:00:f0:30:b8/04:00:03:00:00/40 tag 12 ncq 524288 in
Dec  5 09:24:39 Tower kernel:         res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec  5 09:24:39 Tower kernel: ata14.00: status: { DRDY }
Dec  5 09:24:39 Tower kernel: ata14.00: failed command: READ FPDMA QUEUED
Dec  5 09:24:39 Tower kernel: ata14.00: cmd 60/00:00:f0:34:b8/04:00:03:00:00/40 tag 13 ncq 524288 in
Dec  5 09:24:39 Tower kernel:         res 40/00:60:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Dec  5 09:24:39 Tower kernel: ata14.00: status: { DRDY }
Dec  5 09:24:39 Tower kernel: ata14: hard resetting link
Dec  5 09:24:41 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1428:mvs_I_T_nexus_reset for device[3]:rc= 0
Dec  5 09:24:41 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Dec  5 09:24:41 Tower kernel: sas: sas_ata_task_done: SAS error 8a
Dec  5 09:24:41 Tower kernel: ata14.00: both IDENTIFYs aborted, assuming NODEV
Dec  5 09:24:41 Tower kernel: ata14.00: revalidation failed (errno=-2)
Dec  5 09:24:46 Tower kernel: ata14: hard resetting link
Dec  5 09:24:51 Tower kernel: ata14.00: qc timeout (cmd 0xec)
Dec  5 09:24:51 Tower kernel: ata14.00: failed to IDENTIFY (I/O error, err_mask=0x5)
Dec  5 09:24:51 Tower kernel: ata14.00: revalidation failed (errno=-5)
Dec  5 09:24:51 Tower kernel: ata14: hard resetting link
Dec  5 09:24:51 Tower kernel: sas: sas_form_port: phy3 belongs to port7 already(1)!
Dec  5 09:24:53 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1428:mvs_I_T_nexus_reset for device[3]:rc= 0
Dec  5 09:24:54 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 00000006,  slot [0].
Dec  5 09:24:59 Tower kernel: ata14.00: qc timeout (cmd 0x27)
Dec  5 09:24:59 Tower kernel: ata14.00: failed to read native max address (err_mask=0x4)
Dec  5 09:24:59 Tower kernel: ata14.00: HPA support seems broken, skipping HPA handling
Dec  5 09:24:59 Tower kernel: ata14.00: revalidation failed (errno=-5)
Dec  5 09:24:59 Tower kernel: ata14.00: disabled
Dec  5 09:24:59 Tower kernel: ata14.00: device reported invalid CHS sector 0
Dec  5 09:24:59 Tower kernel: ata14.00: device reported invalid CHS sector 0
Dec  5 09:24:59 Tower kernel: ata14: hard resetting link
Dec  5 09:24:59 Tower kernel: sas: sas_form_port: phy3 belongs to port7 already(1)!
Dec  5 09:25:01 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1428:mvs_I_T_nexus_reset for device[3]:rc= 0
Dec  5 09:25:01 Tower kernel: ata14: EH complete
Dec  5 09:25:01 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
Dec  5 09:25:01 Tower kernel: sd 1:0:7:0: [sdn] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Dec  5 09:25:01 Tower kernel: sd 1:0:7:0: [sdn] tag#0 CDB: opcode=0x28 28 00 03 b8 34 f0 00 04 00 00
Dec  5 09:25:01 Tower kernel: blk_update_request: I/O error, dev sdn, sector 62403824
Dec  5 09:25:01 Tower kernel: md: disk9 read error, sector=62403760
Dec  5 09:25:01 Tower kernel: md: disk9 read error, sector=62403768
Dec  5 09:25:01 Tower kernel: md: disk9 read error, sector=62403776

Any help would be greatly appreciated.

Doug

Quote

December 6, 201510 yr

Author

Okay so after disk 8 rebuilt and I started getting the read errors on Disk 9, I shutdown the server and reseated everything (disk cables, controller card). I started it back up and started the process of rebuilding Disk 9 onto itself. This morning I wake up and it looks like Disk 9 rebuilt and the drive is green, but it says it's unmountable and I'm back to read errors on Disk 8. So I'm guessing that means I've lost whatever data was on 9 (which was very little), but now I'm concerned about the ability to do anything with 8.

Quote

December 7, 201510 yr

Author

Anyone have any thoughts about what those errors might mean? I'm really stuck here and not sure what to do short of ordering parts.

Thanks,

Doug

Quote

December 7, 201510 yr

Community Expert

Since both failed disks are on the same controller one thing that comes to mind is the issue that a few SAS2LP users have of redballing good drives on V6, it can also be a cabling/power/backplane issue, but if the SAS2LP is new or you just recently upgraded to V6 first thing I'd try is using a different controller.

Quote

December 7, 201510 yr

Author

Thanks for that information. That really helps. I didn't think about searching for the issue by the controller. After reading through a bit of those threads it sounds like exactly the issue I'm having.

Hopefully replacing that card will resolve this for good.

Doug

Quote

Disk Failure - Need advice

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)