December 5, 201510 yr Hi folks, I recently had a disk fail and during the rebuild another disk (Disk 8 - sdl) started throwing errors. I thought maybe while trying to find the disk that had failed I had not seated properly the second disk when I put it back. I cancelled the rebuild, reseated all disks and rebooted the array. The rebuild started again this time with no errors. However, during the night later in the rebuild the same disk started throwing errors again. The rebuild of the first failed disk completed and now I have Disk 8 showing as failed. Can you tell from the logs if this disk is truly dead? The smart report on the disk seems fine. I'm enclosing the diagnostics file. Any help would be greatly appreciated. Doug tower-diagnostics-20151204-1941.zip
December 5, 201510 yr Author Adding SMART report. I did a short test which passed and the extended is in progress. tower-smart-20151204-2107.zip
December 5, 201510 yr Author The extended SMART test passed as well. So I'm guessing the drive is okay. I'll try resetting the cables. How do I get the disk to no longer be disabled to test it out?
December 5, 201510 yr Author I found the procedure to rebuild the drive onto itself. That's running now.
December 5, 201510 yr Community Expert The extended SMART test passed as well. So I'm guessing the drive is okay. I'll try resetting the cables. How do I get the disk to no longer be disabled to test it out? To get the disk to rebuild onto itself (which will clear the disabled state) Stop the array and unassign the disk Start the array. It will warn you that the array is unprotected but will start the array with the disk missing. This step makes unRAID 'forget' the current assignment Stop the array and reassign the disk. unRAID will now warn you that starting the array will start the rebuild of the disk Start the array and let the disk rebuild. If everything is working fine this should complete without error. When the rebuild completes run a parity check to check out the rebuild. If the rebuild was error free this should complete with 0 errors. The only difference between this process and that for rebuilding onto a new disk is the addition of the step that makes unRAID 'forget' the current assignment (although doing that does not hurt even when using a new disk).
December 5, 201510 yr Author Okay so the rebuild is in progress and now I'm getting read errors from a different disk (Disk 9). I had this problem last time a drive failed and I tried to rebuild, which was right after I upgrade to v6. I'm enclosing another set of diagnostics after the new read errors started. What is going on here? Thanks, Doug tower-diagnostics-20151205-0953.zip
December 6, 201510 yr Author From the log right before the read errors start near the end of the log i see some sas-scsi errors. Does this mean my controller is failing? Or does this look like some cabling issue? Dec 5 09:24:08 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 0000004C, slot [1]. Dec 5 09:24:39 Tower kernel: sas: Enter sas_scsi_recover_host busy: 3 failed: 3 Dec 5 09:24:39 Tower kernel: sas: trying to find task 0xffff8800b9347000 Dec 5 09:24:39 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff8800b9347000 Dec 5 09:24:39 Tower kernel: sas: sas_scsi_find_task: task 0xffff8800b9347000 is aborted Dec 5 09:24:39 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff8800b9347000 is aborted Dec 5 09:24:39 Tower kernel: sas: trying to find task 0xffff8800b9346500 Dec 5 09:24:39 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff8800b9346500 Dec 5 09:24:39 Tower kernel: sas: sas_scsi_find_task: task 0xffff8800b9346500 is aborted Dec 5 09:24:39 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff8800b9346500 is aborted Dec 5 09:24:39 Tower kernel: sas: trying to find task 0xffff8800b9346c00 Dec 5 09:24:39 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff8800b9346c00 Dec 5 09:24:39 Tower kernel: sas: sas_scsi_find_task: task 0xffff8800b9346c00 is aborted Dec 5 09:24:39 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff8800b9346c00 is aborted Dec 5 09:24:39 Tower kernel: sas: ata14: end_device-1:7: cmd error handler Dec 5 09:24:39 Tower kernel: sas: ata7: end_device-1:0: dev error handler Dec 5 09:24:39 Tower kernel: sas: ata8: end_device-1:1: dev error handler Dec 5 09:24:39 Tower kernel: sas: ata9: end_device-1:2: dev error handler Dec 5 09:24:39 Tower kernel: sas: ata10: end_device-1:3: dev error handler Dec 5 09:24:39 Tower kernel: sas: ata11: end_device-1:4: dev error handler Dec 5 09:24:39 Tower kernel: sas: ata12: end_device-1:5: dev error handler Dec 5 09:24:39 Tower kernel: sas: ata13: end_device-1:6: dev error handler Dec 5 09:24:39 Tower kernel: sas: ata14: end_device-1:7: dev error handler Dec 5 09:24:39 Tower kernel: ata14.00: exception Emask 0x0 SAct 0x3200 SErr 0x0 action 0x6 frozen Dec 5 09:24:39 Tower kernel: ata14.00: failed command: READ FPDMA QUEUED Dec 5 09:24:39 Tower kernel: ata14.00: cmd 60/00:00:f0:24:b8/04:00:03:00:00/40 tag 9 ncq 524288 in Dec 5 09:24:39 Tower kernel: res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) Dec 5 09:24:39 Tower kernel: ata14.00: status: { DRDY } Dec 5 09:24:39 Tower kernel: ata14.00: failed command: READ FPDMA QUEUED Dec 5 09:24:39 Tower kernel: ata14.00: cmd 60/00:00:f0:30:b8/04:00:03:00:00/40 tag 12 ncq 524288 in Dec 5 09:24:39 Tower kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Dec 5 09:24:39 Tower kernel: ata14.00: status: { DRDY } Dec 5 09:24:39 Tower kernel: ata14.00: failed command: READ FPDMA QUEUED Dec 5 09:24:39 Tower kernel: ata14.00: cmd 60/00:00:f0:34:b8/04:00:03:00:00/40 tag 13 ncq 524288 in Dec 5 09:24:39 Tower kernel: res 40/00:60:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) Dec 5 09:24:39 Tower kernel: ata14.00: status: { DRDY } Dec 5 09:24:39 Tower kernel: ata14: hard resetting link Dec 5 09:24:41 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1428:mvs_I_T_nexus_reset for device[3]:rc= 0 Dec 5 09:24:41 Tower kernel: sas: sas_ata_task_done: SAS error 8a Dec 5 09:24:41 Tower kernel: sas: sas_ata_task_done: SAS error 8a Dec 5 09:24:41 Tower kernel: ata14.00: both IDENTIFYs aborted, assuming NODEV Dec 5 09:24:41 Tower kernel: ata14.00: revalidation failed (errno=-2) Dec 5 09:24:46 Tower kernel: ata14: hard resetting link Dec 5 09:24:51 Tower kernel: ata14.00: qc timeout (cmd 0xec) Dec 5 09:24:51 Tower kernel: ata14.00: failed to IDENTIFY (I/O error, err_mask=0x5) Dec 5 09:24:51 Tower kernel: ata14.00: revalidation failed (errno=-5) Dec 5 09:24:51 Tower kernel: ata14: hard resetting link Dec 5 09:24:51 Tower kernel: sas: sas_form_port: phy3 belongs to port7 already(1)! Dec 5 09:24:53 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1428:mvs_I_T_nexus_reset for device[3]:rc= 0 Dec 5 09:24:54 Tower kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 00000006, slot [0]. Dec 5 09:24:59 Tower kernel: ata14.00: qc timeout (cmd 0x27) Dec 5 09:24:59 Tower kernel: ata14.00: failed to read native max address (err_mask=0x4) Dec 5 09:24:59 Tower kernel: ata14.00: HPA support seems broken, skipping HPA handling Dec 5 09:24:59 Tower kernel: ata14.00: revalidation failed (errno=-5) Dec 5 09:24:59 Tower kernel: ata14.00: disabled Dec 5 09:24:59 Tower kernel: ata14.00: device reported invalid CHS sector 0 Dec 5 09:24:59 Tower kernel: ata14.00: device reported invalid CHS sector 0 Dec 5 09:24:59 Tower kernel: ata14: hard resetting link Dec 5 09:24:59 Tower kernel: sas: sas_form_port: phy3 belongs to port7 already(1)! Dec 5 09:25:01 Tower kernel: drivers/scsi/mvsas/mv_sas.c 1428:mvs_I_T_nexus_reset for device[3]:rc= 0 Dec 5 09:25:01 Tower kernel: ata14: EH complete Dec 5 09:25:01 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1 Dec 5 09:25:01 Tower kernel: sd 1:0:7:0: [sdn] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Dec 5 09:25:01 Tower kernel: sd 1:0:7:0: [sdn] tag#0 CDB: opcode=0x28 28 00 03 b8 34 f0 00 04 00 00 Dec 5 09:25:01 Tower kernel: blk_update_request: I/O error, dev sdn, sector 62403824 Dec 5 09:25:01 Tower kernel: md: disk9 read error, sector=62403760 Dec 5 09:25:01 Tower kernel: md: disk9 read error, sector=62403768 Dec 5 09:25:01 Tower kernel: md: disk9 read error, sector=62403776 Any help would be greatly appreciated. Doug
December 6, 201510 yr Author Okay so after disk 8 rebuilt and I started getting the read errors on Disk 9, I shutdown the server and reseated everything (disk cables, controller card). I started it back up and started the process of rebuilding Disk 9 onto itself. This morning I wake up and it looks like Disk 9 rebuilt and the drive is green, but it says it's unmountable and I'm back to read errors on Disk 8. So I'm guessing that means I've lost whatever data was on 9 (which was very little), but now I'm concerned about the ability to do anything with 8.
December 7, 201510 yr Author Anyone have any thoughts about what those errors might mean? I'm really stuck here and not sure what to do short of ordering parts. Thanks, Doug
December 7, 201510 yr Community Expert Since both failed disks are on the same controller one thing that comes to mind is the issue that a few SAS2LP users have of redballing good drives on V6, it can also be a cabling/power/backplane issue, but if the SAS2LP is new or you just recently upgraded to V6 first thing I'd try is using a different controller.
December 7, 201510 yr Author Thanks for that information. That really helps. I didn't think about searching for the issue by the controller. After reading through a bit of those threads it sounds like exactly the issue I'm having. Hopefully replacing that card will resolve this for good. Doug
Archived
This topic is now archived and is closed to further replies.