October 1, 20205 yr This happened a few weeks ago and in my attempt to resolve it myself I ended up in quite the mess and lost a large amount of data. I'm trying to not have to go through that again. I had a disk go bad and I initiated a replacement. New disk is in, it was writing all the contents as normal. I ignored it for 6 hours. I came back and now we have this: Total size:4 TB Elapsed time:5 minutes Current position:343 GB (8.6 %) Estimated speed:67.5 KB/sec Estimated finish: 635 days, 21 hours, 42 minutes Shouldn't take that long. I looked in the syslog and find: Oct 1 13:18:39 alucard kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Oct 1 13:18:39 alucard kernel: sas: trying to find task 0x00000000e2711c5b Oct 1 13:18:39 alucard kernel: sas: sas_scsi_find_task: aborting task 0x00000000e2711c5b Oct 1 13:18:39 alucard kernel: sas: sas_scsi_find_task: task 0x00000000e2711c5b is aborted Oct 1 13:18:39 alucard kernel: sas: sas_eh_handle_sas_errors: task 0x00000000e2711c5b is aborted Oct 1 13:18:39 alucard kernel: sas: ata20: end_device-10:3: cmd error handler Oct 1 13:18:39 alucard kernel: sas: ata17: end_device-10:0: dev error handler Oct 1 13:18:39 alucard kernel: sas: ata18: end_device-10:1: dev error handler Oct 1 13:18:39 alucard kernel: sas: ata19: end_device-10:2: dev error handler Oct 1 13:18:39 alucard kernel: sas: ata20: end_device-10:3: dev error handler Oct 1 13:18:39 alucard kernel: sas: ata25: end_device-10:4: dev error handler Oct 1 13:18:39 alucard kernel: ata20.00: exception Emask 0x0 SAct 0x200 SErr 0x0 action 0x6 frozen Oct 1 13:18:39 alucard kernel: sas: ata22: end_device-10:5: dev error handler Oct 1 13:18:39 alucard kernel: sas: ata23: end_device-10:6: dev error handler Oct 1 13:18:39 alucard kernel: ata20.00: failed command: READ FPDMA QUEUED Oct 1 13:18:39 alucard kernel: sas: ata24: end_device-10:7: dev error handler Oct 1 13:18:39 alucard kernel: ata20.00: cmd 60/00:00:e0:e2:ca/04:00:27:00:00/40 tag 9 ncq dma 524288 in Oct 1 13:18:39 alucard kernel: res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) Oct 1 13:18:39 alucard kernel: ata20.00: status: { DRDY } Oct 1 13:18:39 alucard kernel: ata20: hard resetting link Oct 1 13:18:39 alucard kernel: sas: sas_form_port: phy3 belongs to port3 already(1)! Oct 1 13:18:41 alucard kernel: drivers/scsi/mvsas/mv_sas.c 1434:mvs_I_T_nexus_reset for device[3]:rc= 0 Oct 1 13:18:41 alucard kernel: ata20.00: configured for UDMA/133 Oct 1 13:18:41 alucard kernel: ata20: EH complete Oct 1 13:18:41 alucard kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 The next line is me, 6 hours later ssh'ing in to check on things. I know I which drive seems to be causing the strife, but I know that if I was to just remove it, it would get flagged as a failed drive and then i'd have 2. I have 2 parity disks, so not so bad, but last time this happened it chained into unraid believing I had 5 failed drives and I ended up with some data loss. Is there anything I can do, short of replacing the controller card to kick this back off? What should the process be? I can cancel the parity check, but I'm not 100% sure if i should pull the drive that's holding things up or not. Suggestions welcomed. Diags attached. Thanks for reading. alucard-diagnostics-20201001-1910.zip
October 2, 20205 yr Community Expert 2 hours ago, Klainn said: Is there anything I can do, short of replacing the controller card to kick this back off? Probably the best thing is to replace the Marvell controller(s)
October 2, 20205 yr 3 hours ago, trurl said: Probably the best thing is to replace the Marvell controller(s) To clarify, replace it but not by another Marvell controller. You should ask the specialists here for advices on the replacement device depending on your needs and your system.
Archived
This topic is now archived and is closed to further replies.