More problems with Supermicro server - three failed drives, on parity one data,


Recommended Posts

  • Replies 71
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Posted Images

OK I've got my new cable, got it plugged in, servers booted up, I've done a new config, put assigned all the drives correctly, I am about to commit.

 

Do I check 'Parity is correct' then start the array and then do a parity check?  I don't want to start the array and erase my parity drives do I?

 

Thanks.

Link to comment

You can do either way, P parity should be mostly in sync, but Q should have a few sync errors, the difference is that if there are a lot of sync errors it will be slower than a sync, in this case I would probably try the check correct option first, if it starts finding many errors and becoming slower then abort a do a sync (only on Q parity if there's where most errors are).

Link to comment

So should I just leave it now, or do a parity sync? I am uploading diags and I am seeing these errors in the system log, only in one area I think since boot up, should I be concerned about them?

 

Jan 17 13:36:50 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

Jan 17 13:36:50 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

Jan 17 13:36:50 Tower kernel: sas: ata28: end_device-1:1:17: cmd error handler

Jan 17 13:36:50 Tower kernel: sas: ata1: end_device-1:0:12: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata2: end_device-1:0:13: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata3: end_device-1:0:14: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata4: end_device-1:0:15: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata5: end_device-1:0:16: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata6: end_device-1:0:17: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata7: end_device-1:0:18: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata8: end_device-1:0:19: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata9: end_device-1:0:20: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata10: end_device-1:0:21: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata11: end_device-1:0:22: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata12: end_device-1:0:23: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata13: end_device-1:0:24: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata14: end_device-1:0:25: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata16: end_device-1:0:27: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata15: end_device-1:0:26: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata18: end_device-1:0:29: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata17: end_device-1:0:28: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata19: end_device-1:0:30: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata20: end_device-1:0:31: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata21: end_device-1:0:32: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata22: end_device-1:0:33: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata23: end_device-1:0:34: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata24: end_device-1:0:35: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata25: end_device-1:1:13: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata26: end_device-1:1:14: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata27: end_device-1:1:16: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata28: end_device-1:1:17: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata29: end_device-1:1:20: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata30: end_device-1:1:23: dev error handler

Jan 17 13:36:50 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

 

I found this article, but I am not sure if it applies since I am not doing a resume from suspend to ram

 

http://unix.stackexchange.com/questions/182834/kernel-ata-exception-after-resume-from-suspend-to-ram

tower-diagnostics-20170117-1348.zip

Link to comment

Why do you say that? Because we are still seeing the errors?

 

Let me explain why I think it was the cable. There was so much tension on the cable because it barely reached the HBA and I think that was causing an interrupt in communications between the HBA and the rear backplane, to such an extent that we had drives on the rear back plane dropping out of the array, it could of also been affecting the HBA in other ways, I'm not sure. I know we are still seeing those error messages, but I believe they have always been there.

Link to comment

Why do you say that? Because we are still seeing the errors?

 

Yes, although that error alone is not really serious, more like a delayed read, if you get some like these then they are serious, and usually lead to read errors/drooped disks:

 

Jan 15 13:39:08 Tower kernel: sas: Enter sas_scsi_recover_host busy: 62 failed: 62
Jan 15 13:39:08 Tower kernel: sas: trying to find task 0xffff88087b475900
Jan 15 13:39:08 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff88087b475900
Jan 15 13:39:08 Tower kernel: sas: sas_scsi_find_task: task 0xffff88087b475900 is aborted
Jan 15 13:39:08 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff88087b475900 is aborted

Link to comment

Started a parity check and its been going for seven minutes now, what's interesting is that when I refresh the page a bunch of times, the speed doesn't drop below 51MB/s. Before after I moved it from the 16x slot to the 8x slow and with the old cable, it was bouncing around from the 30s to the 40s and the 50s, it was nowhere near as consistent and fast as it is now, so that's odd. Mayne the cable was flaky? Maybe the tension on the HBA before was causing the issue?

 

Oh and its only found six errors so far.

 

 

And I spoke too soon, we are down to speeds of 36MB/s and higher, also we are up to 9 sync errors corrected so far.

Link to comment

So after a day and a half of doing a parity check with about 15hrs left, it cancels telling me I have issues with me second parity drive and disk 24, both 8TB drive.

 

Please advise next steps, this is getting tiresome.... :-\???:o>:(  I don't mind going down to one parity drive for now if the second parity drive is indeed bad.

 

Diags attached.

 

The only thing I can thing of is that these drives are both on the rear backplane and their sleds are different from those drives in the front that are all the same. I can try putting these disks in different slots, there are at least six empty ones. I find it hard to believe it could be the sleds, the disks are find up until about 60-70% of the way through the parity check and then they fail. Weird.

tower-diagnostics-20170118-2006.zip

Link to comment

Errors are very similar to the last time, both disks dropped offline so there's no SMART, reboot and upload new diagnostics but it's very unlikely that both failed at the same time.

 

If SMART looks good on both I would try a different controller, one with a LSI chipset, which is the same brand used by the backplane expanders, so less chance of some incompatibility/problem.

Link to comment

Ok rebooted and saw errors for ATA:27 I\O errors was unable to identify, which looks to be my second parity drive, its not available in the unRAID GUI, however disk 24 is and looks ok although it still has a red X. I don't mind proceeding with only one parity at this point. Should I remove disk 24, start the array, stop the array, put disk 24 back, and start the array again for a parity rebuild? Also should I physically pull the failed Parity 2 drive before I start the parity rebuild?

 

Diags after the reboot attached.

tower-diagnostics-20170119-1029.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.