More problems with Supermicro server - three failed drives, on parity one data,

ashman70 · January 16, 2017

Cool thanks for looking into that. It would seem kinda of odd if that SASLP supported dual link and the SAS2LP didn't?

JorgeB · January 16, 2017

It would seem kinda of odd if that SASLP supported dual link and the SAS2LP didn't?

Yes, but I've learned a long time ago never to assume anything in these situations.

ashman70 · January 16, 2017

Absolutely, I agree 100%

ashman70 · January 17, 2017

OK I've got my new cable, got it plugged in, servers booted up, I've done a new config, put assigned all the drives correctly, I am about to commit.

Do I check 'Parity is correct' then start the array and then do a parity check? I don't want to start the array and erase my parity drives do I?

Thanks.

JorgeB · January 17, 2017

You can do either way, P parity should be mostly in sync, but Q should have a few sync errors, the difference is that if there are a lot of sync errors it will be slower than a sync, in this case I would probably try the check correct option first, if it starts finding many errors and becoming slower then abort a do a sync (only on Q parity if there's where most errors are).

ashman70 · January 17, 2017

What do you mean by P and Q?

JorgeB · January 17, 2017

P=parity

Q=parity2

ashman70 · January 17, 2017

So should I just leave it now, or do a parity sync? I am uploading diags and I am seeing these errors in the system log, only in one area I think since boot up, should I be concerned about them?

Jan 17 13:36:50 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

Jan 17 13:36:50 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1

Jan 17 13:36:50 Tower kernel: sas: ata28: end_device-1:1:17: cmd error handler

Jan 17 13:36:50 Tower kernel: sas: ata1: end_device-1:0:12: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata2: end_device-1:0:13: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata3: end_device-1:0:14: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata4: end_device-1:0:15: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata5: end_device-1:0:16: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata6: end_device-1:0:17: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata7: end_device-1:0:18: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata8: end_device-1:0:19: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata9: end_device-1:0:20: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata10: end_device-1:0:21: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata11: end_device-1:0:22: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata12: end_device-1:0:23: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata13: end_device-1:0:24: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata14: end_device-1:0:25: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata16: end_device-1:0:27: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata15: end_device-1:0:26: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata18: end_device-1:0:29: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata17: end_device-1:0:28: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata19: end_device-1:0:30: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata20: end_device-1:0:31: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata21: end_device-1:0:32: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata22: end_device-1:0:33: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata23: end_device-1:0:34: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata24: end_device-1:0:35: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata25: end_device-1:1:13: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata26: end_device-1:1:14: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata27: end_device-1:1:16: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata28: end_device-1:1:17: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata29: end_device-1:1:20: dev error handler

Jan 17 13:36:50 Tower kernel: sas: ata30: end_device-1:1:23: dev error handler

Jan 17 13:36:50 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1

I found this article, but I am not sure if it applies since I am not doing a resume from suspend to ram

http://unix.stackexchange.com/questions/182834/kernel-ata-exception-after-resume-from-suspend-to-ram

tower-diagnostics-20170117-1348.zip

JorgeB · January 17, 2017

Never good, but it recovered quickly, let it run for a while, but looks like the cable is not the problem.

ashman70 · January 17, 2017

Why do you say that? Because we are still seeing the errors?

Let me explain why I think it was the cable. There was so much tension on the cable because it barely reached the HBA and I think that was causing an interrupt in communications between the HBA and the rear backplane, to such an extent that we had drives on the rear back plane dropping out of the array, it could of also been affecting the HBA in other ways, I'm not sure. I know we are still seeing those error messages, but I believe they have always been there.

JorgeB · January 17, 2017

Why do you say that? Because we are still seeing the errors?

Yes, although that error alone is not really serious, more like a delayed read, if you get some like these then they are serious, and usually lead to read errors/drooped disks:

Jan 15 13:39:08 Tower kernel: sas: Enter sas_scsi_recover_host busy: 62 failed: 62
Jan 15 13:39:08 Tower kernel: sas: trying to find task 0xffff88087b475900
Jan 15 13:39:08 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff88087b475900
Jan 15 13:39:08 Tower kernel: sas: sas_scsi_find_task: task 0xffff88087b475900 is aborted
Jan 15 13:39:08 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff88087b475900 is aborted

ashman70 · January 17, 2017

I looked back at one of my first logs when I set this up three days ago and they were there.

How long should I leave it running to you reckon before doing a parity check?

JorgeB · January 17, 2017

Start it know because it's when there's more stress and errors are more likely.

ashman70 · January 17, 2017

Started a parity check and its been going for seven minutes now, what's interesting is that when I refresh the page a bunch of times, the speed doesn't drop below 51MB/s. Before after I moved it from the 16x slot to the 8x slow and with the old cable, it was bouncing around from the 30s to the 40s and the 50s, it was nowhere near as consistent and fast as it is now, so that's odd. Mayne the cable was flaky? Maybe the tension on the HBA before was causing the issue?

Oh and its only found six errors so far.

And I spoke too soon, we are down to speeds of 36MB/s and higher, also we are up to 9 sync errors corrected so far.

ashman70 · January 17, 2017

Up to 6305 sync errors corrected so far $:-\$ :-X :'( ::)

JorgeB · January 17, 2017

The sync errors are perfectly normal after what happened, as long as there are no sas/ata log errors I would be happy.

ashman70 · January 17, 2017

Sometimes 'normal' doesn't feel very normal.

But thanks, I just want it to complete the parity sync without any drama. ::) $:-\$

ashman70 · January 17, 2017

Up to almost 1 million sync errors corrected now...wow!

ashman70 · January 18, 2017

Currently sitting at 53.3% and 1,445, 071 sync errors corrected. Elapsed time is 23hrs 39min with 17hrs and 25 min to go.

ashman70 · January 19, 2017

So after a day and a half of doing a parity check with about 15hrs left, it cancels telling me I have issues with me second parity drive and disk 24, both 8TB drive.

Please advise next steps, this is getting tiresome.... $:-\$ I don't mind going down to one parity drive for now if the second parity drive is indeed bad.

Diags attached.

The only thing I can thing of is that these drives are both on the rear backplane and their sleds are different from those drives in the front that are all the same. I can try putting these disks in different slots, there are at least six empty ones. I find it hard to believe it could be the sleds, the disks are find up until about 60-70% of the way through the parity check and then they fail. Weird.

tower-diagnostics-20170118-2006.zip

ashman70 · January 19, 2017

Anyone? Need advice on how to proceed.

JorgeB · January 19, 2017

Errors are very similar to the last time, both disks dropped offline so there's no SMART, reboot and upload new diagnostics but it's very unlikely that both failed at the same time.

If SMART looks good on both I would try a different controller, one with a LSI chipset, which is the same brand used by the backplane expanders, so less chance of some incompatibility/problem.

ashman70 · January 19, 2017

Ok rebooted and saw errors for ATA:27 I\O errors was unable to identify, which looks to be my second parity drive, its not available in the unRAID GUI, however disk 24 is and looks ok although it still has a red X. I don't mind proceeding with only one parity at this point. Should I remove disk 24, start the array, stop the array, put disk 24 back, and start the array again for a parity rebuild? Also should I physically pull the failed Parity 2 drive before I start the parity rebuild?

Diags after the reboot attached.

tower-diagnostics-20170119-1029.zip

JorgeB · January 19, 2017

The parity sync didn't finish, so it's not a good idea to rebuild, you can do another new config like last time and this time leave parity2 unassigned, but I suspect the results will be the same.

ashman70 · January 19, 2017

What do you mean you suspect the results will be the same? Same as what? And why do you expect the results will be the same?

Are you saying I shouldn't proceed until replace the controller? Would an IBM M1015 in IT mode be sufficient?

More problems with Supermicro server - three failed drives, on parity one data,

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation