ashman70 Posted January 16, 2017 Author Share Posted January 16, 2017 Cool thanks for looking into that. It would seem kinda of odd if that SASLP supported dual link and the SAS2LP didn't? Quote Link to comment
JorgeB Posted January 16, 2017 Share Posted January 16, 2017 It would seem kinda of odd if that SASLP supported dual link and the SAS2LP didn't? Yes, but I've learned a long time ago never to assume anything in these situations. Quote Link to comment
ashman70 Posted January 16, 2017 Author Share Posted January 16, 2017 Absolutely, I agree 100% Quote Link to comment
ashman70 Posted January 17, 2017 Author Share Posted January 17, 2017 OK I've got my new cable, got it plugged in, servers booted up, I've done a new config, put assigned all the drives correctly, I am about to commit. Do I check 'Parity is correct' then start the array and then do a parity check? I don't want to start the array and erase my parity drives do I? Thanks. Quote Link to comment
JorgeB Posted January 17, 2017 Share Posted January 17, 2017 You can do either way, P parity should be mostly in sync, but Q should have a few sync errors, the difference is that if there are a lot of sync errors it will be slower than a sync, in this case I would probably try the check correct option first, if it starts finding many errors and becoming slower then abort a do a sync (only on Q parity if there's where most errors are). Quote Link to comment
ashman70 Posted January 17, 2017 Author Share Posted January 17, 2017 What do you mean by P and Q? Quote Link to comment
JorgeB Posted January 17, 2017 Share Posted January 17, 2017 P=parity Q=parity2 Quote Link to comment
ashman70 Posted January 17, 2017 Author Share Posted January 17, 2017 So should I just leave it now, or do a parity sync? I am uploading diags and I am seeing these errors in the system log, only in one area I think since boot up, should I be concerned about them? Jan 17 13:36:50 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 Jan 17 13:36:50 Tower kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1 Jan 17 13:36:50 Tower kernel: sas: ata28: end_device-1:1:17: cmd error handler Jan 17 13:36:50 Tower kernel: sas: ata1: end_device-1:0:12: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata2: end_device-1:0:13: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata3: end_device-1:0:14: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata4: end_device-1:0:15: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata5: end_device-1:0:16: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata6: end_device-1:0:17: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata7: end_device-1:0:18: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata8: end_device-1:0:19: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata9: end_device-1:0:20: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata10: end_device-1:0:21: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata11: end_device-1:0:22: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata12: end_device-1:0:23: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata13: end_device-1:0:24: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata14: end_device-1:0:25: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata16: end_device-1:0:27: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata15: end_device-1:0:26: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata18: end_device-1:0:29: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata17: end_device-1:0:28: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata19: end_device-1:0:30: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata20: end_device-1:0:31: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata21: end_device-1:0:32: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata22: end_device-1:0:33: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata23: end_device-1:0:34: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata24: end_device-1:0:35: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata25: end_device-1:1:13: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata26: end_device-1:1:14: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata27: end_device-1:1:16: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata28: end_device-1:1:17: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata29: end_device-1:1:20: dev error handler Jan 17 13:36:50 Tower kernel: sas: ata30: end_device-1:1:23: dev error handler Jan 17 13:36:50 Tower kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 1 tries: 1 I found this article, but I am not sure if it applies since I am not doing a resume from suspend to ram http://unix.stackexchange.com/questions/182834/kernel-ata-exception-after-resume-from-suspend-to-ram tower-diagnostics-20170117-1348.zip Quote Link to comment
JorgeB Posted January 17, 2017 Share Posted January 17, 2017 Never good, but it recovered quickly, let it run for a while, but looks like the cable is not the problem. Quote Link to comment
ashman70 Posted January 17, 2017 Author Share Posted January 17, 2017 Why do you say that? Because we are still seeing the errors? Let me explain why I think it was the cable. There was so much tension on the cable because it barely reached the HBA and I think that was causing an interrupt in communications between the HBA and the rear backplane, to such an extent that we had drives on the rear back plane dropping out of the array, it could of also been affecting the HBA in other ways, I'm not sure. I know we are still seeing those error messages, but I believe they have always been there. Quote Link to comment
JorgeB Posted January 17, 2017 Share Posted January 17, 2017 Why do you say that? Because we are still seeing the errors? Yes, although that error alone is not really serious, more like a delayed read, if you get some like these then they are serious, and usually lead to read errors/drooped disks: Jan 15 13:39:08 Tower kernel: sas: Enter sas_scsi_recover_host busy: 62 failed: 62 Jan 15 13:39:08 Tower kernel: sas: trying to find task 0xffff88087b475900 Jan 15 13:39:08 Tower kernel: sas: sas_scsi_find_task: aborting task 0xffff88087b475900 Jan 15 13:39:08 Tower kernel: sas: sas_scsi_find_task: task 0xffff88087b475900 is aborted Jan 15 13:39:08 Tower kernel: sas: sas_eh_handle_sas_errors: task 0xffff88087b475900 is aborted Quote Link to comment
ashman70 Posted January 17, 2017 Author Share Posted January 17, 2017 I looked back at one of my first logs when I set this up three days ago and they were there. How long should I leave it running to you reckon before doing a parity check? Quote Link to comment
JorgeB Posted January 17, 2017 Share Posted January 17, 2017 Start it know because it's when there's more stress and errors are more likely. Quote Link to comment
ashman70 Posted January 17, 2017 Author Share Posted January 17, 2017 Started a parity check and its been going for seven minutes now, what's interesting is that when I refresh the page a bunch of times, the speed doesn't drop below 51MB/s. Before after I moved it from the 16x slot to the 8x slow and with the old cable, it was bouncing around from the 30s to the 40s and the 50s, it was nowhere near as consistent and fast as it is now, so that's odd. Mayne the cable was flaky? Maybe the tension on the HBA before was causing the issue? Oh and its only found six errors so far. And I spoke too soon, we are down to speeds of 36MB/s and higher, also we are up to 9 sync errors corrected so far. Quote Link to comment
ashman70 Posted January 17, 2017 Author Share Posted January 17, 2017 Up to 6305 sync errors corrected so far :'( Quote Link to comment
JorgeB Posted January 17, 2017 Share Posted January 17, 2017 The sync errors are perfectly normal after what happened, as long as there are no sas/ata log errors I would be happy. Quote Link to comment
ashman70 Posted January 17, 2017 Author Share Posted January 17, 2017 Sometimes 'normal' doesn't feel very normal. But thanks, I just want it to complete the parity sync without any drama. Quote Link to comment
ashman70 Posted January 17, 2017 Author Share Posted January 17, 2017 Up to almost 1 million sync errors corrected now...wow! Quote Link to comment
ashman70 Posted January 18, 2017 Author Share Posted January 18, 2017 Currently sitting at 53.3% and 1,445, 071 sync errors corrected. Elapsed time is 23hrs 39min with 17hrs and 25 min to go. Quote Link to comment
ashman70 Posted January 19, 2017 Author Share Posted January 19, 2017 So after a day and a half of doing a parity check with about 15hrs left, it cancels telling me I have issues with me second parity drive and disk 24, both 8TB drive. Please advise next steps, this is getting tiresome.... I don't mind going down to one parity drive for now if the second parity drive is indeed bad. Diags attached. The only thing I can thing of is that these drives are both on the rear backplane and their sleds are different from those drives in the front that are all the same. I can try putting these disks in different slots, there are at least six empty ones. I find it hard to believe it could be the sleds, the disks are find up until about 60-70% of the way through the parity check and then they fail. Weird. tower-diagnostics-20170118-2006.zip Quote Link to comment
ashman70 Posted January 19, 2017 Author Share Posted January 19, 2017 Anyone? Need advice on how to proceed. Quote Link to comment
JorgeB Posted January 19, 2017 Share Posted January 19, 2017 Errors are very similar to the last time, both disks dropped offline so there's no SMART, reboot and upload new diagnostics but it's very unlikely that both failed at the same time. If SMART looks good on both I would try a different controller, one with a LSI chipset, which is the same brand used by the backplane expanders, so less chance of some incompatibility/problem. Quote Link to comment
ashman70 Posted January 19, 2017 Author Share Posted January 19, 2017 Ok rebooted and saw errors for ATA:27 I\O errors was unable to identify, which looks to be my second parity drive, its not available in the unRAID GUI, however disk 24 is and looks ok although it still has a red X. I don't mind proceeding with only one parity at this point. Should I remove disk 24, start the array, stop the array, put disk 24 back, and start the array again for a parity rebuild? Also should I physically pull the failed Parity 2 drive before I start the parity rebuild? Diags after the reboot attached. tower-diagnostics-20170119-1029.zip Quote Link to comment
JorgeB Posted January 19, 2017 Share Posted January 19, 2017 The parity sync didn't finish, so it's not a good idea to rebuild, you can do another new config like last time and this time leave parity2 unassigned, but I suspect the results will be the same. Quote Link to comment
ashman70 Posted January 19, 2017 Author Share Posted January 19, 2017 What do you mean you suspect the results will be the same? Same as what? And why do you expect the results will be the same? Are you saying I shouldn't proceed until replace the controller? Would an IBM M1015 in IT mode be sufficient? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.