Jump to content
jang430

Quick, need help. Array showing absurd # of reads and writes on several drives. Shares not accessible

33 posts in this topic Last Reply

Recommended Posts

Posted (edited)

Reads are 22,617,230,667,232, Writes are 22,617,230,667,232.  And of course, thousands of errors.

 

Aug 10 21:12:27 Tower kernel: md: disk1 read error, sector=39195248
Aug 10 21:12:27 Tower kernel: md: disk2 read error, sector=39195248
Aug 10 21:12:27 Tower kernel: md: disk3 read error, sector=39195248
Aug 10 21:12:27 Tower kernel: md: disk4 read error, sector=39195248
Aug 10 21:12:27 Tower kernel: md: disk7 read error, sector=39195248
Aug 10 21:12:27 Tower kernel: md: disk8 read error, sector=39195248
Aug 10 21:12:27 Tower kernel: md: disk9 read error, sector=39195248
Aug 10 21:12:27 Tower kernel: md: disk0 read error, sector=39195248
Aug 10 21:12:27 Tower kernel: md: disk1 read error, sector=39195256
Aug 10 21:12:27 Tower kernel: md: disk2 read error, sector=39195256
Aug 10 21:12:27 Tower kernel: md: disk3 read error, sector=39195256
Aug 10 21:12:27 Tower kernel: md: disk4 read error, sector=39195256
Aug 10 21:12:27 Tower kernel: md: disk7 read error, sector=39195256
Aug 10 21:12:27 Tower kernel: md: disk8 read error, sector=39195256

 

Not sure if it's my HBA controller that I removed, and installed back again.  Before that, it was ok.  No spare HBA.  As for the SFF-8087 cable, using 2 cables for 7 drives.  Can't be possible that all drives are failing if 1 of the cables are defective.

tower-diagnostics-20190810-1324.zip

Edited by jang430

Share this post


Link to post

It might be worth checking that the HBA is well seated in the motherboard slot.   I once had similar problems with one that was not properly seated.

Share this post


Link to post
Posted (edited)

BTW, I forgot to mention that it starts out ok.  No errors.  I can even access shares, then after a few minutes, those errors showed.

 

I've removed it, and placed it well once again.  But ok, will move slots this time.  Any other suggestions from other members?  Hope this is not the end of my HBA :D  Thanks itimpi

Edited by jang430

Share this post


Link to post

@itimpi, reseating the hba controller didn't help.  but unrolling my SFF8087 breakout cable did.  Now, able to access my array without problems.  It also performed a parity check, and found 0 errors.  Could it be possible that both my SFF8087 cables failed at the same time?  To be honest, I highly doubt it.  But I don't have additional cables to troubleshoot.  My cables since long, were a bit rolled.  But it has been like that before.  I've only removed, and reseated my HBA controller.  Don't know why this happened.

Share this post


Link to post

No idea why unrolling the cables worked unless they were putting some strain on the connectors so that vibration could make them momentarily break contact.     I personally got relatively short cables to reduce clutter.    I do know that it is recommended you do NOT try and tidy SATA cables by taping them together as this tends to increase cross-talk but not sure how true this really is.

Share this post


Link to post
16 hours ago, itimpi said:

 I do know that it is recommended you do NOT try and tidy SATA cables by taping them together as this tends to increase cross-talk but not sure how true this really is.

Personally I think the issue the extremely poor retention at the connector, and any attempt to bundle the cables is more likely to pull one of the ends out of alignment. The connector must be completely square in all dimensions to make a proper link that will stay connected during normal system vibration.

Share this post


Link to post

So I got a new set of cables, just arrived today.  No mess.  Connected the cables, and now, I have only 1 drive says mounting.  Nothing happening.  Array not started.  Please help.  Attaching diagnostics.  BTW, changed to a different slot of PCIe already.

tower-diagnostics-20190814-1218.zip

Share this post


Link to post

Lot of errors on parity, difficult to say if it's a disk problem since it's not generating a complete SMART report, try connecting it on a different cable/controller to rule that out.

Share this post


Link to post

OK.  Using new cable already.  But will remove from case, and try again..  Is it possible that it's a PSU problem?

Share this post


Link to post

Could be, but unlikely since it's only parity disk with problem, also suspicious SMART report is incomplete.

Share this post


Link to post
24 minutes ago, johnnie.black said:

difficult to say if it's a disk problem since it's not generating a complete SMART report,

 

Share this post


Link to post

I've done some further testing.  I've switched power between drives, still the same problem.  I've connected the parity drive directly to M/B instead of connected to HBA controller, (using regular sata cable) still having the same problems.  I'm attaching a screenshot here.  

 

My Drive 3 is also disabled since I think it died first before the parity.

 

Won't array start without parity?  Any suggestion what to do next?

Annotation 2019-08-14 234828.png

Share this post


Link to post

so as you can see, I have disk 3 that's currently disabled due to drive failed.  What steps do I start with to recover as much as possible :D?

Share this post


Link to post

Disk 3 also appears to be failing, best options IMHO are manually copying everything you can from old disk3 to a new disk or using ddrescue to clone it, when that's done also get a new disk for parity, do a new config with new disk3 and remaining disks and sync parity.

Share this post


Link to post

While rebuilding parity, I am seeing a lot of errors, though the array is accessible.  Is this normal?

 

image.thumb.png.c767f196448096e565f4d161033ce319.png

 

Could it finally be my Dell Perc H310 that is causing problems?  Any way to find out?  I'm using new SFF-8087 cables already.  

tower-diagnostics-20190818-0437.zip

Edited by jang430

Share this post


Link to post

I have accessed shows from the array. It seems without any heavy transfer or access, the array seems to be working fine. I am afraid to do anything heavy right now. Hope you can help point out possible cause to these issues

Edited by jang430

Share this post


Link to post
On 8/18/2019 at 5:38 AM, jang430 said:

Could it finally be my Dell Perc H310 that is causing problems?

Looks like it:

Aug 18 09:55:38 Tower kernel: mpt2sas_cm0: SAS host is non-operational !!!!

Try it in a different slot, also make sure it's sufficiently cooled.

 

You're also having problems with the parity disk, and that one is on the onboard SATA ports, it dropped offline so there's no SMART report, but looks more like a connection problem.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.