Multiple Disk Failures in Array


galways

Recommended Posts

Hi, ever since upgrading to 6.1.0 I have had multiple instances of faulty disks.  I rebuild the faulty disk and soon after I start a parity check another disk shows as faulty.  I have now rebuilt 3 different disks and sure enough another faulty disk shows up.  On the first disk reported as faulty I rebuilt it with a new precleared disk that I had on hand.  The subsequent rebuilds were replaced with the disks that I pulled after they appeared to be fine after preclearing.  The last disk to show as faulty, disk 5, was one of the faulty disks that I precleared.  I have obviously screwed something up in what I have been doing, so now seek advise as to how to get the array stable.

 

Running on an Asus x99-A motherboard, intel i7-5820k, 32M ram, corsair HX1000i power supply.  I checked sata cables when the first disk went they appear to be OK, power should be more than adequate.  Attaching diagnostics and preclear report of the disk that just reported as faulty.

 

Thanks in advance

preclear_start_5XW14C4P_2015-09-02.txt

tower-diagnostics-20150903-2151.zip

preclear_rpt_5XW14C4P_2015-09-02.txt

preclear_finish_5XW14C4P_2015-09-02.txt

Link to comment

I don't think you have or have had any faulty disks, I think it's the system itself that is unstable.  Something is really wrong, with numerous kernel crashes and drives completely dropping out then coming back.  One drive (sdc) was seen and identified and setup, then suddenly the drive completely lost contact with the system, as if its cables had disconnected.  Then later it showed up again, cables reconnected!  That would seem to either be severe vibration, loose backplane connections, or unreliable power.  The kernel crashes and other instability could be memory, so that's the first test, start the unRAID boot menu Memtest and run it for several passes.  Check all connections, make sure they are tight, can't vibrate loose.  Check the power and SATA cable connectors, and for power splitters, make sure there are no loose connections at all.

 

The motherboard seems odd.  There are 2 onboard SATA controllers, a special 4 port controller with 2 ports unusable, and the normal 6 port controller with the first 2 ports unusable.  You have been provided with SATA controllers that support 10 SATA ports, yet only have 6 usable ones!  That seems strange.

 

The drive you Precleared seems fine.  I wouldn't bother Preclearing anything more until the system is stable.  If you have any overclocking, turn it off.  Set any BIOS settings to safe defaults.  With those kernel crashes, I would not trust the system at all until after a reboot.  And do not run any parity checks or builds or drive rebuilds either, until system can be trusted.  They are only causing additional problems.

Link to comment

Rob, I removed and reinserted every cable connection.  While they all appeared to be connected the 24 pin power didn't have its clip fully engaged so it may have been the problem.  I didn't run a memory test as yet.  I rebuilt the supposed faulty disk and I restarted the parity check at 19:39. At 19:59 back to failed parity check.  Have attached the diagnostic report.  Are you able to advise if it appears to be the same issue.  If so I'll run the memory test.  I use an SAS2LP-MV8 which was feeding the drive in question.  Could that possibly be the problem?  I have another and could swap it out if warrented.  Please advise thanks.

tower-diagnostics-20150904-2006.zip

Link to comment

Ended up doing a new config.  After 12 hours it completed with disk two showing 117 errors.  All drives were green.  As per advice in http://lime-technology.com/forum/index.php?topic=40106.0 I didn't attempt to rebuild the disk until parity was verified.  Ran parity check, it failed and now shows disk 5 as faulty. I'm almost at the end of my rope. Am now going to run the memtest which I didn't do because I had convinced my self that it was a loose power cable.  The fact that the parity-sync seems to run fine gave me the false hope that I was out of the woods.  Guess not!

Reports attached, any further advice appreciated.

tower-diagnostics-20150906-0720.zip

Link to comment

Determined errors were being caused by my controller AOC-SAS2LP-MV8 dropping out.  I pulled the controller, connected my parity and 8 data drives to the motherboard (no room for dual cache).  Parity sync completed with no errors and system is now half way through a parity check without errors.  Definitely was the SAS2LP as before pulling it I tried running my array on two SAS2LP's in an attempt to see if the problem was my motherboard ports.  Sync ran less then 20 minutes before it came to a crawl indicating 600 plus days to complete.

 

Link to comment

There are quite a few reports of trouble with the SAS2LP and v6, including yours.  The problems are different, can be separated (I think) in 3 different categories.

 

* Marvell disk controller chipsets and virtualization - no attached drives available on affected controller unless IOMMU is disabled

 

* slow parity checks with the SAS2LP - long thread of attempts to understand the problem

 

* other misbehavior, such as randomly dropping drives - reports like yours and bkastner's

 

I think you will find the reports and experiences of bkastner are especially interesting, as they bare some resemblance to yours.

 

Because this may be a driver or firmware problem, I would hang onto the card.  When/if the fix appears, the card may be a great option again.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.