One drive failed, then one failed during rebuild --HELP


Recommended Posts

I've had 3 drives fail in the past few months, replaced them, rebuilt them and moved on. After having a fourth fail and thinking about the issue I had decided that my M1015 controller was going bad, so I exchanged it for a 2008 PIKE controller and started a data rebuild on my currently red balled drive...

 

Stupidly I forgot to turn off all my dockers & VM's and had a preclear in progress in the backround smh... I know... stupid

 

now the GUI is saying that neither drives are mountable and have no filesystems. honestly at this point I'm lost and caught with my trousers down.

 

help?

 

current syslog and smart reports for failed drives attached

godzilla-syslog-20150812-0128.zip

godzilla-smart-report-disk5-20150812-0113.txt

godzilla-smart-report-disk6-20150812-0113.txt

Link to comment

Is this the server in your sig? What is the power supply?

 

Smart looks good. How did those other drives fail?

 

Preclear shouldn't impact rebuild in any way, and technically, other read/writes to the array should only have the effect of slowing things down.

 

Was it rebuilding disk6 when disk5 was disabled?

 

Check your cables.

Link to comment

Is this the server in your sig? What is the power supply?

 

Smart looks good. How did those other drives fail?

 

Preclear shouldn't impact rebuild in any way, and technically, other read/writes to the array should only have the effect of slowing things down.

 

Was it rebuilding disk6 when disk5 was disabled?

 

Check your cables.

Trurl thanks for the reply,

It's my Sig, have to update it, its now 18TB wd reds (all 3tb) with 1 1tb wd black cache.

 

The power supply is a 500W Corsair

http://m.newegg.com/Product/index?itemnumber=N82E16817139050&nm_mc=KNC-GoogleAdwords-Mobile&cm_mmc=KNC-GoogleAdwords-Mobile-_-pla-_-Power+Supplies-_-N82E16817139050&gclid=Cj0KEQjw3auuBRDj1LnQyLjy-4sBEiQAKPU_vU9Z6E355pPAEtYMTKSarEMw5ZCNHcsTVwoi6YnmdykaAuA58P8HAQ&gclsrc=aw.ds

 

The other drives failed during normal operation, first was drive 4, then drive 6, then drive 5, and again drive 6. Now drive 5 again

 

Yes drive 5 failed while rebuilding drive 6

 

All the cables are brand new from newegg with clips on both sides, all of which are clipped and secure... maybe ill try replacing them just incase.

 

Since the GUI is saying that their not mountable does that mean that my data on drive 5 is toast? Is there any way to save the data from disk 6 still?

 

Link to comment

The power supply is a 500W Corsair

I think that is rather borderline for that number of drives.  A power supply that is slightly under-speced can cause random drive to appear to fail as they temporarily get starved of power.

The other drives failed during normal operation, first was drive 4, then drive 6, then drive 5, and again drive 6. Now drive 5 again

This is exactly the sort of behaviour that one can expect if he power supply is borderline for the number of drives.

 

I had a Corsair 550W power supply with 20 drives and was getting this sort of behaviour.  When I swapped it for a 750W one the problems disappeared.

 

Since the GUI is saying that their not mountable does that mean that my data on drive 5 is toast? Is there any way to save the data from disk 6 still?

It is very rare that a disk being unmountable means that there is data loss unless the disk has physically failed.  It normally means that a write went wrong and there is some sort of file system corruption that can easily be repaired using the appropriate recovery program.
Link to comment

My appologies for the slow response (at work) and the confusing state of my Sig. I havent updated  it in a while. Currently I only have in my array...

 

1 Parody  3tb wd red

1 cache    1tb wd black

6 data      3tb wd red

8 disks total

 

All my dockers & vm's reside on the cache

the appdata share is cache only

And all data goes to the cache as well, with mover doing it's magic in the wee hours of the night

Link to comment

Tell me if this sounds accurate. Being that this failure has occurred across several separate drives and three separate drive bays. And the sata cables have been replaced during the interim. I'm thinking that the failure is either the psu or the enclosures that I'm using (norco ss-500) I'll catalog the drives tonight and I'm thinking that if the failures are in both enclosures (I use two) then it's the psu and if it's in one enclosure then it's most likely the enclosure?

 

Also the I'm pretty sure the data is gone on disk 6 since I was in the middle of a rebuild correct? (It was about 60% done when disk 5 failed)

 

Disk 5 & 6 were both btrfs... how would I go about checking and repairing the filesystem on disk 5?

Link to comment

Also the I'm pretty sure the data is gone on disk 6 since I was in the middle of a rebuild correct? (It was about 60% done when disk 5 failed)

Were you rebuilding the failed disk back to itself?    If so then there is an excellent chance that the majority of the data is intact as most of the time you would have been writing out sectors with the same contents that they already have.

Link to comment

 

 

Also the I'm pretty sure the data is gone on disk 6 since I was in the middle of a rebuild correct? (It was about 60% done when disk 5 failed)

Were you rebuilding the failed disk back to itself?    If so then there is an excellent chance that the majority of the data is intact as most of the time you would have been writing out sectors with the same contents that they already have.

 

You sir, just made my day. Yes... Thank god, yes I was.

Link to comment

Just cataloged my drives and turns out that they span across both my hot swap drive cages.

 

So my plan is to pick up a new psu and cable the drives directly to the motherboard... and test from there using preclears while incrementally adding the hot swap bays one by one.

 

Just picked up aCorsair HX750i, once I get it here and installed I'll run the tests on a spare disk I have THEN try to save the drives and re-add them.

 

This is so much fun I can't wait to get started (insert sarcasm here)

 

Thank you guys so much I'll keep updating as this progresses

Link to comment
  • 2 weeks later...

Update:

 

1. New PSU is installed

 

2. All HD's removed from hot swap bays and physically connected to MB.                             

3. Installed plugin: Unassigned Devices & verified data is good on disk 5 & 6

 

4. Shrunk array to 4 known 'good' disks.                           

4. Rebuild parity.     

 

5. Precleared spare WD Red 3 TB

 

6. Added new drive to array and migrated disk 5 data to new drive via Unassigned Devices plugin

 

7. Using the old disk 5 as my spare,  repeat step 5 & 6 for disk 6

 

Now I'm back in business with the old disk 6 sitting in standby as my spare.

 

One question, I'd like to test the hot swap cages for bad power and sata links. Whats the best way to go about this? Would a 2 pass preclear on a 120G drive in each bay be enough to test these cages? The cages are both Norco SS-500.

 

@trurl @itimpi thanks so much guys, you helped me save over 5k family pictures and home movies

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.