Jump to content

[SOLVED] 1 red ball + 1 disk missing - probably only 1 drive damaged


Recommended Posts

Hi guys,

 

I have 1 disk with red ball + 1 disk missing in my UnRAID. It's the first time it happens to me (I have recovered other disks in the past when only 1 disk had a red ball) and I can't find similar cases in the forum so I am posting a new topic here hoping someone will be able to help me out.

 

I am using 5.0 Final Pro and my setup include 17 disks (1 parity, 15 data and 1 cache). A month ago I started considering adding more drives and figured out it was prudent to replace my old PSU (Thermaltake 500W - non-single rail) with a better one, so I got a Corsair CX600.

 

Right before I replaced the PSU, disk15 started giving problems - UnRAID was showing reading-errors but the drive was never marked as disabled (red ball). A SMART report was showing more than 600 "Reallocated_Sector_Ct" + Current_Pending_Sector and Offline_Uncorrectables but the system was still working fine without any red balls (please don't ask me why I didn't replaced this disk before).

 

Today, disk13 shows up with a red ball but after running SMART everything seems to be OK with the disk (no serious errors beside 3 Command_Timeout). At this point and before taking further actions I've decided to replace the PSU (I tough disk13 could been suffering because of my previous PSU)

 

Right now (with new PSU in place), disk15 is not even showing up on UnRAID interface (it says "disk missing") and I have no idea on how to proceed.

 

I have the impression that disk13 is just fine and that I could replace/rebuild disk15 but since I now have disk13 disabled and disk15 missing I don't know what to do.

 

Any idea on how should I proceed here? It's possible my theory of "trust my array" and rebuild disk15 from my other drives including disk13 that seems to be OK? If not, what can I do?

 

Together with the new PSU I also got a AOC-SASLP-MV8 to replace my old PCI SATA card. I also have Seagate 2TB spare drive.

 

Here's the link to my current syslog:

https://drive.google.com/file/d/0B0mYGQLWndqzRnF6U1lNZEVxSVk/edit?usp=sharing

 

Here's from early today (when disk15 was still showing up):

https://drive.google.com/file/d/0B0mYGQLWndqzRUpEQmNUX0dON0U/edit?usp=sharing

 

from yesterday:

https://drive.google.com/file/d/0B0mYGQLWndqzTHhscDVsTXNsbkU/edit?usp=sharing

 

and finally from 2 days ago (when there was no red balls or missing disks):

https://drive.google.com/file/d/0B0mYGQLWndqzYUlwaUtvUVZZVUk/edit?usp=sharing

 

SMART from disk13 also attached.

Link to comment

Find out why disk 15 is missing!

Check if disk 15 is running.

Check the cabling.

 

I am just about to complete preclear in a new disk that's connected to the same cables/ports disk15 was using...

 

Maybe my post was too big and confusing, just to recap:

 

1) disk13 is disabled but it was probably due to a cable or PSU problem. SMART report (see attach cut from syslog) looks OK and I am 100% sure I wasn't writing to the disk when it failed (I only use disk shares instead of user shares when writing).

 

2) disk15 is completed unusable, BIOS doesn't even recognized it anymore. SMART report looks horrible (see attach cut from syslog).

 

So disk13 is good but disabled and disk13 is damaged and missing (here's a screenshot of my Main page)

https://docs.google.com/file/d/0B0mYGQLWndqzU1JleHg2MDZZTW8/edit?usp=drivesdk

 

What if I use 'New Config' button and reassign all drivers with proper position (specially parity) replacing the damaged disk15 with the new drive I am about to complete preclear? Would that work? Would be disk15 reconstructed and disk13 will be just fine and I will have no data loss?

disk13-smart-report.txt

disk15-smart-report.txt

Link to comment

In order to reconstruct 15 you need to cause it to fail. This can be done either by using the old drive in the new config and then changing to the new drive or starting the array with the disk unassigned and then assigning it. After a new config is started, with the "parity is correct" checkbox selected, I think a parity check may commence. You'll need to stop this check as quickly as possible and then cause disk 15 to fail. After the disk is rebuilt do a parity check. There may be a few errors from disk 13. It was disabled because a write to it failed. Likely a file system metadata write. Run a second parity check and no errors should remain.

 

Finally, run reiserfsck on the disks to check the file systems. See my sig.

 

Link to comment

If you set a new disk configuration AND allow ANY parity calc to start, you lose the ability to re-cnstruct ANY drive,.

 

Be very careful.  you have two failed disks. (as far as unRAID is concerned) and under 5.0 there is no well documented way to recover forcing one of the disks to be considered valid. 

 

Ask for guidance from lime-technology before proceeding with ANY new-config followed immediately re-cnstruction of a failed drive.  Do not be fooled into thinking you can easily recover using parity.  I do not know how to accomplish that under 5.0.  Do not just listen to forum users.  I doubt any of us can properly guide you under 5.0. with multiple failed disks.

 

  Send e-mail to [email protected] for guidance.

 

 

Link to comment
So disk13 is good but disabled and disk15 is damaged and missing (here's a screenshot of my Main page)

you got a typo @"disk13 is damaged"

 

So, in principle you "only" need to mark disk 13 as good and do a rebuild on the former disk15.

At the moment you have only 1 loss which should be recoverable with parity.

The only thing is: how to mark disk13 as good?

Let LT guide you through this!  ;)

 

Edit:

Just curious:

Wouldn't a backup of the config be useful in this case?

What file would be necessary? Where is the array setup stored? super.dat?

I imagine you restore the file(s) and start the array. It should detect the failed disk15 and

there you go?

 

Link to comment

OK, so my problem was solved with direct help from Tom/LimeTech.

 

I am updating this so other users know what happened and how it was solved, however I can't vow for this to be a solution for others. It has worked for me under direct supervisor from LimeTech.

 

Here are the steps:

 

0. Disable all add-ons (renamed plugins/extra folders and extra lines from go script)

1. Install a new disk that will replace the old disk15.  Don’t do anything to the config, just physically hook up the disk.

2. Go to Utils page and click “New Config”.  Check the checkbox and click Apply.

3. Go to Main page and assign all your disks again.  Refer to your screenshots (https://docs.google.com/file/d/0B0mYGQLWndqzR29HVFFWMVFYWFU/edit?usp=drivesdk) and make sure have it right.  When you get to disk15, you should be assigning the new disk.

4. From a telnet session or the console, type this command:

 

mdcmd set invalidslot 15

 

5. Important: after typing  above command, do NOT do anything else on the webGui except this:  On the Main page, check the ‘Maintenance mode’ check box and then click ‘Start’. (ie, do not click refresh or refresh the browser, just check the checkbox and click Start).

 

You should see all disks ‘green’ except disk15 which should be ‘orange’ and you should see a data rebuild taking place.

 

Here's my main during the rebuilding: https://docs.google.com/file/d/0B0mYGQLWndqzWXBBbXUwcnRMN0k/edit?usp=drivesdk

 

Once it was completed I had some filesystem corruption that was fixed by following dgaschk suggestions. Big thanks!

 

Thanks Tom for his quick response and all the help.

 

I am now going to install my new SASLP-MV8  :) Go UnRAID!!!!

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...