[SOLVED]Two drives dropped from six drive array


Recommended Posts

Following on from this thread so as not to cause confusion :- http://lime-technology.com/forum/index.php?topic=33710.15

 

One of my drives is not being detected at all and another has come up as unformatted.  I'm currently moving files off a spare 2TB drive to replace the undetected one (hoping it's not a dead port, I know it's not a cable issue) but not sure how to proceed after that.  I know I will need to preclear any replacement disk(s) but then what?  I'm still on 4.7 Plus, there doesn't seem to be a sub-forum for that version. 

Link to comment

Can do, just waiting for a surface scan on the undetected drive to complete so I can pop it back in to the UnRaid box.  I have popped it into my Windows PC temporarily and it is visible although obviously has an unrecognised partition.  SMART data does show it as having 4 spin retrys but no reallocated setors.

 

I currently have a different drive plugged into the same slot in my UnRaid box and it was picked up but I have not tried to preclear or assign it.

 

ETA: I have re-installed the undetected drive into the Tardis and it still won't detect.  The BIOS sits on AHCI port 3 for ages then says device undetected.  If I look at the device list in the UnRAID GUI Disk 3 is listed as "no device" and the drop down box only has "unassigned". 

 

Disk 4 is also still coming up as unformatted and there are errors showing on the main status window.  I can also hear a lot of clicking when I start the array.

 

I have attached a Syslog dump and a copy of the SMART Status Report for Disk 4 (can't get one for Disk 3 since it won't detect).

J3DErrors.txt

syslog-2014-06-14.txt

Link to comment

Sorry to be pushy but can anyone help?  I need to get the failed drives back to the vendor for RMA tomorrow as they are technically out of warranty but only by 2 weeks so they have agreed to try for me.  In the meantime should I preclear my only suitable spare drive and attempt to rebuild the undetected drive? 

 

Then there is the issue of the other drive which has come up as unformatted.  Although a reformat might fix it there seems to be lots of read errors in the Syslog dump so I'm reluctant to return it to service.

 

If I go and buy another drive for this 2nd replacement I may as well buy a larger drive which means replacing the Parity drive first, seems a bit risky given the current state of the system.  I could use the swap-disable procedure to replace the parity drive with a new larger disk I guess but logic tells me I have to at least get the 1st undetected drive replaced first.

 

I have consulted the Wiki but I can't seem to find a procedure for when 2 drives fail at the same time.

 

UPDATE:  This just officially got ugly.  I plugged my spare drive into the same port as the undetected drive and it's not being detected either.  I thought it was being detected as it didn't hang up on that port during the BIOS checks.  Looking at the Device list the only options for that port are "no device" and "unassigned".  Have tried doing a "preclear_disk.sh -l" command from the console and it reports "no un-assigned disks detected". 

 

My UnRAID box is a HP Microserver and the affected port is on a backplane connected to the mobo by a mini SAS cable so I can't swap out the cable.  The other ports are still working so I doubt the cable is the issue here.

 

 

Link to comment

While you doubt the cable, it would still be worth swapping the SATA connection between the faulty drive and another good one - see if the same drive has issues or the new one does.

 

As to your other questions... as you are likely aware, if you have 2 dead drives you are toast as far as recovering data from either. If you swap SATA cables around and still have the same disks reporting issues I don't know if there is much you can do at that point. You can then go buy a nice big drive, set it as parity and install 5.0.5 and move on. However, I would test every combo of ports on the drives to make 100% sure before you go that route.

 

Is the clicking happening with only the unformatted disk in there, or is it when both drives with issues are in. Clicking is never a good sign, and likely indicates a physical issue within the disk, but from your messages I am not sure which disk is causing that for you.

Link to comment

While you doubt the cable, it would still be worth swapping the SATA connection between the faulty drive and another good one - see if the same drive has issues or the new one does.

 

It's been suggested to me that the molex power connectors to the backplane could be the issue.  I'm about to pull the box apart and re-do all the connections so wish me luck.

 

As to your other questions... as you are likely aware, if you have 2 dead drives you are toast as far as recovering data from either. If you swap SATA cables around and still have the same disks reporting issues I don't know if there is much you can do at that point.

 

Not what I wanted to hear.  Hopefully the undetected one is just a dud connection and the drive itself is OK.

 

You can then go buy a nice big drive, set it as parity and install 5.0.5 and move on.

 

Does that migration require a new UnRAID licence key?

 

Is the clicking happening with only the unformatted disk in there, or is it when both drives with issues are in.

 

Both and I can not pin it down to which one but will try removing just one of the drives and see what happens.

Link to comment

You need to try to get one of the two up if you want to be able to recover any data. That said have you checked all ports on the motherboard? Can You add a sata PCI or pcie card? Have you checked power connections? Can you try going direct to motherboard no back plain?

 

 

Thornwood

Link to comment

I'm still in the process of pulling it all apart Thornwood, it's all packed in pretty tight.  I have all the boards and drives extracted from the casing and have blasted all the dust out with my air compressor (water trap equipped).  I just have to work out how to detach the back plane to check it's power connection now.

 

Before disassembling the box I did try the undetected drive in another slot on the back plane and it was detected but obviously the array would not start due to the invalid config caused by rearranging the drives.

 

I only have 1 SATA port on the mainboard, everything else is hanging off a PCIe card or the back plane.

Link to comment

Well the box has been re-assembled after a thorough cleaning and check of all power and drive connections.  I am happy to report that all my disks are again visible via the GUI.  I don't know if a loose connection or dust across a circuit was the issue but I cleaned all the dust out and pulled every power and data connection apart as part of the troubleshooting process.

 

Issue 1:  The disk that wasn't being detected is now blue balled.  Is it simply a matter of ticking the "I'm sure I want to do this" box and then the Start button to initiate the rebuild?  The disk itself appears to be OK, I have attached the SMART values (file labelled 71DSmartData).

 

Issue 2: The disk that was coming up as unformatted is now showing as normal although I have yet to start the array.  I still have some concerns about this drive because there are errors being reported in the SMART logs, I have attached the file  (file labelled J3DErrors).

 

71DSmartData15062014.txt

J3DErrors15062014.txt

Link to comment

Update: After consulting the Wiki I decided to proceed with a restart of the array. The previously undetected disk is currently in the process of being rebuilt however the other disk that was coming up as unformatted has done that again (I guess the array has to be started for the format status to show up).  Should I stop the rebuild process or just let it run and then deal with the unformatted disk?

Link to comment

If I have interpreted your smartctl reports correctly, then disk4 is reporting lots of pending sectors.This is not always a fatal error as such errors can clear themselves the next time such a sector is written, but it does indicate that reads of that disk might be unreliable.  If that is the case the rebuild of the other disk might also be suspect as one of the requirements for a completely successful rebuild is that all the other disks + parity can be read reliably.

 

On that basis it is difficult to recommend whether to let the rebuild complete or not.  It really depends on how critical the data on that disk is or whether it would not be too much of an issue to lose it.

 

Trying to recover data off the disk with pending reallocated sectors is a different issue.  You can try and recover it by using reiserfsck on the unRAID system, or by trying to plug it into a Windows PC and using a Linux reader.    Not sure if this latter approach will work or not if the drive is showing as unformatted in unRAID - it depends on how good the Windows software is in handling potential corruption of the reiserfsck file system.

Link to comment

I think Disk4 is rather unhealthy, it is now showing in excess of 15,000 errors in the main status window.

 

The server houses our media collection but I can't be sure which parts of it are on each individual drive as I haven't forced the shares to use particular drives.  I can't see the 2 problem drives as individual disks under Windows currently but I can see the overall share structure.  While it wouldn't be the end of the world losing the data on those 2 disks a lot of time and effort has been spent on creating their contents, would rather not have to break the bad news to my other half.

 

Can reiserfsck be run on a drive while the others are online?  Have never used it before so don't know how it works.

Link to comment

Your stated goal of having the problem disks returned tomorrow precludes taking the time to do a proper recovery attempt. I suspect if you return either of the drives, you will lose the ability to recover. At this point I think you need to make a decision, either return the drives tomorrow or work on getting the data back and postpone the drive return for as long as it takes. You need 2 blank tested good drives to make the best effort at getting your data back. How much is your data (time) worth?

 

As far as moving from 4.7 to 5.05, you do not need a new key for that, your same key with the license file will work just fine.

 

A drive that shows as unformatted just means that it isn't currently mounted, it could even be just a matter of time before it shows up. I have had drives take more than 1/2 hour to fully mount after a bad crash, and since your array isn't healthy it could take even longer. Capture a full syslog from the current session and post it.

Link to comment

Your stated goal of having the problem disks returned tomorrow precludes taking the time to do a proper recovery attempt.

 

This deadline was only because the drives are both just out of warranty but the seller has agreed to submit them for RMA for me.  They said they need them brought in fairly soon so as to make their best case to the distributor.

 

You need 2 blank tested good drives to make the best effort at getting your data back. How much is your data (time) worth?

 

I have 1 drive I can use here (not part of the current array) but it requires preclear/formatting.  Should I stop the current rebuild and swap the one throwing all the read errors out?  The data isn't precious family memories and could be retrieved from other sources if needed but would be messy and yes time consuming.

 

As far as moving from 4.7 to 5.05, you do not need a new key for that, your same key with the license file will work just fine.

 

OK thanks for confirming, think I should concentrate on restoring the current setup first.

 

A drive that shows as unformatted just means that it isn't currently mounted, it could even be just a matter of time before it shows up.

 

It's pretty plain from the logs that it's having issues.  I have attached a zipped copy for you, it's not pretty.  It's just an extract as the full dump was too large to attach even zipped.  Console is showing over 232,000 errors on that drive now, logs list lots of "handle stripe read errors".

 

Have also attached the current SMART status for this drive throwing read errors (file labelled J3DSmartData), it's reporting imminent failure.

Have also attached the current SMART status for the drive listed as currently rebuilding (file labelled 71DSmartData), it seems to be OK.  I did notice that the Spin Retry Count has reset to 0 for this drive, yesterday it was reporting 4 (possibly power related perhaps?).

syslog-extract-2014-06-16.zip

J3DSmartData16062014.txt

71DSmartData16062014.txt

Link to comment

I really think I need to kill this rebuild now.  The drive being rebuilt appears to be OK but the 2nd unformatted drive is just drowning in read errors now.

 

I'm a bit confused since the drive showing all the read errors is not the one currently marked as being rebuilt.  Is it because the drive is being read as part of the rebuild process on the other drive? 

 

If I kill the current rebuild and replace only the drive throwing read errors will the rebuild process on the other drive kick off again upon a restart of the array?  I realise I need to preclear any replacement drive first, all of which is going to take some time.

 

Have attached another SMART report on the failing drive, I don't think it's long for this Earth.

J3DSmartDataV216062014.txt

Link to comment

i could be wrong but i'd be careful trying to rebuild the drive with parity since you have 2 drives with issues. as you know i am currently going through a similar situation. i ended up removing the drives and rebuilding parity completely. i'm currently in the process of copying the data on those drives back to the array.

 

so far so good.

Link to comment

Update:  I was able to successfully recover all the files (at least I think it was all of them) from one of the drives using Disk Internal Linux Reader, have saved those files to an external drive for now.  The other drive was toast, looks like the motor may have been on the way out.  Both drives have now been sent in for RMA.  I am going to update to 5.0.5 using the remaining good drives plus 2 new larger replacements for the failed drives.  Will update here on my success or otherwise with that process.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.