Jump to content

Recovering failed and failing drives


Recommended Posts

Hi guys,

 

Wondered if you'd be able to help me?

 

I've been running Unraid a few years now and been nothing but impressed with how easy it is to use and maintain. I've had old drives die and replaced them very easily in the past but have a slightly odd situation now.

 

I noticed some Docker app struggling and discovered in the logs the root cause to be a failing disk. I not noticed the failed SMART status on the Dashboard unfortunately. (I'll be setting up alerts from now!)

 

I was about to shut the server down to replace the disk when I noticed another disk showing the orange triangle, saying the contents are emulated. I'm 99% this is from moving the server earlier in the day and a cable coming loose. I've read the instructions to enable the drive with the confidence that parity is OK but decided to rebuild the disk first just to be safe. (SMART status is OK)

 

The problem is this was estimated at 17hrs and has taken 2 days 12hrs so far as the speed keeps dropping to 40KB/sec. I expect this is due to the genuinely failed disk   as errors are still populating in the logs due to failure to read/write.

 

Can anyone suggest what I should do? Do I just be patient and wait a week for the rebuild to finish or is there something else I can do to move past this and replace the failed disk?

 

I've tried to export my diagnostics to help but it doesn't complete. 

 

Thanks in advance for any guidance you can offer!

Link to comment

Possibly the rebuild will be bad anyway even if you let it complete, due to read errors on the other disk. Probably was a bad idea to try to rebuild a disk with another known bad disk in the array.

 

We really need the diagnostics, or at least the SMART for all disks and syslog. Diagnostics is the simple way to get those but if you can't get diagnostics then try to get syslog and SMART for each disk.

 

Link to comment

It would be best seeing the diags pre-reboot, but according to your description disk4 was likely corrupted by trying to rebuild with disk1 failing, the best bet for recovery would be to re-enable disk4 to replace disk1, now there aren't many good options, you can try re-enabling disk4 to replace disk1 but there will likely be corrupted data on both, other option is to use ddrescue on disk1 but will also likely result on some corrupted data on both disks.

Link to comment
10 hours ago, Baph0metal said:

Thank you. I've attached them. They are after cancelling the rebuild, turning off auto start, powering down, checking cables and powering up - array not started. 

30 minutes ago, Baph0metal said:

I'm pretty confident Disk 4 was the result of a loose cable when I physically moved the server. Is it worth me copying as much as I can from Disk 4 to a new disk then using another new drive to replace Disk 2 that the SMART checks deem as faulty?

 

If you start the array I assume it will begin rebuilding disk4 again. Post a screenshot of Main - Array Operations.

Link to comment
3 hours ago, johnnie.black said:

you have two failing disks

Might be useful to consider how you got into this state.

 

Do you have Notifications setup to alert you immediately by email or other agent when a problem is detected? You must address a problem immediately so you don't wind up with multiple problems you can't recover from.

3 hours ago, johnnie.black said:

I would try using ddrescue on disks 1 and 2 and copying everything you can from disk4.

So your array is pretty much gone except for disk3 and all you can hope for is to rescue / copy files from the other disks to somewhere else.

 

Do you have backups of anything important and irreplaceable? Do you have a backup plan? Many of us can't afford to backup everything from our large capacity servers, but you must always have another copy of anything important and irreplaceable, preferably on another system.

Link to comment

Thanks for all your help so far, so very much appreciated! Yes, definitely a lesson learned here, I expected to notice the drives failing far more obviously before it was too late. I do have cloud backups of most of the important data using Code42 but would still prefer to salvage as much as I can here :/

 

I've installed another drive and copied Disk4 entirely using ddrescue (sudo ddrescue -f -r3 /dev/sdg /dev/sdi logfile) which did so without any errors. 

 

What would the next steps be? Until I receive a new disk is it with adding the old disk4 back into the array can bypassing the parity rebuild and focus on disk2 which is the one with the read/write errors in the log?

Link to comment
21 hours ago, trurl said:

I am not sure I understand. What would be the purpose of adding the old disk4 that you have already copied back to the array. In a certain sense (no valid parity) you don't really have an array, just a bunch of separate disks.

 

Sorry, I used ddrescue to copy to a new disk that I mounted using the Unassigned Devices plugin. It copied without any errors so I'm assuming the source disk4 is actually OK and could be added back to the array? I know ideally I'd just keep the newer disk but I'm thinking about time to get the system back to as close to original state as possible and avoid another £200 on another new drive. (I'll pay it if I need to)

 

12 hours ago, johnnie.black said:

You can use ddrescue on disk2 and disk1.

 

Currently using ddrescue on disk2, the really damaged one, to another new drive.

 

You're right that at this point I just have a mess of disks, some on the array, some not. What would be the steps to get back to a clean state? The really valuable data on the array is a small proportion of the capacity.  

Link to comment
2 hours ago, Baph0metal said:

 

Sorry, I used ddrescue to copy to a new disk that I mounted using the Unassigned Devices plugin. It copied without any errors so I'm assuming the source disk4 is actually OK and could be added back to the array? I know ideally I'd just keep the newer disk but I'm thinking about time to get the system back to as close to original state as possible and avoid another £200 on another new drive. (I'll pay it if I need to)

 

 

Currently using ddrescue on disk2, the really damaged one, to another new drive.

 

You're right that at this point I just have a mess of disks, some on the array, some not. What would be the steps to get back to a clean state? The really valuable data on the array is a small proportion of the capacity.  

I don't think there is any reason to assume parity would be valid whatever combination of disks you try to start with, so ultimately New Config to create a new array with whichever disks you want and let it rebuild parity. You could include the UD rescue copy of disk2 in the new array. Similarly for the rescue destination of disk1. If disk4 is truly OK, and the UD rescue copy of disk4 is truly the same, then it wouldn't matter which of these disks you include in the new array.

 

Not sure why you would think the 2 different "disk4" are the same though. The original when through a bad rebuild. The rescue tried to get what it could from that bad rebuild. I would be surprised if the are identical at the bit level.

 

If the original disk itself has no SMART issues then maybe it would make sense to use it as the destination for rescuing disk1.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...