Jump to content

Drive Failure During Data Rebuild -- What Next?


Recommended Posts

Hello. I was replacing D11 on my HOME server and during the Data Rebuild / Parity Sync a different drive (D6) which had shown no issues before decided to take a crap. Now at the end, D6 shows as disabled with contents emulated but there's no actual data visible (just a few empty folders) when opening either D6 or the new D11 though my Windows desktop, even though the UnRaid Main page shows 5+ TB used on both. My hope is that there is still data on both drives and that maybe D6 just had a sata or power connector come loose early in the op? It was a green ball like all the rest when I started and when the errors to it started piling up I did notice that there was no temperature read for the drive even though it stayed green until the end. Anyway, what to do next?  Everything is backed up, though some on D6 is offsite and would be a little harder to recover. I expect I'll have to do a New Config to completely rebuild parity once I assess what's really still there, but how do I even find out what's actually on the drives now without risking (further) data loss? Diagnostics attached and any help appreciated. 

home-diagnostics-20231209-1109.zip

Link to comment

It looks as if disk6 dropped offline.  As a result there is no SMART information for the drive.   It is possible that the drive would come back online if the server is power-cycled.   If that is what happens then it is worth posting new diagnostics so we can get an idea of disk6 health.   You might want to check the power and SATA cabling to the drive as cabling issues are far more common than actual drive failures.   You could run an extended SMART test on disk6 as that is a good test of disk health.

 

if this happened while rebuilding disk11, then the rebuild of disk11 will not have been successful in not having data corruption.

Link to comment

Thanks. Okay I powered down, checked all connections and when I booted up I had D5 read as missing as well as D6 (they sit next to each other and share the same sata power connector via a Y-splitter). So I powered down again and this time I took both drives out and cleaned all the contact points on all the cables and the drives themselves with an eraser + rubbing alcohol (though nothing really appeared to be dirty) and now both drives are there again but D6 is still showing as disabled and an extended SMART test ended with "Completed: Read Failure". New Diagnostics and SMART report for D6 are attached. So what now? On MAIN it looks like I can start the array either with D6 disabled or changed to "no device" without starting a new parity operation either way.  But is there any risk to that vs. assuming D6 is not going to have any data emulated that can still be backed up and just replacing it right away? And if I do replace it, would it be better to do it as a New Config to completely re-write parity from scratch? Or does doing it as a regular drive replacement / data rebuild (even though there's probably nothing there) just do exactly the same thing? Not being 100% confident now with the physical connections to the drives, I'm most concerned with the possibility of another drive falling off during a new parity operation and suffering even further data loss. But I suppose there'd be the same risk of that either way? Even though D6 is in error now I can't help thinking there was nothing actually wrong with it until the last parity op ran through with the bad connection and as much as I can help it I just want to make sure it doesn't happen again.   

home-diagnostics-20231210-1229.zip home D6-smart-20231210-1229.zip

Link to comment

In either case when you start the array in normal mode then disk6 should be emulated and you can see if the emulated disk6 mounts OK, and if so you can rebuild is contents to another drive.    You might want to post new diagnostics after starting the array in normal mode.

 

The fact that the current disk6 failed the extended SMART test means it should be replaced.

Link to comment
4 hours ago, itimpi said:

In either case when you start the array in normal mode then disk6 should be emulated and you can see if the emulated disk6 mounts OK, and if so you can rebuild is contents to another drive.    You might want to post new diagnostics after starting the array in normal mode.

 

The fact that the current disk6 failed the extended SMART test means it should be replaced.

Started the array with D6 enabled and it shows on the Main page as "Unmountable: Unsupported or no file system" and there is no disk6 visible at all when I try to open any emulated data through Windows. New diagnostics attached. Also D11 has folders but they are all empty even tough Main still shows 5+ TB used on the drive. So what do I do with that now too?

home-diagnostics-20231210-1853.zip

Link to comment
3 hours ago, JorgeB said:

Disk11 rebuild will likely be corrupt due to previous errors with disk6, you can try to rebuild disk6 using the old disk11, if parity remains valid, or possibly easier, clone disk6 with ddrescue, must data should be recoverable, then re-sync parity with the clone and old disk11.

 

 

Thanks Jorge, I'll look into that. Meantime I'm also starting to think if I go the New Config route about just leaving off D11 as a last data drive and adding it as a 2nd parity instead. My only worry is that this is a low power server with an extremely weak CPU and parity checks on it already take 3 days while I can't do anything else with it. So would a second parity drive make that take even twice as long? I like the idea of not losing data in a scenario like this again but I wonder if it's just not advisable with the hardware I have. (This is my HOME server if you want to see the specs in my sig below). 

Link to comment
59 minutes ago, ElJimador said:

parity checks on it already take 3 days while I can't do anything else with it

You might want to consider installing the Parity Check Tuning plugin?   This way although the elapsed time would be longer you could perhaps avoid the check running during prime time when you are likely to actually want to use the server.

 

As was mentioned adding a second parity drive is quite likely to have no measurable effect on the party check speed but the only way to be certain is to try.   You can always change your mind later if you decide the second parity drive slows things down too much.

Link to comment
On 12/11/2023 at 3:34 AM, JorgeB said:

Disk11 rebuild will likely be corrupt due to previous errors with disk6, you can try to rebuild disk6 using the old disk11, if parity remains valid, or possibly easier, clone disk6 with ddrescue, must data should be recoverable, then re-sync parity with the clone and old disk11.

 

 

Hi Jorge. Just coming back to this since I've been tied up with work this week. Want to make sure I understand and am doing this correctly before I attempt it. Because having thought it through my preference now is to do a new config dropping the new D11 and adding it as a 2nd parity instead, and replacing the now failed and unmountable D6 with it's clone. So check me on this please. I shut down and replace D11 with a good 6TB drive (same size as D6) at which point I won't be able to restart the array (I assume even in maintenance mode?) due to too many failed/wrong disks. But I can run ddrescue to clone D6 to the new drive. Then once that completes, I shut down again, pull the failed D6 and replace it with the clone and put back the new D11 which I assign in the new config as the 2nd parity. Whatever did or did not copy successfully to the new D6 I'll see when I start the array and parity is re-written for the new config. Correct? Or does the new D6 get formatted first so that whatever was cloned is just wiped right away? If it's the latter then how would I actually back up the data off the clone so I can copy it back to the array after?

 

I hadn't mentioned before, at least in this thread, but another part of my reason for wanting to do a new config is that there is another drive (D8) which I'd prefer to replace now too because although SMART is still showing it as healthy, there were some errors to it also during the last parity sync (before the snafu replacing D11 and starting this thread). So I'd prefer to just swap it for a new 12TB now and use it in my desktop instead before it gets worse.  D8 and D11 I have full local backups for that I'll copy back to the new config after. Anyway, not sure if that makes any difference to what I'm trying to do cloning/recovering D6 or with the new config in general but just thought I'd mention for you and anyone else who might have input it in case it brings to mind any additional issues I'm not thinking of? Thanks again.

Link to comment
43 minutes ago, JorgeB said:

Is this old disk11?

 

 

No. Old D11 had a SMART warning after a ton of errors to it on the last parity sync before I made any changes, which is why I was replacing it in the first place. I don't want to put it back in the array, especially since everything that was on it is already backed up locally and can be easily copied back once the new config is up and running.

 

And actually it turns out the other 6TB I had in mind for the D6 clone isn't ideal either. I'm scanning it with StableBit on my desktop and 98% in it's showing sectors it can't read to, even though as yet there is no SMART warning. 

 

How about if I back up another 6TB drive already in the array (D7) and let ddrescue clone the old D6 to that instead? Wouldn't that allow me to re-start the array after to see what copied successfully, since D7 is already mounted? 

Link to comment
35 minutes ago, ElJimador said:

No. Old D11 had a SMART warning after a ton of errors to it on the last parity sync before I made any changes,

I initially mentioned using old disk11 since I assumed it was good:

 

2 hours ago, ElJimador said:

then re-sync parity with the clone and old disk11.

 

If old disk11 is not available your options are much more limited, you can still clone disk6 but and can also try to clone disk11, if not data on that disk will be lost.

Link to comment
8 hours ago, JorgeB said:

I initially mentioned using old disk11 since I assumed it was good:

 

 

If old disk11 is not available your options are much more limited, you can still clone disk6 but and can also try to clone disk11, if not data on that disk will be lost.

Everything that was on D11 is backed up locally so it won't be any problem copying it back to the new config once it's up and parity has been re-written. D6 is the one I want to clone because most of it is backed up remotely to a server I have at my parents 2,000 miles away, so recovery would be much more of a chore.

 

So my question is how to clone D6 and actually recover that data (as much as can be) before I do the new config? I don't want to try to recover anything from parity or do anything that requires a parity sync before then. I just want to understand if I clone D6 to a same size drive already in the array (backing that drive up first to free up the space), should I be able to start the array after exactly as it is now, since that drive is already mounted, to see how much successfully copied to it? If so that seems likely to be the best option and I can figure out exactly how I want to map everything the new config and then write back from local backups after that.

Link to comment
On 12/16/2023 at 4:05 AM, JorgeB said:

 

 

Click the link.

I did, thanks. I tried the option to copy the failed D6 to an array disk (D7) in maintenance mode and got the following. Is my command string incorrect? Since I had D7 backed up, I tried again only this time I started the array first and deleted everything off of D7 before re-starting in maintenance mode, with the same result. Since the destination disk is empty I'm not sure why it's saying there's no space left on it? 

 

 

root@HOME:~# ddrescue -f /dev/sdj1 /dev/md7 /boot/ddrescue.log
GNU ddrescue 1.27
Press Ctrl-C to interrupt
Initial status (read from mapfile)
rescued: 8388 kB, tried: 0 B, bad-sector: 0 B, bad areas: 0

Current status
     ipos:    8388 kB, non-trimmed:        0 B,  current rate:       0 B/s
     opos:    8388 kB, non-scraped:        0 B,  average rate:       0 B/s
non-tried:    6001 GB,  bad-sector:        0 B,    error rate:       0 B/s
  rescued:    8388 kB,   bad areas:        0,        run time:          0s
pct rescued:    0.00%, read errors:        0,  remaining time:         n/a
                              time since last successful read:         n/a
Copying non-tried blocks... Pass 1 (forwards)
ddrescue: /dev/md7: Write error: No space left on device
root@HOME:~# 

Link to comment

Hey Jorge (or anyone else who can help). After running ddrescue for days, the desktop where I was running the terminal session got accidentally unplugged. Is there a command string I can use in a new session to see where the process is at this point, or just to end it safely? I assume it's still running on the HOME server even though I no longer have any visibility to it. From what I was seeing throughout the operation right up until a day or 2 ago, it was just counting up endless read errors and not saving anything anyway, so at this point I'm just resigned to having lost D6 entirely and recovering in the new config as best I can from offsite backups. Thanks.

Link to comment
35 minutes ago, ElJimador said:

Is there a command string I can use in a new session to see where the process is at this point, or just to end it safely?

I believe if the terminal was closed the operation would abort, unless you were using something like screen, but if you type the ddrescue command again using the same log file it will resume.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...