Parity drive failure, data drive failure and another data drive with SMART errors - can the data be saved?

LFletcher · November 23, 2022

Hi,

Helping a friend who has some issues with his unRAID server.

My understanding is that the parity drive had issues first of all and went offline (Parity 2).

Then there was issues with a data drive (disk 3 - sdd) - this appears to have udma crc errors.

image.png.7d7ca267ab1acb1bf34e0e3a9bc7696d.png

I did SMART checks on the other drives and it appears Disk 5 (sdh) has issues as well, although it hasn't failed the drive (yet).

I've got 3 14TB drives, which I had originally planned to replace both the parity drives with and also disk 3.

Now that disk 5 also has potential issues I can get another 14TB drive to replace that.

My question is, can the data from disk 3 be recovered or is it lost?

If it can be recovered whats the correct order to do things in?

Do I need to copy the data from disk 5 before that has any more issues?

I've attached the diagnostics file as well.

Thanks in advance.

tower-diagnostics-20221123-1310.zip

JorgeB · November 23, 2022

Parity2 also appears to be failing, disk3 looks OK, do you know if something was written to the emulated disk3 after it got disabled?

LFletcher · November 23, 2022

Parity 2 passes a SMART test, but unRAID isn't happy with the results

Disk 3 was unassigned when I booted the server up.

When I allocated the drive thats when the udma crc message popup box came up.

The drives are on a miniSAS backplane so it's unlikely to be a cable issue causing the crc errors.

To the best of my knowledge I don't believe anything has been written to the emulated disk 3.

Thanks

trurl · November 23, 2022

1 hour ago, LFletcher said:

Disk 3 was unassigned when I booted the server up.

When I allocated the drive thats when the udma crc message popup box came up.

Instead of "allocated" do you really mean you reassigned the disk? If you reassigned the disk and started the array it would have started rebuilding the disk, which seems to agree with your first screenshot since it was showing the drive invalid instead of disabled.

Your diagnostics and that screenshot indicate the array is not started currently. Since you still have parity, it should be able to emulate disk3.

Unassign disk3, start the array, and post new diagnostics.

LFletcher · November 23, 2022

27 minutes ago, trurl said:

Instead of "allocated" do you really mean you reassigned the disk? If you reassigned the disk and started the array it would have started rebuilding the disk, which seems to agree with your first screenshot since it was showing the drive invalid instead of disabled.

Yes, the disk was unassigned when I started the server, so I reassigned it, but I hadn't started the array up until now.

I have unassigned disk 3 and started the array.

Disk 3 now resides in the unassigned devices section

Disk 5 isn't happy though and states its unmountable

I've attached the updated diagnostics file.

tower-diagnostics-20221123-1623.zip

JorgeB · November 23, 2022

Check filesystem on disk5 but xfs_repair will abort if there's a read error.

trurl · November 23, 2022

3 hours ago, LFletcher said:

Disk 3 now resides in the unassigned devices section

The point of that exercise was to see that emulated disk3 mounts, which it does, and has 2.5TB of data, which is what it will rebuild, at least ideally. But rebuild of disk3 requires disk5 to be working well. It could be that you need to rebuild disk5 to a new disk instead, which is more complicated. How well that can work depends on

5 hours ago, JorgeB said:

if something was written to the emulated disk3 after it got disabled

Do you have backups of anything important and irreplaceable? Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

LFletcher · November 23, 2022

OK, so I restarted the array in maintenance mode and ran the check with -nv and this was the output

I assume I now need to run;

-v /dev/md5

LFletcher · November 23, 2022

1 hour ago, trurl said:

Do you have backups of anything important and irreplaceable? Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

It's not my machine (I'm trying to sort it out for a friend), but it's safe to assume there won't be any backups.

I know there are photos on the array, but I don't know where they are, or whether they are likely to be on any of the impacted drives.

Obviously in an ideal world we'll be able to restore all of the drives without losing any data, but in an ideal world he would have paid more attention when the box started having issues (and given it to me sooner).

What options do we have, assuming we have no backups to rely on and I need to try and save as much of the data as possible?

All of the assistance I have been given so far is very much appreciated.

Edited November 23, 2022 by LFletcher

trurl · November 24, 2022

Check filesystem as before but without -n. If it asks for it also -L

You don't necessarily need to know which disk anything is on if you have some idea of what user shares contain important data. If you think anything needs to be backed up before proceeding, best to copy the data somewhere off the array so nothing is changed on the array.

LFletcher · November 24, 2022

I have copied the important stuff onto another (external) drive.

Ran the -v command and got this;

image.png.c0d4f95910325bd031d4385b190a808d.png

So then ran the -vL and got this;

image.png.505ce01fd0496dde45840f38a60de53e.png

And also these notifications;

image.png.ca04fb4e8c2f053a328c76ea017d1a9e.png

JorgeB · November 24, 2022

Disk problem, IMHO best bet is to clone that disk with ddrescue then run xfs_repair again, you can then used the cloned disk with old disk3 since that one looks healthy and re-sync parity.

LFletcher · November 25, 2022

Is there any way to speed up the ddrescue process?

It's been running for about 30 hours and it less than 70% done of pass 1 (ignore the run time on the screen shot, I had to restart it after 24 hours, so this is the second run)

image.png.1bbcee71e32cd0af8c9b9635fdbca719.png

JorgeB · November 25, 2022

That will depend on the state of the disk, not much more you can other than wait.

LFletcher · November 27, 2022

ddrescue has now finished.

rescue.JPG.60e0d9c1016aea48ef8725404af6c75f.JPG

I then ran the xfs_repair against the cloned drive;

repair.JPG.f78ebc2761d64e777c625153d283fc8f.JPG

I've now run the following commands from the ddrescue faq;

printf "unRAID " >~/fill.txt

ddrescue -f --fill=- ~/fill.txt /dev/sdd /boot/ddrescue.log

find /mnt/disks/Z2GBNVET -type f -exec grep -l "unRAID" '{}' ';'

which is still in the process of running.

When looking at the data on the mounted cloned drive everything appears to now be in a lost+found directory

image.png.1baa10a62c886f00860b2183b81978da.png

Shouldn't the cloned drive have a directory structure that mirrored the original disk?

I assumed after the check I would have been able to unassign the old bad drive (Disk 5) and assign the cloned drive in it's place, restart the array and this part of the issue would be resolved.

I guess with just the lost+folder that is not going to be the case or am I missing something?

trurl · November 27, 2022

Filesystem repair will often result in some files in lost+found that it couldn't figure out. Could be a lot of lost+found depending on how bad the corruption is. Have you examined the files/folders in lost+found? Since it is a top level folder it is also a user share you can access on the network.

LFletcher · November 27, 2022

I've been having a look.

image.png.7ce3116e70dfd565fd92c6554330e869.png

image.png.ef4ed0b94d87e7bbce33e9a0d6809c46.png

In reality it shouldn't be too difficult to work out what goes where as its either a movie or a tv series/episode.

I suppose I didn't expect everything to be in a lost+found directory, but as I've never done this before - you live and learn.

I'll wait for the scan of corrupt files to finish, but what is my next step?

Is it to unassign Disk 5 (physically remove it from the server), and then assign this cloned disk in its place?

If I do that will unRaid recreate the old folder structure and I'll just have to manually move things into the correct place or will I have to do something else?

Also what are the next steps to sorting out the issues with both Disk 3 (which we unassigned earlier) and the Parity 2 drive which also has issues still?

Thanks

Edited November 27, 2022 by LFletcher

trurl · November 27, 2022

1 hour ago, LFletcher said:

will unRaid recreate the old folder structure

There is nothing Unraid can do to improve your disk5 repair results.

On 11/24/2022 at 6:12 AM, JorgeB said:

best bet is to clone that disk with ddrescue then run xfs_repair again, you can then used the cloned disk with old disk3 since that one looks healthy and re-sync parity.

That seems best to me also. New Config with all disks assigned including the cloned disk5, and rebuild parity.

Parity drive failure, data drive failure and another data drive with SMART errors - can the data be saved?

Recommended Posts

LFletcher

Link to comment

JorgeB

Link to comment

LFletcher

Link to comment

trurl

Link to comment

LFletcher

Link to comment

JorgeB

Link to comment

trurl

Link to comment

LFletcher

Link to comment

LFletcher

Link to comment

trurl

Link to comment

LFletcher

Link to comment

JorgeB

Link to comment

LFletcher

Link to comment

JorgeB

Link to comment

LFletcher

Link to comment

trurl

Link to comment

LFletcher

Link to comment

trurl

Link to comment

Join the conversation