Parity drive failure, data drive failure and another data drive with SMART errors - can the data be saved?


Recommended Posts

Hi,

Helping a friend who has some issues with his unRAID server.

 

My understanding is that the parity drive had issues first of all and went offline (Parity 2).

Then there was issues with a data drive (disk 3 - sdd) - this appears to have udma crc errors.

 

image.png.7d7ca267ab1acb1bf34e0e3a9bc7696d.png

 

I did SMART checks on the other drives and it appears Disk 5 (sdh) has issues as well, although it hasn't failed the drive (yet).

 

image.thumb.png.cbe4751abbc723cefc90097b294fce58.png

 

I've got 3 14TB drives, which I had originally planned to replace both the parity drives with and also disk 3.

Now that disk 5 also has potential issues I can get another 14TB drive to replace that.

 

My question is, can the data from disk 3 be recovered or is it lost?

If it can be recovered whats the correct order to do things in?

 

Do I need to copy the data from disk 5 before that has any more issues?

 

I've attached the diagnostics file as well.

 

Thanks in advance.

 

 

 

tower-diagnostics-20221123-1310.zip

Link to comment

Parity 2 passes a SMART test, but unRAID isn't happy with the results

 

image.thumb.png.ab04d031ba3decdc791399ef224f9c60.png

 

Disk 3 was unassigned when I booted the server up.

When I allocated the drive thats when the udma crc message popup box came up.

image.thumb.png.27e4d02ed2952647b82eb186c2bbeca3.png

 

The drives are on a miniSAS backplane so it's unlikely to be a cable issue causing the crc errors.

 

To the best of my knowledge I don't believe anything has been written to the emulated disk 3.

 

Thanks

 

 

Link to comment
1 hour ago, LFletcher said:

Disk 3 was unassigned when I booted the server up.

When I allocated the drive thats when the udma crc message popup box came up.

Instead of "allocated" do you really mean you reassigned the disk? If you reassigned the disk and started the array it would have started rebuilding the disk, which seems to agree with your first screenshot since it was showing the drive invalid instead of disabled.

 

Your diagnostics and that screenshot indicate the array is not started currently. Since you still have parity, it should be able to emulate disk3.

 

Unassign disk3, start the array, and post new diagnostics.

Link to comment
27 minutes ago, trurl said:

Instead of "allocated" do you really mean you reassigned the disk? If you reassigned the disk and started the array it would have started rebuilding the disk, which seems to agree with your first screenshot since it was showing the drive invalid instead of disabled.

Yes, the disk was unassigned when I started the server, so I reassigned it, but I hadn't started the array up until now.

 

I have unassigned disk 3 and started the array.

 

image.thumb.png.94061a093e54b5cca13a8f4de31403db.png

 

Disk 3 now resides in the unassigned devices section

image.thumb.png.e6755a848415fbaf5162af8e5ce6f4e2.png

 

Disk 5 isn't happy though and states its unmountable

image.thumb.png.76f1923daeeb2453aa5e8ad2bbd9a972.png

 

I've attached the updated diagnostics file.

 

tower-diagnostics-20221123-1623.zip

Link to comment
3 hours ago, LFletcher said:

Disk 3 now resides in the unassigned devices section

The point of that exercise was to see that emulated disk3 mounts, which it does, and has 2.5TB of data, which is what it will rebuild, at least ideally. But rebuild of disk3 requires disk5 to be working well. It could be that you need to rebuild disk5 to a new disk instead, which is more complicated. How well that can work depends on

5 hours ago, JorgeB said:

 if something was written to the emulated disk3 after it got disabled

 

Do you have backups of anything important and irreplaceable? Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

Link to comment
1 hour ago, trurl said:

Do you have backups of anything important and irreplaceable? Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

It's not my machine (I'm trying to sort it out for a friend), but it's safe to assume there won't be any backups.

I know there are photos on the array, but I don't know where they are, or whether they are likely to be on any of the impacted drives.

 

Obviously in an ideal world we'll be able to restore all of the drives without losing any data, but in an ideal world he would have paid more attention when the box started having issues (and given it to me sooner).

 

What options do we have, assuming we have no backups to rely on and I need to try and save as much of the data as possible?

 

All of the assistance I have been given so far is very much appreciated.

Edited by LFletcher
Link to comment

ddrescue has now finished.

 

rescue.JPG.60e0d9c1016aea48ef8725404af6c75f.JPG

 

I then ran the xfs_repair against the cloned drive; 

 

repair.JPG.f78ebc2761d64e777c625153d283fc8f.JPG

 

I've now run the following commands from the ddrescue faq;

 

printf "unRAID " >~/fill.txt

ddrescue -f --fill=- ~/fill.txt /dev/sdd /boot/ddrescue.log

find /mnt/disks/Z2GBNVET -type f -exec grep -l "unRAID" '{}' ';'

 

which is still in the process of running. 

 

When looking at the data on the mounted cloned drive everything appears to now be in a lost+found directory

 

image.png.1baa10a62c886f00860b2183b81978da.png

 

Shouldn't the cloned drive have a directory structure that mirrored the original disk? 

I assumed after the check I would have been able to unassign the old bad drive (Disk 5) and assign the cloned drive in it's place, restart the array and this part of the issue would be resolved.

I guess with just the lost+folder that is not going to be the case or am I missing something?

 

 

 

 

Link to comment

I've been having a look. 

 

image.png.7ce3116e70dfd565fd92c6554330e869.png

 

image.png.ef4ed0b94d87e7bbce33e9a0d6809c46.png

 

In reality it shouldn't be too difficult to work out what goes where as its either a movie or a tv series/episode. 

I suppose I didn't expect everything to be in a lost+found directory, but as I've never done this before - you live and learn. 

 

I'll wait for the scan of corrupt files to finish, but what is my next step? 

Is it to unassign Disk 5 (physically remove it from the server), and then assign this cloned disk in its place? 

If I do that will unRaid recreate the old folder structure and I'll just have to manually move things into the correct place or will I have to do something else? 

 

Also what are the next steps to sorting out the issues with both Disk 3 (which we unassigned earlier) and the Parity 2 drive which also has issues still? 

 

Thanks

 

 

 

Edited by LFletcher
Link to comment
1 hour ago, LFletcher said:

will unRaid recreate the old folder structure

There is nothing Unraid can do to improve your disk5 repair results.

On 11/24/2022 at 6:12 AM, JorgeB said:

best bet is to clone that disk with ddrescue then run xfs_repair again, you can then used the cloned disk with old disk3 since that one looks healthy and re-sync parity.

That seems best to me also. New Config with all disks assigned including the cloned disk5, and rebuild parity.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.