Jump to content

Recover ANY Data from Failed Cache drive RAID0?


Recommended Posts

I came home this morning to my windows VM not working. I believe this is due to one of my two cache drives failing. The cache drives are two 500GB SSDs totaling to 1TB. I noticed errors with one of the cache drives. This is the first error in the log for the drive and only appears once.

Jun 21 09:48:04 Tower kernel: print_req_error: I/O error, dev sdc, sector 89165888

 

This error repeats over and over when I attempt to start the VM. 

Jun 21 09:48:04 Tower kernel: print_req_error: critical medium error, dev sdc, sector 66458640

 

If I am understanding this correctly, the drive has a bad sector, and because of how I mistakenly set it up, that single bad 512 bytes of data on that bad sector means the well is poisoned and that ALL 216GB of data on the cache is ruined? It has my docker stuff saved on it, and my VMs. I care VERY much about the data thats on the 200GB Vdisk for the windows VM. Theres over a years worth of a project ive been working on. Is there no way to recover ANY of it?

 

I thought BTRFS meant something could be done. I know I messed up by having critical data on a RAID striped drives, but I just set it up and forgot about that being a potential issue. It CANT be completely ruined I would think, because theres another linux VM that still appears to be working fine, and most of the dockers appear to be working fine. Its just the Windows VM and the Krusador docker that appear to not be working. I get the Blue screen of death when booting the windows VM. 

 

Attached is my diagnostics and SMART report for the drive that has errors. 

tower-diagnostics-20210621-0848.zip tower-smart-20210621-0720.zip

Link to comment
9 minutes ago, Dwiman89 said:

I care VERY much about the data thats on the 200GB Vdisk for the windows VM. Theres over a years worth of a project ive been working on. Is there no way to recover ANY of it?

I'm sorry but you were using raid0 and didn't have any backups? Even using redundant storage you should still have backups of anything important.

 

You can try cloning the bad SSD with ddrescue, it's not optimized for flash based devices, but it generally works, then mount together with the other one.

Link to comment
16 minutes ago, JorgeB said:

then mount together with the other one.

Forgot to mention, if you do this make sure the bad SSD is physically disconnected from the server before mounting the clone with the other pool member, and for Unraid to accept the new pool config you need to do this:

 

Stop the array, if Docker/VM services are using the cache pool disable them, unassign all cache devices, start array to make Unraid "forget" current cache config, stop array, assign the clone with the other pool member (there can't be an "All existing data on this device will be OVERWRITTEN when array is Started" warning for any cache device), re-enable Docker/VMs if needed, start array.

 

 

Link to comment
48 minutes ago, JorgeB said:

Forgot to mention, if you do this make sure the bad SSD is physically disconnected from the server before mounting the clone with the other pool member.

Geez Looking at that wiki, it looks like a a big margin of error and close to being over my head. Is it safe for me to bring both cache drives or just one out of the array before messing with anything or is all okay to just start the whole array in maintenance mode? Im worried about making it worse. 

 

1 hour ago, JorgeB said:

I'm sorry but you were using raid0 and didn't have any backups? Even using redundant storage you should still have backups of anything important.

 

You can try cloning the bad SSD with ddrescue, it's not optimized for flash based devices, but it generally works, then mount together with the other one.

Well the most important file I thought was my Vdisk.img since its everything on the windows VM on the cache which is 200GB. Thats a huge file to keep backed up all the time and a lot of writes I would think. To even have it backed up on say a weekly basis, that means I have to be copying a 200GB file to my array every week? 

 

when I first set it up, I did pick raid0 but initally it was just scratch drive space with it being periodically moved to the array once a week to reduce wear on the array. The VM was an afterthought and didnt initally have importance to me. It became important, and I didnt think about it being on a raid0 to begin with. I forgot it was saved to it.  

Link to comment
1 hour ago, JorgeB said:

Forgot to mention, if you do this make sure the bad SSD is physically disconnected from the server before mounting the clone with the other pool member.

Also, do I need to preclear and format the replacement drive first before attempting the clone? I was also thinking about manually copying the files using file explorer before just incase, even though if they are messed up, that may mean nothing in the end. 

Link to comment
10 minutes ago, Dwiman89 said:

Well the most important file I thought was my Vdisk.img since its everything on the windows VM on the cache which is 200GB. Thats a huge file to keep backed up all the time and a lot of writes I would think. To even have it backed up on say a weekly basis, that means I have to be copying a 200GB file to my array every week? 

I have a script taking snapshots of my VMs everyday during the night, then send them incrementally to another pool, I have like a month worth if needed, there's some info on how to do this in the VM FAQ thread.

 

12 minutes ago, Dwiman89 said:

Is it safe for me to bring both cache drives or just one out of the array before messing with anything or is all okay to just start the whole array in maintenance mode? Im worried about making it worse. 

You can remove both devices from the server, or leave the other one, just unassign both and you can start the server normally.

 

10 minutes ago, Dwiman89 said:

Also, do I need to preclear and format the replacement drive first before attempting the clone?

Nope.

 

 

 

 

Link to comment
2 hours ago, JorgeB said:

I have a script taking snapshots of my VMs everyday during the night, then send them incrementally to another pool, I have like a month worth if needed, there's some info on how to do this in the VM FAQ thread.

 

You can remove both devices from the server, or leave the other one, just unassign both and you can start the server normally.

 

Nope.

 

 

 

 

 

ddrescue -f /dev/sdX1 /dev/md# /boot/ddrescue.log

"Replace X with source disk (note de 1 in the source disk identifier), # with destination disk number, recommend enabling turbo write first or it will take much longer."

 

With this the damaged source drive is "sdc", but what is the source disk identifier denoted by the 1? If I look in the /dev/ directory there exists an "sdc" and "sdc1".  

 

For the destination disk, where do I get the destination disk number? I see several files under /dev/ that start with "md". The destination SSD is listed as "sdj" under unassigned devices in my main tab. 

Link to comment
6 hours ago, JorgeB said:

I have a script taking snapshots of my VMs everyday during the night, then send them incrementally to another pool, I have like a month worth if needed, there's some info on how to do this in the VM FAQ thread.

 

You can remove both devices from the server, or leave the other one, just unassign both and you can start the server normally.

 

Nope.

 

 

 

 

I think I sucessfully cloned it. it says 99.99% recovery. Those instructions say there is a way to output which files specifically are damaged. I used the commands, but nothing is outputted with the string "unRAID"

 

 find /mnt/cache -type f -exec grep -l "unRAID" '{}' ';'

This is what I used in place of path/to/disk since I assume there is no individual mount point when its in the cache drive pool. I tried to mount the cloned drive as an unassigned device, so I could try to get the output that way, but it just hangs indefinitely on "mounting" for like 20 minutes now.    

Link to comment
10 hours ago, Dwiman89 said:

I see several files under /dev/ that start with "md".

That is to clone to an array device, use the first option to clone to an unassigned device.

 

6 hours ago, Dwiman89 said:

path/to/disk

This is the destination path after mounting it, in this case you could only do this after mounting the pool.

Link to comment
2 hours ago, JorgeB said:

That is to clone to an array device, use the first option to clone to an unassigned device.

 

This is the destination path after mounting it, in this case you could only do this after mounting the pool.

Okay, its mounted to the pool. I recieved notifications that the cache has returned to normal opporation. I believe all is well with the cloning, and it is mounted.

 

I cant seem to get the process to work to see which files specifically are currupted. Under the /mnt/ directory, I have this:

root@Tower:/mnt# ls
cache/  disk1/  disk2/  disk3/  disk4/  disk5/  disks/  remotes/  user/  user0/

 

Thus, i'm using this to try to get the output with the "unRAID" string that it says indicate what files are corrupted. I followed the instructions with first two commands

printf "unRAID " >~/fill.txt

 

And then:

ddrescue -f --fill=- ~/fill.txt /dev/sdY /boot/ddrescue.log

 

I then try this command, and this is the result. I still don't see anything with anything that says "unRAID" indicating a corrupt file. 

root@Tower:/# find /mnt/cache/ -type f -exec grep -l "unRAID" '{}' ';'
/mnt/cache/appdata/binhex-plex/Plex Media Server/Logs/Plex DLNA Server.5.log
/mnt/cache/appdata/binhex-plex/Plex Media Server/Logs/Plex DLNA Server.4.log
/mnt/cache/appdata/binhex-plex/Plex Media Server/Logs/Plex DLNA Server.3.log
/mnt/cache/appdata/binhex-plex/Plex Media Server/Logs/Plex DLNA Server.2.log
/mnt/cache/appdata/binhex-plex/Plex Media Server/Logs/Plex DLNA Server.1.log
/mnt/cache/appdata/binhex-plex/Plex Media Server/Plug-in Support/Caches/com.plexapp.system/HTTP.system/CacheInfo
/mnt/cache/appdata/binhex-krusader/supervisord.log
/mnt/cache/domains/Hassos/hassos_ova-2.12.qcow2
/mnt/cache/domains/Windows 10/vdisk1.img
root@Tower:/#

 

 

I appreciate all of your help with this. I didn't expect to get this far and assumed all was lost. Hopefully once I know which files are damaged, I can work on replacing them, and then set up something to keep everything on the cache backed up like the script you mentioned! Thank you. 

Link to comment
15 minutes ago, Dwiman89 said:

I still don't see anything with anything that says "unRAID" indicating a corrupt file. 

The listed files are the corrupt files:

 

15 minutes ago, Dwiman89 said:

/mnt/cache/appdata/binhex-plex/Plex Media Server/Logs/Plex DLNA Server.5.log

/mnt/cache/appdata/binhex-plex/Plex Media Server/Logs/Plex DLNA Server.4.log

/mnt/cache/appdata/binhex-plex/Plex Media Server/Logs/Plex DLNA Server.3.log

/mnt/cache/appdata/binhex-plex/Plex Media Server/Logs/Plex DLNA Server.2.log

/mnt/cache/appdata/binhex-plex/Plex Media Server/Logs/Plex DLNA Server.1.log

/mnt/cache/appdata/binhex-plex/Plex Media Server/Plug-in Support/Caches/com.plexapp.system/HTTP.system/CacheInfo

/mnt/cache/appdata/binhex-krusader/supervisord.log /mnt/cache/domains/Hassos/hassos_ova-2.12.qcow2

/mnt/cache/domains/Windows 10/vdisk1.img

 

Link to comment
20 hours ago, Dwiman89 said:

Well the most important file I thought was my Vdisk.img since its everything on the windows VM on the cache which is 200GB. Thats a huge file to keep backed up all the time and a lot of writes I would think. To even have it backed up on say a weekly basis, that means I have to be copying a 200GB file to my array every week?

 

Use something like UrBackup to make incremental backups regularly to the array

  • Like 1
Link to comment
On 6/21/2021 at 1:43 PM, JorgeB said:

I have a script taking snapshots of my VMs everyday during the night, then send them incrementally to another pool, I have like a month worth if needed, there's some info on how to do this in the VM FAQ thread.

 

You can remove both devices from the server, or leave the other one, just unassign both and you can start the server normally.

 

Nope.

 

 

 

 

Im working on setting up backups. The script the FAQ points to for automatic backups says its experimental and that was 5 years ago... Is this the script you use? Does it copy the entire vdisk over and over? Thats a lot of writes for me with 200gb worth of vdisk.

Link to comment
40 minutes ago, Dwiman89 said:

I installed that docker, but I'm woefully confused about how to set it up..

In a nutshell, the container is the UrBackup host, it provides an endpoint for any UrBackup clients that you add. After you have it set up, you log in to your UrBackup site, add a client to the backup list on the status page, download the executable it generates and run it on your machine that you want to back up.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...