Cache Drive Failed (1 of 2): Not sure how i can recover the data


Recommended Posts

So my Unraid machine went down unexpectedly due to external power issue (it was behind a surge/overload protector but didnt have a UPS), and when it came back up, im seeing some issues with the cache drives (obviously cant load my VMs since i cant mount the cache drives).

 

So, my build: I have unraid setup with 5 data drives, 1 Parity drive, 2 cache drives (2x 250GB SSDs in btrfs, giving me 250GB space), and some other drives in Unassigned devices.

 

Lets talk about the cache drives. They store my Windows VM, which i would very much like to get back if at all possible. the drives are 1x 250GB Samsung 850 Evo and 1x 250GB OCZ. Seems like after the power outage, the OCZ has errors, and the Samsung says "unmountable" (or is that the entire cache system?). Couple of times i restarted because, once the OCZ was not visible then the Samsung was not visible blah blah, but after disconnecting cables, reconnecting and restarting i now see both drives.

 

Now once i have both connected and array stopped, i can click each drive and see its stats/settings.

 

OCZ: says "partition format: error" (screenshot attached)

Samsung: under "btrfs filesystem show:" it says "warning, device 2 is missing" (screenshot attached)

 

When i mount the drives back into cache, then start the array, the cache says its "unmountable", followed by giving me the option to format the Samsung drive (screenshot attached). Also, on the main page, i can see that the OCZ drive shows up with blue icon to the left, which says "no data". I really dont mind if one of the drives are broken beyond repair (i can easily replace them), but i would very much like to be able to recover the data that was on it if at all possible.

 

Can someone help me and give some advice? If you need any more info, please ask (im noob, so you might need to give me step by step instructions).

 

ATTACHMENTS: http://imgur.com/a/G231c (sorry failed to attach here, so had to host externally, blacked out some identifiable codes)

 

PS: this is Unraid 6.1.x

Link to comment

Also, i tried un-assigning the OCZ drive and only assigning the samsung drive as the Cache drive and starting the array, and the outcome is the same. Says unmountable and asks me to format the Samsung drive at the bottom of the page.

Link to comment

More strange behaviour:

 

Turn the PC off, unplug the OCZ and plug it into a different sata port just to check, then turn the PC up again, assign it as the 2nd cache drive (samsung one is still the first), then start up the array, cache remains "unmountable" but this time the blue circle by the OCZ is now green, and the OCZ stats page doesnt show any error (but starting the array this time took much longer than every other time i have started the array, more like a minute or so vs 5 secs). However, when i stop the array, the OCZ is gone, doesnt show up in any of the dropdowns, and i can no longer assign it because its not showing up. Im really confused now. I bet if i restart or plug it into a different Sata Port, it would show up again.

Link to comment

Physically unplugging OCZ: assigned the samsung as the cache drive and started the array, and outcome is the same as the original screenshot. the cache says unmountable and it asks me to format the samsung.

 

----------------

 

 

Physically unplugging the Samsung:

 

Attempt 1: assigned the OCZ as the only cache drive. Blue circle still remain next to the drive saying its a new drive, error still in the status page of the drive (from the original screenshot). Started the array and and the cache is unmountable and asks me to format the OCZ drive at the bottom of the page.

 

Attempt 2: samsung still remains unplugged but i plugged the OCZ into the sata port which the samsung was plugged into. green circle now, assigned it as the cache drive ans started array. took a long time to start the array again (every other time literally takes 5 secs, this took 1-2 minutes). Cache remains unmountable and asks me to format the OCZ (green dot next to it still). Stopping the array makes the greendot dissapear and blue dot appear...

Link to comment

If neither cache mounts by itself your can try to fix the filesystem, using again one disk at a time, see wiki, unfortunately there's not much experience in the forum fixing BTRFS disks.

 

Hmm, i had a quick look at the wiki, but i cant seem to find any actual instructions on fixing the file system. I came accross the "Fix Common Problems" plugin (http://lime-technology.com/forum/index.php?topic=48972), installed it and scanned and it suggests the following:

 

cache (OCZ-ARC100_XXXXXXXXXXXXXXX) has file system errors (No file system (no btrfs UUID))

 

If the disk if XFS / REISERFS, stop the array, restart the Array in Maintenance mode, and run the file system checks. If the disk is BTRFS, then just run the file system checks (unRaid Main <-- Button) If the disk is listed as being unmountable, and it has data on it, whatever you do do not hit the format button. Seek assistance HERE

 

The button above brings me to the main page, which is not helpful. Obviously i need to run filesystem checks, but i dont know how to do this. =/

Link to comment

Ran the same plugin scan with the samsung drive plugged in, im getting slightly different error:

 

cache (Samsung_SSD_850_EVO_250GB_XXXXXXXXXXX) has file system errors (No file system (32))

If the disk if XFS / REISERFS, stop the array, restart the Array in Maintenance mode, and run the file system checks. If the disk is BTRFS, then just run the file system checks (unRaid Main) If the disk is listed as being unmountable, and it has data on it, whatever you do do not hit the format button. Seek assistance HERE

Link to comment

If neither cache mounts by itself your can try to fix the filesystem, using again one disk at a time, see wiki, unfortunately there's not much experience in the forum fixing BTRFS disks.

 

Hmm, i had a quick look at the wiki, but i cant seem to find any actual instructions on fixing the file system. I came accross the "Fix Common Problems" plugin (http://lime-technology.com/forum/index.php?topic=48972), installed it and scanned and it suggests the following:

 

cache (OCZ-ARC100_XXXXXXXXXXXXXXX) has file system errors (No file system (no btrfs UUID))

 

If the disk if XFS / REISERFS, stop the array, restart the Array in Maintenance mode, and run the file system checks. If the disk is BTRFS, then just run the file system checks (unRaid Main <-- Button) If the disk is listed as being unmountable, and it has data on it, whatever you do do not hit the format button. Seek assistance HERE

 

The button above brings me to the main page, which is not helpful. Obviously i need to run filesystem checks, but i dont know how to do this. =/

 

I think you are looking for this:

from the main page, Click on the Device name "cache" to get to device settings and press the scrub button to check the file system. I believe you would want both drives attached before doing so. If you post the results, somebody knowing more than me would likely be able to recommend the next step.

Link to comment

Just a side note, be careful with the drive garbage collection too.

The garbage collection routine may delete your data beyond any potential recovery because it incorrectly thinks there's no data. It is possible.

So if you are not in the process of trying to restore it (e.g. researching etc.), I would suggest not to plug in the 2 SSDs (just because it's more convenient to leave them plugged in).

Link to comment

I forgot the wiki only mentions scrub which is not useful for an unmountable pool, you can try restore:

 

btrfs restore -v /dev/sdX1 /mnt/disk1/folder

 

This will try to copy all data, it's non destructive, replace X with the correct letter, in the above example it will try to copy to disk1\folder (create folder before), cache disk has to be unassigned/unmounted and array started.

 

If it doesn't work and as last resort you can try btrfs repair, this is destructive and it is possible (though unlikely) that it damages the file system even more.

 

btrfs check --repair /dev/sdX1

 

replace X with correct letter, disk can't be mounted.

 

 

 

Link to comment

I think you are looking for this:

from the main page, Click on the Device name "cache" to get to device settings and press the scrub button to check the file system. I believe you would want both drives attached before doing so. If you post the results, somebody knowing more than me would likely be able to recommend the next step.

 

Well, sadly, i cant do this. Obviously when both drives are selected and start the array, the "cache" section says "unmountable" and at the bottom of the page it asks me to format the samsung drive (refer to the original post screenshot, still the same). When i click the "cache" link to look at the settings/status page, the "scrub"/"balance" buttons are disabled (enabling help shows "Scrub is only available when the Device is Mounted", and obviously implying that i cant scrub if it says "unmountable").

 

...back to square one?

 

PS: i tried starting the array in normal mode and maintenance mode both, exact same outcome.

Link to comment

When i tried the above, just above the SCRUB button, under the pool information i see the following (maybe this is helpful to someone who knows what it means):

 

btrfs filesystem show:

warning, device 2 is missing

Label: none  uuid: cb097edd-62ab-4c2d-b7a7-XXXXXXXXXX

Total devices 2 FS bytes used 116.30GiB

devid    1 size 232.89GiB used 225.03GiB path /dev/sdb1

*** Some devices missing

 

btrfs-progs v4.1.2

ps: uuid partially hidden, not that anyone cares.

Link to comment

I forgot the wiki only mentions scrub which is not useful for an unmountable pool, you can try restore:

 

btrfs restore -v /dev/sdX1 /mnt/disk1/folder

 

This will try to copy all data, it's non destructive, replace X with the correct letter, in the above example it will try to copy to disk1\folder (create folder before), cache disk has to be unassigned/unmounted and array started.

 

If it doesn't work and as last resort you can try btrfs repair, this is destructive and it is possible (though unlikely) that it damages the file system even more.

 

btrfs check --repair /dev/sdX1

 

replace X with correct letter, disk can't be mounted.

 

ah oops, i didnt see your response, sorry. you must have posted while i was writing. Im going to try your method now, lets see how this goes. Will update shorty.

Link to comment

btrfs restore -v /dev/sdX1 /mnt/disk1/folder

 

This will try to copy all data, it's non destructive, replace X with the correct letter, in the above example it will try to copy to disk1\folder (create folder before), cache disk has to be unassigned/unmounted and array started.

 

I used the above command to copy the samsung drive (OCZ was left unplugged, samsung drive says sdb next to it so i used sdb1 on the above command "btrfs restore -v /dev/sdb1 /mnt/disks/sg4tb/ssd_backup/") into one of my other unused HDDS mounted by "unassigned devices" plugin. as soon as i typed the above coomand, i got a lot of lines saying "Trying another mirror" followed by this confirmation message:

 

We seem to be looping a lot on /mnt/disks/sg4tb/ssd_backup/inti_main/Inti: Windows 10 Media Manager/vdisk1.img, do you want to keep going on ? (y/N/a):

 

After clicking Y, it finished, but only 5gb was used on the new disk. The VM that was on the cache drive alone had atleast 80+GB on it, so im a bit confused now. When i use the Unraid GUI to browse this drive, i can find the vdisk1.img on this new drive, which says 120GB, but only 5gb of the drive is being used... hmm, any other suggestions?

Link to comment

i ran the above restore command with -D flag (dry run aparently). And noticed a whole bunch of lines that are similar to the following:

 

parent transid verify failed on 139152146432 wanted 126699 found 75597
parent transid verify failed on 139152424960 wanted 126699 found 75597
parent transid verify failed on 139152424960 wanted 126699 found 75597
parent transid verify failed on 139152424960 wanted 126699 found 75597
parent transid verify failed on 139152424960 wanted 126699 found 75597
parent transid verify failed on 139138514944 wanted 126695 found 75597
parent transid verify failed on 139135254528 wanted 126695 found 75597
parent transid verify failed on 139130388480 wanted 126695 found 75597

 

Is this normal?

Link to comment

btrfs check --repair /dev/sdX1

 

Tried this on the samsung drive, but no luck... output ended with the following:

 

 

 

parent transid verify failed on 139120934912 wanted 126695 found 75703
... lots of lines like this ^ ^ ...
parent transid verify failed on 139152408576 wanted 126699 found 75597
cmds-check.c:3771: check_owner_ref: Assertion `rec->is_root` failed.
btrfs[0x41c616]
btrfs[0x42568e]
btrfs[0x4263c9]
btrfs[0x42705e]
btrfs(cmd_check+0x149d)[0x42a0ad]
btrfs(main+0x153)[0x40a447]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2ad52e635d05]
btrfs[0x409fa9]

 

Any more suggestions?

Link to comment

tried repair command on the OCZ drive as well, and the output was pretty much the same as above.

 

mounting them both (or one at a time) as the cache drives, still says "unmountable" and asking me to format.

 

I suppose i should have taken scheduled backups of the VM (though i dont believe there is a built in way to do this), and i would have been happy to scrap all the recovery and restore the backup. Does this mean the data is beyond recovery? Just to clarify, the only data in concerned about is the data within the windows VM that was installed in the cache drives (i would be fairly satisfied if i could fetch few settings/config/db files that were saved in this windows VM's C drive, that would save me a lot of setup time).

Link to comment

... I suppose i should have taken scheduled backups of the VM ...

 

ESPECIALLY if you are choosing to run without a UPS.  Your first move should be to add a UPS to your setup.  But then you absolutely want to periodically backup your VM's to the protected array.

 

Link to comment

ESPECIALLY if you are choosing to run without a UPS.  Your first move should be to add a UPS to your setup.  But then you absolutely want to periodically backup your VM's to the protected array.

Yup, loud and clear. Sadly its hard for me to find an APC ups, but been looking around, will get one soon. Week or two ago, i did look into backing up the VM, but i didnt see any useful information regurding an easy way to backup restore a VM, maybe i was searching badly (please link me if anyone knows of a good method).

 

For the time being though, do i have any immediate solution or direction i should be heading? My array is offline, im sitting here refreshing this page every few minutes hoping to see suggestions and advice of what to do next. If atleast someone can confirm that recovery is probably pointless and that i should cut my losses and format/restart the Cache pool from scratch, atleast i can start heading down in that direction.

Link to comment

on another note though, im really surprised as to how easy it was to break the pool. You would think that since the btrfs essentially gives a mirror on the cache drive, that anything happens you still have 2 drives with the same data and that you could easily recover from loss. But something like this happens where there is no physical issue with the drive (but some file system error?), which leaves both drives useless, only because of a software issue. kind of makes me wish i installed my VM on a different drive (not on the btrfs cache drives). In the future, i will most definitely try to install my VM on a unassigned device using that plugin if at all possible (as i see that the risk of software corruption is greater than the hardware fault, then again you save space since you will be taking backups anyway).

Link to comment

BTRFS seems to be very sensitive to power failures/unclean shutdown, IMO should not be used without a UPS.

 

There's one more thing you can try if there are errors like this on your syslog when trying to mount the cache disks:

 

BTRFS: failed to read log tree

 

If there are try running:

 

btrfs rescue zero-log /dev/sdX1

 

Other than that, no more ideas, like I said before there's not much experience here fixing BTFRS, also the tools appear not to be a good as XFS and REISER especially.

 

 

 

 

Link to comment

If there are try running:

btrfs rescue zero-log /dev/sdX1

 

No luck, i get the following:

 

# btrfs rescue zero-log /dev/sdb1
warning, device 2 is missing
warning devid 2 not found already
Clearing log on /dev/sdb1, previous log_root 0, level 0
Unable to find block group for 0
extent-tree.c:289: find_search_start: Assertion `1` failed.
btrfs[0x441dd1]
btrfs(btrfs_reserve_extent+0x8a4)[0x4472e5]
btrfs(btrfs_alloc_free_block+0x57)[0x447633]
btrfs(__btrfs_cow_block+0x163)[0x4384f7]
btrfs(btrfs_cow_block+0xd0)[0x438e04]
btrfs[0x43ddba]
btrfs(btrfs_commit_transaction+0xeb)[0x43f940]
btrfs(cmd_rescue_zero_log+0x29d)[0x42f971]
btrfs(handle_command_group+0x5d)[0x40a2ef]
btrfs(cmd_rescue+0x15)[0x42f9a1]
btrfs(main+0x153)[0x40a447]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b32a58bcd05]
btrfs[0x409fa9]

 

Link to comment

As Johnnie noted, btrfs is very sensitive to power loss => you'd likely have been fine if you'd had a UPS, but that ship has clearly sailed.    At this point you simply need to start over and rebuild your cache and your VM's.    I'd definitely add a UPS before starting that process ... and of course also make backups of your VM's to the protected array.    Backing up a VM is simple -- just shutdown the VM, then just copy the VM files to a folder on your array.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.