Did Unraid just purge all data from my cache pool?

ledon · August 3, 2020

Hi everyone,

I had to open up my Unraid box in order to replace a fan. Upon closing it back up I forgot to re-attach the SATA cable of one of my cache drives. I only realised this because I had a strange CPU spike in idle directly after boot which turned out to be „btrfs balance“. That was when I knew sth. was afoot and upon checking, I realised one of the cache drives was missing. So I shut Unraid down again and re-attached the cable (easy enough).

However, contrary to my expectation, the disk was not simply added to the cache pool again. Instead, it was listed under unassigned devices as btrfs device. So I stopped the array and assigned the drive back into the cache pool. It then was listed as unaccessible and that it would have to be reformatted. I thought maybe it is a hiccup in Unraid, so I rebooted again. Alas, when I booted it still said the drive needed to be formatted. I wondered why on Earth it would not simply add it back into the cache pool since all I did was boot with a disconnected SATA, but gave up and reformatted it. Only Unraid went ahead and formatted the WHOLE CACHE POOL including the drive that was still fine and included all data! I immediately stopped the array.

I had a look at the respective section in the FAQ: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=543490

But none of the steps could restore any data. Also tried the btrfs-undelete script, but also to no avail: https://gist.github.com/Changaco/45f8d171027ea2655d74

Is there anything else I can do to restore the data or should I proceed to rebuild the VMs from scratch? Also if anyone can shed light on what just transpired and why my mirrored cache pool simply imploded in this situation would be highly appreciated.

Thanks a lot in advance!

ledon · August 3, 2020

One clarification: I did not try the potentially destructive btrfs check --repair yet.

JorgeB · August 3, 2020

Difficult to say what exactly happened without the diags, if the pool was raid1 and the array started with the single device it would auto balance to single mode, as long as you let it finish first by then adding the other device it should have rebalanced to raid1.

ledon · August 3, 2020

Hi @johnnie.black, thanks for your quick response! Is this a general rule, that one should not act "too fast", i.e. before the balance is finished?

Also, I attached the diags. The wipe of both cache devices is clearly visible in the syslog. But maybe you can see more in there that might help recover the data? Thanks anyway for having a look, it is highly appreciated!

Edited August 3, 2020 by ledon
removed diags

JorgeB · August 3, 2020

7 minutes ago, ledon said:

Is this a general rule, that one should not act "too fast", i.e. before the balance is finished?

If a btrfs balance is running you'll get an inhibited Stop button with a reason why:

imagem.png.ac913346e8b1b6e944cf0cd96631dcef.png

8 minutes ago, ledon said:

The wipe of both cache devices is clearly visible in the syslog.

It is, but you're pool wasn't redundant and that is way it was unmountable the first time, possibly the result on being created on v6.7.x due to a bug:

Aug  3 17:09:29 CubeZero kernel: BTRFS warning (device sdd1): devid 1 uuid 82740355-53fc-4d7a-8aaf-0ec4de6f38ce is missing
### [PREVIOUS LINE REPEATED 1 TIMES] ###
Aug  3 17:09:29 CubeZero kernel: BTRFS warning (device sdd1): chunk 402246860800 missing 1 devices, max tolerance is 0 for writeable mount
Aug  3 17:09:29 CubeZero kernel: BTRFS warning (device sdd1): writeable mount is not allowed due to too many missing devices

You then formatted the pool, and yes, by doing that it wiped both devices, now it still might be possible to recover the pool using a backup superblock, see here, but can't really help with this since I've never used it, you can ask for help on IRC #btrfs linke mentioned in that thread.

ledon · August 3, 2020

Hm, thanks, interesting to know that the pool actually was not redundant. That raises actually more questions than it answers since it only had the capacity of one instead of both drives, but if it was a bug, I guess that might account for that. Anyway, thanks again, I will have a look at the superblock option!

JorgeB · August 3, 2020

1 minute ago, ledon said:

That raises actually more questions than it answers since it only had the capacity of one instead of both drives

The bug is that only data is redundant, metadata isn't, metadata takes very little space but if a device fails or is missing whole pool is lost/won't mount.

ledon · August 3, 2020

2 hours ago, johnnie.black said:

The bug is that only data is redundant, metadata isn't, metadata takes very little space but if a device fails or is missing whole pool is lost/won't mount.

Alright, thanks, that explains the behaviour I saw!

Did Unraid just purge all data from my cache pool?

Recommended Posts

ledon

Link to comment

ledon

Link to comment

JorgeB

Link to comment

ledon

Link to comment

JorgeB

Link to comment

ledon

Link to comment

JorgeB

Link to comment

ledon

Link to comment

Join the conversation