Btrfs cache lost due to stupid error on my part

Endy · December 12, 2023

Generally I have a rule that I don't work on things when I am really tired so that I don't make stupid mistakes.

Last night I was really tired and ignored that rule and now I am paying the price.

I just swapped motherboard/processor/ram for my Unraid server yesterday and that went well. I also added 2 nvme drives to use for my cache pool to replace the 1 nvme drive I had been using so that I would have a little redundancy there. (I know it's not a proper backup, I am actually in the process of setting up a more proper backup system.)

I added 1 of the new nvme drives to the cache pool and that seemed fine and it balanced it. This is where I should have stopped and waited until after I had slept. I was impatient and I thought it was a good idea to just swap the old nvme drive for the new one in the gui directly. Bad idea. Btrfs pool now gone. Turned off the server, went to bed, and now here I am the next day.

I've tried the restore options in the FAQ up until the last option. Currently I have the original nvme drive and the one I added 1st yesterday in the cache pool like it was when it was working before I made the mistake.

I have a feeling that I am screwed, but is there any way to restore from here?

JorgeB · December 12, 2023

Please post the diagnostics after array start and the output of:

btrfs fi show

Endy · December 12, 2023

Label: none  uuid: 849f7ded-6fbb-4f5f-9627-fa781a175567
        Total devices 2 FS bytes used 714.96GiB
        devid    1 size 931.51GiB used 826.48GiB path /dev/sdd1
        devid    2 size 931.51GiB used 826.48GiB path /dev/sdb1

Label: none  uuid: db14842a-e91c-4154-bdd8-fbe89dfc7ce3
        Total devices 1 FS bytes used 340.00KiB
        devid    1 size 20.00GiB used 536.00MiB path /dev/loop2

Label: none  uuid: 6331b51d-add2-421b-a015-22c674718eb5
        Total devices 1 FS bytes used 412.00KiB
        devid    1 size 1.00GiB used 126.38MiB path /dev/loop3

sdd and sdb are part of another btrfs pool not having to do with cache.

turtle-diagnostics-20231212-1040.zip

JorgeB · December 12, 2023

No valid btrfs filesystem on the NVMe devices, suggesting they were wiped or fully trimmed, post the output of:

fdisk -l nvme2n1
fdisk -l nvme0n1

Endy · December 12, 2023

root@Turtle:~# fdisk -l nvme2n1
fdisk: cannot open nvme2n1: No such file or directory
root@Turtle:~# fdisk -l nvme0n1
fdisk: cannot open nvme0n1: No such file or directory

JorgeB · December 12, 2023

Sorry, should be :

fdisk -l /dev/nvme2n1
fdisk -l /dev/nvme0n1

Endy · December 12, 2023

root@Turtle:~# fdisk -l /dev/nvme2n1
Disk /dev/nvme2n1: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: CT1000P1SSD8                            
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
root@Turtle:~# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: WD_BLACK SN850X 2000GB                  
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

JorgeB · December 12, 2023

That confirms no partition exists, if they were wiped with wipefs it may still be recoverable, if a full device trim was done with blkdiscard it won't, you can try this, type:

sfdisk /dev/nvme2n1

then type 2048 and hit return and post a screenshot of the results.

Endy · December 12, 2023

root@Turtle:~# sfdisk /dev/nvme2n1

Welcome to sfdisk (util-linux 2.38.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Checking that no-one is using this disk right now ... OK

Disk /dev/nvme2n1: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: CT1000P1SSD8
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

sfdisk is going to create a new 'dos' disk label.
Use 'label: <name>' before you define a first partition
to override the default.

Type 'help' to get more information.

>>> 2048
Created a new DOS disklabel with disk identifier 0xfc45e933.
Created a new partition 1 of type 'Linux' and of size 931.5 GiB.
/dev/nvme2n1p1 : 2048 1953525167 (931.5G) Linux
/dev/nvme2n1p2:

JorgeB · December 12, 2023

Hit Control + C to abort and repeat but this time with 64

sfdisk /dev/nvme2n1

then type 64 and hit return and post a screenshot of the results.

Endy · December 12, 2023

root@Turtle:~# sfdisk /dev/nvme2n1

Welcome to sfdisk (util-linux 2.38.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Checking that no-one is using this disk right now ... OK

Disk /dev/nvme2n1: 931.51 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: CT1000P1SSD8                            
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

sfdisk is going to create a new 'dos' disk label.
Use 'label: <name>' before you define a first partition
to override the default.

Type 'help' to get more information.

>>> 64
Created a new DOS disklabel with disk identifier 0xe90b059d.
Created a new partition 1 of type 'Linux' and of size 931.5 GiB.
Partition #1 contains a btrfs signature.

Do you want to remove the signature? [Y]es/[N]o:

JorgeB · December 12, 2023

Type N plus return to keep the signature, then type w and hit return to save, then post the output of

btrfs fi show

Endy · December 12, 2023

It's not letting me do the save part.

Do you want to remove the signature? [Y]es/[N]o: N
/dev/nvme2n1p1 :           64   1953525167 (931.5G) Linux
/dev/nvme2n1p2: w
unsupported command
/dev/nvme2n1p2: W
unsupported command
/dev/nvme2n1p2:

JorgeB · December 12, 2023

The write command is still inside sfdisk, no outside, but you need to actually type write and enter, not just w, sorry, you can start again, see here for step by step:

https://forums.unraid.net/topic/141033-lost-drive-after-creating-pool-v612/?do=findComment&comment=1276199

Endy · December 12, 2023

That worked.

/dev/nvme2n1p2: write

New situation:
Disklabel type: dos
Disk identifier: 0xe90b059d

Device         Boot Start        End    Sectors   Size Id Type
/dev/nvme2n1p1         64 1953525167 1953525104 931.5G 83 Linux

The partition table has been altered.
Calling ioctl() to re-read partition table.
Syncing disks.
root@Turtle:~# btrfs fi show
Label: none  uuid: 849f7ded-6fbb-4f5f-9627-fa781a175567
        Total devices 2 FS bytes used 714.96GiB
        devid    1 size 931.51GiB used 826.48GiB path /dev/sdd1
        devid    2 size 931.51GiB used 826.48GiB path /dev/sdb1

Label: none  uuid: db14842a-e91c-4154-bdd8-fbe89dfc7ce3
        Total devices 1 FS bytes used 340.00KiB
        devid    1 size 20.00GiB used 536.00MiB path /dev/loop2

Label: none  uuid: 6331b51d-add2-421b-a015-22c674718eb5
        Total devices 1 FS bytes used 412.00KiB
        devid    1 size 1.00GiB used 126.38MiB path /dev/loop3

warning, device 2 is missing
Label: none  uuid: 9bceeb35-e26e-47d8-9ef3-3534abbaa204
        Total devices 2 FS bytes used 383.72GiB
        devid    1 size 931.51GiB used 884.05GiB path /dev/nvme2n1p1
        *** Some devices missing

*edited to show the part after 'write' as well

Edited December 12, 2023 by Endy

JorgeB · December 13, 2023

I assume nvme2n1, the Crucial device, was the original cache? If yes we can try and mount it alone, if needed you can then still try to recover the other device.

To mount that device alone, unassign both pool members from the pool, start array, stop array, re-assign only the Crucial device, start array, post new diags.

Endy · December 13, 2023

It's alive! As far as I can tell, everything seems to be there. Docker started up and seems to be working.

Am I out of the woods now?

If so, because I plan on just using the 2 new drives for cache and removing the Crucial drive, next step would be to temporarily move all data off of the cache pool? Then delete and recreate the cache pool using just the 2 new drives and then move the data back?

turtle-diagnostics-20231213-0735.zip

JorgeB · December 13, 2023

Everything looks good for now but the pool is still balancing, when the pool activity stops post new diags to confirm all is OK, if yes you can then assign one of the new devices to the pool.

Endy · December 13, 2023

Ok, it finished.

If I am planning on removing the Crucial drive, do I want/need to add one of the new drives? Just want to make sure I don't mess up again.

Also, thank you so much for your help JorgeB. While I wouldn't have lost any irreplaceable data, you have saved me from countless hours recreating and setting everything up. I truly appreciate it.

turtle-diagnostics-20231213-0936.zip

JorgeB · December 13, 2023

Dec 13 09:35:19 Turtle kernel: BTRFS error (device nvme2n1p1): bdev /dev/nvme2n1p1 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0
Dec 13 09:35:33 Turtle kernel: BTRFS info (device nvme2n1p1): balance: ended with status: -5

Balance didn't finish because some data corruption was detected, possibly also why you had issues before adding the device, in this case you should not add the new device, you can create a new pool with the new device(s), copy everything you can then remove that device.

23 minutes ago, Endy said:

Also, thank you so much for your help JorgeB.

Your welcome.

JorgeB · December 13, 2023

Forgot to mention, it would also be a good idea to run memtest, just to make sure there are no obvious RAM issues.

Endy · December 13, 2023

Hopefully small hiccup... I stopped the array and created a new pool for the 2 new drives and when I tried to start the array I got a message saying

"Wrong Pool State

cache - too many missing/wrong devices"

I didn't touch the original cache pool. I tried deleting the new pool and same message.

JorgeB · December 13, 2023

Post new diags, devices may need to be wiped first.

Endy · December 13, 2023

New diagnostics

turtle-diagnostics-20231213-1125.zip

JorgeB · December 13, 2023

1 hour ago, Endy said:

I tried deleting the new pool and same message.

Missed this part, so this is about the old pool, possibly because the missing device failed to be removed, try re-importin the pool again, unassign the old cache device, assign the new ones to a new pool now, because you may have the same issue later after an array stop, start array, stop array, re-assign old cache, start array.

Btrfs cache lost due to stupid error on my part

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation