Problem replacing cache pool drives

hawgorn · July 14, 2023

Short version:

I had a bad SMART readings on my cache 500GB SSD (A-drive), and decided to replace both drives with 1TB drives. I replaced the bad drive with a new one, and the pool (apparently) rebuilded. But when I replaced the second drive (B-drive) the result was “Unmountable: Too many missing/misplaced device” on both drives. I managed to return to working pool by replacing the second original drive (B) back [1].

But now, after a lot of fiddling around with the old and new cache drives (yeah, right… bad move) I have “Unmountable: No pool uuid” on both of the old drives in place. Am I foobar?

Long version:

After I got the pool working [1, above] I decided to play safe and try moving everything to array first. The mover did not move everything, I’d say that VM’s and Docker did not move (yes, they were shutdown and unenabled in settings). I tried copying the content of cache to a backup-share with Dynamix file manager. I let it run overnight, but when I went to bed it copying status was like 120% and growing?? That copy process did not work, all of cache was still not copied (when comparing Calculate Occupied space). I then tried using MC, straight to disk where Mover had moved some files, but it stalled on several VM-images and there was also some errors (didn’t record the error, silly me). At this stage I still had a working cache pool by using both of the original drives.

I then cleared both partition tables of the 1TB disks with gparted on a linux machine , and I then tried the way of the FAQ to replace the drives. I put one new disk to a spare sata port and assigned it to cache to but received “Unmountable: Too many missing/misplaced device” straight away. I then tried clearing the cache config, changing different drive combinations (but always keeping old A & B-drives in their respective slots). Nothing worked, either I got “too many missing..” or “No pool uuid”. And now I have both of the old cache drives in place with “Unmountable: No pool uuid”. How screwed am I?

I do have recent backups of everything (appdata, docker.img, VM’s), should I just give in, restore from those, or can the cache be saved?

ziggy-nas-diagnostics-20230714-1326.zip

JorgeB · July 14, 2023

Post the output of

btrfs fi show

sdb and sdd were the original pool members correct?

hawgorn · July 14, 2023

Yes, sdb and sdd were the orinal pool members.

----------

warning, device 4 is missing
warning, device 3 is missing
ERROR: cannot read chunk root
Label: none uuid: f7747906-ca16-4858-b91f-8f600a5b448d
        Total devices 3 FS bytes used 108.99GiB
        devid    5 size 476.94GiB used 17.00GiB path /dev/sdb1
        *** Some devices missing

JorgeB · July 14, 2023

Pool has 2 missing devices, I understood that original pool was 2 devices?

Post output of

btrfs-select-super -s 1 /dev/sdd1

And again

btrfs fi show

hawgorn · July 14, 2023

11 minutes ago, JorgeB said:
btrfs-select-super -s 1 /dev/sdd1

warning, device 4 is missing
using SB copy 1, bytenr 67108864

11 minutes ago, JorgeB said:
btrfs fi show

warning, device 4 is missing
Label: none uuid: f7747906-ca16-4858-b91f-8f600a5b448d
        Total devices 3 FS bytes used 108.99GiB
        devid    3 size 465.76GiB used 153.03GiB path /dev/sdd1
        devid    5 size 476.94GiB used 17.00GiB path /dev/sdb1
        *** Some devices missing

----------------

Original pool was 2 devices: samsung EVO 500GB and Pro 512GB

JorgeB · July 14, 2023

48 minutes ago, hawgorn said:

Original pool was 2 devices:

Then something went wrong with the replacement, because it now thinks there should be three devices, lets see if it mounts with these alone two but doubt it, since it doesn't look like a fully redundant pool based on data distribution:

stop array

unassign all pool devices

start array to reset the pool

stop array

reassign both pool devices

start array and post new diags.

hawgorn · July 14, 2023

I'll be damned, the two original disks came online as a pool. And I did the same procedure once or twice before calling for help. Third time is the charm?

Here's the new diags.

ziggy-nas-diagnostics-20230714-2041.zip

JorgeB · July 14, 2023

Not out of the woods yet:

             Data      Metadata  System                              
Id Path      RAID1     RAID1     RAID1    Unallocated Total     Slack
-- --------- --------- --------- -------- ----------- --------- -----
 3 /dev/sdd1 150.00GiB   3.00GiB 32.00MiB   312.73GiB 465.76GiB     -
 4 missing   133.00GiB   3.00GiB 32.00MiB   340.91GiB 476.94GiB     -
 5 /dev/sdb1  17.00GiB         -        -   459.94GiB 476.94GiB     -

sdd has most of the data and it appears to be failing, lets see if it can finish the removing the missing device.

hawgorn · July 14, 2023

31 minutes ago, JorgeB said:

Not out of the woods yet:

             Data      Metadata  System                              
Id Path      RAID1     RAID1     RAID1    Unallocated Total     Slack
-- --------- --------- --------- -------- ----------- --------- -----
 3 /dev/sdd1 150.00GiB   3.00GiB 32.00MiB   312.73GiB 465.76GiB     -
 4 missing   133.00GiB   3.00GiB 32.00MiB   340.91GiB 476.94GiB     -
 5 /dev/sdb1  17.00GiB         -        -   459.94GiB 476.94GiB     -

sdd has most of the data and it appears to be failing, lets see if it can finish the removing the missing device.

sdb is/was the one that SMART outed.

How should I proceed, what do you suggest?

JorgeB · July 14, 2023

For now wait some time to see if it can finish the balance.

hawgorn · July 15, 2023

No balancing happening, stll sitting at same numbers. And you are right, sdd IS the failing one, I misread.

             Data      Metadata System
Id Path      RAID1     RAID1     RAID1    Unallocated Total     Slack
-- --------- --------- --------- -------- ----------- --------- -----
3 /dev/sdd1 150.00GiB   3.00GiB 32.00MiB   312.73GiB 465.76GiB     -
4 missing   133.00GiB   3.00GiB 32.00MiB   340.91GiB 476.94GiB     -
5 /dev/sdb1 17.00GiB         -        -   459.94GiB 476.94GiB     -
-- --------- --------- --------- -------- ----------- --------- -----

JorgeB · July 15, 2023

Best option might be trying to copy anything you can from the pool and then recreate.

hawgorn · July 15, 2023

Would adding a third drive to the pool and hoping it would balance be asking for more headache?

JorgeB · July 15, 2023

It won't help, the problem is that the failed drive cannot rebuild the new drive correctly due to read errors, the same would happen if you added a 3rd.

hawgorn · July 15, 2023

2 hours ago, JorgeB said:

Best option might be trying to copy anything you can from the pool and then recreate.

ok, thanks. How is this for a plan?

-delete old pool

-create new pool with the new disks.

-point cache shares to the new pool

cache shares are are now cache:yes, so everything is in theory in array

-copy backups to cache shares in array

-change cache:prefer

-invoke mover

-cross fingers

JorgeB · July 15, 2023

If I'm understanding correctly what you mean it should work.

hawgorn · July 15, 2023

Jorge, you are awesome, thank you!! I'm up and running again.

Two of the Debian VM's don't work, but atleast one of thme should just be a matter of fixing grub.

hawgorn · July 18, 2023

Apparently I'm still not out of the woods. Last night I started working on the VM's. I got "cache is full or read only". I let the server sit for the night. This morning the primary cache drive was in standby mode. I stopped the array, changed spin down delay to 'never'. After starting array the cache disko were 'unmountable: no file system'. Clearing pool config did not work.

ziggy-nas-diagnostics-20230718-0944.zip

JorgeB · July 18, 2023

Is cache the new pool or the old one? Try running

btrfs rescue zero-log /dev/sdb1

Then re-start array and post new diags.

hawgorn · July 18, 2023

It's the new pool.

Clearing the log brought the pool back online. Here's the new diags after re-starting array.

ziggy-nas-diagnostics-20230718-1218.zip

JorgeB · July 18, 2023

Not normal for the pool to have issues so soon, so there might be an underlying issue, but now looks again.

hawgorn · July 18, 2023

14 minutes ago, JorgeB said:

Not normal for the pool to have issues so soon, so there might be an underlying issue, but now looks again.

I agree. I was planning to change the innards to a new chassis soon, but might now just buy a new MB, CPU and memory to install to the new chassis (as I'm running intel 4th and DDR3 now).

hawgorn · July 18, 2023

Working on the pfsense xml brought the pool/disks down again, and clearing the btrfs log made them online.

Could this be a memory problem? Maybe I'll run memtest later this week.

JorgeB · July 18, 2023

Could be, but often when you get this log tree issue, it re-occurs, so run memtest but might still be a good idea to re-create the pool anyway.

hawgorn · July 18, 2023

3 hours ago, JorgeB said:

Could be, but often when you get this log tree issue, it re-occurs, so run memtest but might still be a good idea to re-create the pool anyway.

Ok, I'll do that. For future reference and as I'd like to learn a bit, what might be causing this log tree issue, and what would I be looking for in the diags?

Problem replacing cache pool drives

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation