Jump to content

Problem replacing cache pool drives


Go to solution Solved by JorgeB,

Recommended Posts

Short version:

I had a bad SMART readings on my cache 500GB SSD (A-drive), and decided to replace both drives with 1TB drives. I replaced the bad drive with a new one, and the pool (apparently) rebuilded. But when I replaced the second drive (B-drive) the result was “Unmountable: Too many missing/misplaced device” on both drives. I managed to return to working pool by replacing the second original drive (B) back [1].

 

But now, after a lot of fiddling around with the old and new cache drives (yeah, right… bad move) I have “Unmountable: No pool uuid” on both of the old drives in place. Am I foobar?

 

 

Long version:

After I got the pool working [1, above] I decided to play safe and try moving everything to array first. The mover did not move everything, I’d say that VM’s and Docker did not move (yes, they were shutdown and unenabled in settings). I tried copying the content of cache to a backup-share with Dynamix file manager. I let it run overnight, but when I went to bed it copying status was like 120% and growing?? That copy process did not work, all of cache was still not copied (when comparing Calculate Occupied space). I then tried using MC, straight to disk where Mover had moved some files, but it stalled on several VM-images and there was also some errors (didn’t record the error, silly me). At this stage I still had a working cache pool by using both of the original drives.

 

I then cleared both partition tables of the 1TB disks with gparted on a linux machine , and I then tried the way of the FAQ to replace the drives. I put one new disk to a spare sata port and assigned it to cache to but received “Unmountable: Too many missing/misplaced device” straight away.  I then tried clearing the cache config, changing different drive combinations (but always keeping old A & B-drives in their respective slots). Nothing worked, either I got “too many missing..” or “No pool uuid”. And now I have both of the old cache drives in place with “Unmountable: No pool uuid”. How screwed am I?

 

I do have recent backups of everything (appdata, docker.img, VM’s), should I just give in, restore from those, or can the cache be saved?

ziggy-nas-diagnostics-20230714-1326.zip

Link to comment

Yes, sdb and sdd were the orinal pool members.

 

----------

warning, device 4 is missing
warning, device 3 is missing
ERROR: cannot read chunk root
Label: none  uuid: f7747906-ca16-4858-b91f-8f600a5b448d
        Total devices 3 FS bytes used 108.99GiB
        devid    5 size 476.94GiB used 17.00GiB path /dev/sdb1
        *** Some devices missing

 

 

Link to comment
11 minutes ago, JorgeB said:
btrfs-select-super -s 1 /dev/sdd1

warning, device 4 is missing
using SB copy 1, bytenr 67108864

 

11 minutes ago, JorgeB said:
btrfs fi show

warning, device 4 is missing
Label: none  uuid: f7747906-ca16-4858-b91f-8f600a5b448d
        Total devices 3 FS bytes used 108.99GiB
        devid    3 size 465.76GiB used 153.03GiB path /dev/sdd1
        devid    5 size 476.94GiB used 17.00GiB path /dev/sdb1
        *** Some devices missing

 

----------------

Original pool was 2 devices: samsung EVO 500GB and Pro 512GB

Link to comment
48 minutes ago, hawgorn said:

Original pool was 2 devices:

Then something went wrong with the replacement, because it now thinks there should be three devices, lets see if it mounts with these alone two but doubt it, since it doesn't look like a fully redundant pool based on data distribution:

 

stop array

unassign all pool devices

start array to reset the pool

stop array

reassign both pool devices

start array and post new diags.

Link to comment

Not out of the woods yet:

 


 

             Data      Metadata  System                              
Id Path      RAID1     RAID1     RAID1    Unallocated Total     Slack
-- --------- --------- --------- -------- ----------- --------- -----
 3 /dev/sdd1 150.00GiB   3.00GiB 32.00MiB   312.73GiB 465.76GiB     -
 4 missing   133.00GiB   3.00GiB 32.00MiB   340.91GiB 476.94GiB     -
 5 /dev/sdb1  17.00GiB         -        -   459.94GiB 476.94GiB     -

 

sdd has most of the data and it appears to be failing, lets see if it can finish the removing the missing device.

Link to comment
31 minutes ago, JorgeB said:

Not out of the woods yet:

 


 

             Data      Metadata  System                              
Id Path      RAID1     RAID1     RAID1    Unallocated Total     Slack
-- --------- --------- --------- -------- ----------- --------- -----
 3 /dev/sdd1 150.00GiB   3.00GiB 32.00MiB   312.73GiB 465.76GiB     -
 4 missing   133.00GiB   3.00GiB 32.00MiB   340.91GiB 476.94GiB     -
 5 /dev/sdb1  17.00GiB         -        -   459.94GiB 476.94GiB     -

 

sdd has most of the data and it appears to be failing, lets see if it can finish the removing the missing device.

sdb is/was the one that SMART outed.

 

How should I proceed, what do you suggest?

Link to comment

No balancing happening, stll sitting at same numbers. And you are right, sdd IS the failing one, I misread.

 

             Data      Metadata  System                              
Id Path      RAID1     RAID1     RAID1    Unallocated Total     Slack
-- --------- --------- --------- -------- ----------- --------- -----
 3 /dev/sdd1 150.00GiB   3.00GiB 32.00MiB   312.73GiB 465.76GiB     -
 4 missing   133.00GiB   3.00GiB 32.00MiB   340.91GiB 476.94GiB     -
 5 /dev/sdb1  17.00GiB         -        -   459.94GiB 476.94GiB     -
-- --------- --------- --------- -------- ----------- --------- -----
                      

Link to comment
2 hours ago, JorgeB said:

Best option might be trying to copy anything you can from the pool and then recreate.

ok, thanks. How is this for a plan?

 

-delete old pool

-create new pool with the new disks.

-point cache shares to the new pool

 

cache shares are are now cache:yes, so everything is in theory in array

 

-copy backups to cache shares in array

-change cache:prefer

-invoke mover

-cross fingers

 

Link to comment

Apparently I'm still not out of the woods. Last night I started working on the VM's. I got "cache is full or read only". I let the server sit for the night. This morning the primary cache drive was in standby mode. I stopped the array, changed spin down delay to 'never'. After starting array the cache disko were 'unmountable: no file system'. Clearing pool config did not work.

 

ziggy-nas-diagnostics-20230718-0944.zip

Link to comment
14 minutes ago, JorgeB said:

Not normal for the pool to have issues so soon, so there might be an underlying issue, but now looks again.

I agree. I was planning to change the innards to a new chassis soon, but might now just buy a new MB, CPU and memory to install to the new chassis (as I'm running intel 4th and DDR3 now).

 

Link to comment
3 hours ago, JorgeB said:

Could be, but often when you get this log tree issue, it re-occurs, so run memtest but might still be a good idea to re-create the pool anyway.

Ok, I'll do that. For future reference and as I'd like to learn a bit, what might be causing this log tree issue, and what would I be looking for in the diags?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...