Removed cache pool device, added two cache pool devices, Wrong Pool State, cache - invalid expansion - General Support

March 31Mar 31

Hi, I removed one cache drive as it was reporting errors. I tried to add two drives to replace it, but instead of doing so, it seems like there's some unhappy state the pool is stuck in.
I can't start the array with them, I can't remove the drives, setting them to 'no device' pops up the same 'Wrong Pool State, cache - invalid expansion' message.

kuutio-diagnostics-20260331-2013.zip

Quote

March 31Mar 31

Community Expert

Please post output from btrfs fi show

Quote

March 31Mar 31

Author

Here you go:

root@kuutio:~# btrfs fi show
warning, device 1 is missing
Label: none  uuid: 88139840-0e20-48a6-b8dc-b69dcea7bed9
        Total devices 3 FS bytes used 196.12GiB
        devid    2 size 894.25GiB used 73.00GiB path /dev/sdg1
        devid    3 size 953.87GiB used 132.03GiB path /dev/nvme0n1p1
        *** Some devices missing

sidenote: isn't that in btrfs-usage.txt?

Quote

April 1Apr 1

Community Expert

It is, but that's not typically the case, so I didn't check.

Try reimporting the pool with just those two devices, assuming it was redudndant:

on main click on the first device for that pool and then "remove pool"

back on main, create a new pool with the same name and 2 slots

assign just those two devices, leave the filesystem set to auto

start the array to import the pool and post new diags.

Quote

April 1Apr 1

Author

kuutio-diagnostics-20260401-1249.zip

edit: Huh, message disappeared.
That let the array start.

Edited April 1Apr 1 by korhojoa

Quote

April 1Apr 1

Community Expert

Diags after array start in normal mode please, not maintenance mode.

Quote

April 1Apr 1

Author

Understood, normal mode:
kuutio-diagnostics-20260401-1300.zip

Quote

April 1Apr 1

Community Expert

With the array running, type

btrfs dev remove missing /mnt/cache

Then stop the array and reimport the tool once more with the two devices. After that, you can leave the pool as is with those two devices or add/replace them.

Quote

April 1Apr 1

Author

kuutio-diagnostics-20260401-1650.zip

I'm attaching one more diagnostics archive, just because while the btrfs dev remove missing /mnt/cache was running (close to finishing, it exited soon after, exit code 0 though), I was looking at the syslog, and started seeing link drops, timeouts and btrfs errors from one of the cache drives. Am I still OK to reimport (remove, recreate with same name and devices)?

Quote

April 1Apr 1

Community Expert

2 hours ago, korhojoa said:
Am I still OK to reimport (remove, recreate with same name and devices)?

Yep, missing device is gone.

Quote

April 1Apr 1

Author

Okay, tried it.kuutio-diagnostics-20260401-1919.zip
The cache pool now says (after reimport + start array).
Unmountable: unsupported or no filesystem

Quote

April 1Apr 1

Community Expert

sdg dropped offline, check/replace its cables and post new diags after array start:

Apr 1 16:45:51 kuutio kernel: ata5: softreset failed (device not ready)

Apr 1 16:45:51 kuutio kernel: ata5: hard resetting link

Apr 1 16:45:57 kuutio kernel: ata5: link is slow to respond, please be patient (ready=0)

Apr 1 16:46:02 kuutio kernel: ata5: softreset failed (device not ready)

Apr 1 16:46:02 kuutio kernel: ata5: hard resetting link

Apr 1 16:46:07 kuutio kernel: ata5: link is slow to respond, please be patient (ready=0)

### [PREVIOUS LINE REPEATED 1 TIMES] ###

Apr 1 16:46:37 kuutio kernel: ata5: softreset failed (device not ready)

Apr 1 16:46:37 kuutio kernel: ata5: limiting SATA link speed to 1.5 Gbps

Apr 1 16:46:37 kuutio kernel: ata5: hard resetting link

Apr 1 16:46:42 kuutio kernel: ata5: softreset failed (device not ready)

Apr 1 16:46:42 kuutio kernel: ata5: softreset failed

Apr 1 16:46:42 kuutio kernel: ata5: reset failed, giving up

Apr 1 16:46:42 kuutio kernel: ata5.00: disable device

Apr 1 16:46:42 kuutio kernel: ata5: EH complete

Quote

April 2Apr 2

Author

Checked the cables, sata cable was slightly unplugged, even though using a cable with locking connector.
Here you go.
kuutio-diagnostics-20260402-1650.zip

Quote

April 2Apr 2

Community Expert

Scrub the pool and post the results from the GUI.

Quote

April 2Apr 2

Author

Scrub complete:

UUID:             88139840-0e20-48a6-b8dc-b69dcea7bed9
Scrub started:    Thu Apr  2 18:42:40 2026
Status:           finished
Duration:         0:13:48
Total to scrub:   392.35GiB
Rate:             485.23MiB/s
Error summary:    verify=93024 csum=791170
  Corrected:      884194
  Uncorrectable:  0
  Unverified:     0

Quote

April 2Apr 2

Community Expert

All errors were corrected, so that's good. Recommend resetting the stats and monitoring for any further issues.

Quote

April 3Apr 3

Author

kuutio-diagnostics-20260403-1329.zip

edit: I don't know what happens to the messages I type.

It looks like the scrub operation fixed the 'errors' by writing the corrupt data from 'sdg1' (the disk with the loose cable) to 'nvme0n1p1', essentially corrupting the valid data.

VM disks and docker image are corrupt, so they won't run. Is there a way to get back at least that data from the drive I removed from the pool earlier (Samsung_SSD_870_QVO_2TB_S5RPNF0T615547D)?

Edited April 3Apr 3 by korhojoa

Quote

April 3Apr 3

Community Expert

Were there any new issues?

Quote

April 3Apr 3

Author

After attaching a file, it looks like typing more into the input, it seems to lose the text sometimes.

I edited in the text soon after posting:

It looks like the scrub operation fixed the 'errors' by writing the corrupt data from 'sdg1' (the disk with the loose cable) to 'nvme0n1p1', essentially corrupting the valid data.

VM disks and docker image are corrupt, so they won't run. Is there a way to get back at least that data from the drive I removed from the pool earlier (Samsung_SSD_870_QVO_2TB_S5RPNF0T615547D)?

Quote

April 3Apr 3

Community Expert

2 hours ago, korhojoa said:
It looks like the scrub operation fixed the 'errors' by writing the corrupt data from 'sdg1' (the disk with the loose cable) to 'nvme0n1p1', essentially corrupting the valid data.

btrfs always knows which device has the correct data, based on the checksum and transid, and it will use that to update the stale device; it cannot do it on the wrong device.

2 hours ago, korhojoa said:
VM disks and docker image are corrupt, so they won't run.

If these are corrupt, either there were other issues, or the domains share is set to NOCOW, which disables the checksums, and in that case btrfs cannot correct that data, but if you try to access it it will use any of the disks to read it, basically randomly, so it can get stale data.

Quote

Removed cache pool device, added two cache pool devices, Wrong Pool State, cache - invalid expansion

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)