Cache Pool Drive Replacement - Not gone to plan

extrobe · September 6, 2020

I have a 4xSSD Cache Pool on BTRFS (3x480GB, 1x500GB)

Few days ago, one of the drives was being flagged for replacement (500gb).

New one arrived, and read through the FAQ post

The only bit I was different on, was I didn't have a spare port, so instead did the following...

- Stopped the array

- Pulled out the faulty disk caddy

- Replaced the disk with the new one

- Selected the new disk in the pool

(which is pretty much the same process I use on the main disks)

But... whilst the offending disk shows as a 'new device', now one of the other other 3 disks is showing as Unmountable: No File System

I've tried stopping the array again, and removing/re-inserting the disk. I've also tried putting the old disk back in, but I don't seem to be able progress from here.

Is this recoverable? I have partial backups, so not all is lost, but annoyingly I think my PLEX instances were on my exclude list, and probably my biggest 'loss'

extrobe · September 6, 2020

Although, perhaps I'm just being a dunce - is the Unmountable message just reflecting that the pool as an entirety can't be mounted?

It's prompting me to format the 'lead' disk in the cache - it was cache 4 which I swapped out

extrobe · September 6, 2020

Ok...

I did the following...

Stopped Array

Started Array in Maintenance Mode

Ran

mkdir /x
mount -o degraded,usebackuproot,ro /dev/sdh1 /x

Realised I should probably run that not in maintenance mode

Stopped Array

Started Array (normal)

and the cache pool seemingly is back online

Not sure if I've lost data or not, but I can't get docker to start "Docker Service failed to start"

EDIT: Did a restart, and back to being unmountable

EDIT: Repeating the previous steps, this time copying to the array using Midnight Commander - but getting a lot of copy errors (keeps saying [stalled]), and pretty sure there are some missing some directories. Would putting the old disk in again, and using the above command be a sensible next step?

EDIT: Adding the old disk back just gives the warning 'all data will be overwritten when you start the array', so doesn't feel like this will work

Edited September 6, 2020 by extrobe

JorgeB · September 6, 2020

2 hours ago, extrobe said:

The only bit I was different on, was I didn't have a spare port, so instead did the following...

The FAQ mentions to use another procedure if you don't have a spare port.

Please post the diagnostics: Tools -> Diagnostics

extrobe · September 6, 2020

demeter-diagnostics-20200906-1810.zip

Diagnostics attached.

I did follow that link for no spare port, but it went back to the single-disk procedure, and wasn't sure if that was the right one to follow - so figured that as the multi-disk procedure was to just select a new disk (seemingly like for a standard disk) I thought I could just hot-swap them instead 😕

Edit: References to disk Crucial_CT500MX200 = Old Cache 4, Crucial_CT500MX500 = New Cache 4

Edited September 6, 2020 by extrobe

JorgeB · September 6, 2020

According to syslog you're missing two devices:, devid 3 and 4, it's detecting 2 new devices.

Sep  6 17:31:02 DEMETER emhttpd: cache uuid: c8f42191-039e-41d8-894d-bdd878c15864
Sep  6 17:31:02 DEMETER emhttpd: cache TotDevices: 4
Sep  6 17:31:02 DEMETER emhttpd: cache NumDevices: 4
Sep  6 17:31:02 DEMETER emhttpd: cache NumFound: 2
Sep  6 17:31:02 DEMETER emhttpd: cache NumMissing: 1
Sep  6 17:31:02 DEMETER emhttpd: cache NumMisplaced: 0
Sep  6 17:31:02 DEMETER emhttpd: cache NumExtra: 2
Sep  6 17:31:02 DEMETER emhttpd: cache LuksState: 0
Sep  6 17:31:02 DEMETER emhttpd: shcmd (408): mount -t btrfs -o noatime,nodiratime,degraded -U c8f42191-039e-41d8-894d-bdd878c15864 /mnt/cache
Sep  6 17:31:02 DEMETER kernel: BTRFS info (device sdh1): allowing degraded mounts
Sep  6 17:31:02 DEMETER kernel: BTRFS info (device sdh1): disk space caching is enabled
Sep  6 17:31:02 DEMETER kernel: BTRFS info (device sdh1): has skinny extents
Sep  6 17:31:02 DEMETER kernel: BTRFS warning (device sdh1): devid 3 uuid 91c09af4-c319-4ef1-a89e-17b8b9080b28 is missing
Sep  6 17:31:02 DEMETER kernel: BTRFS warning (device sdh1): devid 4 uuid 20de7db3-546e-45a6-b08b-919ab79effeb is missing
Sep  6 17:31:02 DEMETER kernel: BTRFS warning (device sdh1): chunk 1009532010496 missing 2 devices, max tolerance is 1 for writeable mount

If you just replaced one there's a problem with another one, which is not being detected as a pool member, any idea why that would happen, did you do anything else?

extrobe · September 6, 2020

I did swap 2 of the disks around (the 2x Samsungs) - that was because I thought the 'unmountable filesystem' message was specific to that disk, so swapped the bays over to check it wasn't a connection issue. But I checked the assignments still matched before moving on.

JorgeB · September 6, 2020

If by swapping you mean just changing slots that wouldn't be a problem, something else must have happened, or the pool was already missing a device, you can try the old cache device if you still have it untouched, if not you can't recover the pool with 2 missing devices.

extrobe · September 6, 2020

Yes, just changing slots.

I'm trying to get it to read off the original disk (the disk itself should have still been physically ok, it was just approaching EoL), but struggling to get it to include it in the pool.

Working my way through some of the BTRFS troubleshooting steps, but starting to look like a lost cause

EDIT: Looks like it's the other Crucial disk which is not showing up - but there was nothing to suggest it was an issue before hand - in fact, I checked the SMART data before I started the replacement as wanted to see how much life that one had left.

When I try to mount it, it says the Special Device doesn't exist - any diagnostics I can do on this to work out why that might be / confirm it's damaged?

Looks like the data on the original disk has also already gone - when I try to mount it, it says wrong FS type

Edited September 6, 2020 by extrobe

Cache Pool Drive Replacement - Not gone to plan

Recommended Posts

extrobe

Link to comment

extrobe

Link to comment

extrobe

Link to comment

JorgeB

Link to comment

extrobe

Link to comment

JorgeB

Link to comment

extrobe

Link to comment

JorgeB

Link to comment

extrobe

Link to comment

Join the conversation