Cache Pool Drive Replacement - Not gone to plan


extrobe

Recommended Posts

I have a 4xSSD Cache Pool on BTRFS (3x480GB, 1x500GB)

 

Few days ago, one of the drives was being flagged for replacement (500gb).

New one arrived, and read through the FAQ post

 

The only bit I was different on, was I didn't have a spare port, so instead did the following...

- Stopped the array

- Pulled out the faulty disk caddy

- Replaced the disk with the new one

- Selected the new disk in the pool

 

(which is pretty much the same process I use on the main disks)

 

But... whilst the offending disk shows as a 'new device', now one of the other other 3 disks is showing as Unmountable: No File System

 

I've tried stopping the array again, and removing/re-inserting the disk. I've also tried putting the old disk back in, but I don't seem to be able progress from here.

Is this recoverable? I have partial backups, so not all is lost, but annoyingly I think my PLEX instances were on my exclude list, and probably my biggest 'loss'

 

 

Link to comment

Ok...

I did the following...

Stopped Array

Started Array in Maintenance Mode

Ran 

mkdir /x
mount -o degraded,usebackuproot,ro /dev/sdh1 /x

Realised I should probably run that not in maintenance mode

Stopped Array

Started Array (normal)

and the cache pool seemingly is back online

 

Not sure if I've lost data or not, but I can't get docker to start "Docker Service failed to start"

 

EDIT: Did a restart, and back to being unmountable

EDIT: Repeating the previous steps, this time copying to the array using Midnight Commander - but getting a lot of copy errors (keeps saying [stalled]), and pretty sure there are some missing some directories. Would putting the old disk in again, and using the above command be a sensible next step?

 

EDIT: Adding the old disk back just gives the warning 'all data will be overwritten when you start the array', so doesn't feel like this will work

Edited by extrobe
Link to comment

demeter-diagnostics-20200906-1810.zip

 

Diagnostics attached.

I did follow that link for no spare port, but it went back to the single-disk procedure, and wasn't sure if that was the right one to follow - so figured that as the multi-disk procedure was to just select a new disk (seemingly like for a standard disk) I thought I could just hot-swap them instead 😕

 

Edit: References to disk Crucial_CT500MX200 = Old Cache 4, Crucial_CT500MX500 = New Cache 4

Edited by extrobe
Link to comment

According to syslog you're missing two devices:, devid 3 and 4, it's detecting 2 new devices.

 

Sep  6 17:31:02 DEMETER emhttpd: cache uuid: c8f42191-039e-41d8-894d-bdd878c15864
Sep  6 17:31:02 DEMETER emhttpd: cache TotDevices: 4
Sep  6 17:31:02 DEMETER emhttpd: cache NumDevices: 4
Sep  6 17:31:02 DEMETER emhttpd: cache NumFound: 2
Sep  6 17:31:02 DEMETER emhttpd: cache NumMissing: 1
Sep  6 17:31:02 DEMETER emhttpd: cache NumMisplaced: 0
Sep  6 17:31:02 DEMETER emhttpd: cache NumExtra: 2
Sep  6 17:31:02 DEMETER emhttpd: cache LuksState: 0
Sep  6 17:31:02 DEMETER emhttpd: shcmd (408): mount -t btrfs -o noatime,nodiratime,degraded -U c8f42191-039e-41d8-894d-bdd878c15864 /mnt/cache
Sep  6 17:31:02 DEMETER kernel: BTRFS info (device sdh1): allowing degraded mounts
Sep  6 17:31:02 DEMETER kernel: BTRFS info (device sdh1): disk space caching is enabled
Sep  6 17:31:02 DEMETER kernel: BTRFS info (device sdh1): has skinny extents
Sep  6 17:31:02 DEMETER kernel: BTRFS warning (device sdh1): devid 3 uuid 91c09af4-c319-4ef1-a89e-17b8b9080b28 is missing
Sep  6 17:31:02 DEMETER kernel: BTRFS warning (device sdh1): devid 4 uuid 20de7db3-546e-45a6-b08b-919ab79effeb is missing
Sep  6 17:31:02 DEMETER kernel: BTRFS warning (device sdh1): chunk 1009532010496 missing 2 devices, max tolerance is 1 for writeable mount

 

If you just replaced one there's a problem with another one, which is not being detected as a pool member, any idea why that would happen, did you do anything else?

Link to comment

Yes, just changing slots.

 

I'm trying to get it to read off the original disk (the disk itself should have still been physically ok, it was just approaching EoL), but struggling to get it to include it in the pool.

Working my way through some of the BTRFS troubleshooting steps, but starting to look like a lost cause :( 

 

EDIT: Looks like it's the other Crucial disk which is not showing up - but there was nothing to suggest it was an issue before hand - in fact, I checked the SMART data before I started the replacement as wanted to see how much life that one had left.

When I try to mount it, it says the Special Device doesn't exist - any diagnostics I can do on this to work out why that might be / confirm it's damaged?

 

Looks like the data on the original disk has also already gone - when I try to mount it, it says wrong FS type

Edited by extrobe
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.