Add cache disk to pool & array will no longer start

Mike S · May 25, 2021

I tried to add a second disk to the cache pool which the original disk was formatted in BTRFS so it should be compatible with the pool. Once I've added the disk it was stuck at mounting for 5-6 hours and it was eventually rebooted. Since rebooting it gets past mounting but gets stuck at Starting Services.

It seems to start fine in maintenance mode but will not start up normally. Safemode also doesn't seem to help the situation. I cannot seem to gather diagnostics as it seems to get stuck downloading. All I can see in the sys log:

May 24 20:29:36 Tower kernel: CPU: 3 PID: 58 Comm: kworker/u16:2 Tainted: G        W         5.10.28-Unraid #1
May 24 20:29:36 Tower kernel: Hardware name: BASE_BOARD_MANUFACTURER MODEL_NAME/132-SE-E775, BIOS 4.6.5 04/09/2018
May 24 20:29:36 Tower kernel: Workqueue: events_unbound btrfs_async_reclaim_data_space
May 24 20:29:36 Tower kernel: Call Trace:
May 24 20:29:36 Tower kernel: <IRQ>
May 24 20:29:36 Tower kernel: dump_stack+0x6b/0x83
May 24 20:29:36 Tower kernel: ? lapic_can_unplug_cpu+0x8e/0x8e
May 24 20:29:36 Tower kernel: nmi_cpu_backtrace+0x7d/0x8f
May 24 20:29:36 Tower kernel: nmi_trigger_cpumask_backtrace+0x56/0xd3
May 24 20:29:36 Tower kernel: rcu_dump_cpu_stacks+0x9f/0xc6
May 24 20:29:36 Tower kernel: rcu_sched_clock_irq+0x1ec/0x543
May 24 20:29:36 Tower kernel: ? _raw_spin_unlock_irqrestore+0xd/0xe
May 24 20:29:36 Tower kernel: update_process_times+0x50/0x6e
May 24 20:29:36 Tower kernel: tick_sched_timer+0x36/0x64
May 24 20:29:36 Tower kernel: __hrtimer_run_queues+0xb7/0x10b
May 24 20:29:36 Tower kernel: ? tick_sched_do_timer+0x39/0x39
May 24 20:29:36 Tower kernel: hrtimer_interrupt+0x8d/0x15b
May 24 20:29:36 Tower kernel: __sysvec_apic_timer_interrupt+0x5d/0x68
May 24 20:29:36 Tower kernel: asm_call_irq_on_stack+0x12/0x20
May 24 20:29:36 Tower kernel: </IRQ>
May 24 20:29:36 Tower kernel: sysvec_apic_timer_interrupt+0x71/0x95
May 24 20:29:36 Tower kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
May 24 20:29:36 Tower kernel: RIP: 0010:btrfs_async_reclaim_data_space+0x40/0xf4
May 24 20:29:36 Tower kernel: Code: 4c 8d a5 b8 00 00 00 48 89 ef e8 2d 9a 42 00 48 8b 85 b8 00 00 00 49 39 c4 74 62 48 89 ef 4c 8b ad d0 00 00 00 e8 56 f3 ff ff <f6> 45 40 01 75 1e 4c 89 f7 b9 08 00 00 00 48 83 ca ff 48 89 ee e8
May 24 20:29:36 Tower kernel: RSP: 0018:ffffc90000233e78 EFLAGS: 00000287
May 24 20:29:36 Tower kernel: RAX: ffffc90000a43d58 RBX: ffff8881003ca000 RCX: 0000000000000000
May 24 20:29:36 Tower kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff888104baa000
May 24 20:29:36 Tower kernel: RBP: ffff888104baa000 R08: 0000000000000001 R09: 0000646e756f626e
May 24 20:29:36 Tower kernel: R10: 8080808080808080 R11: fefefefefefefeff R12: ffff888104baa0b8
May 24 20:29:36 Tower kernel: R13: 0000000000000000 R14: ffff88814205d000 R15: 0000000000000000
May 24 20:29:36 Tower kernel: ? btrfs_async_reclaim_data_space+0x40/0xf4
May 24 20:29:36 Tower kernel: process_one_work+0x13c/0x1d5
May 24 20:29:36 Tower kernel: worker_thread+0x18b/0x22f
May 24 20:29:36 Tower kernel: ? process_scheduled_works+0x27/0x27
May 24 20:29:36 Tower kernel: kthread+0xe5/0xea
May 24 20:29:36 Tower kernel: ? __kthread_bind_mask+0x57/0x57
May 24 20:29:36 Tower kernel: ret_from_fork+0x22/0x30

Any thoughts?

I've attached a full syslog

tower-syslog-20210525-0135.zip

Edited May 25, 2021 by Mike S

JorgeB · May 25, 2021

According to the log there's a new cache device but there's also a device missing:

May 24 20:25:31 Tower emhttpd: /mnt/cache TotDevices: 2
May 24 20:25:31 Tower emhttpd: /mnt/cache NumDevices: 2
May 24 20:25:31 Tower emhttpd: /mnt/cache NumFound: 1
May 24 20:25:31 Tower emhttpd: /mnt/cache NumMissing: 1
May 24 20:25:31 Tower emhttpd: /mnt/cache NumMisplaced: 0
May 24 20:25:31 Tower emhttpd: /mnt/cache NumExtra: 1
May 24 20:25:31 Tower emhttpd: /mnt/cache LuksState: 0
May 24 20:25:31 Tower emhttpd: shcmd (332): mount -t btrfs -o noatime,space_cache=v2,discard=async,degraded -U 8af1fbf7-4e95-4aa6-aa41-91cf4fcabeec /mnt/cache
May 24 20:25:31 Tower kernel: BTRFS info (device sdf1): turning on async discard
May 24 20:25:31 Tower kernel: BTRFS info (device sdf1): allowing degraded mounts
May 24 20:25:31 Tower kernel: BTRFS info (device sdf1): using free space tree
May 24 20:25:31 Tower kernel: BTRFS info (device sdf1): has skinny extents
May 24 20:25:31 Tower kernel: BTRFS warning (device sdf1): devid 2 uuid 34efdbf3-fad8-42bb-acdc-bba1ed3bcaf4 is missing
May 24 20:25:31 Tower kernel: BTRFS info (device sdf1): enabling ssd optimizations
May 24 20:25:31 Tower kernel: BTRFS error (device sdf1): balance: invalid convert data profile raid1
May 24 20:25:31 Tower kernel: BTRFS warning (device sdf1): Skipping commit of aborted transaction.

Do you still have the missing device?

Mike S · May 25, 2021

The only device that changed was adding the new one, but it crashed on the original mounting so maybe its recognizing that as another device that is missing? That looks like it's looking for 2 total but found 1 and 1 extra?

I also in that message see:

May 24 20:25:31 Tower kernel: BTRFS error (device sdf1): balance: invalid convert data profile raid1

Does that have anything to do with it?

Edited May 25, 2021 by Mike S

JorgeB · May 25, 2021

Total devices means the pool has 2 devices, and 2 devices are assigned (num devices), but the most important part is this one:

May 24 20:25:31 Tower emhttpd: /mnt/cache NumFound: 1
May 24 20:25:31 Tower emhttpd: /mnt/cache NumMissing: 1
May 24 20:25:31 Tower emhttpd: /mnt/cache NumMisplaced: 0
May 24 20:25:31 Tower emhttpd: /mnt/cache NumExtra: 1

This means only 1 pool device was found, there's 1 missing and 1 extra (new device).

1 hour ago, Mike S said:

Does that have anything to do with it?

Yes, it can't convert to raid 1 because of the missing device.

Mike S · May 25, 2021

hmm so if that is the case how can I get rid of the ghost device?

JorgeB · May 25, 2021

See if the pool starts with only the existing device, if yes post new diags after array start.

Mike S · May 25, 2021

Still stuck at "Starting Services" with only the original see this now in the syslog instead:

May 25 09:56:08 Tower emhttpd: shcmd (356): mkdir -p /mnt/cache
May 25 09:56:08 Tower emhttpd: /mnt/cache uuid: 8af1fbf7-4e95-4aa6-aa41-91cf4fcabeec
May 25 09:56:08 Tower emhttpd: /mnt/cache TotDevices: 2
May 25 09:56:08 Tower emhttpd: /mnt/cache NumDevices: 1
May 25 09:56:08 Tower emhttpd: /mnt/cache NumFound: 1
May 25 09:56:08 Tower emhttpd: /mnt/cache NumMissing: 1
May 25 09:56:08 Tower emhttpd: /mnt/cache NumMisplaced: 0
May 25 09:56:08 Tower emhttpd: /mnt/cache NumExtra: 0
May 25 09:56:08 Tower emhttpd: /mnt/cache LuksState: 0
May 25 09:56:08 Tower emhttpd: shcmd (357): mount -t btrfs -o noatime,space_cache=v2,discard=async,degraded -U 8af1fbf7-4e95-4aa6-aa41-91cf4fcabeec /mnt/cache

JorgeB · May 25, 2021

Please post the complete syslog to see the error/crash.

Mike S · May 25, 2021

tower-syslog-20210525-1507.zip

JorgeB · May 25, 2021

You'll need to recreate the pool, but if there's still important data there you can try to recover with this:

First create a temp dir:

mkdir /x

then try to mount with skip balance:

mount -o degraded,skip_balance /dev/sdf1 /x

If that doesn't work try read-only:

mount -o degraded,ro /dev/sdf1 /x

If either works you can browse /x and copy any important data to the array.

Mike S · May 25, 2021

whelp neither of those worked unfortunately. My cache is sitting at 240gb/240gb after removing the one device. If I have to rebuild the pool how do I go about doing that? Wasn't seeing anything in the docs: https://wiki.unraid.net/Manual/Storage_Management#Adding_disks_to_a_pool

JorgeB · May 25, 2021

If there's no data there you just need to re-format.

Mike S · May 25, 2021

Oh I meant 240 of 240 full not empty

JorgeB · May 25, 2021

If it's completely full it's possibly one of the reasons it's crashing, a COW filesystem should never be completely full, but see here for some more recovery options, then re-format.

Mike S · May 25, 2021

awesome thanks so much for the help was able to finally get the copy of the cache and working on rebuilding the pool now. Appreciate the help!!

Add cache disk to pool & array will no longer start

Recommended Posts

Mike S

Link to comment

JorgeB

Link to comment

Mike S

Link to comment

JorgeB

Link to comment

Mike S

Link to comment

JorgeB

Link to comment

Mike S

Link to comment

JorgeB

Link to comment

Mike S

Link to comment

JorgeB

Link to comment

Mike S

Link to comment

JorgeB

Link to comment

Mike S

Link to comment

JorgeB

Link to comment

Mike S

Link to comment

Join the conversation