Mike S Posted May 25, 2021 Share Posted May 25, 2021 (edited) I tried to add a second disk to the cache pool which the original disk was formatted in BTRFS so it should be compatible with the pool. Once I've added the disk it was stuck at mounting for 5-6 hours and it was eventually rebooted. Since rebooting it gets past mounting but gets stuck at Starting Services. It seems to start fine in maintenance mode but will not start up normally. Safemode also doesn't seem to help the situation. I cannot seem to gather diagnostics as it seems to get stuck downloading. All I can see in the sys log: May 24 20:29:36 Tower kernel: CPU: 3 PID: 58 Comm: kworker/u16:2 Tainted: G W 5.10.28-Unraid #1 May 24 20:29:36 Tower kernel: Hardware name: BASE_BOARD_MANUFACTURER MODEL_NAME/132-SE-E775, BIOS 4.6.5 04/09/2018 May 24 20:29:36 Tower kernel: Workqueue: events_unbound btrfs_async_reclaim_data_space May 24 20:29:36 Tower kernel: Call Trace: May 24 20:29:36 Tower kernel: <IRQ> May 24 20:29:36 Tower kernel: dump_stack+0x6b/0x83 May 24 20:29:36 Tower kernel: ? lapic_can_unplug_cpu+0x8e/0x8e May 24 20:29:36 Tower kernel: nmi_cpu_backtrace+0x7d/0x8f May 24 20:29:36 Tower kernel: nmi_trigger_cpumask_backtrace+0x56/0xd3 May 24 20:29:36 Tower kernel: rcu_dump_cpu_stacks+0x9f/0xc6 May 24 20:29:36 Tower kernel: rcu_sched_clock_irq+0x1ec/0x543 May 24 20:29:36 Tower kernel: ? _raw_spin_unlock_irqrestore+0xd/0xe May 24 20:29:36 Tower kernel: update_process_times+0x50/0x6e May 24 20:29:36 Tower kernel: tick_sched_timer+0x36/0x64 May 24 20:29:36 Tower kernel: __hrtimer_run_queues+0xb7/0x10b May 24 20:29:36 Tower kernel: ? tick_sched_do_timer+0x39/0x39 May 24 20:29:36 Tower kernel: hrtimer_interrupt+0x8d/0x15b May 24 20:29:36 Tower kernel: __sysvec_apic_timer_interrupt+0x5d/0x68 May 24 20:29:36 Tower kernel: asm_call_irq_on_stack+0x12/0x20 May 24 20:29:36 Tower kernel: </IRQ> May 24 20:29:36 Tower kernel: sysvec_apic_timer_interrupt+0x71/0x95 May 24 20:29:36 Tower kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20 May 24 20:29:36 Tower kernel: RIP: 0010:btrfs_async_reclaim_data_space+0x40/0xf4 May 24 20:29:36 Tower kernel: Code: 4c 8d a5 b8 00 00 00 48 89 ef e8 2d 9a 42 00 48 8b 85 b8 00 00 00 49 39 c4 74 62 48 89 ef 4c 8b ad d0 00 00 00 e8 56 f3 ff ff <f6> 45 40 01 75 1e 4c 89 f7 b9 08 00 00 00 48 83 ca ff 48 89 ee e8 May 24 20:29:36 Tower kernel: RSP: 0018:ffffc90000233e78 EFLAGS: 00000287 May 24 20:29:36 Tower kernel: RAX: ffffc90000a43d58 RBX: ffff8881003ca000 RCX: 0000000000000000 May 24 20:29:36 Tower kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff888104baa000 May 24 20:29:36 Tower kernel: RBP: ffff888104baa000 R08: 0000000000000001 R09: 0000646e756f626e May 24 20:29:36 Tower kernel: R10: 8080808080808080 R11: fefefefefefefeff R12: ffff888104baa0b8 May 24 20:29:36 Tower kernel: R13: 0000000000000000 R14: ffff88814205d000 R15: 0000000000000000 May 24 20:29:36 Tower kernel: ? btrfs_async_reclaim_data_space+0x40/0xf4 May 24 20:29:36 Tower kernel: process_one_work+0x13c/0x1d5 May 24 20:29:36 Tower kernel: worker_thread+0x18b/0x22f May 24 20:29:36 Tower kernel: ? process_scheduled_works+0x27/0x27 May 24 20:29:36 Tower kernel: kthread+0xe5/0xea May 24 20:29:36 Tower kernel: ? __kthread_bind_mask+0x57/0x57 May 24 20:29:36 Tower kernel: ret_from_fork+0x22/0x30 Any thoughts? I've attached a full syslog tower-syslog-20210525-0135.zip Edited May 25, 2021 by Mike S Quote Link to comment
JorgeB Posted May 25, 2021 Share Posted May 25, 2021 According to the log there's a new cache device but there's also a device missing: May 24 20:25:31 Tower emhttpd: /mnt/cache TotDevices: 2 May 24 20:25:31 Tower emhttpd: /mnt/cache NumDevices: 2 May 24 20:25:31 Tower emhttpd: /mnt/cache NumFound: 1 May 24 20:25:31 Tower emhttpd: /mnt/cache NumMissing: 1 May 24 20:25:31 Tower emhttpd: /mnt/cache NumMisplaced: 0 May 24 20:25:31 Tower emhttpd: /mnt/cache NumExtra: 1 May 24 20:25:31 Tower emhttpd: /mnt/cache LuksState: 0 May 24 20:25:31 Tower emhttpd: shcmd (332): mount -t btrfs -o noatime,space_cache=v2,discard=async,degraded -U 8af1fbf7-4e95-4aa6-aa41-91cf4fcabeec /mnt/cache May 24 20:25:31 Tower kernel: BTRFS info (device sdf1): turning on async discard May 24 20:25:31 Tower kernel: BTRFS info (device sdf1): allowing degraded mounts May 24 20:25:31 Tower kernel: BTRFS info (device sdf1): using free space tree May 24 20:25:31 Tower kernel: BTRFS info (device sdf1): has skinny extents May 24 20:25:31 Tower kernel: BTRFS warning (device sdf1): devid 2 uuid 34efdbf3-fad8-42bb-acdc-bba1ed3bcaf4 is missing May 24 20:25:31 Tower kernel: BTRFS info (device sdf1): enabling ssd optimizations May 24 20:25:31 Tower kernel: BTRFS error (device sdf1): balance: invalid convert data profile raid1 May 24 20:25:31 Tower kernel: BTRFS warning (device sdf1): Skipping commit of aborted transaction. Do you still have the missing device? 1 Quote Link to comment
Mike S Posted May 25, 2021 Author Share Posted May 25, 2021 (edited) The only device that changed was adding the new one, but it crashed on the original mounting so maybe its recognizing that as another device that is missing? That looks like it's looking for 2 total but found 1 and 1 extra? I also in that message see: May 24 20:25:31 Tower kernel: BTRFS error (device sdf1): balance: invalid convert data profile raid1 Does that have anything to do with it? Edited May 25, 2021 by Mike S Quote Link to comment
JorgeB Posted May 25, 2021 Share Posted May 25, 2021 Total devices means the pool has 2 devices, and 2 devices are assigned (num devices), but the most important part is this one: May 24 20:25:31 Tower emhttpd: /mnt/cache NumFound: 1 May 24 20:25:31 Tower emhttpd: /mnt/cache NumMissing: 1 May 24 20:25:31 Tower emhttpd: /mnt/cache NumMisplaced: 0 May 24 20:25:31 Tower emhttpd: /mnt/cache NumExtra: 1 This means only 1 pool device was found, there's 1 missing and 1 extra (new device). 1 hour ago, Mike S said: Does that have anything to do with it? Yes, it can't convert to raid 1 because of the missing device. Quote Link to comment
Mike S Posted May 25, 2021 Author Share Posted May 25, 2021 hmm so if that is the case how can I get rid of the ghost device? Quote Link to comment
JorgeB Posted May 25, 2021 Share Posted May 25, 2021 See if the pool starts with only the existing device, if yes post new diags after array start. Quote Link to comment
Mike S Posted May 25, 2021 Author Share Posted May 25, 2021 Still stuck at "Starting Services" with only the original see this now in the syslog instead: May 25 09:56:08 Tower emhttpd: shcmd (356): mkdir -p /mnt/cache May 25 09:56:08 Tower emhttpd: /mnt/cache uuid: 8af1fbf7-4e95-4aa6-aa41-91cf4fcabeec May 25 09:56:08 Tower emhttpd: /mnt/cache TotDevices: 2 May 25 09:56:08 Tower emhttpd: /mnt/cache NumDevices: 1 May 25 09:56:08 Tower emhttpd: /mnt/cache NumFound: 1 May 25 09:56:08 Tower emhttpd: /mnt/cache NumMissing: 1 May 25 09:56:08 Tower emhttpd: /mnt/cache NumMisplaced: 0 May 25 09:56:08 Tower emhttpd: /mnt/cache NumExtra: 0 May 25 09:56:08 Tower emhttpd: /mnt/cache LuksState: 0 May 25 09:56:08 Tower emhttpd: shcmd (357): mount -t btrfs -o noatime,space_cache=v2,discard=async,degraded -U 8af1fbf7-4e95-4aa6-aa41-91cf4fcabeec /mnt/cache Quote Link to comment
JorgeB Posted May 25, 2021 Share Posted May 25, 2021 Please post the complete syslog to see the error/crash. Quote Link to comment
Mike S Posted May 25, 2021 Author Share Posted May 25, 2021 tower-syslog-20210525-1507.zip Quote Link to comment
JorgeB Posted May 25, 2021 Share Posted May 25, 2021 You'll need to recreate the pool, but if there's still important data there you can try to recover with this: First create a temp dir: mkdir /x then try to mount with skip balance: mount -o degraded,skip_balance /dev/sdf1 /x If that doesn't work try read-only: mount -o degraded,ro /dev/sdf1 /x If either works you can browse /x and copy any important data to the array. Quote Link to comment
Mike S Posted May 25, 2021 Author Share Posted May 25, 2021 whelp neither of those worked unfortunately. My cache is sitting at 240gb/240gb after removing the one device. If I have to rebuild the pool how do I go about doing that? Wasn't seeing anything in the docs: https://wiki.unraid.net/Manual/Storage_Management#Adding_disks_to_a_pool Quote Link to comment
JorgeB Posted May 25, 2021 Share Posted May 25, 2021 If there's no data there you just need to re-format. Quote Link to comment
Mike S Posted May 25, 2021 Author Share Posted May 25, 2021 Oh I meant 240 of 240 full not empty Quote Link to comment
JorgeB Posted May 25, 2021 Share Posted May 25, 2021 If it's completely full it's possibly one of the reasons it's crashing, a COW filesystem should never be completely full, but see here for some more recovery options, then re-format. 1 Quote Link to comment
Mike S Posted May 25, 2021 Author Share Posted May 25, 2021 awesome thanks so much for the help was able to finally get the copy of the cache and working on rebuilding the pool now. Appreciate the help!! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.