jeradc Posted January 31 Share Posted January 31 The system wont fully boot up, I think it's hung bringing up the file shares? Lower left corner of the UI says "Array Starting - Starting Services..." Docker wont start Reading up on the link provided in the log to Openzfs github, I've run the recommend command: I don't care about anything in that folder ("Cache/Scratch Space"), and don't mind blowing it away completely. But... I don't know how to repair the ZFS filesystem to allow Unraid to continue booting and getting docker up again. Any next steps would be appreciated. tower-diagnostics-20240131-0009.zip Quote Link to comment
JorgeB Posted January 31 Share Posted January 31 Any snapshots on that data set? That error usually means a file that is now gone, but can still be in a snapshot, or metadata corruption, you can also try scrubbing the pool. Quote Link to comment
jeradc Posted January 31 Author Share Posted January 31 (edited) I don't know anything about snapshots. I don't know how to check metadata corruption. and I assume I just click this scrub button here? Edited January 31 by jeradc Quote Link to comment
jeradc Posted February 1 Author Share Posted February 1 clicking the scrub button has not changed anything in (3) hours now. Not sure what to do. Quote Link to comment
JorgeB Posted February 1 Share Posted February 1 Sorry for the delay, likely different time zones, post the output of: zfs list -t all Quote Link to comment
jeradc Posted February 1 Author Share Posted February 1 6 hours ago, JorgeB said: Sorry for the delay, likely different time zones, post the output of: zfs list -t all when I said "3 hours", I was only referring to the system not giving me feedback on success or failure from the scrub command, and wasn't sure how long a scrub could take. Here is the output: Quote Link to comment
JorgeB Posted February 1 Share Posted February 1 22 hours ago, jeradc said: I don't care about anything in that folder ("Cache/Scratch Space") If there's nothing important in that share, destroy that dataset with zfs destroy 'cache/SCRATCH SPACE' Then reboot and run a new scrub. Quote Link to comment
JorgeB Posted February 1 Share Posted February 1 Something is using data dataset, having an SSH session or explorer windows opened to it is enough, if you don't see anything reboot, leave docker and VM services disabled, and try again Quote Link to comment
jeradc Posted February 1 Author Share Posted February 1 (edited) Setting Docker Enabled -> No, rebooting and running the 'zfs destroy' command was successful. Then, running a scrub was successful. However, the array never finishes coming completely online now after re-enabling docker and obviously docker does not start. There is nothing really in the log for me to go off of, to self trouble-shoot and get docker working again. tower-diagnostics-20240201-1410.zip Edited February 1 by jeradc Quote Link to comment
JorgeB Posted February 2 Share Posted February 2 You have something creating that dataset again: cache/SCRATCH SPACE 124G 128K 124G 1% /mnt/cache/SCRATCH SPACE Likely a container mapping, also try recreating the docker image: https://docs.unraid.net/unraid-os/manual/docker-management/#re-create-the-docker-image-file Also see below if you have any custom docker networks: https://docs.unraid.net/unraid-os/manual/docker-management/#docker-custom-networks Quote Link to comment
jeradc Posted February 2 Author Share Posted February 2 (edited) It's a temp folder for several of my docker containers (tdarr, sonarr, etc.). I recreated the share before trying to restart docker, so the apps wouldnt fail. Edited February 2 by jeradc Quote Link to comment
jeradc Posted February 2 Author Share Posted February 2 "Docker Service failed to start." Disabled docker, deleted vdisk, start docker.... wont start. Disable docker, reboot, delete vdisk, start docker.... wont start. The system just keeps saying "Array Started..... Starting services" in the corner, and when I click on the docker tab "Docker Service failed to start." Quote Link to comment
jeradc Posted February 2 Author Share Posted February 2 attached tower-diagnostics-20240202-1312.zip Quote Link to comment
JorgeB Posted February 2 Share Posted February 2 There's a ZFS related call trace when the pool mounts: Feb 2 12:43:23 Tower kernel: Call Trace: Feb 2 12:43:23 Tower kernel: <TASK> Feb 2 12:43:23 Tower kernel: dump_stack_lvl+0x44/0x5c Feb 2 12:43:23 Tower kernel: spl_panic+0xd0/0xe8 [spl] Feb 2 12:43:23 Tower kernel: ? sysvec_apic_timer_interrupt+0x92/0xa6 Feb 2 12:43:23 Tower kernel: ? bt_grow_leaf+0xc3/0xd6 [zfs] Feb 2 12:43:23 Tower kernel: ? zfs_btree_insert_leaf_impl+0x21/0x44 [zfs] Feb 2 12:43:23 Tower kernel: ? zfs_btree_add_idx+0xee/0x1c5 [zfs] Feb 2 12:43:23 Tower kernel: range_tree_remove_impl+0x77/0x406 [zfs] Feb 2 12:43:23 Tower kernel: ? zio_wait+0x1ee/0x1fd [zfs] Feb 2 12:43:23 Tower kernel: space_map_load_callback+0x70/0x79 [zfs] Feb 2 12:43:23 Tower kernel: space_map_iterate+0x2d3/0x324 [zfs] Feb 2 12:43:23 Tower kernel: ? spa_stats_destroy+0x16c/0x16c [zfs] Feb 2 12:43:23 Tower kernel: space_map_load_length+0x93/0xcb [zfs] Feb 2 12:43:23 Tower kernel: metaslab_load+0x33b/0x6e3 [zfs] Feb 2 12:43:23 Tower kernel: ? _raw_spin_unlock+0x14/0x29 Feb 2 12:43:23 Tower kernel: ? raw_spin_rq_unlock_irq+0x5/0x10 Feb 2 12:43:23 Tower kernel: ? finish_task_switch.isra.0+0x140/0x218 Feb 2 12:43:23 Tower kernel: ? __schedule+0x5ba/0x612 Feb 2 12:43:23 Tower kernel: ? __wake_up_common_lock+0x88/0xbb Feb 2 12:43:23 Tower kernel: metaslab_preload+0x4c/0x97 [zfs] Feb 2 12:43:23 Tower kernel: taskq_thread+0x266/0x38a [spl] Feb 2 12:43:23 Tower kernel: ? wake_up_q+0x44/0x44 Feb 2 12:43:23 Tower kernel: ? taskq_dispatch_delay+0x106/0x106 [spl] Feb 2 12:43:23 Tower kernel: kthread+0xe4/0xef Feb 2 12:43:23 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b Feb 2 12:43:23 Tower kernel: ret_from_fork+0x1f/0x30 Feb 2 12:43:23 Tower kernel: </TASK> This suggests the pool is corrupt, recommend backing up what you can and re-formatting, may also be a good idea to run memtest. Quote Link to comment
jeradc Posted February 3 Author Share Posted February 3 (edited) copy AppData folder to main array for backup. disable docker reboot with array not started, remove disk 1 and 2 from cache pool, set cache pool # drives to (0). Start array, no cache pool. Shutdown, reboot and run memtest86+, it passes (2) runs: Start unraid, start array, everything is happy. Stop array, add pool, add both drives to pool. Start array...... and we get a Kernel panic / crash in the event log I can't run/download a Diagnostics bundle as it freezes on this line and wont complete: "/usr/sbin/zpool status 2>/dev/null|todos >>'/tower-diagnostics-20240202-1946/system/zfs-info.txt'" Edited February 3 by jeradc Quote Link to comment
JorgeB Posted February 3 Share Posted February 3 You need to wipe the devices and re-format the pool. Quote Link to comment
jeradc Posted February 3 Author Share Posted February 3 I assumed recreating the pool would do that. If you have specific commands for me, that would be appreciated. otherwise.... off to google I go. Quote Link to comment
Solution jeradc Posted February 4 Author Solution Share Posted February 4 (edited) Stopped array, removed drives from cache pool, set pool drive size to (0). rebooted. started array Ran this command "wipefs -af /dev/sdx" stopped array, created pool, added both devices, started array. cache came up, but drives were unformatted. Selected the box to format, and unraid then set it up as a BTRFS cache pool by default. I enabled a daily balance and a weekly scrub on this pool. Not sure what best practice is. Started docker. moved back my AppData directory. Recreated my "Scratch Space" share, and enabled visibility on it. Reinstalled all my docker containers from "Apps -> Previous Apps" ..... and so far, I think I'm back and functional. This is my 2nd failure on this cache pool in (6) months. First failure this past summer was on a btrfs pool, so when I rebuilt it, I chose ZFS. I guess they both suck. I had no errors for years, before setting it up as a pool for redundancy last year. Wish there was more stability to this system. Edited February 4 by jeradc Quote Link to comment
JorgeB Posted February 4 Share Posted February 4 4 hours ago, jeradc said: This is my 2nd failure on this cache pool in (6) months. First failure this past summer was on a btrfs pool, so when I rebuilt it, I chose ZFS. I guess they both suck. IMHO the more likely reason is that there's an underlying hardware issue. Quote Link to comment
jeradc Posted February 5 Author Share Posted February 5 On 2/4/2024 at 6:26 AM, JorgeB said: IMHO the more likely reason is that there's an underlying hardware issue. how do I find that? Anything indicating that in the logs? any tests you suggest? Quote Link to comment
JorgeB Posted February 6 Share Posted February 6 11 hours ago, jeradc said: Anything indicating that in the logs? Not directly, but two different filesystems getting corrupt is very suspicious, could still be RAM, memtest is only definitive if it finds errors, could also be another issue, like the devices, problem is that if it takes some months to happen, it won't be easy to troubleshoot. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.