unraid stopped working while out of town. boots in safemode only. zfs storage

myths · June 2, 2023

went on a trip out of town recently and had alot of storms here. no power outages for more than a minute or two. server is on backup power so dont think it shut down. During bootup i get a panic and doesnt go past there. it looks like the drives are all being found and ran test but thats as far as im able to get. This is the trace log i found. i have all my storage disks in a zfs pool. no zfs pools being found. but not sure if its related to error. this is what im getting on normal boot and the trace from logs.

freezes at panic

if anyone could point me in the right direction that would help alot

this is not on the new unraid with zfs.

Jun  2 00:38:03 Tower kernel: PANIC: zfs: removing nonexistent segment from range tree (offset=1f10c3514000 size=2000)
Jun  2 00:38:03 Tower kernel: Showing stack for process 57252
Jun  2 00:38:03 Tower kernel: CPU: 81 PID: 57252 Comm: z_wr_iss Tainted: P           O      5.19.17-Unraid #2
Jun  2 00:38:03 Tower kernel: Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.13.0 05/14/2021
Jun  2 00:38:03 Tower kernel: Call Trace:
Jun  2 00:38:03 Tower kernel: <TASK>
Jun  2 00:38:03 Tower kernel: dump_stack_lvl+0x44/0x5c
Jun  2 00:38:03 Tower kernel: vcmn_err+0x86/0xc3 [spl]
Jun  2 00:38:03 Tower kernel: ? pn_free+0x2a/0x2a [zfs]
Jun  2 00:38:03 Tower kernel: ? bt_grow_leaf+0xc3/0xd6 [zfs]
Jun  2 00:38:03 Tower kernel: ? zfs_btree_insert_leaf_impl+0x21/0x44 [zfs]
Jun  2 00:38:03 Tower kernel: ? pn_free+0x2a/0x2a [zfs]
Jun  2 00:38:03 Tower kernel: ? zfs_btree_find_in_buf+0x4b/0x97 [zfs]
Jun  2 00:38:03 Tower kernel: zfs_panic_recover+0x6d/0x88 [zfs]
Jun  2 00:38:03 Tower kernel: range_tree_remove_impl+0xd3/0x416 [zfs]
Jun  2 00:38:03 Tower kernel: space_map_load_callback+0x70/0x79 [zfs]
Jun  2 00:38:03 Tower kernel: space_map_iterate+0x2ec/0x341 [zfs]
Jun  2 00:38:03 Tower kernel: ? spa_stats_destroy+0x16c/0x16c [zfs]
Jun  2 00:38:03 Tower kernel: space_map_load_length+0x94/0xd0 [zfs]
Jun  2 00:38:03 Tower kernel: metaslab_load+0x34d/0x6f5 [zfs]
Jun  2 00:38:03 Tower kernel: ? spl_kmem_alloc_impl+0xc6/0xf7 [spl]
Jun  2 00:38:03 Tower kernel: ? __kmalloc_node+0x1b4/0x1df
Jun  2 00:38:03 Tower kernel: metaslab_activate+0x3b/0x1f4 [zfs]
Jun  2 00:38:03 Tower kernel: metaslab_alloc_dva+0x7e2/0xf39 [zfs]
Jun  2 00:38:03 Tower kernel: ? spl_kmem_cache_alloc+0x4a/0x608 [spl]
Jun  2 00:38:03 Tower kernel: metaslab_alloc+0xfd/0x1f6 [zfs]
Jun  2 00:38:03 Tower kernel: zio_dva_allocate+0xe8/0x738 [zfs]
Jun  2 00:38:03 Tower kernel: ? spl_kmem_alloc_impl+0xc6/0xf7 [spl]
Jun  2 00:38:03 Tower kernel: ? preempt_latency_start+0x2b/0x46
Jun  2 00:38:03 Tower kernel: ? _raw_spin_lock+0x13/0x1c
Jun  2 00:38:03 Tower kernel: ? _raw_spin_unlock+0x14/0x29
Jun  2 00:38:03 Tower kernel: ? tsd_hash_search+0x74/0x81 [spl]
Jun  2 00:38:03 Tower kernel: zio_execute+0xb2/0xdd [zfs]
Jun  2 00:38:03 Tower kernel: taskq_thread+0x277/0x3a5 [spl]
Jun  2 00:38:03 Tower kernel: ? wake_up_q+0x44/0x44
Jun  2 00:38:03 Tower kernel: ? zio_taskq_member.constprop.0.isra.0+0x4f/0x4f [zfs]
Jun  2 00:38:03 Tower kernel: ? taskq_dispatch_delay+0x115/0x115 [spl]
Jun  2 00:38:03 Tower kernel: kthread+0xe7/0xef
Jun  2 00:38:03 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b
Jun  2 00:38:03 Tower kernel: ret_from_fork+0x22/0x30
Jun  2 00:38:03 Tower kernel: </TASK>
Jun  2 00:41:36 Tower kernel: md: sync done. time=729sec
Jun  2 00:41:36 Tower kernel: md: recovery thread: exit status: 0
Jun  2 00:53:25 Tower kernel: mdcmd (37): nocheck cancel

myths · June 2, 2023

I notice 1 of the lights to a hard drive tray isnt lighting up, but the disk is spinning up. bad light or something more, not sure. tried booting without disk in to see if different error but same thing. ive looked over all connections and nothing seamed off.

JorgeB · June 2, 2023

Rename the zfs plugin plg file so it doesn't install and post diags if it boots.

myths · June 2, 2023

i just did this. if i remove the zfs plugin it boots. I also went a step further and unmounted all my drives and installed one at a time to see if a disk was causing this. what ive found out is i have a 4disk backplane inside server for the 4th vdev, if i plug one of the disks into it i get the error -.-. when i run without plugin i checked connected disks in console and it sees all my disks. but with plugin installed i cant boot with a disk plugged in. could this be a backplane failure or possible vdev failure of those 4 disk.

i didnt check diags while doing all this but i can reinstall the drives and get them if it would help.

Edited June 2, 2023 by myths

JorgeB · June 2, 2023

You could try re-installing the plugin but I believe the plugin tries to auto import all the pools? Not sure since I've never used it, you could also update to v6.12-rc6 then try to import the pool read only to see if zfs doesn't panic.

myths · June 2, 2023

i tried to do a fresh install of plugin, during install it would cause the panic and freeze. i also tried to update but couldnt find much info on importing and rolled back.

JorgeB · June 2, 2023

Update to rc6 then try importing the pool read-only using the CLI:

zpool import -o readonly=on pool_name

If it works backup data and recreate the pool.

myths · June 3, 2023

ahh, hmm, know of any alternatives besides backup? backing up 200tb of data would require quite alot in drives and a new server.

JorgeB · June 3, 2023

There fsck for zfs, if the filesystem is corrupt not many options to fix it, you can try reverting some transitions as a last resort.

myths · June 4, 2023

how would i go about reverting some, ive got snapshots but in read only im not able to.

JorgeB · June 4, 2023

Note that some data loss can occur with these:

zpool import -F

"Recovery mode for a non-importable pool. Attempt to return the pool to an importable state by discarding the last few transactions. Not all damaged pools can be recovered by using this option. If successful, the data from the discarded transactions is irretrievably lost. This option is ignored if the pool is importable or already imported."

zpool import -FX

"Used with the -F recovery option. Determines whether extreme measures to find a valid txg should take place. This allows the pool to be rolled back to a txg which is no longer guaranteed to be consistent. Pools imported at an inconsistent txg may contain uncorrectable checksum errors. For more details about pool recovery mode, see the -F option, above. WARNING: This option can be extremely hazardous to the health of your pool and should only be used as a last resort."

myths · June 5, 2023

im wondering if the fault could be with my cache. i was reading over this post https://forums.unraid.net/topic/129408-solved-read-only-file-system-after-crash-of-cache-pool/
this is the very first error i got before the zfs error. i only have a picture of it so will sum it up.

btrfs fritical device nvme unable to find logical device

page cache invalidation failure on direct io

file /var/cache/netdata/dbengine/datafile-

then a second invalidationfauluer in same folder but different file.

the file was datafile-1-0000002124.ndf pid 5903

the second was 2177 pid 59017

i had already tried the -f import before. im wondering if it could possible the with the nvme?

Edited June 5, 2023 by myths

JorgeB · June 5, 2023

If the NVMe is using btrfs it won't have anything to do with zfs, though if there were issues with both pools at the same time it could suggest a hardware issue.

myths · June 5, 2023

im running a hardware diag on server to see if it finds anything as well. also going to boot up another os in a few days and try to connect zfs to it. i see some people with these panic errors are able to open on another computer or in true/free nas whatever its called now change the zfs commands to bypass checks and start pool. not found a way to do that on there. pretty much grabbing at straws to see what i can do before rebuild

JorgeB · June 5, 2023

6 minutes ago, myths said:

on another computer or in true/free nas whatever its called now change the zfs commands to bypass checks and start pool. not found a way to do that on there.

The commands are the same in any OS, they are zfs specific, like the examples I posted above.

myths · June 5, 2023

i didnt see anywhere to put them. the guides say to edit the zfs boot files and add command lines to them. unraid just now supports zfs officially so maybe its hiding somewhere ive not looked yet. did the scans with zero errors. so last thing to do is try that. ill try the fx but i think in the boot file i saw commants to bypass fail safe checks and other checks before loading the zfs as in not to check for any curroption. not sure im half asleep right now. 2 days of reading up on all this XD. thanks for the help.

said x was an invalid option

Edited June 5, 2023 by myths

JorgeB · June 6, 2023

9 hours ago, myths said:

i didnt see anywhere to put them

You use the CLI (terminal).

myths · June 7, 2023

it seams all the errors im getting are isolated to 1 vdev. Do you know if its possible to just sacrafice that vdev and have the pool work with just the other 3?

JorgeB · June 7, 2023

2 hours ago, myths said:

Do you know if its possible to just sacrafice that vdev and have the pool work with just the other 3?

That's not possible, with zfs raidz if one vdev is dead the pool is dead.

myths · June 9, 2023

thanks ive got some new drives ordered to backup what i can. i had thought that each vdev was independant of other vdevs. looks like they strip across all so that makes sense. thanks for all your help.

unraid stopped working while out of town. boots in safemode only. zfs storage

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation