zfs Cache pool stuck on Start (doesn't start), PANIC error, UNMOUNTABLE: WRONG OR NO FILE SYSTEM (cache pool)

kintamanate · September 19, 2023

I have tried to fix this on my own with referencing my error code against other peoples' problems but I think I'm just making things worse.

I think it started when I updated to from 6.9 to 6.12. I didn't think much of it, but my CPU threads would spike and stay at 100% making my Docker services unusable. Only solution would be to do a dirty reboot/shut down. My terminal would say it's initiating a clean shut down, but not actually do it. Other times I'd notice a complete lock up, GUI wouldn't connect, terminal wouldn't respond to key presses. I was using a Crucial MX500 1TB SSD as my cache in btrfs from the SpaceInvaderOne Youtube videos from a couple years ago and at some point on 6.12 it made itself read only. So, I tried to back up as much of my cache to the array as I could and replaced it with a pair of Samsung 870 EVO 1TB SSDs in mirror for my cache pool. Thing is, I didn't know my plugin depreciated after upgrading from 6.9 to 6.12 and I lost about 3 months worth of back ups... I mean, just nothing backed up. Great. My fault for not verifying back ups and reading the "what's new in this update" text.

After installing and setting up my cache pool in ZFS it worked for about a week or two before I started to notice the same sort of CPU thread lock ups and crashes. I turned off most of my docker containers except for Plex, PiHole and a handful of other active ones. The issue persisted. Usually a reboot would fix the issue, I'd be able to start the array back up my cache data and then swap the cache pool for a new one. This time it didn't work to back up and I had to rely on my my previous back up. This time I went with a pair of NVME 1TB drives that were the left overs of upgrades to another system. Installed, redid my cache pool, in zfs again, but only installed Plex, PiHole and a couple others which were off mostly. Worked for a week until same symptoms came again. Prior to the update I was able to go months, almost a year without a reboot, now a week or two tops.

I ran a few passes on my ram with memtest86, all passed.

When I boot up now I cannot start the array, the start button is greyed out and doesn't fully start. When I do this, I can no longer shut down or reboot cleanly. I cannot get a diagnostic output either.

I took some pictures of my start up terminal and a few times it outputted: VERIFY3(size <= rt->rt_space) failed (281442911768576 <= 8497270784) PANIC at range_tree.c:436:range_tree_remove_impl()

I found 3 threads that seemed to have the same issue I am having and I attempted to do this one.

I rebooted into safe mode and ran zpool import command it looked like this output.

When I ran zpool import -o readonly=on cache and then started my array my cache disks said UNMOUNTABLE: WRONG OR NO FILE SYSTEM.

I decided turned off the array before I made things worse. On the flip side I was finally able to save a diagnostic file.

tower-diagnostics-20230919-0411.zip

JorgeB · September 19, 2023

2 hours ago, kintamanate said:

When I ran zpool import -o readonly=on cache and then started my array my cache disks said UNMOUNTABLE: WRONG OR NO FILE SYSTEM.

This is normal but according to the diags the pool in online, there will be some data corruption though, you can use -v to see which files are affected, but they will fail to copy, copy anything else you can from /mnt/cache elsewhere and re-format, btw so many issues suggests a hardware problem.

pool: cache
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 1.07M in 00:00:00 with 0 errors on Tue Aug 29 22:44:11 2023
config:

    NAME           STATE     READ WRITE CKSUM
    cache          ONLINE       0     0     0
      mirror-0     ONLINE       0     0     0
        nvme0n1p1  ONLINE       0     0     0
        nvme1n1p1  ONLINE       0     0     0

errors: 2 data errors, use '-v' for a list

kintamanate · September 19, 2023

So write the command zpool import -v cache

to see which files are corrupt?

If I am unable to mount the cache drives how can I copy the data off?

JorgeB · September 20, 2023

9 hours ago, kintamanate said:

So write the command zpool import -v cache

to see which files are corrupt?

zpool status -v

9 hours ago, kintamanate said:

If I am unable to mount the cache drives how can I copy the data off?

The pool was mounted in the diags posted, just import read-only again then start the array, GUI will still show unmountable but the data will be under /mnt/cache

kintamanate · September 23, 2023

Thank you for your help. I am back up and running. I backed up the cache pool elsewhere (the two corrupted files were of no consequence I was able to retrieve them from a local copy), and then formatted the pool. I restored the backups and things are back up and running.

zfs Cache pool stuck on Start (doesn't start), PANIC error, UNMOUNTABLE: WRONG OR NO FILE SYSTEM (cache pool)

Recommended Posts

kintamanate

Link to comment

JorgeB

Link to comment

kintamanate

Link to comment

JorgeB

Link to comment

kintamanate

Link to comment

Join the conversation