6.12.6 - ZFS cache pool error "status: One or more devices has experienced an error resulting in data corruption

jeradc · January 31

The system wont fully boot up, I think it's hung bringing up the file shares?

Lower left corner of the UI says "Array Starting - Starting Services..."
Docker wont start

Reading up on the link provided in the log to Openzfs github, I've run the recommend command: ZFSCommand.PNG.a3d895cee6f85b8edd076a95f0fe26b3.PNG

I don't care about anything in that folder ("Cache/Scratch Space"), and don't mind blowing it away completely.

But... I don't know how to repair the ZFS filesystem to allow Unraid to continue booting and getting docker up again.

Any next steps would be appreciated.

tower-diagnostics-20240131-0009.zip

JorgeB · January 31

Any snapshots on that data set? That error usually means a file that is now gone, but can still be in a snapshot, or metadata corruption, you can also try scrubbing the pool.

jeradc · January 31

I don't know anything about snapshots.
I don't know how to check metadata corruption.

and I assume I just click this scrub button here?

ZFSScrub.PNG.9c51f0b3b450df181e4f01a42b298bbd.PNG

Edited January 31 by jeradc

jeradc · February 1

clicking the scrub button has not changed anything in (3) hours now. Not sure what to do.

JorgeB · February 1

Sorry for the delay, likely different time zones, post the output of:

zfs list -t all

jeradc · February 1

6 hours ago, JorgeB said:
Sorry for the delay, likely different time zones, post the output of:
zfs list -t all

when I said "3 hours", I was only referring to the system not giving me feedback on success or failure from the scrub command, and wasn't sure how long a scrub could take.

Here is the output:

ZFS-List.PNG.df3d8b779f542f49785c681d985b6c9c.PNG

JorgeB · February 1

22 hours ago, jeradc said:

I don't care about anything in that folder ("Cache/Scratch Space")

If there's nothing important in that share, destroy that dataset with

zfs destroy 'cache/SCRATCH SPACE'

Then reboot and run a new scrub.

jeradc · February 1

ZFSDestroy.PNG.ad9167c8b5c50ef9a577adcac3329e3c.PNG

JorgeB · February 1

Something is using data dataset, having an SSH session or explorer windows opened to it is enough, if you don't see anything reboot, leave docker and VM services disabled, and try again

jeradc · February 1

Setting Docker Enabled -> No, rebooting and running the 'zfs destroy' command was successful.

Then, running a scrub was successful.

ZPool_Healthy.PNG.8166a98f668a4c189a8d320893121c28.PNG

However, the array never finishes coming completely online now after re-enabling docker

UNRAIDStatus.PNG.ce0719eb9d10f0164b08f550e0da6abd.PNG

and obviously docker does not start. There is nothing really in the log for me to go off of, to self trouble-shoot and get docker working again.

tower-diagnostics-20240201-1410.zip

Edited February 1 by jeradc

JorgeB · February 2

You have something creating that dataset again:

cache/SCRATCH SPACE  124G  128K  124G   1% /mnt/cache/SCRATCH SPACE

Likely a container mapping, also try recreating the docker image:

https://docs.unraid.net/unraid-os/manual/docker-management/#re-create-the-docker-image-file

Also see below if you have any custom docker networks:

https://docs.unraid.net/unraid-os/manual/docker-management/#docker-custom-networks

jeradc · February 2

It's a temp folder for several of my docker containers (tdarr, sonarr, etc.). I recreated the share before trying to restart docker, so the apps wouldnt fail.

Edited February 2 by jeradc

jeradc · February 2

"Docker Service failed to start."

Disabled docker, deleted vdisk, start docker.... wont start.
Disable docker, reboot, delete vdisk, start docker.... wont start.

The system just keeps saying "Array Started..... Starting services" in the corner, and when I click on the docker tab "Docker Service failed to start."

JorgeB · February 2

Post new diags.

jeradc · February 2

attached

tower-diagnostics-20240202-1312.zip

JorgeB · February 2

There's a ZFS related call trace when the pool mounts:

Feb  2 12:43:23 Tower kernel: Call Trace:
Feb  2 12:43:23 Tower kernel: <TASK>
Feb  2 12:43:23 Tower kernel: dump_stack_lvl+0x44/0x5c
Feb  2 12:43:23 Tower kernel: spl_panic+0xd0/0xe8 [spl]
Feb  2 12:43:23 Tower kernel: ? sysvec_apic_timer_interrupt+0x92/0xa6
Feb  2 12:43:23 Tower kernel: ? bt_grow_leaf+0xc3/0xd6 [zfs]
Feb  2 12:43:23 Tower kernel: ? zfs_btree_insert_leaf_impl+0x21/0x44 [zfs]
Feb  2 12:43:23 Tower kernel: ? zfs_btree_add_idx+0xee/0x1c5 [zfs]
Feb  2 12:43:23 Tower kernel: range_tree_remove_impl+0x77/0x406 [zfs]
Feb  2 12:43:23 Tower kernel: ? zio_wait+0x1ee/0x1fd [zfs]
Feb  2 12:43:23 Tower kernel: space_map_load_callback+0x70/0x79 [zfs]
Feb  2 12:43:23 Tower kernel: space_map_iterate+0x2d3/0x324 [zfs]
Feb  2 12:43:23 Tower kernel: ? spa_stats_destroy+0x16c/0x16c [zfs]
Feb  2 12:43:23 Tower kernel: space_map_load_length+0x93/0xcb [zfs]
Feb  2 12:43:23 Tower kernel: metaslab_load+0x33b/0x6e3 [zfs]
Feb  2 12:43:23 Tower kernel: ? _raw_spin_unlock+0x14/0x29
Feb  2 12:43:23 Tower kernel: ? raw_spin_rq_unlock_irq+0x5/0x10
Feb  2 12:43:23 Tower kernel: ? finish_task_switch.isra.0+0x140/0x218
Feb  2 12:43:23 Tower kernel: ? __schedule+0x5ba/0x612
Feb  2 12:43:23 Tower kernel: ? __wake_up_common_lock+0x88/0xbb
Feb  2 12:43:23 Tower kernel: metaslab_preload+0x4c/0x97 [zfs]
Feb  2 12:43:23 Tower kernel: taskq_thread+0x266/0x38a [spl]
Feb  2 12:43:23 Tower kernel: ? wake_up_q+0x44/0x44
Feb  2 12:43:23 Tower kernel: ? taskq_dispatch_delay+0x106/0x106 [spl]
Feb  2 12:43:23 Tower kernel: kthread+0xe4/0xef
Feb  2 12:43:23 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b
Feb  2 12:43:23 Tower kernel: ret_from_fork+0x1f/0x30
Feb  2 12:43:23 Tower kernel: </TASK>

This suggests the pool is corrupt, recommend backing up what you can and re-formatting, may also be a good idea to run memtest.

jeradc · February 3

copy AppData folder to main array for backup.
disable docker
reboot
with array not started, remove disk 1 and 2 from cache pool, set cache pool # drives to (0).
Start array, no cache pool.
Shutdown, reboot and run memtest86+, it passes (2) runs:

Start unraid, start array, everything is happy.

Stop array, add pool, add both drives to pool.

Start array...... and we get a Kernel panic / crash in the event log

I can't run/download a Diagnostics bundle as it freezes on this line and wont complete:

"/usr/sbin/zpool status 2>/dev/null|todos >>'/tower-diagnostics-20240202-1946/system/zfs-info.txt'"

Edited February 3 by jeradc

JorgeB · February 3

You need to wipe the devices and re-format the pool.

jeradc · February 3

I assumed recreating the pool would do that. If you have specific commands for me, that would be appreciated.

otherwise.... off to google I go.

jeradc · February 4

Stopped array, removed drives from cache pool, set pool drive size to (0).
rebooted.
started array
Ran this command "wipefs -af /dev/sdx"

stopped array, created pool, added both devices, started array.
cache came up, but drives were unformatted.
Selected the box to format, and unraid then set it up as a BTRFS cache pool by default.
I enabled a daily balance and a weekly scrub on this pool. Not sure what best practice is.

Started docker.
moved back my AppData directory.
Recreated my "Scratch Space" share, and enabled visibility on it.
Reinstalled all my docker containers from "Apps -> Previous Apps"

..... and so far, I think I'm back and functional.

This is my 2nd failure on this cache pool in (6) months. First failure this past summer was on a btrfs pool, so when I rebuilt it, I chose ZFS. I guess they both suck. I had no errors for years, before setting it up as a pool for redundancy last year. Wish there was more stability to this system.

Edited February 4 by jeradc

JorgeB · February 4

4 hours ago, jeradc said:

This is my 2nd failure on this cache pool in (6) months. First failure this past summer was on a btrfs pool, so when I rebuilt it, I chose ZFS. I guess they both suck.

IMHO the more likely reason is that there's an underlying hardware issue.

jeradc · February 5

On 2/4/2024 at 6:26 AM, JorgeB said:

IMHO the more likely reason is that there's an underlying hardware issue.

how do I find that? Anything indicating that in the logs? any tests you suggest?

JorgeB · February 6

11 hours ago, jeradc said:

Anything indicating that in the logs?

Not directly, but two different filesystems getting corrupt is very suspicious, could still be RAM, memtest is only definitive if it finds errors, could also be another issue, like the devices, problem is that if it takes some months to happen, it won't be easy to troubleshoot.

6.12.6 - ZFS cache pool error "status: One or more devices has experienced an error resulting in data corruption

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation