Jump to content

6.12.6 - ZFS cache pool error "status: One or more devices has experienced an error resulting in data corruption


Go to solution Solved by jeradc,

Recommended Posts

The system wont fully boot up, I think it's hung bringing up the file shares?

  • Lower left corner of the UI says "Array Starting - Starting Services..."
  • Docker wont start

 

Reading up on the link provided in the log to Openzfs github, I've run the recommend command:ZFSCommand.PNG.a3d895cee6f85b8edd076a95f0fe26b3.PNG

 

I don't care about anything in that folder ("Cache/Scratch Space"), and don't mind blowing it away completely.

 

But... I don't know how to repair the ZFS filesystem to allow Unraid to continue booting and getting docker up again.

 

Any next steps would be appreciated.

tower-diagnostics-20240131-0009.zip

Link to comment
6 hours ago, JorgeB said:

Sorry for the delay, likely different time zones, post the output of:

zfs list -t all

 

when I said "3 hours", I was only referring to the system not giving me feedback on success or failure from the scrub command, and wasn't sure how long a scrub could take.

 

Here is the output:

ZFS-List.PNG.df3d8b779f542f49785c681d985b6c9c.PNG

Link to comment

Setting Docker Enabled -> No, rebooting and running the 'zfs destroy' command was successful.

Then, running a scrub was successful.

 

ZPool_Healthy.PNG.8166a98f668a4c189a8d320893121c28.PNG

 

However, the array never finishes coming completely online now after re-enabling docker

 

UNRAIDStatus.PNG.ce0719eb9d10f0164b08f550e0da6abd.PNG

 

and obviously docker does not start. There is nothing really in the log for me to go off of, to self trouble-shoot and get docker working again.

tower-diagnostics-20240201-1410.zip

Edited by jeradc
Link to comment

You have something creating that dataset again:

 

cache/SCRATCH SPACE  124G  128K  124G   1% /mnt/cache/SCRATCH SPACE

 

Likely a container mapping, also try recreating the docker image:

 

https://docs.unraid.net/unraid-os/manual/docker-management/#re-create-the-docker-image-file

Also see below if you have any custom docker networks:

https://docs.unraid.net/unraid-os/manual/docker-management/#docker-custom-networks

Link to comment

"Docker Service failed to start."

  • Disabled docker, deleted vdisk, start docker.... wont start.
  • Disable docker, reboot, delete vdisk, start docker.... wont start.

The system just keeps saying "Array Started..... Starting services" in the corner, and when I click on the docker tab "Docker Service failed to start."

Link to comment

There's a ZFS related call trace when the pool mounts:

 

Feb  2 12:43:23 Tower kernel: Call Trace:
Feb  2 12:43:23 Tower kernel: <TASK>
Feb  2 12:43:23 Tower kernel: dump_stack_lvl+0x44/0x5c
Feb  2 12:43:23 Tower kernel: spl_panic+0xd0/0xe8 [spl]
Feb  2 12:43:23 Tower kernel: ? sysvec_apic_timer_interrupt+0x92/0xa6
Feb  2 12:43:23 Tower kernel: ? bt_grow_leaf+0xc3/0xd6 [zfs]
Feb  2 12:43:23 Tower kernel: ? zfs_btree_insert_leaf_impl+0x21/0x44 [zfs]
Feb  2 12:43:23 Tower kernel: ? zfs_btree_add_idx+0xee/0x1c5 [zfs]
Feb  2 12:43:23 Tower kernel: range_tree_remove_impl+0x77/0x406 [zfs]
Feb  2 12:43:23 Tower kernel: ? zio_wait+0x1ee/0x1fd [zfs]
Feb  2 12:43:23 Tower kernel: space_map_load_callback+0x70/0x79 [zfs]
Feb  2 12:43:23 Tower kernel: space_map_iterate+0x2d3/0x324 [zfs]
Feb  2 12:43:23 Tower kernel: ? spa_stats_destroy+0x16c/0x16c [zfs]
Feb  2 12:43:23 Tower kernel: space_map_load_length+0x93/0xcb [zfs]
Feb  2 12:43:23 Tower kernel: metaslab_load+0x33b/0x6e3 [zfs]
Feb  2 12:43:23 Tower kernel: ? _raw_spin_unlock+0x14/0x29
Feb  2 12:43:23 Tower kernel: ? raw_spin_rq_unlock_irq+0x5/0x10
Feb  2 12:43:23 Tower kernel: ? finish_task_switch.isra.0+0x140/0x218
Feb  2 12:43:23 Tower kernel: ? __schedule+0x5ba/0x612
Feb  2 12:43:23 Tower kernel: ? __wake_up_common_lock+0x88/0xbb
Feb  2 12:43:23 Tower kernel: metaslab_preload+0x4c/0x97 [zfs]
Feb  2 12:43:23 Tower kernel: taskq_thread+0x266/0x38a [spl]
Feb  2 12:43:23 Tower kernel: ? wake_up_q+0x44/0x44
Feb  2 12:43:23 Tower kernel: ? taskq_dispatch_delay+0x106/0x106 [spl]
Feb  2 12:43:23 Tower kernel: kthread+0xe4/0xef
Feb  2 12:43:23 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b
Feb  2 12:43:23 Tower kernel: ret_from_fork+0x1f/0x30
Feb  2 12:43:23 Tower kernel: </TASK>

 

This suggests the pool is corrupt, recommend backing up what you can and re-formatting, may also be a good idea to run memtest.

Link to comment
  • copy AppData folder to main array for backup.
  • disable docker
  • reboot
  • with array not started, remove disk 1 and 2 from cache pool, set cache pool # drives to (0).
  • Start array, no cache pool.
  • Shutdown, reboot and run memtest86+, it passes (2) runs:
  • PXL_20240203_011954303_MP.thumb.jpg.575e67b0210b3c9bfe723fb54b08931e.jpg

Start unraid, start array, everything is happy.

Stop array, add pool, add both drives to pool.

Start array...... and we get a Kernel panic / crash in the event log

  • 643223437_Screenshot2024-02-02194958.thumb.png.a8f2dfaba56f4b9f876c69c26a0cd8e8.png

 

I can't run/download a Diagnostics bundle as it freezes on this line and wont complete:

  • "/usr/sbin/zpool status 2>/dev/null|todos >>'/tower-diagnostics-20240202-1946/system/zfs-info.txt'"
Edited by jeradc
Link to comment
  • Solution
  • Stopped array, removed drives from cache pool, set pool drive size to (0).
  • rebooted.
  • started array
  • Ran this command "wipefs -af /dev/sdx"
  • Unraid_Wipe.png.7ce617bf0762d3e998fcf7e3c9d96e28.png

 

  • stopped array, created pool, added both devices, started array.
     
  • cache came up, but drives were unformatted.
  • Selected the box to format, and unraid then set it up as a BTRFS cache pool by default.
  • I enabled a daily balance and a weekly scrub on this pool. Not sure what best practice is.

 

  • Started docker.
  • moved back my AppData directory.
  • Recreated my "Scratch Space" share, and enabled visibility on it.
  • Reinstalled all my docker containers from "Apps -> Previous Apps"

 

..... and so far, I think I'm back and functional.

 

This is my 2nd failure on this cache pool in (6) months. First failure this past summer was on a btrfs pool, so when I rebuilt it, I chose ZFS.  I guess they both suck. I had no errors for years, before setting it up as a pool for redundancy last year. Wish there was more stability to this system.

Edited by jeradc
Link to comment
4 hours ago, jeradc said:

This is my 2nd failure on this cache pool in (6) months. First failure this past summer was on a btrfs pool, so when I rebuilt it, I chose ZFS.  I guess they both suck.

IMHO the more likely reason is that there's an underlying hardware issue.

Link to comment
11 hours ago, jeradc said:

Anything indicating that in the logs?

Not directly, but two different filesystems getting corrupt is very suspicious, could still be RAM, memtest is only definitive if it finds errors, could also be another issue, like the devices, problem is that if it takes some months to happen, it won't be easy to troubleshoot.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...