Jump to content

ZFS experienced an unrecoverable error


Go to solution Solved by d3m3zs,

Recommended Posts

Hi! Today I faced very weird issues:

pool: backup_zfs 
state: ONLINE 
status: One or more devices has experienced an unrecoverable error.  An attempt was made to correct the error.  Applications are unaffected. 
action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. 
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P 
scan: scrub repaired 0B in 00:00:01 with 0 errors on Sat May 25 12:13:28 2024 config:
NAME           STATE     READ WRITE CKSUM
backup_zfs     ONLINE       0     0     0
  mirror-0     ONLINE       0     0     0
    /dev/sdd1  ONLINE       0     0     0
    /dev/sdb1  ONLINE       0   490     0
errors: No known data errors

Another disk in array:

pool: disk2
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A 
  scan: scrub in progress since Sat May 25 12:03:21 2024
1.65T scanned at 1.28G/s, 255G issued at 197M/s, 5.63T total
0B repaired, 4.41% done, 07:57:55 to go
config:

NAME          STATE     READ WRITE CKSUM
disk2         ONLINE       0     0     0
  /dev/md2p1  ONLINE       0     0     0

errors: 5 data errors, use '-v' for a listhttps://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A

Despite verifying the health of my disks and recreating the pool, I continue to encounter errors.

 

Surprisingly, Unraid notifications remained silent, and I stumbled upon the issue while inspecting the pool status. I even ruled out data cable problems related to motherboard connectors or the LSI controller.

 

Considering the persistent ZFS bug, I’m contemplating creating a Btrfs pool as an alternative.

What should I do?

Link to comment
21 hours ago, bmartino1 said:

it looks like you make a single disk pool with 1 disk. and uses a partition instead of the disk itself...

image.thumb.png.59a7cadaac5b22e48a63a5df0246f4a4.png

I executed yesterday ZFS scrub command on disk2 and after 14 hours see no issues:


  pool: disk2
 state: ONLINE
  scan: scrub repaired 0B in 14:56:12 with 0 errors on Sun May 26 02:59:33 2024
config:

	NAME          STATE     READ WRITE CKSUM
	disk2         ONLINE       0     0     0
	  /dev/md2p1  ONLINE       0     0     0

errors: No known data errors

Today I got errors on disk1 during parity sync (it is continue running) and according to screenshots - size is 0 and I can not open any files on this disk:

image.png.9d70fbf4413c093eb14ac6fe6d5cc9f2.png


  pool: disk1
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
  scan: scrub repaired 0B in 13:30:27 with 0 errors on Wed May  1 12:30:38 2024
config:

	NAME          STATE     READ WRITE CKSUM
	disk1         UNAVAIL      0     0     0  insufficient replicas
	  /dev/md1p1  FAULTED      3     0     0  too many errors

errors: 16 data errors, use '-v' for a list

ZFS mirror "backup-zfs" still have the same status:


  pool: backup_zfs
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
	invalid.  Sufficient replicas exist for the pool to continue
	functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:00:01 with 0 errors on Sat May 25 12:13:28 2024
config:

	NAME           STATE     READ WRITE CKSUM
	backup_zfs     DEGRADED     0     0     0
	  mirror-0     DEGRADED    19 6.63K     0
	    /dev/sdd1  ONLINE      21 7.53K     0
	    /dev/sdb1  UNAVAIL     21 21.7K     0

errors: No known data errors

 

A few months as I switched to ZFS and only yesterday it started happen and also what I don`t like that Unraid notified me only about these errors on disk1 during parity sync, nothing about ZFS DEGRADED or SUSPENDED statuses.

 

I was trying to download diagnostic file, but it failed in chrome tab

image.thumb.png.c69f7fde03a8604aef94e9cf45bedefc.png

Will try to get it from boot/logs if it is created.

 

Link to comment

Thank you.
So the mod may be more assistance here.

It looks like you used zfs as the array file format per the main pciture...

I've had terrible experience with unraid doing that. Thee aray and how it works done't act nor work right with unraid array setups.
See docs: https://docs.unraid.net/unraid-os/manual/storage-management/

if you're doing ZFS it should be pool devices only. At least that is the recomend keeping the array formatted as xfs/btrfs.

some of the weirdness can be explained as the array and partiy are not tralking/working as they dhould based on the file system type.

Link to comment
15 minutes ago, bmartino1 said:

It looks like you used zfs as the array file format per the main pciture...

Exactly, worked smoothly before...

16 minutes ago, bmartino1 said:

I've had terrible experience with unraid doing that.

looks like it is now my turn to get this experience 👌

17 minutes ago, bmartino1 said:

if you're doing ZFS it should be pool devices only.

I thought to leave at least one ZFS drive for snapshots from nvme pool.

 

Link to comment
On 5/27/2024 at 10:42 AM, JorgeB said:

First thing is to try and fix the disk1 issues, looks more like a power/connection issue, replace both cable for that disk and post new diags after array start.

Tried, rebooted, changed cables. First reboot I couldn`t see GUI, but was able to login via ssh and removed network config, rebooted from ssh again and was able to see GUI.

But array didn`t start, status "Starting " and that is it, created diagnostics archive (in attachment) and rebooted server again from console.

image.thumb.png.ed3c43c5481ef97aef27dc5996517b27.png

And now again - don`t have GUI, but can login via ssh:

image.thumb.png.8e3e094d6c732bd42b4acf086a237fcf.png

 

Any ideas what should I do? I already copied all config folder from server, maybe need to remove all files from config folder and reboot one more time?

Also, I checked logs from statistics archive and found few errors on sda drive, so maybe USB issue?

May 28 13:42:12 BlackBox kernel: critical medium error, dev sda, sector 2057 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 2
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 9, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 10, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 11, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 12, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 13, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 14, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 15, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 8, async page read

 

image.thumb.png.a04ff3afeeeafb182b6b59a2b061f9c5.png

blackbox-diagnostics-20240528-1405.zip

Link to comment
Just now, JorgeB said:

sda is the flash drive, if it's logging errors you need to replace it, or you can also try to reformat it first to see if it helps.

Will try to format, but what will be with license and how can I avoid losing all my settings? Copy again /config folder to drive? 

Link to comment

Almost 1 hour Array is trying to start

sdb = disk1, and sde and sdi = both are "backup_zfs" pool

What should I do? 

  pool: disk1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 348M in 1 days 03:05:31 with 7992612 errors on Tue May 28 10:56:43 2024
config:

	NAME          STATE     READ WRITE CKSUM
	disk1         ONLINE       0     0     0
	  /dev/md1p1  ONLINE       0     0 15.3M

errors: 7992612 data errors, use '-v' for a list
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 24 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 24 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb, logical block 3, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770560 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770560 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770496, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770561 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770497, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770562 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770498, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770563 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770499, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770564 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770500, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770565 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770501, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770566 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770502, async page read
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770503, async page read
May 28 16:07:58 BlackBox emhttpd: status: One or more devices has experienced an error resulting in data
May 28 16:09:54 BlackBox kernel: I/O error, dev sdi, sector 592 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:09:54 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sdi1 error=5 type=1 offset=270336 size=8192 flags=b08c1
May 28 16:09:54 BlackBox kernel: I/O error, dev sdi, sector 35156654672 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:09:54 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sdi1 error=5 type=1 offset=18000207159296 size=8192 flags=b08c1
May 28 16:09:54 BlackBox kernel: I/O error, dev sdi, sector 35156655184 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:09:54 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sdi1 error=5 type=1 offset=18000207421440 size=8192 flags=b08c1
May 28 16:09:54 BlackBox kernel: I/O error, dev sdi, sector 592 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:09:54 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sdi1 error=5 type=1 offset=270336 size=8192 flags=b08c1
May 28 16:09:54 BlackBox kernel: I/O error, dev sdi, sector 35156654672 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:09:54 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sdi1 error=5 type=1 offset=18000207159296 size=8192 flags=b08c1
May 28 16:09:54 BlackBox kernel: I/O error, dev sdi, sector 35156655184 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:09:54 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sdi1 error=5 type=1 offset=18000207421440 size=8192 flags=b08c1
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 3597621920 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1841982390272 size=4096 flags=1808a0
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 3597607896 op 0x0:(READ) flags 0x700 phys_seg 3 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1841975209984 size=12288 flags=40080ca0
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 3597616616 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1841979674624 size=4096 flags=1808a0
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 3597627120 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1841985052672 size=4096 flags=1808a0
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 592 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=270336 size=8192 flags=b08c1
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 35156654672 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=18000207159296 size=8192 flags=b08c1
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 35156655184 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=18000207421440 size=8192 flags=b08c1
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 2062909672 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1056209719296 size=4096 flags=1808a0
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 2062950344 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1056230543360 size=8192 flags=1808a0
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 2062952920 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1056231862272 size=4096 flags=1808a0

 

Link to comment

I don't see errors with disk1 on that log, do see errors on both pool devices, causing this:

 

May 28 16:10:05 BlackBox kernel: WARNING: Pool 'backup_zfs' has encountered an uncorrectable I/O failure and has been suspended.

 

A pool is suspended when zfs cannot continue to access it, because there were errors with both disks, this will make Linux get stuck, and Unraid will never continue without a reboot.

 

You will need to force a shutdown, then check/replace cables for both pool disks and post new diags/syslog after array start.

 

 

 

 

Link to comment
13 minutes ago, JorgeB said:

I don't see errors with disk1 on that log

Many these records

00 00
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770560 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: sd 7:0:0:0: [sdb] tag#851 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK 

sdb is disk1.

 

15 minutes ago, JorgeB said:

You will need to force a shutdown, then check/replace cables for both pool disks and post new diags/syslog after array start.

Thank you, on my way. 

Link to comment
  • Solution
On 5/29/2024 at 11:04 AM, JorgeB said:

You are having constant errors with multiple disks, suggesting some underlying issue, like bad PSU, or bad HBA for example, if you have some spares try swapping some parts around, you need to fix those errors first.

I want to say thank you, it seems issue is resolved, now I know how to backup flash drive and which logs can be important.

Solution was - add active cooling to LSI adapter:

image.png.9a76078ccbd8daffdaaea4b61150adb5.png

I will take a look at my drives in few days to be sure. 

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...