ZFS experienced an unrecoverable error

d3m3zs · May 25

Hi! Today I faced very weird issues:

pool: backup_zfs 
state: ONLINE 
status: One or more devices has experienced an unrecoverable error.  An attempt was made to correct the error.  Applications are unaffected. 
action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. 
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P 
scan: scrub repaired 0B in 00:00:01 with 0 errors on Sat May 25 12:13:28 2024 config:
NAME           STATE     READ WRITE CKSUM
backup_zfs     ONLINE       0     0     0
  mirror-0     ONLINE       0     0     0
    /dev/sdd1  ONLINE       0     0     0
    /dev/sdb1  ONLINE       0   490     0
errors: No known data errors

Another disk in array:

pool: disk2
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A 
  scan: scrub in progress since Sat May 25 12:03:21 2024
1.65T scanned at 1.28G/s, 255G issued at 197M/s, 5.63T total
0B repaired, 4.41% done, 07:57:55 to go
config:

NAME          STATE     READ WRITE CKSUM
disk2         ONLINE       0     0     0
  /dev/md2p1  ONLINE       0     0     0

errors: 5 data errors, use '-v' for a listhttps://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A

Despite verifying the health of my disks and recreating the pool, I continue to encounter errors.

Surprisingly, Unraid notifications remained silent, and I stumbled upon the issue while inspecting the pool status. I even ruled out data cable problems related to motherboard connectors or the LSI controller.

Considering the persistent ZFS bug, I’m contemplating creating a Btrfs pool as an alternative.

What should I do?

bmartino1 · May 25

unraid verison...

Diagnostic file please.

Picture of main tab to show layout of disks.

did you install all the zfs Community app plugins?

it looks like you make a single disk pool with 1 disk. and uses a partition instead of the disk itself...

Review:
https://docs.oracle.com/cd/E19253-01/819-5461/gamml/index.html

JorgeB · May 26

Please post the diagnostics.

d3m3zs · May 26

21 hours ago, bmartino1 said:

it looks like you make a single disk pool with 1 disk. and uses a partition instead of the disk itself...

I executed yesterday ZFS scrub command on disk2 and after 14 hours see no issues:


  pool: disk2
 state: ONLINE
  scan: scrub repaired 0B in 14:56:12 with 0 errors on Sun May 26 02:59:33 2024
config:

	NAME          STATE     READ WRITE CKSUM
	disk2         ONLINE       0     0     0
	  /dev/md2p1  ONLINE       0     0     0

errors: No known data errors

Today I got errors on disk1 during parity sync (it is continue running) and according to screenshots - size is 0 and I can not open any files on this disk:

image.png.9d70fbf4413c093eb14ac6fe6d5cc9f2.png


  pool: disk1
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
  scan: scrub repaired 0B in 13:30:27 with 0 errors on Wed May  1 12:30:38 2024
config:

	NAME          STATE     READ WRITE CKSUM
	disk1         UNAVAIL      0     0     0  insufficient replicas
	  /dev/md1p1  FAULTED      3     0     0  too many errors

errors: 16 data errors, use '-v' for a list

ZFS mirror "backup-zfs" still have the same status:


  pool: backup_zfs
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
	invalid.  Sufficient replicas exist for the pool to continue
	functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:00:01 with 0 errors on Sat May 25 12:13:28 2024
config:

	NAME           STATE     READ WRITE CKSUM
	backup_zfs     DEGRADED     0     0     0
	  mirror-0     DEGRADED    19 6.63K     0
	    /dev/sdd1  ONLINE      21 7.53K     0
	    /dev/sdb1  UNAVAIL     21 21.7K     0

errors: No known data errors

A few months as I switched to ZFS and only yesterday it started happen and also what I don`t like that Unraid notified me only about these errors on disk1 during parity sync, nothing about ZFS DEGRADED or SUSPENDED statuses.

I was trying to download diagnostic file, but it failed in chrome tab

Will try to get it from boot/logs if it is created.

bmartino1 · May 26

open terminal and type diagnostic

d3m3zs · May 26

11 minutes ago, bmartino1 said:

diagnostic

diagnosticS

thanks!

blackbox-diagnostics-20240526-2353.7z

bmartino1 · May 26

Thank you.
So the mod may be more assistance here.

It looks like you used zfs as the array file format per the main pciture...

I've had terrible experience with unraid doing that. Thee aray and how it works done't act nor work right with unraid array setups.
See docs: https://docs.unraid.net/unraid-os/manual/storage-management/

if you're doing ZFS it should be pool devices only. At least that is the recomend keeping the array formatted as xfs/btrfs.

some of the weirdness can be explained as the array and partiy are not tralking/working as they dhould based on the file system type.

d3m3zs · May 26

15 minutes ago, bmartino1 said:

It looks like you used zfs as the array file format per the main pciture...

Exactly, worked smoothly before...

16 minutes ago, bmartino1 said:

I've had terrible experience with unraid doing that.

looks like it is now my turn to get this experience 👌

17 minutes ago, bmartino1 said:

if you're doing ZFS it should be pool devices only.

I thought to leave at least one ZFS drive for snapshots from nvme pool.

JorgeB · May 27

First thing is to try and fix the disk1 issues, looks more like a power/connection issue, replace both cable for that disk and post new diags after array start.

d3m3zs · May 28

On 5/27/2024 at 10:42 AM, JorgeB said:

First thing is to try and fix the disk1 issues, looks more like a power/connection issue, replace both cable for that disk and post new diags after array start.

Tried, rebooted, changed cables. First reboot I couldn`t see GUI, but was able to login via ssh and removed network config, rebooted from ssh again and was able to see GUI.

But array didn`t start, status "Starting " and that is it, created diagnostics archive (in attachment) and rebooted server again from console.

And now again - don`t have GUI, but can login via ssh:

Any ideas what should I do? I already copied all config folder from server, maybe need to remove all files from config folder and reboot one more time?

Also, I checked logs from statistics archive and found few errors on sda drive, so maybe USB issue?

May 28 13:42:12 BlackBox kernel: critical medium error, dev sda, sector 2057 op 0x0:(READ) flags 0x0 phys_seg 7 prio class 2
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 9, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 10, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 11, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 12, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 13, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 14, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 15, async page read
May 28 13:42:12 BlackBox kernel: Buffer I/O error on dev sda1, logical block 8, async page read

blackbox-diagnostics-20240528-1405.zip

d3m3zs · May 28

Oh, maybe it helpful - right now all 8 HDD`s connected via LSI controller, previously 4 were connected to LSI and 4 to motherboard.

But command lsblk can return list of disks.

JorgeB · May 28

sda is the flash drive, if it's logging errors you need to replace it, or you can also try to reformat it first to see if it helps.

d3m3zs · May 28

Just now, JorgeB said:

sda is the flash drive, if it's logging errors you need to replace it, or you can also try to reformat it first to see if it helps.

Will try to format, but what will be with license and how can I avoid losing all my settings? Copy again /config folder to drive?

JorgeB · May 28

Backup the complete /config folder, then restore it.

d3m3zs · May 28

1 hour ago, JorgeB said:

Backup the complete /config folder, then restore it.

I formatted drive, flashed, copied again /config folder and Array is starting for 30 minutes...

d3m3zs · May 28

Almost 1 hour Array is trying to start

sdb = disk1, and sde and sdi = both are "backup_zfs" pool

What should I do?

  pool: disk1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 348M in 1 days 03:05:31 with 7992612 errors on Tue May 28 10:56:43 2024
config:

	NAME          STATE     READ WRITE CKSUM
	disk1         ONLINE       0     0     0
	  /dev/md1p1  ONLINE       0     0 15.3M

errors: 7992612 data errors, use '-v' for a list

May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 24 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 24 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb, logical block 3, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770560 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770560 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770496, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770561 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770497, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770562 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770498, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770563 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770499, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770564 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770500, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770565 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770501, async page read
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770566 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770502, async page read
May 28 16:06:27 BlackBox kernel: Buffer I/O error on dev sdb1, logical block 23437770503, async page read
May 28 16:07:58 BlackBox emhttpd: status: One or more devices has experienced an error resulting in data
May 28 16:09:54 BlackBox kernel: I/O error, dev sdi, sector 592 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:09:54 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sdi1 error=5 type=1 offset=270336 size=8192 flags=b08c1
May 28 16:09:54 BlackBox kernel: I/O error, dev sdi, sector 35156654672 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:09:54 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sdi1 error=5 type=1 offset=18000207159296 size=8192 flags=b08c1
May 28 16:09:54 BlackBox kernel: I/O error, dev sdi, sector 35156655184 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:09:54 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sdi1 error=5 type=1 offset=18000207421440 size=8192 flags=b08c1
May 28 16:09:54 BlackBox kernel: I/O error, dev sdi, sector 592 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:09:54 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sdi1 error=5 type=1 offset=270336 size=8192 flags=b08c1
May 28 16:09:54 BlackBox kernel: I/O error, dev sdi, sector 35156654672 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:09:54 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sdi1 error=5 type=1 offset=18000207159296 size=8192 flags=b08c1
May 28 16:09:54 BlackBox kernel: I/O error, dev sdi, sector 35156655184 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:09:54 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sdi1 error=5 type=1 offset=18000207421440 size=8192 flags=b08c1
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 3597621920 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1841982390272 size=4096 flags=1808a0
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 3597607896 op 0x0:(READ) flags 0x700 phys_seg 3 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1841975209984 size=12288 flags=40080ca0
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 3597616616 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1841979674624 size=4096 flags=1808a0
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 3597627120 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1841985052672 size=4096 flags=1808a0
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 592 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=270336 size=8192 flags=b08c1
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 35156654672 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=18000207159296 size=8192 flags=b08c1
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 35156655184 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=18000207421440 size=8192 flags=b08c1
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 2062909672 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1056209719296 size=4096 flags=1808a0
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 2062950344 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1056230543360 size=8192 flags=1808a0
May 28 16:10:05 BlackBox kernel: I/O error, dev sde, sector 2062952920 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2
May 28 16:10:05 BlackBox kernel: zio pool=backup_zfs vdev=/dev/sde1 error=5 type=1 offset=1056231862272 size=4096 flags=1808a0

JorgeB · May 28

I assume disk1 is still disabled and being emulated? If yes post the complete syslog, or the diags, there could be errors with other disks causing those.

d3m3zs · May 28

Just now, JorgeB said:

I assume disk1 is still disabled and being emulated? If yes post the complete syslog, or the diags, there could be errors with other disks causing those.

You are absolutely right.

So, just copied syslog:

syslog.txt

JorgeB · May 28

I don't see errors with disk1 on that log, do see errors on both pool devices, causing this:

May 28 16:10:05 BlackBox kernel: WARNING: Pool 'backup_zfs' has encountered an uncorrectable I/O failure and has been suspended.

A pool is suspended when zfs cannot continue to access it, because there were errors with both disks, this will make Linux get stuck, and Unraid will never continue without a reboot.

You will need to force a shutdown, then check/replace cables for both pool disks and post new diags/syslog after array start.

d3m3zs · May 28

13 minutes ago, JorgeB said:

I don't see errors with disk1 on that log

Many these records

00 00
May 28 16:06:27 BlackBox kernel: I/O error, dev sdb, sector 23437770560 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
May 28 16:06:27 BlackBox kernel: sd 7:0:0:0: [sdb] tag#851 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK

sdb is disk1.

15 minutes ago, JorgeB said:

You will need to force a shutdown, then check/replace cables for both pool disks and post new diags/syslog after array start.

Thank you, on my way.

JorgeB · May 28

54 minutes ago, d3m3zs said:

Many these records

I missed those, since they are during boot before array start, but not the main issue for now, since disk1 is disabled.

d3m3zs · May 28

Replaced cables, don`t see errors for disk1 anymore, but now I can see errors on parity drive...

Decided to delete backup_zfs pool.

Can`t see log

Edited May 28 by d3m3zs

JorgeB · May 29

You are having constant errors with multiple disks, suggesting some underlying issue, like bad PSU, or bad HBA for example, if you have some spares try swapping some parts around, you need to fix those errors first.

d3m3zs · May 31

On 5/29/2024 at 11:04 AM, JorgeB said:

You are having constant errors with multiple disks, suggesting some underlying issue, like bad PSU, or bad HBA for example, if you have some spares try swapping some parts around, you need to fix those errors first.

I want to say thank you, it seems issue is resolved, now I know how to backup flash drive and which logs can be important.

Solution was - add active cooling to LSI adapter:

image.png.9a76078ccbd8daffdaaea4b61150adb5.png

I will take a look at my drives in few days to be sure.

ZFS experienced an unrecoverable error

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation