ZFS Mirror (RAID0) Cache pool degraded. General advice needed. (Unraid 6.12.4)

bunkermagnus · September 26, 2023

I'm a bit lost here where I try to determine whether the ZFS mirror implementation is a bit shaky still or if I'm suffering from a hardware error and how to proceed.

This has happened

A couple of weeks ago I bought two Samsung 2 Tb Evo 970 plus devices to replace my pretty worn 512 Gb SSD of the same brand and model.

My reaoning was that I wanted a mirrored cache to put some shares with semi-important data not to disturb the spin-down array disks unnecessarily.

Everything went smooth and worked as planned for a couple of weeks.

Today I noticed that my mirrored ZFS pool with these devices was degraded and that one of the SSD:s had been removed from the pool as shown below:

Quote

pool: cache_zfs_mirror

state: DEGRADED

status: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.

action: Online the device using zpool online' or replace the device with 'zpool replace'.

scan: scrub repaired 0B in 00:00:44 with 0 errors on Tue Sep 26 13:38:20 2023 config: NAME STATE READ WRITE CKSUM cache_zfs_mirror DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /dev/nvme0n1p1 REMOVED 0 0 0 /dev/nvme1n1p1 ONLINE 0 0 0

errors: No known data errors

I hadn't noticed anything and Unraid had not alerted me to this error, neither through e-mail or pushover.

I consulted my SNMP log:

Quote

ep 22 20:13:24 192.168.1.250 kernel: nvme nvme0: I/O 883 (I/O Cmd) QID 2 timeout, aborting
Sep 22 20:13:29 192.168.1.250 kernel: nvme nvme0: I/O 884 (I/O Cmd) QID 2 timeout, aborting
Sep 22 20:13:30 192.168.1.250 kernel: nvme nvme0: I/O 885 (I/O Cmd) QID 2 timeout, aborting
Sep 22 20:13:40 192.168.1.250 kernel: nvme nvme0: I/O 886 (I/O Cmd) QID 2 timeout, aborting
Sep 22 20:13:47 192.168.1.250 kernel: nvme nvme0: I/O 887 (I/O Cmd) QID 2 timeout, aborting
Sep 22 20:13:52 192.168.1.250 kernel: nvme nvme0: I/O 22 QID 0 timeout, reset controller
Sep 22 20:13:54 192.168.1.250 kernel: nvme nvme0: I/O 883 QID 2 timeout, reset controller
Sep 22 20:15:24 192.168.1.250 kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Sep 22 20:15:24 192.168.1.250 kernel: nvme nvme0: Abort status: 0x371
Sep 22 20:15:24 192.168.1.250 kernel: nvme nvme0: Abort status: 0x371
Sep 22 20:15:24 192.168.1.250 kernel: nvme nvme0: Abort status: 0x371
Sep 22 20:15:24 192.168.1.250 kernel: nvme nvme0: Abort status: 0x371
Sep 22 20:15:24 192.168.1.250 kernel: nvme nvme0: Abort status: 0x371
Sep 22 20:15:55 192.168.1.250 kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Sep 22 20:15:55 192.168.1.250 kernel: nvme nvme0: Removing after probe failure status: -19
Sep 22 20:16:25 192.168.1.250 kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1
Sep 22 20:16:25 192.168.1.250 kernel: nvme0n1: detected capacity change from 3907029168 to 0
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=2 offset=962088488960 size=12288 flags=180880
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=2 offset=962088452096 size=12288 flags=180880
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=2 offset=962088415232 size=4096 flags=180880
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=2 offset=962088378368 size=12288 flags=180880
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=2 offset=962088341504 size=4096 flags=180880
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=2 offset=962088304640 size=16384 flags=180880
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=2 offset=962088267776 size=16384 flags=180880
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=2 offset=963697836032 size=69632 flags=180880
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=5 offset=0 size=0 flags=100480
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=5 offset=0 size=0 flags=100480
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=5 offset=0 size=0 flags=100480
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=5 offset=0 size=0 flags=100480
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=5 offset=0 size=0 flags=100480
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=5 offset=0 size=0 flags=100480
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=5 offset=0 size=0 flags=100480
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=5 offset=0 size=0 flags=100480
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=5 offset=0 size=0 flags=100480
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=5 offset=0 size=0 flags=100480
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=5 offset=0 size=0 flags=100480
Sep 22 20:16:25 192.168.1.250 kernel: zio pool=cache_zfs_mirror vdev=/dev/nvme0n1p1 error=5 type=5 offset=0 size=0 flags=100480
Sep 22 20:27:51 192.168.1.250 emhttpd: spinning down /dev/nvme0n1
Sep 22 20:27:51 192.168.1.250 emhttpd: sdspin /dev/nvme0n1 down: 2
Sep 22 20:36:58 192.168.1.250 monitor: Stop running nchan processes
Sep 22 20:46:00 192.168.1.250 kernel: Plex Transcoder[29069]: segfault at 28 ip 000014a68d44ad02 sp 00007ffdc5058fe0 error 4 in libavformat.so.59[14a68d2ca000+1fe000] likely on CPU 1 (core 1, socket 0)
Sep 22 20:46:00 192.168.1.250 kernel: Code: 89 ef e8 f1 a8 07 00 45 31 ed 85 c0 41 0f 94 c5 45 85 e4 74 60 83 bb 80 05 00 00 00 74 57 48 8b ab 70 05 00 00 48 89 6c 24 08 <48> 8b 7d 28 e8 65 a5 07 00 eb 62 48 8b bb 80 00 00 00 8b 93 7c 05

So the "spin-down" of my one of the SSDs in my new mirrored cache pool happened a couple of days back.

So now to my questions:

I chose to make a ZFS pool but I now have doubts and are making preparations to rebuild the pool as an BRTFS mirror instead, is there any downsides doing that compared to keeping ZFS?
What would be the best way to rule out that the disabled disk is actually faulty, I'm thinking about maybe run a diagnostics from the servers BIOS testing the NVME and maybe a SMART test, any other suggestions?
What does the log tell you that are a bit more hardware savvy than I am, to me it sound like an potential HW I/O failure?

Thanks in advance and sorry for lengthy post.

-Daedalus · September 26, 2023

I can't add to this, except to put a question 4 in there: Why was there not a big red flashing alert in the UI for this?

JorgeB · September 26, 2023

51 minutes ago, bunkermagnus said:

I chose to make a ZFS pool but I now have doubts and are making preparations to rebuild the pool as an BRTFS mirror instead, is there any downsides doing that compared to keeping ZFS?

The device dropped offline, the same thing would have happened with btrfs.

52 minutes ago, bunkermagnus said:

What would be the best way to rule out that the disabled disk is actually faulty, I'm thinking about maybe run a diagnostics from the servers BIOS testing the NVME and maybe a SMART test, any other suggestions?

Please post the complete diagnostics.

52 minutes ago, bunkermagnus said:

hadn't noticed anything and Unraid had not alerted me to this error, neither through e-mail or pushover.

Currently the GUI doesn't warn of pool device issues, xfs, btrfs or zfs, I have a very old feature request for this but for now see here for better pool monitoring for btrfs and zfs.

bunkermagnus · September 26, 2023

13 minutes ago, JorgeB said:

The device dropped offline, the same thing would have happened with btrfs.

Please post the complete diagnostics.

Currently the GUI doesn't warn of pool device issues, xfs, btrfs or zfs, I have a very old feature request for this but for now see here for better pool monitoring for btrfs and zfs.

unraid-diagnostics-20230926-1926.zip

Thank you for your reply, enclosing diagnostics file.

bunkermagnus · September 26, 2023

It would seem this is most likely a Gigabyte x570 hardware issue, so many have reported NVMe issues with M.2 slot one which is PCI 4.0. That the motherboard just suddenly removes the drive from that slot.

Reddit thread on Gigabyte boards dropping NVMe SSD

JorgeB · September 26, 2023

If it's a board problem it likely won't help, but possibly worth a try, it helps in some similar cases, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Note that you will likely need to power cycle the server to get the device back, just rebooting it's usually not enough, then see if the above makes a difference.

bunkermagnus · September 26, 2023

31 minutes ago, JorgeB said:
If it's a board problem it likely won't help, but possibly worth a try, it helps in some similar cases, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
e.g.:
append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
Note that you will likely need to power cycle the server to get the device back, just rebooting it's usually not enough, then see if the above makes a difference.

Thank you, I have edited the Syslinux configuration and will reboot now.

Also, a powercycle brought the ZFS pool back online like nothing ever happened, all I hade to do was a "clear".

I have also updated to the latest Motherboard BIOS F38f and I have also removed all USB devices connected to the "blue" USB slots directly connected to the CPU as some explained that their NVMe problems disappeared when doing that.

I will give it a try, too bad it's a sneaky problem that can occur after 1 hour or 2 months.

Thank you anyway!

ZFS Mirror (RAID0) Cache pool degraded. General advice needed. (Unraid 6.12.4)

Recommended Posts

bunkermagnus

This has happened

Link to comment

-Daedalus

Link to comment

JorgeB

Link to comment

bunkermagnus

Link to comment

bunkermagnus

Link to comment

JorgeB

Link to comment

bunkermagnus

Link to comment

Join the conversation