[7.2.5] NVME Drive Failure

May 13May 13

Overview:

One of my NVME cache drives failed. I built the server last summer with all brand new components (aside from the hard drives which are manufacturer re-certified). I have two Samsung 990 Pros 2TB in a RAID 1 configuration for my cache. The one that is still working shows 89% endurance remaining. I have not performed any restarts since upgrading from 7.2.4 to 7.2.5 approximately 12 days ago. I am very surprised this happened so soon, considering it's not even been a year.

Hardware Info:

CPU: Intel Core i5-12600K
Motherboard: ASRock Z790 PG SONIC ATX LGA 1700
Memory: Patriot Viper Venom 64GB (4 x 16GB) DDR5-6400 CL32
Cache Storage: 2 x Samsung 990 Pro 2TB M.2-2280 PCIe 4.0 X4 NVME SSD (both with a heat sink provided by the motherboard)
PSU: Corsair RM850 850W 80+ Gold

Cache Configuration:

The two NVME drives are in a RAID 1 zfs pool.

Error/Warning Log Snippet from May 1st - Today:

# NOTE: May 1st 10am - I was trying to add a new drive to the storage pool and run into some issues, but was eventually able to successfully add it.
May  1 10:43:43 Enterprise kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks
May  1 10:43:43 Enterprise kernel: ACPI: Early table checksum verification disabled
May  1 10:43:43 Enterprise kernel: floppy0: no floppy controllers found
May  1 10:43:43 Enterprise kernel: i915 0000:00:02.0: [drm] [ENCODER:240:DDI A/PHY A] failed to retrieve link info, disabling eDP
May  1 10:43:59 Enterprise mcelog: failed to prefill DIMM database from DMI data
May  1 10:44:08 Enterprise rc.local: # array and lets you kill those tasks.
May  1 10:44:47 Enterprise root: error log : /var/log/graphql-api.log
May  1 10:45:31 Enterprise root: mount: /mnt/disk6: wrong fs type, bad option, bad superblock on /dev/md6p1, missing codepage or helper program, or other error.
May  1 10:45:31 Enterprise root:        dmesg(1) may have more information after failed mount system call.
May  1 10:45:31 Enterprise emhttpd: disk6: mount error: wrong or no file system
May  3 21:20:48 Enterprise php-fpm[11985]: [WARNING] [pool www] server reached max_children setting (50), consider raising it
May  6 20:39:15 Enterprise php-fpm[11985]: [WARNING] [pool www] server reached max_children setting (50), consider raising it
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891230208 size=122880 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996918018048 size=12288 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891475968 size=131072 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=1588870750208 size=4096 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891607040 size=118784 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891099136 size=131072 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996890968064 size=131072 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891860992 size=118784 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891353088 size=118784 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891729920 size=131072 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891983872 size=122880 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=1588870754304 size=4096 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996918878208 size=20480 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996894081024 size=49152 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996894130176 size=114688 flags=3145856
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304
May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304

Pool Status Pulled From Working NVME Drive:

  pool: cache
 state: DEGRADED
status: One or more devices have been removed.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using zpool online' or replace the device with
	'zpool replace'.
  scan: scrub repaired 0B in 00:03:54 with 0 errors on Fri May  1 10:03:56 2026
config:

	NAME                STATE     READ WRITE CKSUM
	cache               DEGRADED     0     0     0
	  mirror-0          DEGRADED     0     0     0
	    /dev/nvme0n1p1  ONLINE       0     0     0
	    /dev/nvme1n1p1  REMOVED      0     0     0

errors: No known data errors

Pool Information:

NAME            SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
cache          1.81T   346G  1.47T        -         -    29%    18%  1.00x  DEGRADED  -
  mirror-0     1.81T   346G  1.47T        -         -    29%  18.6%      -  DEGRADED
    nvme0n1p1  1.82T      -      -        -         -      -      -      -    ONLINE
    nvme1n1p1  1.82T      -      -        -         -      -      -      -   REMOVED

Attributes of working drive:

Note that the SSD endurance remaining is 89% and the power on hours are only 8months.

-	Critical warning	0x00
-	Temperature	41 Celsius
-	Available spare	100%
-	Available spare threshold	10%
-	Percentage used	11%
-	Data units read	169,230,056 [86.6 TB]
-	Data units written	316,196,765 [161 TB]
-	Host read commands	729,658,723
-	Host write commands	2,016,696,505
-	Controller busy time	10,418
-	Power cycles	24
-	Power on hours	6,140 (8m, 12d, 20h)
-	Unsafe shutdowns	2
-	Media and data integrity errors	0
-	Error information log entries	0
-	Warning comp. temperature time	0
-	Critical comp. temperature time	0
-	Temperature sensor 1	41 Celsius
-	Temperature sensor 2	45 Celsius
-	SSD endurance remaining	89 %

Diagnostics Report:

https://drive.google.com/file/d/1v99BdIZxzgppSFI23dvH14MjrNuyCjJ_/view?usp=drive_link

Conclusion/Questions:

I find it hard to believe that a drive with ~89% life left in it and a runtime of ~8months fails prematurely like this. What could cause this and why only the one NVME and not the other since they're mirrored?
The pool information shows the drive being REMOVED, could this be a motherboard issue instead? I am unable to view any information on the drive within the web gui.
I updated to 7.2.5 12 days ago, could this be a side affect?
I have not restarted the server since discovering the error. Next steps, from what I've read, would be to perform a mem test.

Any advice on how to proceed would be much appreciated, thank you!

Edited May 13May 13 by Muzek
Minor edits for clarification

Quote

May 14May 14

Author

Update:

I restarted the server and went in BIOS. Both NVME drives showed up, so that should theoretically rule out a motherboard or NVME seating issue. Running a memtest now before booting unraid.

Memtest passed.

Edited May 14May 14 by Muzek
Update from memtest

Quote

May 14May 14

Author

Update 2:

Restarted the server, and the array booted just fine. I'm still concerned what caused this issue in the first place. What else should I look into to narrow down what may have caused the issue to begin with?

Quote

May 14May 14

Community Expert

If it happens again, post the diagnostics before rebooting.

Quote

May 14May 14

Author

9 hours ago, JorgeB said:
If it happens again, post the diagnostics before rebooting.

The diagnostics before restarting are attached the to original post. I couldn't figure out how to add attachments (maybe it's because this was my first post?). Anyway, here they are:

diagnostics-20260513-1511.zip

Quote

May 14May 14

Community Expert

NVMe devie is droping offline:

May 11 13:07:16 Enterprise kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1

May 11 13:07:16 Enterprise kernel: nvme nvme1: Disabling device after reset failure: -19

See if this helps:

on Main click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

Reboot (or power cycle the server if just a reboot doesn't bring the device back) and then see if it makes a difference.

Quote

1

May 14May 14

Author

4 hours ago, JorgeB said:
NVMe devie is droping offline:
May 11 13:07:16 Enterprise kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1
May 11 13:07:16 Enterprise kernel: nvme nvme1: Disabling device after reset failure: -19
See if this helps:
on Main click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" and add this to your default boot option, after "append initrd=/bzroot"
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off
e.g.:
append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off
Reboot (or power cycle the server if just a reboot doesn't bring the device back) and then see if it makes a difference.

Thank you for the suggestion. After looking into what it actually does, this is a no-brainer for my setup considering all my containers run off of my NVME drives 24/7. I also found that my firmware for the drives was out-of-date, so I also updated those. Hoping this problem won't come up again. Thanks!

Quote

1

[7.2.5] NVME Drive Failure

Featured Replies

Overview:

Hardware Info:

Cache Configuration:

Error/Warning Log Snippet from May 1st - Today:

Pool Status Pulled From Working NVME Drive:

Pool Information:

Attributes of working drive:

Diagnostics Report:

Conclusion/Questions:

Update:

Update 2:

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)