May 13May 13 Overview:One of my NVME cache drives failed. I built the server last summer with all brand new components (aside from the hard drives which are manufacturer re-certified). I have two Samsung 990 Pros 2TB in a RAID 1 configuration for my cache. The one that is still working shows 89% endurance remaining. I have not performed any restarts since upgrading from 7.2.4 to 7.2.5 approximately 12 days ago. I am very surprised this happened so soon, considering it's not even been a year.Hardware Info:CPU: Intel Core i5-12600KMotherboard: ASRock Z790 PG SONIC ATX LGA 1700Memory: Patriot Viper Venom 64GB (4 x 16GB) DDR5-6400 CL32Cache Storage: 2 x Samsung 990 Pro 2TB M.2-2280 PCIe 4.0 X4 NVME SSD (both with a heat sink provided by the motherboard)PSU: Corsair RM850 850W 80+ GoldCache Configuration:The two NVME drives are in a RAID 1 zfs pool.Error/Warning Log Snippet from May 1st - Today:# NOTE: May 1st 10am - I was trying to add a new drive to the storage pool and run into some issues, but was eventually able to successfully add it. May 1 10:43:43 Enterprise kernel: x86/split lock detection: #AC: crashing the kernel on kernel split_locks and warning on user-space split_locks May 1 10:43:43 Enterprise kernel: ACPI: Early table checksum verification disabled May 1 10:43:43 Enterprise kernel: floppy0: no floppy controllers found May 1 10:43:43 Enterprise kernel: i915 0000:00:02.0: [drm] [ENCODER:240:DDI A/PHY A] failed to retrieve link info, disabling eDP May 1 10:43:59 Enterprise mcelog: failed to prefill DIMM database from DMI data May 1 10:44:08 Enterprise rc.local: # array and lets you kill those tasks. May 1 10:44:47 Enterprise root: error log : /var/log/graphql-api.log May 1 10:45:31 Enterprise root: mount: /mnt/disk6: wrong fs type, bad option, bad superblock on /dev/md6p1, missing codepage or helper program, or other error. May 1 10:45:31 Enterprise root: dmesg(1) may have more information after failed mount system call. May 1 10:45:31 Enterprise emhttpd: disk6: mount error: wrong or no file system May 3 21:20:48 Enterprise php-fpm[11985]: [WARNING] [pool www] server reached max_children setting (50), consider raising it May 6 20:39:15 Enterprise php-fpm[11985]: [WARNING] [pool www] server reached max_children setting (50), consider raising it May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891230208 size=122880 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996918018048 size=12288 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891475968 size=131072 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=1588870750208 size=4096 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891607040 size=118784 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891099136 size=131072 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996890968064 size=131072 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891860992 size=118784 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891353088 size=118784 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891729920 size=131072 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996891983872 size=122880 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=1588870754304 size=4096 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996918878208 size=20480 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996894081024 size=49152 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=2 offset=996894130176 size=114688 flags=3145856 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304 May 11 13:07:16 Enterprise kernel: zio pool=cache vdev=/dev/nvme1n1p1 error=5 type=5 offset=0 size=0 flags=2098304Pool Status Pulled From Working NVME Drive: pool: cache state: DEGRADED status: One or more devices have been removed. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Online the device using zpool online' or replace the device with 'zpool replace'. scan: scrub repaired 0B in 00:03:54 with 0 errors on Fri May 1 10:03:56 2026 config: NAME STATE READ WRITE CKSUM cache DEGRADED 0 0 0 mirror-0 DEGRADED 0 0 0 /dev/nvme0n1p1 ONLINE 0 0 0 /dev/nvme1n1p1 REMOVED 0 0 0 errors: No known data errorsPool Information:NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT cache 1.81T 346G 1.47T - - 29% 18% 1.00x DEGRADED - mirror-0 1.81T 346G 1.47T - - 29% 18.6% - DEGRADED nvme0n1p1 1.82T - - - - - - - ONLINE nvme1n1p1 1.82T - - - - - - - REMOVEDAttributes of working drive:Note that the SSD endurance remaining is 89% and the power on hours are only 8months.- Critical warning 0x00 - Temperature 41 Celsius - Available spare 100% - Available spare threshold 10% - Percentage used 11% - Data units read 169,230,056 [86.6 TB] - Data units written 316,196,765 [161 TB] - Host read commands 729,658,723 - Host write commands 2,016,696,505 - Controller busy time 10,418 - Power cycles 24 - Power on hours 6,140 (8m, 12d, 20h) - Unsafe shutdowns 2 - Media and data integrity errors 0 - Error information log entries 0 - Warning comp. temperature time 0 - Critical comp. temperature time 0 - Temperature sensor 1 41 Celsius - Temperature sensor 2 45 Celsius - SSD endurance remaining 89 %Diagnostics Report:https://drive.google.com/file/d/1v99BdIZxzgppSFI23dvH14MjrNuyCjJ_/view?usp=drive_linkConclusion/Questions:I find it hard to believe that a drive with ~89% life left in it and a runtime of ~8months fails prematurely like this. What could cause this and why only the one NVME and not the other since they're mirrored?The pool information shows the drive being REMOVED, could this be a motherboard issue instead? I am unable to view any information on the drive within the web gui.I updated to 7.2.5 12 days ago, could this be a side affect?I have not restarted the server since discovering the error. Next steps, from what I've read, would be to perform a mem test.Any advice on how to proceed would be much appreciated, thank you! Edited May 13May 13 by Muzek Minor edits for clarification
May 14May 14 Author Update:I restarted the server and went in BIOS. Both NVME drives showed up, so that should theoretically rule out a motherboard or NVME seating issue. Running a memtest now before booting unraid. Memtest passed. Edited May 14May 14 by Muzek Update from memtest
May 14May 14 Author Update 2:Restarted the server, and the array booted just fine. I'm still concerned what caused this issue in the first place. What else should I look into to narrow down what may have caused the issue to begin with?
May 14May 14 Author 9 hours ago, JorgeB said:If it happens again, post the diagnostics before rebooting.The diagnostics before restarting are attached the to original post. I couldn't figure out how to add attachments (maybe it's because this was my first post?). Anyway, here they are: diagnostics-20260513-1511.zip
May 14May 14 Community Expert NVMe devie is droping offline:May 11 13:07:16 Enterprise kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1May 11 13:07:16 Enterprise kernel: nvme nvme1: Disabling device after reset failure: -19See if this helps:on Main click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" and add this to your default boot option, after "append initrd=/bzroot"nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=offe.g.:append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=offReboot (or power cycle the server if just a reboot doesn't bring the device back) and then see if it makes a difference.
May 14May 14 Author 4 hours ago, JorgeB said:NVMe devie is droping offline:May 11 13:07:16 Enterprise kernel: nvme nvme1: Device not ready; aborting reset, CSTS=0x1May 11 13:07:16 Enterprise kernel: nvme nvme1: Disabling device after reset failure: -19See if this helps:on Main click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" and add this to your default boot option, after "append initrd=/bzroot"nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=offe.g.:append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=offReboot (or power cycle the server if just a reboot doesn't bring the device back) and then see if it makes a difference.Thank you for the suggestion. After looking into what it actually does, this is a no-brainer for my setup considering all my containers run off of my NVME drives 24/7. I also found that my firmware for the drives was out-of-date, so I also updated those. Hoping this problem won't come up again. Thanks!
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.