June 11Jun 11 I didn't want to hijack @Elmojo 's thread about his failing cache drive, https://forums.unraid.net/topic/199306-cache-pool-errors-failing-drive-or/But it seems I have a very similar problem, at a similar time, with a similar nvme ssd - a Samsung 990 Pro. Passes all tests and doesn't directly report any read / crc / anything else errors.Last week I came back from vacation and Unraid reported the drive was offline - I forget the wording as I thought it was just a glitch and rebooted the server - to find my cache gone. Rebooted a few times, checked a couple things, nothing changed.Server was kept off for a couple days while I ordered a nvme to usb dock to just check the drive - because isn't that what we'd all do? :D Disk seemed good, went back into the server and booted up - everything worked. Hmm.. ok weird, right?Fast forward until tonight, about an hour ago the cache share disappeared again - no messages from Unraid other than console logs such as:UNRAID kernel: nvme nvme1: I/O tag 735 (c2df) opcode 0x2 (I/O Cmd) QID 14 timeout, aborting req_op:READ(0) size:131072UNRAID kernel: I/O error, dev nvme1n1, sector 1954447249 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2UNRAID kernel: I/O error, dev loop3, sector 75840 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 2UNRAID kernel: BTRFS error (device loop3): bdev /dev/loop3 errs: wr 0, rd 7, flush 0, corrupt 0, gen 0UNRAID emhttpd: device /dev/nvme1n1 has size zeroThis time I rebooted, and everything is fine again. Preemptively I'm going to start backing up the cache drive to the array like Elmojo is doing, but does this all point to a failing ssd or should I be looking at some other cause?Diagnostic file doesn't have anything of interest relating to this drive other than the logs above, and I've been running 7.2.4 for quite some time so it's not a recent upgrade issue. Edited June 11Jun 11 by Energen
June 11Jun 11 Community Expert try this, on Main click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" and add this to your default boot option, after "append initrd=/bzroot"nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=offe.g.:append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=offReboot and then see if it makes a difference; if it still drops post the diagnostics next time it happens.
June 11Jun 11 Author The only change from what's currently there is the pcie_port_pm=offOriginal:label Unraid OSmenu defaultkernel /bzimageappend initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=offI'll give this a go and see what happens. Thanks!It's very peculiar that this has only started, and that's it's occurred a week apart. If the SSD was dying I would expect a much more consistent problem. Edited June 11Jun 11 by Energen
June 18Jun 18 Author @JorgeB So things have been running fine for the last week, until today. Same problem occurred out of nowhere.Disk Location: nvme0n1 Alert - Device failureSamsung SSD 990 PRO with Heatsink 2TBJun 18 18:49:04 UNRAID kernel: I/O error, dev loop3, sector 76160 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 2Jun 18 18:49:04 UNRAID kernel: BTRFS error (device loop3): bdev /dev/loop3 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0I find it a little concerning/coincidental that there have been a few threads recently about nvme cache drive problems all occurring around the same time out of nowhere. If the drive were failing I would expect it to be failing daily and not run fine for a week without issue. It's as if something on the system is causing a breakdown. I have 2 identical nvme's installed at the same time, granted used for different purposes and the cache drive get used with more activity, but this is very strange. I ran SATA SSD cache drives for years without problems, and now the nvme for ~2 years without a problem, not sure what's going on. Is it the drive or something in Unraid. unraid-diagnostics-20260618-1848.zip
June 18Jun 18 Author I rebooted the server and had the same problem upon start, drive missing.I shut down the server (power off) and turned it back on and drive is online.
June 19Jun 19 Community Expert Jun 18 03:11:30 UNRAID kernel: nvme nvme0: Abort status: 0x371Jun 18 03:11:51 UNRAID kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1Jun 18 03:11:51 UNRAID kernel: nvme nvme0: Disabling device after reset failure: -19NVMe is still dropping offline; if the kernel options don't help, best bet is to use a different brand/model device (or a different board)7 hours ago, Energen said:I rebooted the server and had the same problem upon start, drive missing.I shut down the server (power off) and turned it back on and drive is online.This is normal when this happens; a power cycle is required to bring the device back, not just a reboot.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.