cache drive dying?

June 11Jun 11

I didn't want to hijack @Elmojo 's thread about his failing cache drive, https://forums.unraid.net/topic/199306-cache-pool-errors-failing-drive-or/

But it seems I have a very similar problem, at a similar time, with a similar nvme ssd - a Samsung 990 Pro. Passes all tests and doesn't directly report any read / crc / anything else errors.

Last week I came back from vacation and Unraid reported the drive was offline - I forget the wording as I thought it was just a glitch and rebooted the server - to find my cache gone. Rebooted a few times, checked a couple things, nothing changed.

Server was kept off for a couple days while I ordered a nvme to usb dock to just check the drive - because isn't that what we'd all do? :D Disk seemed good, went back into the server and booted up - everything worked. Hmm.. ok weird, right?

Fast forward until tonight, about an hour ago the cache share disappeared again - no messages from Unraid other than console logs such as:

UNRAID kernel: nvme nvme1: I/O tag 735 (c2df) opcode 0x2 (I/O Cmd) QID 14 timeout, aborting req_op:READ(0) size:131072

UNRAID kernel: I/O error, dev nvme1n1, sector 1954447249 op 0x1:(WRITE) flags 0x29800 phys_seg 1 prio class 2

UNRAID kernel: I/O error, dev loop3, sector 75840 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 2

UNRAID kernel: BTRFS error (device loop3): bdev /dev/loop3 errs: wr 0, rd 7, flush 0, corrupt 0, gen 0

UNRAID emhttpd: device /dev/nvme1n1 has size zero

This time I rebooted, and everything is fine again. Preemptively I'm going to start backing up the cache drive to the array like Elmojo is doing, but does this all point to a failing ssd or should I be looking at some other cause?

Diagnostic file doesn't have anything of interest relating to this drive other than the logs above, and I've been running 7.2.4 for quite some time so it's not a recent upgrade issue.

Edited June 11Jun 11 by Energen

Quote

June 11Jun 11

Community Expert

try this, on Main click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

Reboot and then see if it makes a difference; if it still drops post the diagnostics next time it happens.

Quote

June 11Jun 11

Author

The only change from what's currently there is the pcie_port_pm=off

Original:

label Unraid OS

menu default

kernel /bzimage

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

I'll give this a go and see what happens. Thanks!

It's very peculiar that this has only started, and that's it's occurred a week apart. If the SSD was dying I would expect a much more consistent problem.

Edited June 11Jun 11 by Energen

Quote

June 18Jun 18

Author

@JorgeB So things have been running fine for the last week, until today. Same problem occurred out of nowhere.

Disk Location: nvme0n1 Alert - Device failure

Samsung SSD 990 PRO with Heatsink 2TB

Jun 18 18:49:04 UNRAID kernel: I/O error, dev loop3, sector 76160 op 0x0:(READ) flags 0x1000 phys_seg 4 prio class 2

Jun 18 18:49:04 UNRAID kernel: BTRFS error (device loop3): bdev /dev/loop3 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0

I find it a little concerning/coincidental that there have been a few threads recently about nvme cache drive problems all occurring around the same time out of nowhere. If the drive were failing I would expect it to be failing daily and not run fine for a week without issue. It's as if something on the system is causing a breakdown. I have 2 identical nvme's installed at the same time, granted used for different purposes and the cache drive get used with more activity, but this is very strange. I ran SATA SSD cache drives for years without problems, and now the nvme for ~2 years without a problem, not sure what's going on. Is it the drive or something in Unraid.

unraid-diagnostics-20260618-1848.zip

Quote

June 18Jun 18

Author

I rebooted the server and had the same problem upon start, drive missing.

I shut down the server (power off) and turned it back on and drive is online.

Quote

June 19Jun 19

Community Expert

Jun 18 03:11:30 UNRAID kernel: nvme nvme0: Abort status: 0x371

Jun 18 03:11:51 UNRAID kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x1

Jun 18 03:11:51 UNRAID kernel: nvme nvme0: Disabling device after reset failure: -19

NVMe is still dropping offline; if the kernel options don't help, best bet is to use a different brand/model device (or a different board)

7 hours ago, Energen said:
I rebooted the server and had the same problem upon start, drive missing.
I shut down the server (power off) and turned it back on and drive is online.

This is normal when this happens; a power cycle is required to bring the device back, not just a reboot.

Quote

cache drive dying?

Featured Replies

Disk Location: nvme0n1 Alert - Device failure

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)