NVME cache drive routinely goes missing

crazykidguy · June 18, 2020

I bought a ADATA SX8100NP to replace my smaller capacity Samsung 850 EVO a while back. The Adata drive has been running fine but recently it will just disappear from the array and won't show until I completely shutdown and restart the server. I tried re-seating it the very first time it happened but recently the issue kept coming back. I forgot to grab the full diagnostics from last time before I restarted the server but I've attached SMART report. I've been reading mixed info about SMART tests on NVME drives so I'm not sure if this is an indication there's something wrong with the drive's functionality.

The drive is plugged in directly to my motherboard (B450 Tomahawk Max) via the M.2 slot.

ADATA_SX8100NP_2J3620078167-20200618-0908.txt

JorgeB · June 18, 2020

Diags could give more clues but this can sometimes help:

Also one report of a similar issue being caused by overheating:

crazykidguy · November 28, 2020

Picking back up on this old thread since the problem still persists and has been appearing here and there since the last time I posted. I was able to grab the diagnostics from the most recent time when this happened and it seems to start with the I/O errors on the nvme drive, similar to the linked post. However I don't think the drive temperatures ever exceeded 40C in my case.

The system logs are unfortunately cluttered with a lot of FTP logs so the first error starts around line 8980.

My array is on xfs and I've since switched the cache drive in question to xfs from btrfs. However, I do have xfs, ntfs, and btrfs drives in use through UD.

tower-diagnostics-20201127-1739.zip

JorgeB · November 28, 2020

Did you try what in the first link, if not do it.

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append" and before "initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

Reboot and see if it makes a difference.

crazykidguy · November 30, 2020

Thanks, I added it last time when I brought this up but realize now that I had it in the incorrect order, after the initrd section. I'll make the fix and see if the issue resolves.

onufry · January 21, 2021

hey,

any updates on this?

i've experienced same "missing NVMe cache drive" situation with 1 of the 2 cache drives (in a pool) goes missing. using 2x Samsung 980_PRO 1TB. it happens often enough to be annoyed.

from the log i can see these lines when it happened:
Jan 20 18:32:54 JJ-NAS kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x2010
Jan 20 18:33:51 JJ-NAS login[402]: ROOT LOGIN on '/dev/pts/0'
Jan 20 18:33:56 JJ-NAS kernel: nvme nvme0: I/O 16 QID 0 timeout, disable controller
Jan 20 18:34:16 JJ-NAS kernel: nvme nvme0: Device not ready; aborting reset
Jan 20 18:34:16 JJ-NAS kernel: nvme nvme0: Identify Controller failed (-4)
Jan 20 18:34:16 JJ-NAS kernel: nvme nvme0: Removing after probe failure status: -5
Jan 20 18:34:16 JJ-NAS kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme0n1p1 errs: wr 75281, rd 7, flush 15132, corrupt 0, gen 0
Jan 20 18:34:16 JJ-NAS kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme0n1p1 errs: wr 75282, rd 7, flush 15132, corrupt 0, gen 0

errors continue...

after a drive goes missing i still can see Cache "active" and SMART showing healthy, but no temp showing. Entering Cache Self-test shows that disk is not available and must be spun up, clicking "Spin Up" does nothing.

reboot does not bring back the drive. it looks like a shutdown and hard press start does (i did enter the BIOS though and exited without save).

thanks in advance for any help / suggestions

mobo: ASROCK X570M Pro4 (BIOS P3.30)

CPU: Ryzen 3400G

RAM: 32GB

Cache Pool: 2x Samsung NVMe 980 PRO 1TB

Array: 4x 8TB Seagate IronWolf

Parity: 1x 12TB Seagate IronWolf

JorgeB · January 21, 2021

7 hours ago, onufry said:

thanks in advance for any help / suggestions

Look for a BIOS update and or try the latest Unraid -rc, if that doesn't help not much more you can do other than using different NVMe deiveces or a different board.

crazykidguy · January 21, 2021

14 hours ago, onufry said:

hey,

any updates on this?

i've experienced same "missing NVMe cache drive" situation with 1 of the 2 cache drives (in a pool) goes missing. using 2x Samsung 980_PRO 1TB. it happens often enough to be annoyed.

from the log i can see these lines when it happened:
Jan 20 18:32:54 JJ-NAS kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x2010
Jan 20 18:33:51 JJ-NAS login[402]: ROOT LOGIN on '/dev/pts/0'
Jan 20 18:33:56 JJ-NAS kernel: nvme nvme0: I/O 16 QID 0 timeout, disable controller
Jan 20 18:34:16 JJ-NAS kernel: nvme nvme0: Device not ready; aborting reset
Jan 20 18:34:16 JJ-NAS kernel: nvme nvme0: Identify Controller failed (-4)
Jan 20 18:34:16 JJ-NAS kernel: nvme nvme0: Removing after probe failure status: -5
Jan 20 18:34:16 JJ-NAS kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme0n1p1 errs: wr 75281, rd 7, flush 15132, corrupt 0, gen 0
Jan 20 18:34:16 JJ-NAS kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme0n1p1 errs: wr 75282, rd 7, flush 15132, corrupt 0, gen 0

errors continue...

after a drive goes missing i still can see Cache "active" and SMART showing healthy, but no temp showing. Entering Cache Self-test shows that disk is not available and must be spun up, clicking "Spin Up" does nothing.

reboot does not bring back the drive. it looks like a shutdown and hard press start does (i did enter the BIOS though and exited without save).

thanks in advance for any help / suggestions

mobo: ASROCK X570M Pro4 (BIOS P3.30)

CPU: Ryzen 3400G

RAM: 32GB

Cache Pool: 2x Samsung NVMe 980 PRO 1TB

Array: 4x 8TB Seagate IronWolf

Parity: 1x 12TB Seagate IronWolf

Yea it sounds like the same kind of error I was seeing. Rebooting the entire array multiple times can sometimes bring the cache back -- otherwise I usually had to do a hard reset. Even adding

nvme_core.default_ps_max_latency_us=0

as suggested by JorgeB didn't help. I also tried switching the file system on the NVME from btrfs to xfs, no success. It could have been a power state issue could sometimes pop up without any activity on the drive (so the temp of the drive would just be at idle temp).

Perhaps a BIOS update could help like JorgeB mentioned. My motherboard is a MSI Tomahawk B450 MAX v.7C02v35 paired with a Ryzen 2700X but I've never tried updating the BIOS to resolve this issue.

Eventually I gave up trying to diagnose the problem and switched over to a WD NVME drive. Have been solid ever since. Interestingly, I don't have a heat sink on this one whereas I did on the ADATA but temps have been fine.

onufry · January 21, 2021

11 hours ago, JorgeB said:

Look for a BIOS update and or try the latest Unraid -rc, if that doesn't help not much more you can do other than using different NVMe devices or a different board.

thanks,

my bios has the latest version. i did already try switching to different NVMe. originally i used Sabrent and now Samsung - same issues.

it seems trying unraid 6.9 RC is my next option to try.

onufry · January 29, 2021

On 1/21/2021 at 2:31 PM, onufry said:

thanks,

my bios has the latest version. i did already try switching to different NVMe. originally i used Sabrent and now Samsung - same issues.

it seems trying unraid 6.9 RC is my next option to try.

yet another update.

for past few days i started to monitor dashboard, and i've noticed that my CPU spikes to 70% and hover around that without dropping. checked top and saw ffmpeg was a culprit. after restarting Jellyfin CPU usage went to "my system average" <15%. obviously CPU, Mobo temps also dropped. i have not had NVMe gone missing for these few days.

so, i my very ignorant opinion, i suspect that (as someone mentioned elsewhere) missing cache NVMe has to do with drive controller temperature sensor triggering fail-safe.

thoughts?

Edited January 29, 2021 by onufry

Pushy · March 29, 2021

I'm facing this too recently when my son starts playing any game on Windows VM (GPU Pass-Through). Otherwise everything works fine for days! Latest BIOS updated. Unraid 6.9.1 updated. Used two suggestions from discussion topics:

amd_iommu=pt nvme_core.default_ps_max_latency_us=0

But still no difference.

I'd occasionally get messages about cache disk temperature rising to 56-57 deg and then sometimes it resulted in cache disk going missing when it crossed 60 deg. But now, there is no temperature indication what-so-ever and as soon as game is turned on within 3 mins the cache disk goes missing. Please help.

kraken-diagnostics-20210329-1303.zip

JorgeB · March 29, 2021

3 minutes ago, Pushy said:

I'm facing this too recently when my son starts playing any game on Windows VM (GPU Pass-Through).

NVMe controller is going down:

Mar 29 13:02:29 Kraken kernel: nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10

If you're already on the latest BIOS not much else you can do, try a different M.2 slot if available, you can also try disabling PCIe ACS override if not needed.

NerdyGriffin · July 29, 2021

On my system, I have always had PCIe ACS override set to "Both" and never had issues with the NVMe cache, but today I tried setting PCIe ACS to "Disabled" and also tried "Downstream", and in those modes I got the "disk missing" from the cache drive every time I tried to startup the array.

This makes some sense, since NVMe is a PCIe device, so I just wanted to add this incase it's helpful to someone

In summary, if you get those "device ... Missing" errors with an NVMe drive as cache drive, it might help to try each of the possible options for the "PCIe ACS override" setting, to see if one of those options fixes that cache issue.

The results will probably be different for everyone depending on your motherboard and probably also your arrangements of any other PCIe devices

hunter13wright · April 14, 2022

I know this is an old post, but I wanted to pass my thanks and confirm that the posted solution worked for me. In my case, I couldn't even boot into Unraid until I disconnected the drives, added:

nvme_core.default_ps_max_latency_us=0

and then rebooted with the drives reconnected.

Edited April 14, 2022 by hunter13wright

wacko37 · March 11, 2023

@JorgeB

I've been having a very similar issue with 2 x 1tb Samsung 970 evo plus nvme drives that i have in a mirrored cache on my Gigabyte z590 elite MB.

My server will be running fine for a month plus with no issues, until i have to do a random reboot for whatever reason, but upon reboot 1 of the 2 drives will present in "dashboard" tab but the temperature is not being reported and is just blank. There is always a huge list of errors in the system logs related to the NVME in question and that cache is unusable. Its also worth mentioning that 9 times out 10 the server reboots just fine without issue, it seems to be a totally random issue.

The only method that clears the issue is a full power down of the system, as a reboot does not seem to reset the issue at all. After the reboot the mirrored cache balances its issues and everything returns to normal.

My question is will the "nvme_core.default_ps_max_latency_us=0" boot option clear my issue and is there any implications to using this boot option permanently that would cause further problems.

Thanks

Edited March 12, 2023 by wacko37

JorgeB · March 12, 2023

15 hours ago, wacko37 said:

My question is will the "nvme_core.default_ps_max_latency_us=0" boot option clear my issue and is there any implications to using this boot option permanently that would cause further problems.

It might help and it will not cause any other problem, also add pcie_asp=off, so it should be:

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

wacko37 · March 14, 2023

On 3/12/2023 at 10:35 PM, JorgeB said:
It might help and it will not cause any other problem, also add pcie_asp=off, so it should be:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Thanks so much for your help & sorry for the late reply, I will apply the following boot option and test over the next month. I will reply the outcome for others in the same boat.

wacko37 · March 24, 2023

On 3/15/2023 at 7:02 AM, wacko37 said:

Thanks so much for your help & sorry for the late reply, I will apply the following boot option and test over the next month. I will reply the outcome for others in the same boat.

OK after many reboots this seems to have fixed my problem...... fingers crossed. I'll message back if this changes.

Thanks @JorgeB for all that you do mate!

NVME cache drive routinely goes missing

Recommended Posts

crazykidguy

Link to comment

JorgeB

Link to comment

crazykidguy

Link to comment

JorgeB

Link to comment

crazykidguy

Link to comment

onufry

Link to comment

JorgeB

Link to comment

crazykidguy

Link to comment

onufry

Link to comment

onufry

Link to comment

Pushy

Link to comment

JorgeB

Link to comment

NerdyGriffin

Link to comment

hunter13wright

Link to comment

wacko37

Link to comment

JorgeB

Link to comment

wacko37

Link to comment

wacko37

Link to comment

Join the conversation