Jump to content

NVME Drive falling out of pool and system event


Recommended Posts

Looks like my Google-Fu has failed me, but maybe someone can shed some insight.

 

My XPG S70 Blade seems to have connection dropouts with no rhyme or reason that I can find. It'll be fine for a few hours then all of a sudden I get a notification that the device is missing.

 

Mar 29 11:00:27 TheRedQueen kernel:  nvme nvme2: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xFFFF

 

I tried passing it through to a VM to check on it's firmware to see if there's an update, but to no avail. Also the error has shifted from the passthrough now to 

 

TheRedQueen kernel: vfio-pci 0000:02:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update

 

I should mention that in the VM the drive never actually drops out of the OS, but unraid it definitely does and a reboot is necessary to get it back on track.

 

I've reseated it once just in case, but seems the issue seems to happen once every 24 hours. Thoughts?

 

Supermicro M12SWA-TF 

AMD Ryzen Threadripper PRO 3955WX

NVIDIA GTX 1060 6GB (For Transcoding Purposes)

2x LSI 9202-16e HBAs

LSI 9272-8i HBA

2x T-Force Cardea 1TB (Cache) in a ASUS Hyper M.2 Expansion (Bifurcated x4x4x4x4)

Seasonic PRIME 1000W Platinum PSU.

 

 

theredqueen-diagnostics-20220329-1129.zip

Link to comment

The below might help and it's worth a shot, if not best bet is a different NVMe device (or different board).

 

Some NVMe devices have issues with power states on Linux, try this, on the main GUI page click on flash, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (on the top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0


Reboot and see if it makes a difference.

Link to comment

Looks like I'm still getting this while the SSD is passed through after the the Syslinux Configuration change.; I'm going to poke around in my BIOS and turn off C States if possible when I get home to see if that's the root cause. Definitely feels power management related.

 

vfio-pci 0000:02:00.0: VPD access failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update

Link to comment

Didn't find much regarding to C States, but in the BIOS there's a mention under the NVMe configuration for AMI Firmware versus vendor firmware. Selecting AMI Firmware has seemed to resolve the issue (at least over the past 12+ hours). I'll test it with some less important stuff over the next few days just in case and will post an update.

  • Like 1
Link to comment
  • 1 year later...

I dont have the drives (I have in RAID1 BTRFS) being dropped by the UnRaid OS, but I do see a lot of these in my dmesg logs. I am also going to try the kernel option to see if it helps.

 

image.png.2b1f7de68b17160709878e1259e38303.png

 

 

[320250.699810] nvme 0000:03:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update
[320250.829812] nvme 0000:06:00.0: VPD access failed.  This is likely a firmware bug on this device.  Contact the card vendor for a firmware update

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...