Jump to content
  • [6.12.13] NVMe no longer working


    hawihoney
    • Closed Minor

    My system has two NVMe attached to the motherboard. This combo worked throughout the latest releases without any problems. With the installation of 6.12.13 one of the two NVMe throws errors now:

     

    Aug 27 08:19:17 Tower unassigned.devices: Mounting partition 'nvme1n1p1' at mountpoint '/mnt/disks/NVMe2'...
    Aug 27 08:19:17 Tower unassigned.devices: Mount cmd: /sbin/mount -t 'xfs' -o rw,relatime,discard '/dev/nvme1n1p1' '/mnt/disks/NVMe2'
    Aug 27 08:19:17 Tower kernel: XFS (nvme1n1p1): Mounting V5 Filesystem
    Aug 27 08:19:17 Tower kernel: XFS (nvme1n1p1): Ending clean mount
    Aug 27 08:19:17 Tower unassigned.devices: Successfully mounted '/dev/nvme1n1p1' on '/mnt/disks/NVMe2'.
    Aug 27 08:19:21 Tower emhttpd: shcmd (197): /usr/local/sbin/mount_image '/mnt/pool_nvme/system/docker/' /var/lib/docker 10
    Aug 27 08:19:24 Tower emhttpd: shcmd (214): /usr/local/sbin/mount_image '/mnt/pool_nvme/system/libvirt.img' /etc/libvirt 1
    Aug 27 08:35:19 Tower emhttpd: spinning down /dev/nvme1n1
    Aug 27 08:35:19 Tower emhttpd: sdspin /dev/nvme1n1 down: 25
    Aug 27 12:34:49 Tower kernel: nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff
    Aug 27 12:34:49 Tower kernel: nvme nvme1: Does your device have a faulty power saving mode enabled?
    Aug 27 12:34:49 Tower kernel: nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off" and report a bug
    Aug 27 12:34:49 Tower kernel: nvme 0000:09:00.0: Unable to change power state from D3cold to D0, device inaccessible
    Aug 27 12:34:49 Tower kernel: nvme nvme1: Removing after probe failure status: -19
    Aug 27 12:34:49 Tower kernel: nvme1n1: detected capacity change from 1953525168 to 0
    Aug 27 13:48:39 Tower kernel: XFS (nvme1n1p1): metadata I/O error in "xfs_imap_to_bp+0x50/0x70 [xfs]" at daddr 0x3ae0bca0 len 32 error 5
    Aug 27 13:48:39 Tower kernel: XFS (nvme1n1p1): metadata I/O error in "xfs_imap_to_bp+0x50/0x70 [xfs]" at daddr 0x3ae0bca0 len 32 error 5
    Aug 27 13:48:39 Tower kernel: XFS (nvme1n1p1): metadata I/O error in "xfs_imap_to_bp+0x50/0x70 [xfs]" at daddr 0x3ae0bca0 len 32 error 5
    Aug 27 13:48:39 Tower kernel: XFS (nvme1n1p1): metadata I/O error in "xfs_imap_to_bp+0x50/0x70 [xfs]" at daddr 0x3ae0bca0 len 32 error 5
    Aug 27 13:48:39 Tower kernel: XFS (nvme1n1p1): metadata I/O error in "xfs_imap_to_bp+0x50/0x70 [xfs]" at daddr 0x3ae0bca0 len 32 error 5
    Aug 27 13:49:03 Tower kernel: XFS (nvme1n1p1): log I/O error -5
    Aug 27 13:49:03 Tower kernel: XFS (nvme1n1p1): Filesystem has been shut down due to log error (0x2).
    Aug 27 13:49:03 Tower kernel: XFS (nvme1n1p1): Please unmount the filesystem and rectify the problem(s).

     

    Should I go back to to the previous Unraid release?

     

    Diagnostiics attached.

     

    Thanks in advance.

     

    tower-diagnostics-20240827-1549.zip




    User Feedback

    Recommended Comments

    Device is dropping offline, try adding that to syslinux.cfg, after /bzroot:

     

    Quote

    Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off"

     

    Link to comment

    You have your motherboard-mounted NVMe drive mounted using the unassigned devices plugin. Is that intentional? Just want to make sure that isn't part of your issue.

    Link to comment
    1 hour ago, T0rqueWr3nch said:

    Is that intentional?

    Yes. I always wanted both NVMe as a 2-device pool. But I don't trust BTRFS and ZFS seems way to complicated to me. So I replicate the first pool device to the second one mounted thru Unassigned Devices via User Scripts. With multi-Array support - whenever it will arrive - I will change that to a 2-device Unraid array.

     

    Link to comment
    15 hours ago, JorgeB said:

    nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off

     

    The interesting part is that this device was available after reboot for a short period of time. And after that it was gone silently. This combo was running happily for years with previous Unraid versions.

     

    Is this correct?

     

    image.thumb.png.22c7a332e641c34990d1942d76e188da.png

     

    I will reboot then and report.

     

    Thanks.

     

    Edited by hawihoney
    Link to comment

    A kernel change can sometimes make this issue visible, or help with it, see if that helps.

    Link to comment

    With these settings in syslinux.cfg the second NVMe was missing completely. After removing this setting this NVMe is still not visible. Looking like an hardware error.

     

    I remember these kinds of errors years ago. I could solve that with Shutdown/Restart instead of Reboot. Next time I will try Shutdown/Restart again.

     

    ***EDIT*** Couldn't wait. Did a complete shutdown and did restart with the Powerknob on the Supermicro case. Bingo. NVMe2 is back again. I must remember this. Never do a reboot on my system, shutdown the system completely instead.

     

    Edited by hawihoney
    Link to comment
    1 hour ago, hawihoney said:

    After removing this setting this NVMe is still not visible.

    You usually need to power cycle the server to get the device back, just rebooting is not enough.

    Link to comment
    17 hours ago, JorgeB said:

    You usually need to power cycle the server to get the device back, just rebooting is not enough.

     

    That's what I found out exactly. See above.

     

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.

×
×
  • Create New...