Jump to content

NVME pool keeps erroring / dropping


Go to solution Solved by JorgeB,

Recommended Posts

I recently added a third NVME SSD drive to my machine (Sandisk Corp WD Black SN850X NVMe SSD - 4 TB).  I have two existing nvme SSD drives that I’ve been running for more than a year with no issues: WD_BLACK SN750 SE 1TB each.


After a few days of no problems, I’m now getting errors in the syslog (excerpt below).  On the Main tab on Unraid Gui, I see the drives also listed in the Unassigned devices.

 

I’ve added the following to the syslinux configuration: nvme_core.default_ps_max_latency_us=0.   I’ll see if that stops the issue.  I am also moving everything off of the new nvme drive, and will see if there’s a firmware update for it.  WD firmware update must be done in windows, so will passthrough to one of my windows VMs. 

While I'm working through the firmware update, Is there anything else I should be looking at to determine the cause of the problem?  My signature has some of my system information.

 

Dec 31 10:09:14 Tower kernel: nvme 0000:03:00.0: platform quirk: setting simple suspend
Dec 31 10:09:14 Tower kernel: nvme nvme3: pci function 0000:03:00.0
Dec 31 10:09:14 Tower kernel: nvme 0000:04:00.0: platform quirk: setting simple suspend
Dec 31 10:09:14 Tower kernel: nvme nvme4: pci function 0000:04:00.0
Dec 31 10:09:14 Tower kernel: nvme nvme4: 16/0/0 default/read/poll queues
Dec 31 10:09:14 Tower kernel: nvme4n2: p1
Dec 31 10:09:14 Tower kernel: nvme nvme3: 16/0/0 default/read/poll queues
Dec 31 10:09:14 Tower kernel: nvme3n2: p1
Dec 31 10:09:16 Tower kernel: XFS (nvme0n1p1): log I/O error -5
Dec 31 10:09:16 Tower kernel: XFS (nvme0n1p1): Filesystem has been shut down due to log error (0x2).
Dec 31 10:09:16 Tower kernel: XFS (nvme0n1p1): Please unmount the filesystem and rectify the problem(s).
Dec 31 10:09:16 Tower rsyslogd: file '/mnt/user/system/syslog-127.0.0.1.log'[13] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: Input/output error [v8.2102.0 try https://www.rsyslog.com/e/2027 ]
Dec 31 10:09:16 Tower rsyslogd: file '/mnt/user/system/syslog-127.0.0.1.log': open error: Input/output error [v8.2102.0 try https://www.rsyslog.com/e/2433 ]
Dec 31 10:09:16 Tower kernel: docker0: port 11(veth7a24e9e) entered disabled state
Dec 31 10:09:16 Tower kernel: veth78e2538: renamed from eth0
Dec 31 10:09:35 Tower kernel: I/O error, dev loop2, sector 305624 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Dec 31 10:09:35 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Dec 31 10:09:35 Tower kernel: I/O error, dev loop2, sector 284456 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Dec 31 10:09:35 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
Dec 31 10:09:35 Tower kernel: I/O error, dev loop2, sector 284456 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Dec 31 10:09:35 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 3, flush 0, corrupt 0, gen 0
Dec 31 10:09:35 Tower kernel: I/O error, dev loop2, sector 284456 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Dec 31 10:09:35 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 4, flush 0, corrupt 0, gen 0
Dec 31 10:09:35 Tower kernel: I/O error, dev loop2, sector 284456 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Dec 31 10:09:35 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 5, flush 0, corrupt 0, gen 0
Dec 31 10:09:35 Tower kernel: I/O error, dev loop2, sector 284456 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Dec 31 10:09:35 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 6, flush 0, corrupt 0, gen 0
Dec 31 10:09:35 Tower kernel: I/O error, dev loop2, sector 284456 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Dec 31 10:09:35 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 7, flush 0, corrupt 0, gen 0
Dec 31 10:09:35 Tower kernel: I/O error, dev loop2, sector 284456 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Dec 31 10:09:35 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 8, flush 0, corrupt 0, gen 0
Dec 31 10:09:35 Tower kernel: I/O error, dev loop2, sector 284456 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Dec 31 10:09:35 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 9, flush 0, corrupt 0, gen 0
Dec 31 10:09:35 Tower kernel: I/O error, dev loop2, sector 284456 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Dec 31 10:09:35 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 10, flush 0, corrupt 0, gen 0
Dec 31 10:09:50 Tower kernel: blk_print_req_error: 1408 callbacks suppressed
Dec 31 10:09:50 Tower kernel: I/O error, dev loop2, sector 285536 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Dec 31 10:09:50 Tower kernel: btrfs_dev_stat_inc_and_print: 1408 callbacks suppressed
Dec 31 10:09:50 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 1419, flush 0, corrupt 0, gen 0
Dec 31 10:09:50 Tower kernel: I/O error, dev loop2, sector 285536 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
Dec 31 10:09:50 Tower kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 1420, flush 0, corrupt 0, gen 0

 

 

 

tower-diagnostics-20231231-1011.zip

Link to comment
2 hours ago, konaboy said:
Dec 31 10:09:14 Tower kernel: nvme 0000:03:00.0: platform quirk: setting simple suspend
Dec 31 10:09:14 Tower kernel: nvme 0000:04:00.0: platform quirk: setting simple suspend

System have suspend / resume ?

Link to comment
Posted (edited)
3 hours ago, Vr2Io said:

System have suspend / resume ?

No.   In fact I was in the gui at the time that logged.   I was able to update the firmware on that nvme.    I’m going to put back some of the shares that were on it and see if it occurs again.   

Edited by konaboy
Link to comment
  • Solution

Try this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off


Reboot and see if it makes a difference.

Link to comment
On 1/2/2024 at 4:24 AM, JorgeB said:

Try this, on the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off


Reboot and see if it makes a difference.

Thanks - between the above and updating firmware on the nvme, I'm all set.    PS:  I ran for a few days after removing the above, (after updated firmware), and that also seemed to fix the problem.    I put this back in as a belts and suspenders solution - just to be safe.

  • Like 2
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...