NVME on UnRaid 6.12.8 having issues


Go to solution Solved by JorgeB,

Recommended Posts

So I upgraded from 6.12.4 again to the latest UnRaid after replacing my LSI card. As I figured my issues with UnRaid 6.12.8 was due to that. So far since the upgrade I've had a couple issues, one with my docker not port forwarding correctly, I used the work around for that now I'm having issues when the NVME goes under load in a virtual machine (windows) dropping offline and the entire WebUI freezing for a bit until it comes back on. 

I will post my diags! Thanks!
jlw-unraid-diagnostics-20240313-1847.zip
 

Edited by jlw_4049
Link to comment

Update another error 

 

r 13 18:48:16 jlw-unRaid elogind-daemon[1821]: Removed session c1.
Mar 13 21:13:30 jlw-unRaid kernel: BTRFS error (device nvme1n1p1): parent transid verify failed on logical 461406208 mirror 1 wanted 172467 found 169668
Mar 13 21:13:30 jlw-unRaid kernel: BTRFS error (device nvme1n1p1): parent transid verify failed on logical 461406208 mirror 2 wanted 172467 found 169668
Mar 13 21:13:30 jlw-unRaid kernel: BTRFS: error (device nvme1n1p1: state A) in btrfs_finish_ordered_io:3319: errno=-5 IO failure
Mar 13 21:13:30 jlw-unRaid kernel: BTRFS info (device nvme1n1p1: state EA): forced readonly

 

Link to comment

The second error is from the filesystem, the NVMe device is timing out:

 

Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 517 (I/O Cmd) QID 1 timeout, aborting
Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 518 (I/O Cmd) QID 1 timeout, aborting
Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 519 (I/O Cmd) QID 1 timeout, aborting
Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 520 (I/O Cmd) QID 1 timeout, aborting
Mar 13 18:37:28 jlw-unRaid kernel: nvme nvme1: I/O 517 QID 1 timeout, reset controller
Mar 13 18:37:29 jlw-unRaid kernel: nvme nvme1: Abort status: 0x371

 

Post new diags to see the filesystem errors, they could be related.

 

 

Link to comment
Posted (edited)

 

3 hours ago, JorgeB said:

The second error is from the filesystem, the NVMe device is timing out:

 

Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 517 (I/O Cmd) QID 1 timeout, aborting
Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 518 (I/O Cmd) QID 1 timeout, aborting
Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 519 (I/O Cmd) QID 1 timeout, aborting
Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 520 (I/O Cmd) QID 1 timeout, aborting
Mar 13 18:37:28 jlw-unRaid kernel: nvme nvme1: I/O 517 QID 1 timeout, reset controller
Mar 13 18:37:29 jlw-unRaid kernel: nvme nvme1: Abort status: 0x371

 

Post new diags to see the filesystem errors, they could be related.

 

 

It seems like both issues happened just like before. When it happens the VM is doing something that is hitting the RAM and NVME pretty hard for about a minute (it's doing some image generation on a video/while compressing the images)

I just upgraded the LSI card and added fans/cleaned everything up. Today I will reseat the ram/run memtest86 to see if errors are coming from that potentially. But I did pull the diagnostics to send to you before I did that. jlw-unraid-diagnostics-20240314-0844.zip

I did do a short SMART test on the NVME and it shows no errors/issues. So I'm wondering if I knocked the RAM loose or something when moving the case/cleaning the machine.

Waking up and pulling the logs/diags I checked the logs again and I see LOTS of these 
 

Mar 14 02:53:17 jlw-unRaid kernel: verify_parent_transid: 2909 callbacks suppressed
Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684
Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684
Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684
Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684
Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684
Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684
Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684
Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 0 wanted 172473 found 169684
Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684
Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 0 wanted 172473 found 169684
Mar 14 02:54:06 jlw-unRaid kernel: verify_parent_transid: 3188 callbacks suppressed
Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684
Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684
Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684
Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684
Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684
Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684
Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684
Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684
Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684
Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684

 

Edited by jlw_4049
Link to comment
  • Solution
Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300256, 224 blocks, I/O Error (sct 0x3 / sc 0x71)
Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300768, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 886718400, 512 blocks, I/O Error (sct 0x3 / sc 0x71)
Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300768 op 0x0:(READ) flags 0x80700 phys_seg 18 prio class 2
Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300256 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 2
Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 886718400 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2
Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300512, 96 blocks, I/O Error (sct 0x3 / sc 0x71)
Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300512 op 0x0:(READ) flags 0x80700 phys_seg 7 prio class 2
Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300480, 32 blocks, I/O Error (sct 0x3 / sc 0x71)
Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300480 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300000, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300000 op 0x0:(READ) flags 0x80700 phys_seg 17 prio class 2


 

 

After aborting the NVMe device is giving errors and losing writes, hence the btrfs errors after, I would suggest trying with a different one if possible

  • Upvote 1
Link to comment
1 hour ago, JorgeB said:
Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300256, 224 blocks, I/O Error (sct 0x3 / sc 0x71)
Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300768, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 886718400, 512 blocks, I/O Error (sct 0x3 / sc 0x71)
Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300768 op 0x0:(READ) flags 0x80700 phys_seg 18 prio class 2
Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300256 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 2
Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 886718400 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2
Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300512, 96 blocks, I/O Error (sct 0x3 / sc 0x71)
Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300512 op 0x0:(READ) flags 0x80700 phys_seg 7 prio class 2
Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300480, 32 blocks, I/O Error (sct 0x3 / sc 0x71)
Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300480 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300000, 256 blocks, I/O Error (sct 0x3 / sc 0x71)
Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300000 op 0x0:(READ) flags 0x80700 phys_seg 17 prio class 2


 

 

After aborting the NVMe device is giving errors and losing writes, hence the btrfs errors after, I would suggest trying with a different one if possible

Think it's worth checking the RAM or purely just the NVME?

Link to comment

Does the NVMe disk have a heatsink?

 

I have had issues in the past using a number of mini-PCs where the NVMe disks had very little cooling and would experience issues under heavy load - cheap aluminium heatsinks from Amazon resolved these issues.  I always fit these now unless the motherboard or expansion card is supplied with a heatsink for NVMe disks.  

Edited by PPH
Link to comment
12 minutes ago, JorgeB said:

Not seeing anything so far that suggests a RAM issue, but if you don't have ECC RAM and it's been a while since a memtest, it won't hurt to do one.

Thanks ill grab a drive today and report back. I'm near best buy. 

 

Thanks for the help

Link to comment
Posted (edited)
6 hours ago, PPH said:

Does the NVMe disk have a heatsink?

 

I have had issues in the past using a number of mini-PCs where the NVMe disks had very little cooling and would experience issues under heavy load - cheap aluminium heatsinks from Amazon resolved these issues.  I always fit these now unless the motherboard or expansion card is supplied with a heatsink for NVMe disks.  


Thanks for the response, it was an Inland 3D TLC NAND that I had to RMA once after getting it for about 2 months. They sent me a "fixed" one and turns out it just did the same thing again (only the first time the issues raised in Windows, so it was a very different error). It's not very performant so I don't think it would get too hot and has a lot of air flow on it. 
 

6 hours ago, JorgeB said:

Not seeing anything so far that suggests a RAM issue, but if you don't have ECC RAM and it's been a while since a memtest, it won't hurt to do one.


Went ahead and grabbed a Samsung 980 Pro to replace the NVME, fired up, put it under stress, it's working flawlessly. Thanks again for the help!

Edited by jlw_4049
  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.