jlw_4049 Posted March 13 Share Posted March 13 (edited) So I upgraded from 6.12.4 again to the latest UnRaid after replacing my LSI card. As I figured my issues with UnRaid 6.12.8 was due to that. So far since the upgrade I've had a couple issues, one with my docker not port forwarding correctly, I used the work around for that now I'm having issues when the NVME goes under load in a virtual machine (windows) dropping offline and the entire WebUI freezing for a bit until it comes back on. I will post my diags! Thanks! jlw-unraid-diagnostics-20240313-1847.zip Edited March 13 by jlw_4049 Quote Link to comment
jlw_4049 Posted March 14 Author Share Posted March 14 Update another error r 13 18:48:16 jlw-unRaid elogind-daemon[1821]: Removed session c1. Mar 13 21:13:30 jlw-unRaid kernel: BTRFS error (device nvme1n1p1): parent transid verify failed on logical 461406208 mirror 1 wanted 172467 found 169668 Mar 13 21:13:30 jlw-unRaid kernel: BTRFS error (device nvme1n1p1): parent transid verify failed on logical 461406208 mirror 2 wanted 172467 found 169668 Mar 13 21:13:30 jlw-unRaid kernel: BTRFS: error (device nvme1n1p1: state A) in btrfs_finish_ordered_io:3319: errno=-5 IO failure Mar 13 21:13:30 jlw-unRaid kernel: BTRFS info (device nvme1n1p1: state EA): forced readonly Quote Link to comment
JorgeB Posted March 14 Share Posted March 14 The second error is from the filesystem, the NVMe device is timing out: Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 517 (I/O Cmd) QID 1 timeout, aborting Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 518 (I/O Cmd) QID 1 timeout, aborting Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 519 (I/O Cmd) QID 1 timeout, aborting Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 520 (I/O Cmd) QID 1 timeout, aborting Mar 13 18:37:28 jlw-unRaid kernel: nvme nvme1: I/O 517 QID 1 timeout, reset controller Mar 13 18:37:29 jlw-unRaid kernel: nvme nvme1: Abort status: 0x371 Post new diags to see the filesystem errors, they could be related. Quote Link to comment
jlw_4049 Posted March 14 Author Share Posted March 14 (edited) 3 hours ago, JorgeB said: The second error is from the filesystem, the NVMe device is timing out: Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 517 (I/O Cmd) QID 1 timeout, aborting Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 518 (I/O Cmd) QID 1 timeout, aborting Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 519 (I/O Cmd) QID 1 timeout, aborting Mar 13 18:36:58 jlw-unRaid kernel: nvme nvme1: I/O 520 (I/O Cmd) QID 1 timeout, aborting Mar 13 18:37:28 jlw-unRaid kernel: nvme nvme1: I/O 517 QID 1 timeout, reset controller Mar 13 18:37:29 jlw-unRaid kernel: nvme nvme1: Abort status: 0x371 Post new diags to see the filesystem errors, they could be related. It seems like both issues happened just like before. When it happens the VM is doing something that is hitting the RAM and NVME pretty hard for about a minute (it's doing some image generation on a video/while compressing the images) I just upgraded the LSI card and added fans/cleaned everything up. Today I will reseat the ram/run memtest86 to see if errors are coming from that potentially. But I did pull the diagnostics to send to you before I did that. jlw-unraid-diagnostics-20240314-0844.zip I did do a short SMART test on the NVME and it shows no errors/issues. So I'm wondering if I knocked the RAM loose or something when moving the case/cleaning the machine. Waking up and pulling the logs/diags I checked the logs again and I see LOTS of these Mar 14 02:53:17 jlw-unRaid kernel: verify_parent_transid: 2909 callbacks suppressed Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684 Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684 Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684 Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684 Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684 Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684 Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684 Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 0 wanted 172473 found 169684 Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684 Mar 14 02:53:17 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 0 wanted 172473 found 169684 Mar 14 02:54:06 jlw-unRaid kernel: verify_parent_transid: 3188 callbacks suppressed Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684 Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684 Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684 Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684 Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684 Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684 Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684 Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684 Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 1 wanted 172473 found 169684 Mar 14 02:54:06 jlw-unRaid kernel: BTRFS error (device nvme1n1p1: state EA): parent transid verify failed on logical 490471424 mirror 2 wanted 172473 found 169684 Edited March 14 by jlw_4049 Quote Link to comment
Solution JorgeB Posted March 14 Solution Share Posted March 14 Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300256, 224 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300768, 256 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 886718400, 512 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300768 op 0x0:(READ) flags 0x80700 phys_seg 18 prio class 2 Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300256 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 2 Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 886718400 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2 Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300512, 96 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300512 op 0x0:(READ) flags 0x80700 phys_seg 7 prio class 2 Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300480, 32 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300480 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2 Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300000, 256 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300000 op 0x0:(READ) flags 0x80700 phys_seg 17 prio class 2 After aborting the NVMe device is giving errors and losing writes, hence the btrfs errors after, I would suggest trying with a different one if possible 1 Quote Link to comment
jlw_4049 Posted March 14 Author Share Posted March 14 1 hour ago, JorgeB said: Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300256, 224 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300768, 256 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 886718400, 512 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300768 op 0x0:(READ) flags 0x80700 phys_seg 18 prio class 2 Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300256 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 2 Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 886718400 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 2 Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300512, 96 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300512 op 0x0:(READ) flags 0x80700 phys_seg 7 prio class 2 Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300480, 32 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300480 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2 Mar 13 01:04:54 jlw-unRaid kernel: nvme1n1: I/O Cmd(0x2) @ LBA 879300000, 256 blocks, I/O Error (sct 0x3 / sc 0x71) Mar 13 01:04:54 jlw-unRaid kernel: I/O error, dev nvme1n1, sector 879300000 op 0x0:(READ) flags 0x80700 phys_seg 17 prio class 2 After aborting the NVMe device is giving errors and losing writes, hence the btrfs errors after, I would suggest trying with a different one if possible Think it's worth checking the RAM or purely just the NVME? Quote Link to comment
JorgeB Posted March 14 Share Posted March 14 Not seeing anything so far that suggests a RAM issue, but if you don't have ECC RAM and it's been a while since a memtest, it won't hurt to do one. Quote Link to comment
PPH Posted March 14 Share Posted March 14 (edited) Does the NVMe disk have a heatsink? I have had issues in the past using a number of mini-PCs where the NVMe disks had very little cooling and would experience issues under heavy load - cheap aluminium heatsinks from Amazon resolved these issues. I always fit these now unless the motherboard or expansion card is supplied with a heatsink for NVMe disks. Edited March 14 by PPH Quote Link to comment
jlw_4049 Posted March 14 Author Share Posted March 14 12 minutes ago, JorgeB said: Not seeing anything so far that suggests a RAM issue, but if you don't have ECC RAM and it's been a while since a memtest, it won't hurt to do one. Thanks ill grab a drive today and report back. I'm near best buy. Thanks for the help Quote Link to comment
jlw_4049 Posted March 14 Author Share Posted March 14 (edited) 6 hours ago, PPH said: Does the NVMe disk have a heatsink? I have had issues in the past using a number of mini-PCs where the NVMe disks had very little cooling and would experience issues under heavy load - cheap aluminium heatsinks from Amazon resolved these issues. I always fit these now unless the motherboard or expansion card is supplied with a heatsink for NVMe disks. Thanks for the response, it was an Inland 3D TLC NAND that I had to RMA once after getting it for about 2 months. They sent me a "fixed" one and turns out it just did the same thing again (only the first time the issues raised in Windows, so it was a very different error). It's not very performant so I don't think it would get too hot and has a lot of air flow on it. 6 hours ago, JorgeB said: Not seeing anything so far that suggests a RAM issue, but if you don't have ECC RAM and it's been a while since a memtest, it won't hurt to do one. Went ahead and grabbed a Samsung 980 Pro to replace the NVME, fired up, put it under stress, it's working flawlessly. Thanks again for the help! Edited March 14 by jlw_4049 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.