Is my NVMe drive dead or RAM problem?


Recommended Posts

Hi! 

One of my NVMe drives has suddenly started giving me a lot of BTRFS errors. Se attached syslog. 

 

Jan 30 07:35:27 MONSTERSERVERN kernel: I/O error, dev loop2, sector 37325840 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 2
Jan 30 07:35:27 MONSTERSERVERN kernel: I/O error, dev loop2, sector 37300584 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 2
Jan 30 07:35:27 MONSTERSERVERN kernel: loop: Write error at byte offset 16593756160, length 4096.
Jan 30 07:35:27 MONSTERSERVERN kernel: I/O error, dev loop2, sector 32409680 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 2
Jan 30 07:35:27 MONSTERSERVERN kernel: loop: Write error at byte offset 19110830080, length 4096.
Jan 30 07:35:27 MONSTERSERVERN kernel: I/O error, dev loop2, sector 37325840 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 2
Jan 30 07:35:30 MONSTERSERVERN kernel: btrfs_dev_stat_inc_and_print: 330006 callbacks suppressed
[...]
Jan 30 07:35:30 MONSTERSERVERN kernel: BTRFS error (device nvme0n1p1: state EA): bdev /dev/nvme0n1p1 errs: wr 172, rd 2403805, flush 0, corrupt 0, gen 0
Jan 30 07:35:30 MONSTERSERVERN kernel: BTRFS error (device nvme0n1p1: state EA): bdev /dev/nvme0n1p1 errs: wr 172, rd 2403807, flush 0, corrupt 0, gen 0
Jan 30 07:35:30 MONSTERSERVERN kernel: BTRFS error (device nvme0n1p1: state EA): bdev /dev/nvme0n1p1 errs: wr 172, rd 2403809, flush 0, corrupt 0, gen 0
[...]
Jan 30 07:35:37 MONSTERSERVERN kernel: I/O error, dev loop2, sector 37430928 op 0x0:(READ) flags 0x80700 phys_seg 4 prio class 2
[...]
Jan 30 09:28:14 MONSTERSERVERN kernel: nvme0n1: I/O Cmd(0x2) @ LBA 1066408400, 8 blocks, I/O Error (sct 0x3 / sc 0x71) 
Jan 30 09:28:14 MONSTERSERVERN kernel: I/O error, dev nvme0n1, sector 1178887112 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 2
[...]
Jan 30 09:28:14 MONSTERSERVERN kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Jan 30 09:28:14 MONSTERSERVERN kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 3, flush 0, corrupt 0, gen 0
Jan 30 09:28:14 MONSTERSERVERN kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 4, flush 0, corrupt 0, gen 0
Jan 30 09:28:14 MONSTERSERVERN kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 1, rd 4, flush 0, corrupt 0, gen 0
Jan 30 09:28:14 MONSTERSERVERN kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 2, rd 4, flush 0, corrupt 0, gen 0
Jan 30 09:28:14 MONSTERSERVERN kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 3, rd 4, flush 0, corrupt 0, gen 0
Jan 30 09:28:14 MONSTERSERVERN kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Jan 30 09:28:14 MONSTERSERVERN kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 3, rd 6, flush 0, corrupt 0, gen 0

Do you think this means that the drive has gone bad or are the errors caused by a problem with my RAM? I have non-ECC DDR4, and recently applied an XMP profile to run it at its native speed (3200MHz). Maybe that was too stressful for the RAM? Previously I ran it at 2133MHz for stability. My cache pool and another NVMe drive are also using BTRFS so I want to know whether there is a risk that they might fail as well. 

Is there a way to recover data on the drive or should I just format it? 

 

Thank you in advance! 

monsterservern-syslog-20240130-0635.zip

monsterservern-diagnostics-20240130-0938.zip

Edited by eribob
Link to comment
Jan 30 09:28:14 MONSTERSERVERN kernel: nvme0n1: I/O Cmd(0x2) @ LBA 1066408400, 8 blocks, I/O Error (sct 0x3 / sc 0x71) 
Jan 30 09:28:14 MONSTERSERVERN kernel: I/O error, dev nvme0n1, sector 1178887112 op 0x0:(READ) flags 0x80700 phys_seg 2 prio class 2
Jan 30 09:28:14 MONSTERSERVERN kernel: nvme0n1: detected capacity change from 3907029168 to 0

 

NVMe device dropped offline, try a different m.2 slot if available, if the same it may be a device issue.

Link to comment

Thank you for the reply! 

24 minutes ago, JorgeB said:

NVMe device dropped offline, try a different m.2 slot if available, if the same it may be a device issue.

So that would mean that my NVMe slot on the motherboard suddenly stopped working? The drive has been in it for 1-2 years without issues. Sounds more likely that it is an issue with the drive in that case? Moving the NVMe drive is not trivial hehe I have to disassemble the server... 

 

I also tried another suggetstion from another thread: 

 

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

But this did not help, rather I got even more problems afterwards. Maybe just a coincidence though. But can this code make things worse in some circumstances? 

 

 

Edited by eribob
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.