Jump to content

nvme cache coruption


Go to solution Solved by JorgeB,

Recommended Posts

Hello everyone,


I know this is not a new subject I did see a similar issue on the forum but I am really not an expert btrfs and cache corruption, I want to correct the issue, the right way.

I recently update my server with my personal old PC I change the motherboard, CPU, RAM. When I boot the machine, I did have some issue with the cache drive it was corrupted. At that time I got the impression the SSD was just dead because it was a cheap WD did have an issue with WD SSD before so I decided to replace the only cache drive by 2 Crucial nvme SSD to add parity. After a few days did start having issues with the docker btfs image some corruption. I rebuild it... fine for a few days. Now I have an error on the cache pool :

Jun 15 20:59:12 INSTALLATION-00 kernel: BTRFS error (device nvme0n1p1): block=178208145408 write time tree block corruption detected
Jun 15 20:59:12 INSTALLATION-00 kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2460: errno=-5 IO failure (Error while writing out transaction)
Jun 15 20:59:12 INSTALLATION-00 kernel: BTRFS: error (device nvme0n1p1: state EA) in cleanup_transaction:1958: errno=-5 IO failure
Jun 15 20:59:13 INSTALLATION-00 kernel: loop: Write error at byte offset 433504256, length 4096.
Jun 15 20:59:13 INSTALLATION-00 kernel: loop: Write error at byte offset 165068800, length 4096.
Jun 15 20:59:13 INSTALLATION-00 kernel: loop: Write error at byte offset 433487872, length 4096.
Jun 15 20:59:13 INSTALLATION-00 kernel: loop: Write error at byte offset 165052416, length 4096.
Jun 15 20:59:13 INSTALLATION-00 kernel: I/O error, dev loop2, sector 846688 op 0x1:(WRITE) flags 0x1800 phys_seg 4 prio class 2
Jun 15 20:59:13 INSTALLATION-00 kernel: I/O error, dev loop2, sector 322400 op 0x1:(WRITE) flags 0x1800 phys_seg 4 prio class 2
Jun 15 20:59:13 INSTALLATION-00 kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 1, rd 0, flush 0, corrupt 2, gen 0
Jun 15 20:59:13 INSTALLATION-00 kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 2, rd 0, flush 0, corrupt 2, gen 0
Jun 15 20:59:13 INSTALLATION-00 kernel: I/O error, dev loop2, sector 846656 op 0x1:(WRITE) flags 0x1800 phys_seg 4 prio class 2
Jun 15 20:59:13 INSTALLATION-00 kernel: I/O error, dev loop2, sector 322368 op 0x1:(WRITE) flags 0x1800 phys_seg 4 prio class 2
Jun 15 20:59:13 INSTALLATION-00 kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 3, rd 0, flush 0, corrupt 2, gen 0
Jun 15 20:59:13 INSTALLATION-00 kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 4, rd 0, flush 0, corrupt 2, gen 0
Jun 15 20:59:13 INSTALLATION-00 kernel: loop: Write error at byte offset 163217408, length 4096.
Jun 15 20:59:13 INSTALLATION-00 kernel: loop: Write error at byte offset 431652864, length 4096.
Jun 15 20:59:13 INSTALLATION-00 kernel: loop: Write error at byte offset 163315712, length 4096.
Jun 15 20:59:13 INSTALLATION-00 kernel: I/O error, dev loop2, sector 318784 op 0x1:(WRITE) flags 0x1800 phys_seg 11 prio class 2
Jun 15 20:59:13 INSTALLATION-00 kernel: loop: Write error at byte offset 431751168, length 4096.
Jun 15 20:59:13 INSTALLATION-00 kernel: loop: Write error at byte offset 163397632, length 4096.
Jun 15 20:59:13 INSTALLATION-00 kernel: loop: Write error at byte offset 431833088, length 4096.
Jun 15 20:59:13 INSTALLATION-00 kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 5, rd 0, flush 0, corrupt 2, gen 0
Jun 15 20:59:13 INSTALLATION-00 kernel: I/O error, dev loop2, sector 843072 op 0x1:(WRITE) flags 0x1800 phys_seg 11 prio class 2
Jun 15 20:59:13 INSTALLATION-00 kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 6, rd 0, flush 0, corrupt 2, gen 0
Jun 15 20:59:13 INSTALLATION-00 kernel: I/O error, dev loop2, sector 318976 op 0x1:(WRITE) flags 0x1800 phys_seg 12 prio class 2
Jun 15 20:59:13 INSTALLATION-00 kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 7, rd 0, flush 0, corrupt 2, gen 0
Jun 15 20:59:13 INSTALLATION-00 kernel: I/O error, dev loop2, sector 843264 op 0x1:(WRITE) flags 0x1800 phys_seg 12 prio class 2
Jun 15 20:59:13 INSTALLATION-00 kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 8, rd 0, flush 0, corrupt 2, gen 0
Jun 15 20:59:13 INSTALLATION-00 kernel: I/O error, dev loop2, sector 319136 op 0x1:(WRITE) flags 0x1800 phys_seg 20 prio class 2
Jun 15 20:59:13 INSTALLATION-00 kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 9, rd 0, flush 0, corrupt 2, gen 0
Jun 15 20:59:13 INSTALLATION-00 kernel: I/O error, dev loop2, sector 843424 op 0x1:(WRITE) flags 0x1800 phys_seg 20 prio class 2
Jun 15 20:59:13 INSTALLATION-00 kernel: BTRFS error (device loop2): bdev /dev/loop2 errs: wr 10, rd 0, flush 0, corrupt 2, gen 0
Jun 15 20:59:13 INSTALLATION-00 kernel: BTRFS: error (device loop2) in btrfs_commit_transaction:2460: errno=-5 IO failure (Error while writing out transaction)
Jun 15 20:59:13 INSTALLATION-00 kernel: BTRFS: error (device loop2: state EA) in cleanup_transaction:1958: errno=-5 IO failure

 

I know I can just rebuild the cache drive and it will probably work for 2 or 3 weeks before doing it again I am trying to find the root cause of this problem so I can fix it for good.


Also not sure if this is related but when I did the updated to 6.12 those errors appear:

Jun 15 10:50:36 INSTALLATION-00 root: moving obsolete plugin gpustat.plg version 2022.11.30a to /boot/config/plugins-error
Jun 15 11:02:07 INSTALLATION-00 root: Fix Common Problems: Error: Multiple NICs on the same IPv4 network ** Ignored
Jun 15 11:57:36 INSTALLATION-00 kernel: CPU: 12 PID: 1239 Comm: kworker/u64:8 Tainted: P           O       6.1.33-Unraid #1
Jun 15 11:57:36 INSTALLATION-00 kernel: Call Trace:
Jun 15 14:02:45 INSTALLATION-00 kernel: CPU: 3 PID: 2199 Comm: smartctl_type Tainted: P        W  O       6.1.33-Unraid #1
Jun 15 14:02:45 INSTALLATION-00 kernel: Call Trace:
Jun 15 14:02:45 INSTALLATION-00 kernel: CPU: 3 PID: 2199 Comm: smartctl_type Tainted: P    B   W  O       6.1.33-Unraid #1
Jun 15 14:02:45 INSTALLATION-00 kernel: Call Trace:
Jun 15 14:03:46 INSTALLATION-00 kernel: CPU: 22 PID: 2987 Comm: smartctl_type Tainted: P    B   W  O       6.1.33-Unraid #1
Jun 15 14:03:46 INSTALLATION-00 kernel: Call Trace:

 

Let me know if you need more information I am open to any solution.

Thank everyone,

 

installation-00-diagnostics-20230616-0742.zip

Link to comment

I never think about it because it was working on my old PC but I already see errors from the test so I guess I will need to shop for RAM now.

Do ECC memory will help mitigate those issue or regular is fine just need to test them before using it? 

 

Thanks for your help I am learning every day,

  • Like 1
Link to comment
22 minutes ago, nanouke said:

Do ECC memory will help mitigate those issue or regular is fine just need to test them before using it?

ECC RAM (in a supported platform) should either correct the error and keep working normally or halt the computer if it cannot correct it, and this way prevent any data corruption.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...