Jump to content

/var/log is getting full - BTRFS error (device nvme1n1p1)


Go to solution Solved by JorgeB,

Recommended Posts

Hello, 

 

I am having a CACHE POOL problem and the great Fix Common Problems plugin found an error that my /var/log/syslog was filling up. 

 

The 2 nvme drives are a few weeks old.  When I went to replace my old cache pool (2 SSDs), the replace one at a time method did not work and I ended up wiping my cache pool and starting fresh.  Now this error surfaced from what I can tell on 11/19/2022.

 

I found this post but it does not show what to do if its not a cabling problem (no cables for NVMe).

 

here's the results of btrfs scrub status /mnt/cache:

UUID:             92b897fd-c2ab-43b4-8adc-7c53792bcd7a
        no stats available
Total to scrub:   280.83GiB
Rate:             0.00B/s
Error summary:    no errors found

 

here's the results of btrfs dev stats -z /mnt/cache:

[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  0
[/dev/nvme1n1p1].generation_errs  0
[/dev/nvme2n2p1].write_io_errs    35301385
[/dev/nvme2n2p1].read_io_errs     2140829
[/dev/nvme2n2p1].flush_io_errs    1644451
[/dev/nvme2n2p1].corruption_errs  0
[/dev/nvme2n2p1].generation_errs  0

 

Last few lines of syslog:

Nov 21 21:20:16 freddie kernel: btrfs_dev_stat_print_on_error: 42 callbacks suppressed
Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250544, rd 2106592, flush 1643107, corrupt 0, gen 0
Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250545, rd 2106592, flush 1643107, corrupt 0, gen 0
Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250545, rd 2106593, flush 1643107, corrupt 0, gen 0
Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250546, rd 2106593, flush 1643107, corrupt 0, gen 0
Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250547, rd 2106593, flush 1643107, corrupt 0, gen 0
Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250548, rd 2106593, flush 1643107, corrupt 0, gen 0
Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250549, rd 2106593, flush 1643107, corrupt 0, gen 0
Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250550, rd 2106593, flush 1643107, corrupt 0, gen 0
Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250551, rd 2106593, flush 1643107, corrupt 0, gen 0
Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250551, rd 2106594, flush 1643107, corrupt 0, gen 0
Nov 21 21:20:17 freddie kernel: BTRFS warning (device nvme1n1p1): lost page write due to IO error on /dev/nvme2n2p1 (-5)
Nov 21 21:20:17 freddie kernel: BTRFS error (device nvme1n1p1): error writing primary super block to device 2

 

unraid-diagnostics-20221121-2118.zip

Link to comment

I rebooted my server and when it came online, I lost my docker.img file and my /mnt/cache/system/libvirt/libvirt.img file.

 

The system said that the Docker service could not start and the VMs service could not start.

 

I zeroed the errors on the pool using btrfs dev stats -c /mnt/cache.

 

When I deleted the docker.img file and recreated it the corruption_errs value started climbing from 0.  After I recovered my libvirt.img file and started the VMs again the corruption_errs continues to climb.

 

[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  0
[/dev/nvme1n1p1].generation_errs  0
[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs  103148
[/dev/nvme0n1p1].generation_errs  0

 

unraid-diagnostics-20221121-2238.zip

Link to comment
  • Solution

One of the devices dropped offline, you should run a scrub to bring it up to date, corruption errors are normal in this case for every synced block, you can re-set them when done.

 

Make sure system share is set to COW, old default was NOCOW, and that cannot be corrected.

 

The below might help with the dropping device.

 

On the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot"

nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

e.g.:

append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off


Reboot and see if it helps.

  • Like 1
Link to comment

Hi @JorgeB

 

Thank you, before today I was aware of the utilities under "Cache Settings" to rebalance, scrub, etc.  I will have to research a bit to see when to use them.  All schedules for balance and scrub are disabled.

 

Regarding System Share, I have updated the "Enable Copy-on-write" setting to AUTO, it was on "NO".  What do you mean this cannot be corrected? Also, do you know what the recommendation is for "Use cache pool" for the System Share?  Should it be "ONLY"?

 

Under "Syslinux Configuration" this is my new setting.  I believe I added "" for GPU passthrough some time ago.

 

unRAID OS Label (Syslinux Configuration)

kernel /bzimage
append pcie_acs_override=multifunction initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

 

Thank you I will be rebooting shortly (waiting for Parity-Check to finish)

Link to comment
14 minutes ago, MickMorley said:

What do you mean this cannot be corrected?

NOCOW disables checksums, so with raid1 if one of the devices drops offline and then comes back online btrfs has no way of knowing which device has the latest and correct data, and it will just read form both alternately, and since the dropped device has wrong data it can result in data corrutpion, e.g:

 

8 hours ago, MickMorley said:

I lost my docker.img file and my /mnt/cache/system/libvirt/libvirt.img file.

 

  • Like 2
Link to comment

Hi @JorgeB, I appreciate the explanations!

 

So far everything is working normally.  I recreated all of my dockers using the Previous Apps feature and selecting all at once.  I had a backup of libvirt,img file.

 

I put in your recommendationsa and all is good.  Syslog looks OK.

 

A btrfs dev stats -c /mnt/cache renders no errors:

 

[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  0
[/dev/nvme1n1p1].generation_errs  0
[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs  0
[/dev/nvme0n1p1].generation_errs  0

 

  • Like 2
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...