MickMorley Posted November 22, 2022 Share Posted November 22, 2022 Hello, I am having a CACHE POOL problem and the great Fix Common Problems plugin found an error that my /var/log/syslog was filling up. The 2 nvme drives are a few weeks old. When I went to replace my old cache pool (2 SSDs), the replace one at a time method did not work and I ended up wiping my cache pool and starting fresh. Now this error surfaced from what I can tell on 11/19/2022. I found this post but it does not show what to do if its not a cabling problem (no cables for NVMe). here's the results of btrfs scrub status /mnt/cache: UUID: 92b897fd-c2ab-43b4-8adc-7c53792bcd7a no stats available Total to scrub: 280.83GiB Rate: 0.00B/s Error summary: no errors found here's the results of btrfs dev stats -z /mnt/cache: [/dev/nvme1n1p1].write_io_errs 0 [/dev/nvme1n1p1].read_io_errs 0 [/dev/nvme1n1p1].flush_io_errs 0 [/dev/nvme1n1p1].corruption_errs 0 [/dev/nvme1n1p1].generation_errs 0 [/dev/nvme2n2p1].write_io_errs 35301385 [/dev/nvme2n2p1].read_io_errs 2140829 [/dev/nvme2n2p1].flush_io_errs 1644451 [/dev/nvme2n2p1].corruption_errs 0 [/dev/nvme2n2p1].generation_errs 0 Last few lines of syslog: Nov 21 21:20:16 freddie kernel: btrfs_dev_stat_print_on_error: 42 callbacks suppressed Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250544, rd 2106592, flush 1643107, corrupt 0, gen 0 Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250545, rd 2106592, flush 1643107, corrupt 0, gen 0 Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250545, rd 2106593, flush 1643107, corrupt 0, gen 0 Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250546, rd 2106593, flush 1643107, corrupt 0, gen 0 Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250547, rd 2106593, flush 1643107, corrupt 0, gen 0 Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250548, rd 2106593, flush 1643107, corrupt 0, gen 0 Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250549, rd 2106593, flush 1643107, corrupt 0, gen 0 Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250550, rd 2106593, flush 1643107, corrupt 0, gen 0 Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250551, rd 2106593, flush 1643107, corrupt 0, gen 0 Nov 21 21:20:16 freddie kernel: BTRFS error (device nvme1n1p1): bdev /dev/nvme2n2p1 errs: wr 35250551, rd 2106594, flush 1643107, corrupt 0, gen 0 Nov 21 21:20:17 freddie kernel: BTRFS warning (device nvme1n1p1): lost page write due to IO error on /dev/nvme2n2p1 (-5) Nov 21 21:20:17 freddie kernel: BTRFS error (device nvme1n1p1): error writing primary super block to device 2 unraid-diagnostics-20221121-2118.zip Quote Link to comment
MickMorley Posted November 22, 2022 Author Share Posted November 22, 2022 I rebooted my server and when it came online, I lost my docker.img file and my /mnt/cache/system/libvirt/libvirt.img file. The system said that the Docker service could not start and the VMs service could not start. I zeroed the errors on the pool using btrfs dev stats -c /mnt/cache. When I deleted the docker.img file and recreated it the corruption_errs value started climbing from 0. After I recovered my libvirt.img file and started the VMs again the corruption_errs continues to climb. [/dev/nvme1n1p1].write_io_errs 0 [/dev/nvme1n1p1].read_io_errs 0 [/dev/nvme1n1p1].flush_io_errs 0 [/dev/nvme1n1p1].corruption_errs 0 [/dev/nvme1n1p1].generation_errs 0 [/dev/nvme0n1p1].write_io_errs 0 [/dev/nvme0n1p1].read_io_errs 0 [/dev/nvme0n1p1].flush_io_errs 0 [/dev/nvme0n1p1].corruption_errs 103148 [/dev/nvme0n1p1].generation_errs 0 unraid-diagnostics-20221121-2238.zip Quote Link to comment
Solution JorgeB Posted November 22, 2022 Solution Share Posted November 22, 2022 One of the devices dropped offline, you should run a scrub to bring it up to date, corruption errors are normal in this case for every synced block, you can re-set them when done. Make sure system share is set to COW, old default was NOCOW, and that cannot be corrected. The below might help with the dropping device. On the main GUI page click on the flash drive, scroll down to "Syslinux Configuration", make sure it's set to "menu view" (top right) and add this to your default boot option, after "append initrd=/bzroot" nvme_core.default_ps_max_latency_us=0 pcie_aspm=off e.g.: append initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Reboot and see if it helps. 1 Quote Link to comment
MickMorley Posted November 22, 2022 Author Share Posted November 22, 2022 Hi @JorgeB Thank you, before today I was aware of the utilities under "Cache Settings" to rebalance, scrub, etc. I will have to research a bit to see when to use them. All schedules for balance and scrub are disabled. Regarding System Share, I have updated the "Enable Copy-on-write" setting to AUTO, it was on "NO". What do you mean this cannot be corrected? Also, do you know what the recommendation is for "Use cache pool" for the System Share? Should it be "ONLY"? Under "Syslinux Configuration" this is my new setting. I believe I added "" for GPU passthrough some time ago. unRAID OS Label (Syslinux Configuration) kernel /bzimage append pcie_acs_override=multifunction initrd=/bzroot nvme_core.default_ps_max_latency_us=0 pcie_aspm=off Thank you I will be rebooting shortly (waiting for Parity-Check to finish) Quote Link to comment
JorgeB Posted November 22, 2022 Share Posted November 22, 2022 14 minutes ago, MickMorley said: What do you mean this cannot be corrected? NOCOW disables checksums, so with raid1 if one of the devices drops offline and then comes back online btrfs has no way of knowing which device has the latest and correct data, and it will just read form both alternately, and since the dropped device has wrong data it can result in data corrutpion, e.g: 8 hours ago, MickMorley said: I lost my docker.img file and my /mnt/cache/system/libvirt/libvirt.img file. 2 Quote Link to comment
MickMorley Posted November 23, 2022 Author Share Posted November 23, 2022 Hi @JorgeB, I appreciate the explanations! So far everything is working normally. I recreated all of my dockers using the Previous Apps feature and selecting all at once. I had a backup of libvirt,img file. I put in your recommendationsa and all is good. Syslog looks OK. A btrfs dev stats -c /mnt/cache renders no errors: [/dev/nvme1n1p1].write_io_errs 0 [/dev/nvme1n1p1].read_io_errs 0 [/dev/nvme1n1p1].flush_io_errs 0 [/dev/nvme1n1p1].corruption_errs 0 [/dev/nvme1n1p1].generation_errs 0 [/dev/nvme0n1p1].write_io_errs 0 [/dev/nvme0n1p1].read_io_errs 0 [/dev/nvme0n1p1].flush_io_errs 0 [/dev/nvme0n1p1].corruption_errs 0 [/dev/nvme0n1p1].generation_errs 0 2 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.