Simom Posted October 26, 2022 Share Posted October 26, 2022 Hey, I have a BTRFS related problem and I am unsure what to do, so any help is appreciated! I realized some of my dockers weren't running properly, so I took a look into the log and found these lines created about every five seconds BTRFS error (device loop2): bdev /dev/loop2 errs: wr 39, rd 0, flush 0, corrupt 20124274, gen 0 Oct 26 03:10:23 Turing kernel: BTRFS warning (device loop2): csum hole found for disk bytenr range [412303360, 412307456) Oct 26 03:10:23 Turing kernel: BTRFS warning (device loop2): csum failed root 1370 ino 1038 off 0 csum 0x42b31ff3 expected csum 0x00000000 mirror 1 Because my docker image is stored on a BTRFS RAID1 (2 1TB NVMEs) cache pool called "cache" I am guessing that this is somewhat of the root of the problem. I also ran "btrfs dev stats /mnt/cache" resulting in the following: root@Turing:~# btrfs dev stats /mnt/cache [/dev/nvme1n1p1].write_io_errs 364810 [/dev/nvme1n1p1].read_io_errs 272 [/dev/nvme1n1p1].flush_io_errs 32498 [/dev/nvme1n1p1].corruption_errs 115 [/dev/nvme1n1p1].generation_errs 0 [/dev/nvme0n1p1].write_io_errs 0 [/dev/nvme0n1p1].read_io_errs 0 [/dev/nvme0n1p1].flush_io_errs 0 [/dev/nvme0n1p1].corruption_errs 0 [/dev/nvme0n1p1].generation_errs 0 I read in the FAQ that everything should be zero. As these are NVME drives I put cables out of the equation and just tried to start a scrub, which is directly aborted and the following lines can be seen in the log: Oct 26 03:45:18 Turing ool www[20751]: /usr/local/emhttp/plugins/dynamix/scripts/btrfs_scrub 'start' '/mnt/cache' '-r' Oct 26 03:45:18 Turing kernel: BTRFS info (device nvme1n1p1): scrub: started on devid 2 Oct 26 03:45:18 Turing kernel: BTRFS info (device nvme1n1p1): scrub: not finished on devid 2 with status: -30 Oct 26 03:45:18 Turing kernel: BTRFS info (device nvme1n1p1): scrub: started on devid 3 Oct 26 03:45:18 Turing kernel: BTRFS info (device nvme1n1p1): scrub: not finished on devid 3 with status: -30 Unfortunately, I am at a dead end with my ideas; if you have any: please let me know! (I also have attached the diagnostics) turing-diagnostics-20221026-0310.zip Quote Link to comment
Solution JorgeB Posted October 26, 2022 Solution Share Posted October 26, 2022 The write errors suggests one of the cache devices dropped offline before, use the script in the FAQ to monitor the pool for the future, but there are other issues: write time tree block corruption detected This usually indicates a RAM problem or other kernel memory corruption, and you are running the RAM above the officially supported speeds and that is known to corrupt data, see here and adjust accordingly. Quote Link to comment
Simom Posted October 26, 2022 Author Share Posted October 26, 2022 Thank you for the reply! I rebooted and dropped down to the default DDR4 speeds and the scrub finished with some errors that were corrected. I was thinking of switching to an Intel-based system, is there a similar list for supported RAM configurations or is it mainly a Ryzen issue? Quote Link to comment
JorgeB Posted October 26, 2022 Share Posted October 26, 2022 11 minutes ago, Simom said: I was thinking of switching to an Intel-based system, is there a similar list for supported RAM configurations or is it mainly a Ryzen issue? Intel usually specified the max speed with all sockets populated, so for example CPUs that support DDR4 3200 support it with 4 DIMMs, for Alder Lake with DDR5 this is no longer true: Cannot find this table for Raptor Lake but I would assume there will also be limitations since max speed is now 5600 MT/s with DDR5 Quote Link to comment
Simom Posted October 26, 2022 Author Share Posted October 26, 2022 Thanks for the help, I really appreciate it! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.