btrfs cache pool errors and a corrupted docker image...again?

Hollandex · November 6, 2022

My docker image got corrupted again. Probably due to cache drive errors but I'm still unsure. I recreated the docker image and, so far, everything is good.

I ran btrfs dev stats against my cache drives and they both had a high number of corruption errors. No idea how old they were but I cleared them out. Then I ran a scrub against the pool and I got 5 checksum errors (all but 1 fixed, explained below).

UUID:             f0eb0645-ca4a-418e-bc12-95393fa57c50
Scrub started:    Sun Nov  6 12:42:38 2022
Status:           finished
Duration:         0:02:09
Total to scrub:   762.89GiB
Rate:             5.91GiB/s
Error summary:    csum=5
  Corrected:      0
  Uncorrectable:  0
  Unverified:     0

Also, corruptions errors are starting to come back on one of the drives. They both had corruption errors prior to this but currently only one does since I cleared them an hour-ish ago.

[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs  2
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  0
[/dev/nvme1n1p1].generation_errs  0

On to the questions!

First, I have a user script running hourly that should have warned me about the btrfs errors, but it didn't. Is there something incorrect with this?

#!/bin/bash
if mountpoint -q /mnt/cache; then
btrfs dev stats /mnt/cache
if [[ $? -ne 0 ]]; then /usr/local/emhttp/webGui/scripts/notify -i warning -s "ERRORS on cache pool"; fi
fi

Second question, how do I fix the checksum error? I saw the system logs mentioned the corrupted files. They weren't critical so I deleted them and scrubbed again. I still get 1 checksum error. Nothing in the syslog is pointing to a file. Any idea what this could be and/or how to fix it?

Nov  6 13:38:26 Sanctuary  ool www[7491]: /usr/local/emhttp/plugins/dynamix/scripts/btrfs_scrub 'start' '/mnt/cache' '-r'
Nov  6 13:38:26 Sanctuary kernel: BTRFS info (device nvme0n1p1): scrub: started on devid 2
Nov  6 13:38:26 Sanctuary kernel: BTRFS info (device nvme0n1p1): scrub: started on devid 1
Nov  6 13:38:32 Sanctuary kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
Nov  6 13:39:04 Sanctuary kernel: BTRFS info (device nvme0n1p1): scrub: finished on devid 2 with status: 0
Nov  6 13:39:05 Sanctuary kernel: BTRFS info (device nvme0n1p1): scrub: finished on devid 1 with status: 0

And last question, is there any way to diagnose the corruption issues? This is the third time I've had the docker image get corrupted, and every time there's been some cache drive errors, too. One time, the drives were completely borked and I had to format them. I ran memtest for over 12 hours with no errors.

I doubt the drives are going bad, they were both bought a couple years ago when I built this system. I know btrfs can have issues if the RAM is overclocked on Ryzen systems. I do have XMP turned on but I'm not overclocking the RAM beyond its rated speeds and this is an Intel system. No idea if any of that matters.

Edited November 7, 2022 by Hollandex

JorgeB · November 7, 2022

Please post the diagnostics.

btrfs cache pool errors and a corrupted docker image...again?

Recommended Posts

Hollandex

Link to comment

JorgeB

Link to comment

Join the conversation