BTRFS Failures in Cache pool?

Jellepepe · May 7

Hi all, thanks for taking the time.
I noticed some errors in a docker this morning and went to reboot it, upon trying to start i was greeted with a;
'Execution error 403'

Since then, all dockers have failed to reboot, and I seem to have read-only access to my cache pool.

The following is visible in the logs:

May  7 01:39:34 Tower kernel: BTRFS error (device loop2: state EAL): bdev /dev/loop2 errs: wr 988, rd 0, flush 0, corrupt 2, gen 0
May  7 01:39:39 Tower kernel: loop: Write error at byte offset 3140636672, length 4096.
May  7 01:39:39 Tower kernel: I/O error, dev loop2, sector 6134056 op 0x1:(WRITE) flags 0x100000 phys_seg 4 prio class 2
May  7 01:39:39 Tower kernel: BTRFS error (device loop2: state EAL): bdev /dev/loop2 errs: wr 989, rd 0, flush 0, corrupt 2, gen 0
May  7 01:39:44 Tower kernel: loop: Write error at byte offset 3140636672, length 4096.
May  7 01:39:44 Tower kernel: I/O error, dev loop2, sector 6134056 op 0x1:(WRITE) flags 0x100000 phys_seg 1 prio class 2
May  7 01:39:44 Tower kernel: BTRFS error (device loop2: state EAL): bdev /dev/loop2 errs: wr 990, rd 0, flush 0, corrupt 2, gen 0
May  7 01:39:45 Tower kernel: loop: Write error at byte offset 3140636672, length 4096.
May  7 01:39:45 Tower kernel: I/O error, dev loop2, sector 6134056 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
May  7 01:39:45 Tower kernel: BTRFS error (device loop2: state EAL): bdev /dev/loop2 errs: wr 991, rd 0, flush 0, corrupt 2, gen 0

I have seen no other errors or warnings present.

Running unraid 6.12.4
My main array consists of 3x 20TB exos 2x data, 1x Parity.
The cache pool with issues is 2x 2tb lexar nm790 in a btrfs raid 1.
There is also a secondary cache pool with a single sata ssd, this seems to be fine (its also empty).

All discs have plenty of free space.

My understanding is this is some sort of filesystem failure on the 2x2tb cache pool? How would I go about recovering from here?

I've already started copying all non-replicated from the cache discs, nothing is critical of course.

Appreciate the help as I'm a bit lost on what the actual issue is or the way forward

Edited May 7 by Jellepepe

JorgeB · May 7

Please post the diagnostics.

Jellepepe · May 7

I think this should be it?

tower-diagnostics-20240507-1217.zip

JorgeB · May 7

May  6 22:48:27 Tower kernel: BTRFS error (device nvme0n1p1): block=1336104894464 write time tree block corruption detected
May  6 22:48:27 Tower kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2466: errno=-5 IO failure (Error while writing out transaction)
May  6 22:48:27 Tower kernel: BTRFS info (device nvme0n1p1: state E): forced readonly

Pool filesystem went read-only, this can be caused by a hardware issue like bad RAM or a filesystem problem, I would start by running memtest, if nothing is found backup and recreate the pool, using btrfs again or try zfs.

Jellepepe · May 7

I appreciate it, is it ok If I leave this unsolved until this is done, in case I encounter any other issues during the process?

Is there any downsides to switching to ZFS I should keep in mind?

7 minutes ago, JorgeB said:

if nothing is found backup and recreate the pool, using btrfs again or try zfs.

Edited May 7 by Jellepepe

JorgeB · May 7

33 minutes ago, Jellepepe said:

Is there any downsides to switching to ZFS I should keep in mind?

Not usually, and zfs tends to be more robust.

Jellepepe · May 7

10 hours ago, JorgeB said:
May  6 22:48:27 Tower kernel: BTRFS error (device nvme0n1p1): block=1336104894464 write time tree block corruption detected
May  6 22:48:27 Tower kernel: BTRFS: error (device nvme0n1p1) in btrfs_commit_transaction:2466: errno=-5 IO failure (Error while writing out transaction)
May  6 22:48:27 Tower kernel: BTRFS info (device nvme0n1p1: state E): forced readonly
Pool filesystem went read-only, this can be caused by a hardware issue like bad RAM or a filesystem problem, I would start by running memtest, if nothing is found backup and recreate the pool, using btrfs again or try zfs.

Quick update: I finished backups and started running memtest, currently 4 passes deep but will leave it running overnight.
Assuming this is doesnt find anything, is there any way to identify what could have caused the corruption or general best practices to keep in mind?
I've been super happy with unraid running almost exactly 6 months without any issues, but this is somewhat worrying as I never had any corruption issues on my old hyper-V setup, wondering if I might be doing something wrong 😅

JorgeB · May 8

While this error usually meant a RAM problem, some users see it after upgrading from v6.11 to v6.12, so possibly some kernel issue, for those, sometimes re-formatting the filesystem btrfs solves the issue, for others it doesn't, in that case best to use zfs, but if you also get problems with zfs, then it's likely a hardware issue.

Jellepepe · May 8

14 hours ago, JorgeB said:

While this error usually meant a RAM problem, some users see it after upgrading from v6.11 to v6.12, so possibly some kernel issue, for those, sometimes re-formatting the filesystem btrfs solves the issue, for others it doesn't, in that case best to use zfs, but if you also get problems with zfs, then it's likely a hardware issue.

Alright, after running memtest overnight with no errors, rebooted unraid and found the btrfs cache pool mounted rw again with 0 errors.
Ran check & scrub with no failures too, same for the docker disk, quite odd.

Either way I reformatted as a zfs pool, that all went without issues.
Ended up updating to 6.12.10 while i was at it too.

I did notice a few traces that seemed to be related to macvlan network thing, not sure why that was enabled since I installed 6.12.4 originally and I read it should default to ipvlan?
Either way I switched to ipvlan now which seems to not have affected anything.

I wonder if a crash/error related to that might have triggered a btrfs failure state even though no actual filesystem issues were present? For now I will keep an eye on things but it seems to be fine with no issues. Marked as solved

JorgeB · May 9

10 hours ago, Jellepepe said:

I wonder if a crash/error related to that might have triggered a btrfs failure state

It should be unrelated.

BTRFS Failures in Cache pool?

Recommended Posts

Jellepepe

Link to comment

JorgeB

Link to comment

Jellepepe

Link to comment

JorgeB

Link to comment

Jellepepe

Link to comment

JorgeB

Link to comment

Jellepepe

Link to comment

JorgeB

Link to comment

Jellepepe

Link to comment

JorgeB

Link to comment

Join the conversation