unRAID unstable a few hours after upgrading to 6.12.6, BTRFS error with cache pool

snowboardjoe · December 18, 2023

I upgraded from 6.12.4 to 6.12.6 this morning and everything seemed to be running fine. At 6:36pm this evening everything became unstable with Docker. Plex stopped working was the first sign and started digging in the issue. These were the very first few events at that time in my syslog:

[Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474
[Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 1 wanted 14578466 found 14578474
[Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474
[Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474
[Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474
[Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 2 wanted 14578466 found 14578474
[Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474
[Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state A) in btrfs_finish_ordered_io:3319: errno=-5 IO failure
[Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1: state A): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474
[Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1: state EA): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474
[Sun Dec 17 18:36:57 2023] BTRFS info (device sdg1: state EA): forced readonly
[Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1: state EA): parent transid verify failed on logical 2151677952 mirror 1 wanted 14578466 found 14578474
[Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure
[Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure
[Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure
[Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure
[Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure
[Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure
[Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure

I was able to reboot and the system came back online. However, I immediately started seeing lots of BTRFS errors with /dev/sdg1. That drive is one of two drives in my cache pool. According to dashboard it thinks everything is fine. Still verifying things in logs to see what is going on. I've not dealt with BTRFS errors like this before.

Going to get basic array stable again with no Docker to get some stability back. I'm thinking this is just a timing coincidence with a SDD failure around the same time I chose to upgrade today? Open to suggestions to diagnose and restore services here.

snowboardjoe · December 18, 2023

Ram memtest and it failed. Trying to isolate which stick is bad. All of the earlier errors report sdg had errors. My theory is that only that data is corrupted and I should remove that drive from the pool. I should be able to restore everything.

Does that sound like a viable plan?

snowboardjoe · December 18, 2023

I tested RAM with two sticks and it passed. Ran check with the other two sticks, that passed. When all 4 sticks are installed, memtest fails. Ugh! Just looked up the RAM compatibility list and discovered that 4 stick of ram is not supported. I don't know how it worked for almost a year like this.

New game plan:

Let memtest run through the night with the existing ram and verify no errors.
Boot unraid in safe mode.
Run btrfs scrub against cache pool and check results.

Open to other suggestions in the hope I can recover data, but since I'm not well versed with btrfs, I want to make sure I'm following best practices here to prevent further damage. Again, my logs said there was only a problem with /dev/sdg1, so I'm hoping damage is limited to that and could work around that by using other mirror slice to get things back in shape.

snowboardjoe · December 18, 2023

Memtest ran all night, 9 times and was clean. So, I'm just operating with 50% of RAM before this whole mess started. Ran scrub and everything came back with no errors. I rebooted again, but Docker won't come up because it's in RO mode:

Dec 18 07:19:31 laffy emhttpd: shcmd (91): /usr/local/sbin/mount_image '/mnt/cache/appdata/docker.img' /var/lib/docker 32
Dec 18 07:19:31 laffy root: truncate: cannot open '/mnt/cache/appdata/docker.img' for writing: Read-only file system
Dec 18 07:19:31 laffy kernel: loop2: detected capacity change from 0 to 67108864
Dec 18 07:19:31 laffy kernel: BTRFS: device fsid 51f116da-f032-424a-9b4c-da5dd26d4a6f devid 1 transid 4251808 /dev/loop2 scanned by mount (10258)
Dec 18 07:19:31 laffy kernel: BTRFS info (device loop2): using crc32c (crc32c-intel) checksum algorithm
Dec 18 07:19:31 laffy kernel: BTRFS info (device loop2): using free space tree
Dec 18 07:19:31 laffy kernel: BTRFS info (device loop2): enabling ssd optimizations
Dec 18 07:19:31 laffy kernel: BTRFS info (device loop2): start tree-log replay
Dec 18 07:19:31 laffy kernel: BTRFS warning (device loop2): log replay required on RO media
Dec 18 07:19:31 laffy root: mount: /var/lib/docker: can't read superblock on /dev/loop2.
Dec 18 07:19:31 laffy root:        dmesg(1) may have more information after failed mount system call.
Dec 18 07:19:31 laffy root: mount error
Dec 18 07:19:31 laffy kernel: BTRFS error (device loop2): open_ctree failed
Dec 18 07:19:31 laffy emhttpd: shcmd (91): exit status: 1

Looking at dmesg:

[Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1): using crc32c (crc32c-intel) checksum algorithm
[Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1): using free space tree
[Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1): enabling ssd optimizations
[Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1): start tree-log replay
[Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1): checking UUID tree
[Mon Dec 18 07:19:25 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 1 wanted 14578466 found 14578474
[Mon Dec 18 07:19:25 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 2 wanted 14578466 found 14578474
[Mon Dec 18 07:19:25 2023] BTRFS: error (device sdg1: state A) in do_free_extent_accounting:2845: errno=-5 IO failure
[Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1: state EA): forced readonly
[Mon Dec 18 07:19:25 2023] BTRFS error (device sdg1: state EA): failed to run delayed ref for logical 2465001472 num_bytes 12288 type 178 action 2 ref_mod 1: -5
[Mon Dec 18 07:19:25 2023] BTRFS: error (device sdg1: state EA) in btrfs_run_delayed_refs:2149: errno=-5 IO failure
[Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1: state EMA): turning on async discard
[Mon Dec 18 07:19:25 2023] BTRFS error (device sdg1: state EMA): Remounting read-write after error is not allowed
[Mon Dec 18 07:19:30 2023] loop2: detected capacity change from 0 to 67108864
[Mon Dec 18 07:19:30 2023] BTRFS: device fsid 51f116da-f032-424a-9b4c-da5dd26d4a6f devid 1 transid 4251808 /dev/loop2 scanned by mount (10258)
[Mon Dec 18 07:19:30 2023] BTRFS info (device loop2): using crc32c (crc32c-intel) checksum algorithm
[Mon Dec 18 07:19:30 2023] BTRFS info (device loop2): using free space tree
[Mon Dec 18 07:19:30 2023] BTRFS info (device loop2): enabling ssd optimizations
[Mon Dec 18 07:19:30 2023] BTRFS info (device loop2): start tree-log replay
[Mon Dec 18 07:19:30 2023] BTRFS warning (device loop2): log replay required on RO media
[Mon Dec 18 07:19:30 2023] BTRFS error (device loop2): open_ctree failed

So, it looks like some replay of a journal is needed here? Researching that error. Attaching diagnostics in current state.

laffy-diagnostics-20231218-0728.zip

itimpi · December 18, 2023

The loop2 device is the docker.img file and it looks like that is corrupt. I would suggest recreating this file and re-installing your containers with previous settings intact via Apps->Previous Apps. Before doing so run a scrub on the cache pool to check that there is not corruption at that level as that would need fixing first.

snowboardjoe · December 18, 2023

/mnt/cache is still in a RO state:

root@laffy:/var/lib/btrfs# btrfs scrub start -B /mnt/cache 
ERROR: scrubbing /mnt/cache failed for device id 1: ret=-1, errno=30 (Read-only file system)
ERROR: scrubbing /mnt/cache failed for device id 2: ret=-1, errno=30 (Read-only file system)
scrub canceled for 8bdd3d07-cbb0-4d53-a9f2-da67099186ea
Scrub started:    Mon Dec 18 07:40:35 2023
Status:           aborted
Duration:         0:00:00
Total to scrub:   0.00B
Rate:             0.00B/s
Error summary:    no errors found

I don't know how to clear this condition.

snowboardjoe · December 18, 2023

I ran the following:

btrfs check --force /dev/sdg1

Lots of errors about:

parent transid verify failed on 2151677952 wanted 14578466 found 14578474

But it does finish. While the error above appears many times, it's for the same values.

snowboardjoe · December 18, 2023

Making a copy of all data that I can to a separate disk. Turns out I do not have full backups of everything as I thought. Then I'm going to take a look at this item I found and see if there is a way to recover to an older metablock (usebackuproot).

https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/FAQ.html#How_do_I_recover_from_a_.22parent_transid_verify_failed.22_error.3F

I fear this entire filesystem may be unrecoverable, but still fighting a little more.

JorgeB · December 18, 2023

Pool is also going read only, suggest backing up what you can and re-format.

snowboardjoe · December 18, 2023

Went through several other steps, but did not make any progress. Ran restore to an external drive and that seemed to grab everything just fine. Then as a last ditch effort I ran the repair command:

root@laffy:/etc# btrfs check --repair --force /dev/sdg1
enabling repair mode
Opening filesystem to check...
WARNING: filesystem mounted, continuing because of --force
Checking filesystem on /dev/sdg1
UUID: 8bdd3d07-cbb0-4d53-a9f2-da67099186ea
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
parent transid verify failed on 2151677952 wanted 14578466 found 14578474
parent transid verify failed on 2151677952 wanted 14578466 found 14578474
parent transid verify failed on 2151677952 wanted 14578466 found 14578474
Ignoring transid failure
super bytes used 258022117376 mismatches actual used 258022100992
parent transid verify failed on 2151677952 wanted 14578466 found 14578474
Ignoring transid failure
No device size related problem found
[3/7] checking free space tree
[4/7] checking fs roots
parent transid verify failed on 2151677952 wanted 14578466 found 14578474
Ignoring transid failure

...identical reponses delted for bervity...

[5/7] checking only csums items (without verifying data)
parent transid verify failed on 2151677952 wanted 14578466 found 14578474
Ignoring transid failure
[6/7] checking root refs
Recowing metadata block 2151677952
parent transid verify failed on 2151677952 wanted 14578466 found 14578474
Ignoring transid failure
[7/7] checking quota groups skipped (not enabled on this FS)
found 516044218368 bytes used, no error found
total csum bytes: 415007296
total tree bytes: 1490927616
total fs tree bytes: 743276544
total extent tree bytes: 180305920
btree space waste bytes: 374203586
file data blocks allocated: 943655108608
 referenced 490961231872

Ran check one more time and it came back clean. Remounted filesystem as RW and remained stable (normally it would go RO in about 30 seconds). Did a full reboot and all services are back online including all containers. Watching for stability at this point.

unRAID unstable a few hours after upgrading to 6.12.6, BTRFS error with cache pool

Recommended Posts

snowboardjoe

Link to comment

snowboardjoe

Link to comment

snowboardjoe

Link to comment

snowboardjoe

Link to comment

itimpi

Link to comment

snowboardjoe

Link to comment

snowboardjoe

Link to comment

snowboardjoe

Link to comment

JorgeB

Link to comment

snowboardjoe

Link to comment

Join the conversation