snowboardjoe Posted December 18, 2023 Share Posted December 18, 2023 I upgraded from 6.12.4 to 6.12.6 this morning and everything seemed to be running fine. At 6:36pm this evening everything became unstable with Docker. Plex stopped working was the first sign and started digging in the issue. These were the very first few events at that time in my syslog: [Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474 [Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 1 wanted 14578466 found 14578474 [Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474 [Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474 [Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474 [Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 2 wanted 14578466 found 14578474 [Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474 [Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state A) in btrfs_finish_ordered_io:3319: errno=-5 IO failure [Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1: state A): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474 [Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1: state EA): parent transid verify failed on logical 2151677952 mirror 0 wanted 14578466 found 14578474 [Sun Dec 17 18:36:57 2023] BTRFS info (device sdg1: state EA): forced readonly [Sun Dec 17 18:36:57 2023] BTRFS error (device sdg1: state EA): parent transid verify failed on logical 2151677952 mirror 1 wanted 14578466 found 14578474 [Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure [Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure [Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure [Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure [Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure [Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure [Sun Dec 17 18:36:57 2023] BTRFS: error (device sdg1: state EA) in btrfs_finish_ordered_io:3319: errno=-5 IO failure I was able to reboot and the system came back online. However, I immediately started seeing lots of BTRFS errors with /dev/sdg1. That drive is one of two drives in my cache pool. According to dashboard it thinks everything is fine. Still verifying things in logs to see what is going on. I've not dealt with BTRFS errors like this before. Going to get basic array stable again with no Docker to get some stability back. I'm thinking this is just a timing coincidence with a SDD failure around the same time I chose to upgrade today? Open to suggestions to diagnose and restore services here. Quote Link to comment
snowboardjoe Posted December 18, 2023 Author Share Posted December 18, 2023 Ram memtest and it failed. Trying to isolate which stick is bad. All of the earlier errors report sdg had errors. My theory is that only that data is corrupted and I should remove that drive from the pool. I should be able to restore everything. Does that sound like a viable plan? Quote Link to comment
snowboardjoe Posted December 18, 2023 Author Share Posted December 18, 2023 I tested RAM with two sticks and it passed. Ran check with the other two sticks, that passed. When all 4 sticks are installed, memtest fails. Ugh! Just looked up the RAM compatibility list and discovered that 4 stick of ram is not supported. I don't know how it worked for almost a year like this. New game plan: Let memtest run through the night with the existing ram and verify no errors. Boot unraid in safe mode. Run btrfs scrub against cache pool and check results. Open to other suggestions in the hope I can recover data, but since I'm not well versed with btrfs, I want to make sure I'm following best practices here to prevent further damage. Again, my logs said there was only a problem with /dev/sdg1, so I'm hoping damage is limited to that and could work around that by using other mirror slice to get things back in shape. Quote Link to comment
snowboardjoe Posted December 18, 2023 Author Share Posted December 18, 2023 Memtest ran all night, 9 times and was clean. So, I'm just operating with 50% of RAM before this whole mess started. Ran scrub and everything came back with no errors. I rebooted again, but Docker won't come up because it's in RO mode: Dec 18 07:19:31 laffy emhttpd: shcmd (91): /usr/local/sbin/mount_image '/mnt/cache/appdata/docker.img' /var/lib/docker 32 Dec 18 07:19:31 laffy root: truncate: cannot open '/mnt/cache/appdata/docker.img' for writing: Read-only file system Dec 18 07:19:31 laffy kernel: loop2: detected capacity change from 0 to 67108864 Dec 18 07:19:31 laffy kernel: BTRFS: device fsid 51f116da-f032-424a-9b4c-da5dd26d4a6f devid 1 transid 4251808 /dev/loop2 scanned by mount (10258) Dec 18 07:19:31 laffy kernel: BTRFS info (device loop2): using crc32c (crc32c-intel) checksum algorithm Dec 18 07:19:31 laffy kernel: BTRFS info (device loop2): using free space tree Dec 18 07:19:31 laffy kernel: BTRFS info (device loop2): enabling ssd optimizations Dec 18 07:19:31 laffy kernel: BTRFS info (device loop2): start tree-log replay Dec 18 07:19:31 laffy kernel: BTRFS warning (device loop2): log replay required on RO media Dec 18 07:19:31 laffy root: mount: /var/lib/docker: can't read superblock on /dev/loop2. Dec 18 07:19:31 laffy root: dmesg(1) may have more information after failed mount system call. Dec 18 07:19:31 laffy root: mount error Dec 18 07:19:31 laffy kernel: BTRFS error (device loop2): open_ctree failed Dec 18 07:19:31 laffy emhttpd: shcmd (91): exit status: 1 Looking at dmesg: [Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1): using crc32c (crc32c-intel) checksum algorithm [Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1): using free space tree [Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1): enabling ssd optimizations [Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1): start tree-log replay [Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1): checking UUID tree [Mon Dec 18 07:19:25 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 1 wanted 14578466 found 14578474 [Mon Dec 18 07:19:25 2023] BTRFS error (device sdg1): parent transid verify failed on logical 2151677952 mirror 2 wanted 14578466 found 14578474 [Mon Dec 18 07:19:25 2023] BTRFS: error (device sdg1: state A) in do_free_extent_accounting:2845: errno=-5 IO failure [Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1: state EA): forced readonly [Mon Dec 18 07:19:25 2023] BTRFS error (device sdg1: state EA): failed to run delayed ref for logical 2465001472 num_bytes 12288 type 178 action 2 ref_mod 1: -5 [Mon Dec 18 07:19:25 2023] BTRFS: error (device sdg1: state EA) in btrfs_run_delayed_refs:2149: errno=-5 IO failure [Mon Dec 18 07:19:25 2023] BTRFS info (device sdg1: state EMA): turning on async discard [Mon Dec 18 07:19:25 2023] BTRFS error (device sdg1: state EMA): Remounting read-write after error is not allowed [Mon Dec 18 07:19:30 2023] loop2: detected capacity change from 0 to 67108864 [Mon Dec 18 07:19:30 2023] BTRFS: device fsid 51f116da-f032-424a-9b4c-da5dd26d4a6f devid 1 transid 4251808 /dev/loop2 scanned by mount (10258) [Mon Dec 18 07:19:30 2023] BTRFS info (device loop2): using crc32c (crc32c-intel) checksum algorithm [Mon Dec 18 07:19:30 2023] BTRFS info (device loop2): using free space tree [Mon Dec 18 07:19:30 2023] BTRFS info (device loop2): enabling ssd optimizations [Mon Dec 18 07:19:30 2023] BTRFS info (device loop2): start tree-log replay [Mon Dec 18 07:19:30 2023] BTRFS warning (device loop2): log replay required on RO media [Mon Dec 18 07:19:30 2023] BTRFS error (device loop2): open_ctree failed So, it looks like some replay of a journal is needed here? Researching that error. Attaching diagnostics in current state. laffy-diagnostics-20231218-0728.zip Quote Link to comment
itimpi Posted December 18, 2023 Share Posted December 18, 2023 The loop2 device is the docker.img file and it looks like that is corrupt. I would suggest recreating this file and re-installing your containers with previous settings intact via Apps->Previous Apps. Before doing so run a scrub on the cache pool to check that there is not corruption at that level as that would need fixing first. Quote Link to comment
snowboardjoe Posted December 18, 2023 Author Share Posted December 18, 2023 /mnt/cache is still in a RO state: root@laffy:/var/lib/btrfs# btrfs scrub start -B /mnt/cache ERROR: scrubbing /mnt/cache failed for device id 1: ret=-1, errno=30 (Read-only file system) ERROR: scrubbing /mnt/cache failed for device id 2: ret=-1, errno=30 (Read-only file system) scrub canceled for 8bdd3d07-cbb0-4d53-a9f2-da67099186ea Scrub started: Mon Dec 18 07:40:35 2023 Status: aborted Duration: 0:00:00 Total to scrub: 0.00B Rate: 0.00B/s Error summary: no errors found I don't know how to clear this condition. Quote Link to comment
snowboardjoe Posted December 18, 2023 Author Share Posted December 18, 2023 I ran the following: btrfs check --force /dev/sdg1 Lots of errors about: parent transid verify failed on 2151677952 wanted 14578466 found 14578474 But it does finish. While the error above appears many times, it's for the same values. Quote Link to comment
snowboardjoe Posted December 18, 2023 Author Share Posted December 18, 2023 Making a copy of all data that I can to a separate disk. Turns out I do not have full backups of everything as I thought. Then I'm going to take a look at this item I found and see if there is a way to recover to an older metablock (usebackuproot). https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/FAQ.html#How_do_I_recover_from_a_.22parent_transid_verify_failed.22_error.3F I fear this entire filesystem may be unrecoverable, but still fighting a little more. Quote Link to comment
JorgeB Posted December 18, 2023 Share Posted December 18, 2023 Pool is also going read only, suggest backing up what you can and re-format. Quote Link to comment
Solution snowboardjoe Posted December 18, 2023 Author Solution Share Posted December 18, 2023 Went through several other steps, but did not make any progress. Ran restore to an external drive and that seemed to grab everything just fine. Then as a last ditch effort I ran the repair command: root@laffy:/etc# btrfs check --repair --force /dev/sdg1 enabling repair mode Opening filesystem to check... WARNING: filesystem mounted, continuing because of --force Checking filesystem on /dev/sdg1 UUID: 8bdd3d07-cbb0-4d53-a9f2-da67099186ea [1/7] checking root items Fixed 0 roots. [2/7] checking extents parent transid verify failed on 2151677952 wanted 14578466 found 14578474 parent transid verify failed on 2151677952 wanted 14578466 found 14578474 parent transid verify failed on 2151677952 wanted 14578466 found 14578474 Ignoring transid failure super bytes used 258022117376 mismatches actual used 258022100992 parent transid verify failed on 2151677952 wanted 14578466 found 14578474 Ignoring transid failure No device size related problem found [3/7] checking free space tree [4/7] checking fs roots parent transid verify failed on 2151677952 wanted 14578466 found 14578474 Ignoring transid failure ...identical reponses delted for bervity... [5/7] checking only csums items (without verifying data) parent transid verify failed on 2151677952 wanted 14578466 found 14578474 Ignoring transid failure [6/7] checking root refs Recowing metadata block 2151677952 parent transid verify failed on 2151677952 wanted 14578466 found 14578474 Ignoring transid failure [7/7] checking quota groups skipped (not enabled on this FS) found 516044218368 bytes used, no error found total csum bytes: 415007296 total tree bytes: 1490927616 total fs tree bytes: 743276544 total extent tree bytes: 180305920 btree space waste bytes: 374203586 file data blocks allocated: 943655108608 referenced 490961231872 Ran check one more time and it came back clean. Remounted filesystem as RW and remained stable (normally it would go RO in about 30 seconds). Did a full reboot and all services are back online including all containers. Watching for stability at this point. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.