BTRFS cache pool corrupt (both drives) [v6.9.2]

Followers

May 11, 20224 yr

More bad luck with my machine: BTRFS reports corruption on both cache SSDs.

~# btrfs dev stats -c /mnt/cache
[/dev/sdf1].write_io_errs    0
[/dev/sdf1].read_io_errs     0
[/dev/sdf1].flush_io_errs    0
[/dev/sdf1].corruption_errs  1043
[/dev/sdf1].generation_errs  0
[/dev/sdg1].write_io_errs    0
[/dev/sdg1].read_io_errs     0
[/dev/sdg1].flush_io_errs    0
[/dev/sdg1].corruption_errs  710
[/dev/sdg1].generation_errs  0

I do have a Ryzen CPU but the RAM is not overclocked afaik. I noticed these errors because my Windows VM crashed and the cache pool was mounted as read-only after a reboot.

I would like to know what my recovery options are. In the first place I want to find out which files are affected. It is possible (perhaps likely) that my appdata backup includes errors from the cache pool. Identifying which files have known corruption may allow me to check the quality of my backup on the array.

BTRFS scrub unfortunately aborts immediately without doing anything useful:

~# btrfs scrub start /mnt/cache
scrub started on /mnt/cache, fsid 6b069b9c-0b3d-42b3-ab4c-3828d20b9396 (pid=14271)

~# btrfs scrub status /mnt/cache
UUID:             6b069b9c-0b3d-42b3-ab4c-3828d20b9396
Scrub started:    Wed May 11 21:51:54 2022
Status:           aborted
Duration:         0:00:00
Total to scrub:   274.67GiB
Rate:             0.00B/s
Error summary:    no errors found

The syslog does list an exit status of -30 for my scrubbing attempt via the webgui. In my ssh scrubbing attempts, however, the exit status of `btrfs scrub start` is 0.

May 11 21:10:44 : /usr/local/emhttp/plugins/dynamix/scripts/btrfs_scrub 'start' '/mnt/cache' '-r'
May 11 21:10:44 : BTRFS info (device sdf1): scrub: started on devid 1
May 11 21:10:44 : BTRFS info (device sdf1): scrub: not finished on devid 1 with status: -30
May 11 21:10:44 : BTRFS info (device sdf1): scrub: started on devid 2
May 11 21:10:44 : BTRFS info (device sdf1): scrub: not finished on devid 2 with status: -30

My next step is to run memtest after the parity check completes. I do not know how to proceed after that, other than purchase new RAM if it's bad. I had issues with both the array and cache on this machine a month ago, described here:

I didn't do much to recover from these earlier errors. Most functionality could be restored after a reboot, and I updated the BIOS as advised. Unfortunately I didn't think of doing a btrfs scrub back then. It might have already uncovered some corruption.

System/hardware:

- motherboard: Gigabyte AX370-Gaming K5

- bios: American Megatrends Inc. Version F51d. Dated: 12/13/2021

- cpu: AMD Ryzen 7 1700 Eight-Core @ 2725 MHz

- ram: 32 GiB DDR4

- parity: Toshiba 8 TB

- data disks: 3x HGST 4 TB

- cache: Samsung SSD 840 pro 256 GB + 860 evo 250 GB

- Unraid version 6.9.2

- Radeon 5700XT used with IOMMU in the Windows VM

Any advice would be much appreciated.

pierre-diagnostics-20220511-2217.anon.zip

Quote

Solved by convergence

March 15, 20233 yr

Go to solution

May 12, 20224 yr

Community Expert

8 hours ago, convergence said:

My next step is to run memtest

Do that, run at least a couple of passes.

Quote

May 14, 20224 yr

Author

Ok, I ran 9 full passes. I'm pretty confident that my RAM is good.

I let memtest run for 3 full passes. Tried to change the configuration to use SMP, which didn't work because I couldn't get memtest to recognize my usb keyboard. I let it run for another 6 full passes after that. No errors at all.

After booting into Unraid again, I got a couple of different BTRFS errors: IO failure this time, all on device sdf1. I'm going to see if swapping the sata cable has any effect.

Quote

May 14, 20224 yr

Community Expert

The type of errors still suggest some hardware issue, maybe try with just one DIMM, re-format the pool then monitor for new errors.

Quote

10 months later...

March 15, 20233 yr

Author
Solution

I recovered from the BTRFS corruption by formatting both cache drives, adding them to the pool, and restoring my data from a backup. Everything was successfully recovered except the corrupted file. Only one docker container was affected, and I'm fairly sure the file got corrupted by a dirty shutdown due to power loss.

I've kept the system running on unchanged hardware without any problems until early February 2023. Then -- one fateful morning -- I rebooted the machine and was notified of cache pool corruption shortly after. A scrub revealed that the same file had gotten corrupted as in 2022, albeit in an entirely different container that only had recoverable errors during the previous incident. Both containers (corrupted in '22 & '23) were running an Urbit image. The main difference was that I thought I had managed to do a clean reboot the second time.

The similarity between the incidents allowed me to figure out that the default shutdown timeout settings were insufficient for my setup. I have now increased the relevant timeouts and haven't had problems since. I'm still not entirely confident that every shutdown will be clean from now on, and I hope to find the time to investigate what takes so long in my shutdown sequence. I also don't understand how docker containers getting killed would cause BTRFS corruption... I understand that a file on the container's volume might get corrupted, but not how different data could be written to the disks of the pool (and without updating the checksums).

Quote

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Followers

Go to topic listing

BTRFS cache pool corrupt (both drives) [v6.9.2]

Featured Replies

Solved by convergence

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)