Jump to content

BTRFS cache pool corrupt (both drives) [v6.9.2]


Go to solution Solved by convergence,

Recommended Posts

More bad luck with my machine: BTRFS reports corruption on both cache SSDs.

 

~# btrfs dev stats -c /mnt/cache
[/dev/sdf1].write_io_errs    0
[/dev/sdf1].read_io_errs     0
[/dev/sdf1].flush_io_errs    0
[/dev/sdf1].corruption_errs  1043
[/dev/sdf1].generation_errs  0
[/dev/sdg1].write_io_errs    0
[/dev/sdg1].read_io_errs     0
[/dev/sdg1].flush_io_errs    0
[/dev/sdg1].corruption_errs  710
[/dev/sdg1].generation_errs  0

 

I do have a Ryzen CPU but the RAM is not overclocked afaik. I noticed these errors because my Windows VM crashed and the cache pool was mounted as read-only after a reboot.

 

I would like to know what my recovery options are. In the first place I want to find out which files are affected. It is possible (perhaps likely) that my appdata backup includes errors from the cache pool. Identifying which files have known corruption may allow me to check the quality of my backup on the array.

 

BTRFS scrub unfortunately aborts immediately without doing anything useful:

~# btrfs scrub start /mnt/cache
scrub started on /mnt/cache, fsid 6b069b9c-0b3d-42b3-ab4c-3828d20b9396 (pid=14271)

~# btrfs scrub status /mnt/cache
UUID:             6b069b9c-0b3d-42b3-ab4c-3828d20b9396
Scrub started:    Wed May 11 21:51:54 2022
Status:           aborted
Duration:         0:00:00
Total to scrub:   274.67GiB
Rate:             0.00B/s
Error summary:    no errors found

 

The syslog does list an exit status of -30 for my scrubbing attempt via the webgui. In my ssh scrubbing attempts, however, the exit status of `btrfs scrub start` is 0.

May 11 21:10:44 : /usr/local/emhttp/plugins/dynamix/scripts/btrfs_scrub 'start' '/mnt/cache' '-r'
May 11 21:10:44 : BTRFS info (device sdf1): scrub: started on devid 1
May 11 21:10:44 : BTRFS info (device sdf1): scrub: not finished on devid 1 with status: -30
May 11 21:10:44 : BTRFS info (device sdf1): scrub: started on devid 2
May 11 21:10:44 : BTRFS info (device sdf1): scrub: not finished on devid 2 with status: -30

 

My next step is to run memtest after the parity check completes. I do not know how to proceed after that, other than purchase new RAM if it's bad. I had issues with both the array and cache on this machine a month ago, described here:

I didn't do much to recover from these earlier errors. Most functionality could be restored after a reboot, and I updated the BIOS as advised. Unfortunately I didn't think of doing a btrfs scrub back then. It might have already uncovered some corruption.

 

System/hardware:

 

- motherboard: Gigabyte AX370-Gaming K5

- bios: American Megatrends Inc. Version F51d. Dated: 12/13/2021

- cpu: AMD Ryzen 7 1700 Eight-Core @ 2725 MHz

- ram: 32 GiB DDR4

 

- parity: Toshiba 8 TB

- data disks: 3x HGST 4 TB

- cache: Samsung SSD 840 pro 256 GB + 860 evo 250 GB

 

- Unraid version 6.9.2

- Radeon 5700XT used with IOMMU in the Windows VM

 

Any advice would be much appreciated.

 

pierre-diagnostics-20220511-2217.anon.zip

Link to comment

Ok, I ran 9 full passes. I'm pretty confident that my RAM is good.

 

I let memtest run for 3 full passes. Tried to change the configuration to use SMP, which didn't work because I couldn't get memtest to recognize my usb keyboard. I let it run for another 6 full passes after that. No errors at all.

 

After booting into Unraid again, I got a couple of different BTRFS errors: IO failure this time, all on device sdf1. I'm going to see if swapping the sata cable has any effect.

memtest-220514.jpg

Link to comment
  • 10 months later...
  • Solution

I recovered from the BTRFS corruption by formatting both cache drives, adding them to the pool, and restoring my data from a backup. Everything was successfully recovered except the corrupted file. Only one docker container was affected, and I'm fairly sure the file got corrupted by a dirty shutdown due to power loss.

 

I've kept the system running on unchanged hardware without any problems until early February 2023. Then -- one fateful morning -- I rebooted the machine and was notified of cache pool corruption shortly after. A scrub revealed that the same file had gotten corrupted as in 2022, albeit in an entirely different container that only had recoverable errors during the previous incident. Both containers (corrupted in '22 & '23) were running an Urbit image. The main difference was that I thought I had managed to do a clean reboot the second time.

 

The similarity between the incidents allowed me to figure out that the default shutdown timeout settings were insufficient for my setup. I have now increased the relevant timeouts and haven't had problems since. I'm still not entirely confident that every shutdown will be clean from now on, and I hope to find the time to investigate what takes so long in my shutdown sequence. I also don't understand how docker containers getting killed would cause BTRFS corruption... I understand that a file on the container's volume might get corrupted, but not how different data could be written to the disks of the pool (and without updating the checksums).

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...