convergence Posted May 11, 2022 Share Posted May 11, 2022 More bad luck with my machine: BTRFS reports corruption on both cache SSDs. ~# btrfs dev stats -c /mnt/cache [/dev/sdf1].write_io_errs 0 [/dev/sdf1].read_io_errs 0 [/dev/sdf1].flush_io_errs 0 [/dev/sdf1].corruption_errs 1043 [/dev/sdf1].generation_errs 0 [/dev/sdg1].write_io_errs 0 [/dev/sdg1].read_io_errs 0 [/dev/sdg1].flush_io_errs 0 [/dev/sdg1].corruption_errs 710 [/dev/sdg1].generation_errs 0 I do have a Ryzen CPU but the RAM is not overclocked afaik. I noticed these errors because my Windows VM crashed and the cache pool was mounted as read-only after a reboot. I would like to know what my recovery options are. In the first place I want to find out which files are affected. It is possible (perhaps likely) that my appdata backup includes errors from the cache pool. Identifying which files have known corruption may allow me to check the quality of my backup on the array. BTRFS scrub unfortunately aborts immediately without doing anything useful: ~# btrfs scrub start /mnt/cache scrub started on /mnt/cache, fsid 6b069b9c-0b3d-42b3-ab4c-3828d20b9396 (pid=14271) ~# btrfs scrub status /mnt/cache UUID: 6b069b9c-0b3d-42b3-ab4c-3828d20b9396 Scrub started: Wed May 11 21:51:54 2022 Status: aborted Duration: 0:00:00 Total to scrub: 274.67GiB Rate: 0.00B/s Error summary: no errors found The syslog does list an exit status of -30 for my scrubbing attempt via the webgui. In my ssh scrubbing attempts, however, the exit status of `btrfs scrub start` is 0. May 11 21:10:44 : /usr/local/emhttp/plugins/dynamix/scripts/btrfs_scrub 'start' '/mnt/cache' '-r' May 11 21:10:44 : BTRFS info (device sdf1): scrub: started on devid 1 May 11 21:10:44 : BTRFS info (device sdf1): scrub: not finished on devid 1 with status: -30 May 11 21:10:44 : BTRFS info (device sdf1): scrub: started on devid 2 May 11 21:10:44 : BTRFS info (device sdf1): scrub: not finished on devid 2 with status: -30 My next step is to run memtest after the parity check completes. I do not know how to proceed after that, other than purchase new RAM if it's bad. I had issues with both the array and cache on this machine a month ago, described here: I didn't do much to recover from these earlier errors. Most functionality could be restored after a reboot, and I updated the BIOS as advised. Unfortunately I didn't think of doing a btrfs scrub back then. It might have already uncovered some corruption. System/hardware: - motherboard: Gigabyte AX370-Gaming K5 - bios: American Megatrends Inc. Version F51d. Dated: 12/13/2021 - cpu: AMD Ryzen 7 1700 Eight-Core @ 2725 MHz - ram: 32 GiB DDR4 - parity: Toshiba 8 TB - data disks: 3x HGST 4 TB - cache: Samsung SSD 840 pro 256 GB + 860 evo 250 GB - Unraid version 6.9.2 - Radeon 5700XT used with IOMMU in the Windows VM Any advice would be much appreciated. pierre-diagnostics-20220511-2217.anon.zip Quote Link to comment
JorgeB Posted May 12, 2022 Share Posted May 12, 2022 8 hours ago, convergence said: My next step is to run memtest Do that, run at least a couple of passes. Quote Link to comment
convergence Posted May 14, 2022 Author Share Posted May 14, 2022 Ok, I ran 9 full passes. I'm pretty confident that my RAM is good. I let memtest run for 3 full passes. Tried to change the configuration to use SMP, which didn't work because I couldn't get memtest to recognize my usb keyboard. I let it run for another 6 full passes after that. No errors at all. After booting into Unraid again, I got a couple of different BTRFS errors: IO failure this time, all on device sdf1. I'm going to see if swapping the sata cable has any effect. Quote Link to comment
JorgeB Posted May 14, 2022 Share Posted May 14, 2022 The type of errors still suggest some hardware issue, maybe try with just one DIMM, re-format the pool then monitor for new errors. Quote Link to comment
Solution convergence Posted March 15, 2023 Author Solution Share Posted March 15, 2023 I recovered from the BTRFS corruption by formatting both cache drives, adding them to the pool, and restoring my data from a backup. Everything was successfully recovered except the corrupted file. Only one docker container was affected, and I'm fairly sure the file got corrupted by a dirty shutdown due to power loss. I've kept the system running on unchanged hardware without any problems until early February 2023. Then -- one fateful morning -- I rebooted the machine and was notified of cache pool corruption shortly after. A scrub revealed that the same file had gotten corrupted as in 2022, albeit in an entirely different container that only had recoverable errors during the previous incident. Both containers (corrupted in '22 & '23) were running an Urbit image. The main difference was that I thought I had managed to do a clean reboot the second time. The similarity between the incidents allowed me to figure out that the default shutdown timeout settings were insufficient for my setup. I have now increased the relevant timeouts and haven't had problems since. I'm still not entirely confident that every shutdown will be clean from now on, and I hope to find the time to investigate what takes so long in my shutdown sequence. I also don't understand how docker containers getting killed would cause BTRFS corruption... I understand that a file on the container's volume might get corrupted, but not how different data could be written to the disks of the pool (and without updating the checksums). Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.