Hollandex Posted May 3, 2022 Share Posted May 3, 2022 (edited) I posted a while back about my BTRFS cache pool getting borked. At the time, the solution was to reformat the drives and set them up again. I did that, all has been well for a couple months. Then, corruption again. So, I did my usual thing. Format the drives, set them up as a pool again, use the CA Backup plugin to restore my AppData, and I was off to the races. I added a UserScript that checks the BTRFS pool hourly (as suggested by JorgeB) and within a few hours, I started to get corruption errors on both of the NVMEs. root@Sanctuary:~# btrfs dev stats /mnt/cache [/dev/nvme0n1p1].write_io_errs 0 [/dev/nvme0n1p1].read_io_errs 0 [/dev/nvme0n1p1].flush_io_errs 0 [/dev/nvme0n1p1].corruption_errs 4 [/dev/nvme0n1p1].generation_errs 0 [/dev/nvme1n1p1].write_io_errs 0 [/dev/nvme1n1p1].read_io_errs 0 [/dev/nvme1n1p1].flush_io_errs 0 [/dev/nvme1n1p1].corruption_errs 2 [/dev/nvme1n1p1].generation_errs 0 root@Sanctuary:~# btrfs fi usage -T /mnt/cache Overall: Device size: 3.64TiB Device allocated: 310.06GiB Device unallocated: 3.33TiB Device missing: 0.00B Used: 95.50GiB Free (estimated): 1.77TiB (min: 1.77TiB) Free (statfs, df): 1.77TiB Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 173.05MiB (used: 64.00KiB) Multiple profiles: no Data Metadata System Id Path RAID1 RAID1 RAID1 Unallocated -- -------------- --------- --------- -------- ----------- 1 /dev/nvme0n1p1 153.00GiB 2.00GiB 32.00MiB 1.67TiB 2 /dev/nvme1n1p1 153.00GiB 2.00GiB 32.00MiB 1.67TiB -- -------------- --------- --------- -------- ----------- Total 153.00GiB 2.00GiB 32.00MiB 3.33TiB Used 47.56GiB 195.88MiB 48.00KiB I read somewhere that maybe a "scrub" is in order. I assume that's "btrfs scrub /mnt/cache"? I didn't want to start blindly typing commands, though. Should I run a scrub? And, if so, do I scrub the cache pool or an individual drive? I realized there's an option to scrub in the cache settings. I did that, no errors found. But the size doesn't look right at all. My cache is two 2TB drives. And there is only about 300MB used. It should have been scrubbing nearly the full 2TB, right? UUID: f0eb0645-ca4a-418e-bc12-95393fa57c50 Scrub started: Tue May 3 13:45:16 2022 Status: finished Duration: 0:00:18 Total to scrub: 95.76GiB Rate: 5.32GiB/s Error summary: no errors found Any other suggestions? Or any other output that might be helpful? Thanks! Edited May 3, 2022 by Hollandex Quote Link to comment
Hollandex Posted May 3, 2022 Author Share Posted May 3, 2022 I wanted to mention that, last time this happened, I ran MemTest for about 12 hours with no errors. And I ran extended SMART tests on both NVMe drives. No errors. Is there any way to get raid functionality out of cache drives without BTRFS? I'd be curious to see if this issue is specific to BTRFS. If not, I may go to a single XFS cache drive with nightly backups. Quote Link to comment
JorgeB Posted May 4, 2022 Share Posted May 4, 2022 RAM would still be the main suspect, according to last diags you were overclocking the RAM, I would start by stopping that, if it still happens try with one DIMM at a time (without overclocking), if it's not RAM it's likely other hardware issue, but my money is still on that. You can change to XFS, but if there's data corruption it will continue, you just won't be warned. Quote Link to comment
Hollandex Posted May 4, 2022 Author Share Posted May 4, 2022 I'll pull back the XMP profile on the RAM and see if that does the trick. Thanks for your help! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.