Jump to content

Cache Pool got corrupted....again. What next?


Recommended Posts

I posted a while back about my BTRFS cache pool getting borked. At the time, the solution was to reformat the drives and set them up again. I did that, all has been well for a couple months. Then, corruption again.  

  

So, I did my usual thing. Format the drives, set them up as a pool again, use the CA Backup plugin to restore my AppData, and I was off to the races.  

  

I added a UserScript that checks the BTRFS pool hourly (as suggested by JorgeB) and within a few hours, I started to get corruption errors on both of the NVMEs.  

  

root@Sanctuary:~# btrfs dev stats /mnt/cache
[/dev/nvme0n1p1].write_io_errs    0
[/dev/nvme0n1p1].read_io_errs     0
[/dev/nvme0n1p1].flush_io_errs    0
[/dev/nvme0n1p1].corruption_errs  4
[/dev/nvme0n1p1].generation_errs  0
[/dev/nvme1n1p1].write_io_errs    0
[/dev/nvme1n1p1].read_io_errs     0
[/dev/nvme1n1p1].flush_io_errs    0
[/dev/nvme1n1p1].corruption_errs  2
[/dev/nvme1n1p1].generation_errs  0

root@Sanctuary:~# btrfs fi usage -T /mnt/cache
Overall:
    Device size:                   3.64TiB
    Device allocated:            310.06GiB
    Device unallocated:            3.33TiB
    Device missing:                  0.00B
    Used:                         95.50GiB
    Free (estimated):              1.77TiB      (min: 1.77TiB)
    Free (statfs, df):             1.77TiB
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              173.05MiB      (used: 64.00KiB)
    Multiple profiles:                  no

                  Data      Metadata  System              
Id Path           RAID1     RAID1     RAID1    Unallocated
-- -------------- --------- --------- -------- -----------
 1 /dev/nvme0n1p1 153.00GiB   2.00GiB 32.00MiB     1.67TiB
 2 /dev/nvme1n1p1 153.00GiB   2.00GiB 32.00MiB     1.67TiB
-- -------------- --------- --------- -------- -----------
   Total          153.00GiB   2.00GiB 32.00MiB     3.33TiB
   Used            47.56GiB 195.88MiB 48.00KiB            

 

I read somewhere that maybe a "scrub" is in order. I assume that's "btrfs scrub /mnt/cache"? I didn't want to start blindly typing commands, though. Should I run a scrub? And, if so, do I scrub the cache pool or an individual drive?

 

I realized there's an option to scrub in the cache settings. I did that, no errors found. But the size doesn't look right at all. My cache is two 2TB drives. And there is only about 300MB used. It should have been scrubbing nearly the full 2TB, right?

 

UUID:             f0eb0645-ca4a-418e-bc12-95393fa57c50
Scrub started:    Tue May  3 13:45:16 2022
Status:           finished
Duration:         0:00:18
Total to scrub:   95.76GiB
Rate:             5.32GiB/s
Error summary:    no errors found

 

Any other suggestions? Or any other output that might be helpful?

 

Thanks!

Edited by Hollandex
Link to comment

I wanted to mention that, last time this happened, I ran MemTest for about 12 hours with no errors. And I ran extended SMART tests on both NVMe drives. No errors.  

  

Is there any way to get raid functionality out of cache drives without BTRFS? I'd be curious to see if this issue is specific to BTRFS. If not, I may go to a single XFS cache drive with nightly backups.

Link to comment

RAM would still be the main suspect, according to last diags you were overclocking the RAM, I would start by stopping that, if it still happens try with one DIMM at a time (without overclocking), if it's not RAM it's likely other hardware issue, but my money is still on that.

 

You can change to XFS, but if there's data corruption it will continue, you just won't be warned.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...