BTRFS / Cache reliability concerns


Recommended Posts

Hi, I'm wondering if BTRFS is the right solution for a resilient cache / storage solution?

 

I run two Unraid servers, primary 40TB disk plus 2TB cache (4 x 1TB SSD), secondary 26TB disk plus 2TB cache (4 x 1TB SSD). On two occasions I've lost my entire Cache volume due one of the drives "failing". I say failing, but really both times it was my own fault, I didn't want to shut down, and I pulled the wrong drive, and immediately plugged it back in. But this is no different to a drive failing, or a connection failing. Pulling disks during certification of large resilient storage systems is a perfectly good test.

 

One would expect the loss of 1 disk in a 4 disk BTRFS RAID10 config to be a non-issue, not so, first the log started showing BTRFS corruption issue, ok, seems it is not being auto fixed, then I run a cache scrub, no errors, still errors in the log, scrub with repair, reported repaired. Then I started getting docker write failures, seems my cache became read-only, and BTRFS corrupt.

In both cases I resorted to rebuilding the cache from scratch, and restored appdata backups, lost the VM's (unlike docker stop/restart no easy way to backup VM's).

 

I've run hardware RAID for a long time, including hardware that uses SSD caching, I've lost disks, pulled disks, but in all cases the array eventually comes back on its own.

I simply do not have the same trust in Unraid's cache, I think it is fragile, I think it is unreliable to the point where it needs to be backed up constantly.

 

I'd like to see the Unraid/Limetech publish their resiliency test and performance plans? What is tested for, what are known failure scenarios, what are known recoverable scenarios, are my expectations of resiliency and performance unfounded?

 

And this is not about BTRFS, this is about Unraid, I don't care what Unraid uses for the cache volume, it could have supported SSD's in data volumes and no cache would be required, it could have used ZFS and we would have different problems, BTRFS was an Unraid choice, and I find it fragile.

 

What are your experiences with cache resiliency?

Link to comment
  • 4 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.