BTRFS Cache drive failed, how to restore and update to ZFS?

97WaterPolo · April 24

Hi everyone,

Server (6.11.5) has been running like a dream for so long I forgot it exists. Anyways I hopped on today and I got a bunch of errors on 1 of my cache drives. I ran the btrfs dev stats and noticed my nvme1n1p1 had a bunch of errors, I zeroed it out and run btrfs scrub and still have numerous errors. From what I've seen on other forum posts looks like my drive is on the way out. I ordered a new 980 PRO 2TB NVME and it should be here tomorrow, but I'd like some advice on how to migrate over.

image.png.873d43e5216a013ca769eb2823fd4ee0.png

So for right now it looks like all my docker containers and VMs are running without any issue and seem to be writing to the Cache 2 (the drive without any errors). I am planning on leaving everything running until tomorrow evening when I get the new drive in from Amazon (I assume this would be okay since the other drive is operating fine).

Once I receive it I am planning on doing the following.

Stopping all VMs/Docker containers
Change all my Shares to "Yes" where it has Cache as Only/Prefer (I am still on 6.11.5)
Run the mover to get everything off the cache drives and onto my array.
Shutdown the server
Replace the bad drive with the new one from amazon
Start array and assign the new drive to the pool
Let the pool rebuild and then revert my changes to the shares from above, Run mover
Restart VMs/Docker containers

Does this seem like an appropriate checklist to get my cache pool back up and running without any data loss?

I've also been reading up about ZFS and using it for the cache pool, but it looks like for stable support I would do best upgrading to 6.12.X and then trying to set up my cache pool in ZFS instead of BTRFS. I assume it would be worth it to get my existing cache pool back up and running first before I do any migration of sort, but if ZFS is less prone to errors for a cache pool, would it make sense to stop at step 6 and do the following?

6. Start array and assign the new drive to the pool

7. Upgrade UnraidOS to 6.12.X

8. Setup the new cache pool as ZFS

9. Revert changes to share and then run mover

10. Restart VMs/Docker containers

My only hesitation is that it seems like a major jump in versions from 6.11 to 6.12, should I wait until my array is in a stable state before I do the upgrade? Is it even worth to upgrade if the only thing I want out of it is ZFS for my cache pool?

aincrad-diagnostics-20240423-2306.zip

JorgeB · April 24

One of the NVMe devices dropped offline, power cycle the server, don't just reboot, to see if it comes back and post new diags after array start.

97WaterPolo · April 25

20 hours ago, JorgeB said:

One of the NVMe devices dropped offline, power cycle the server, don't just reboot, to see if it comes back and post new diags after array start.

Hi @JorgeB

I did do a restart, but once I re-enabled anything reading/writing from the cache drive it started throwing errors in syslog and I am seeing errors again on btrfs dev stats after wiping it. I think that specific drive might be shot? Is there any issue if I were to swap it out with a new 2TB NVME SSD or should I try to wipe and reformat the bad drive?

I also tried scrubbing that drive and I am getting an error code of -30. I also see a Fix common problems error where it is Unable to write to cache, I feel like after the reboot I am in a worse state as before it was still able to read/write from the cache pool.

I do however see the device back in the dashboard as accessible and reading a temperature again

What are your thoughts on migrating to 6.12.X to use ZFS instead of btrfs?

aincrad-diagnostics-20240424-2120.zip

Edited April 25 by 97WaterPolo

JorgeB · April 25

If the scrub is aborting best bet is to copy what you can from the pool and then reformat, and I do recommend zfs over btrfs for mirrors, since it's better at recovering from a dropped device.

97WaterPolo · April 25

7 hours ago, JorgeB said:

If the scrub is aborting best bet is to copy what you can from the pool and then reformat, and I do recommend zfs over btrfs for mirrors, since it's better at recovering from a dropped device.

Since only one of the drives is failing, can I just swap out that one drive and see if it rebuilds from the good drive to the new drive? I did do an rsync onto my array but it completed with some file errors so I'm not sure if I have a perfect copy. I do use CA backup, but I never realized that it was wiping all of the previous runs. So I only have a backup from this past Monday but I'm not sure when the corruption started so I'd like to resort to wiping and reformatting as the last result in the event my backup is corrupted. Will swapping out the bad drive with the new one and letting the system rebuild the pool work?

JorgeB · April 25

It doesn't look to me like device problem. it just dropped offline, but you can try to remove it to see if it's better with just the other one.

97WaterPolo · April 25

2 hours ago, JorgeB said:

It doesn't look to me like device problem. it just dropped offline, but you can try to remove it to see if it's better with just the other one.

Even if it dropped offline the restart should've fixed it right? It seems like that 1 drive that's bad is consistently failing to read and write in syslog as well. If I reformat it there's a strong chance that I'll lose all the data on both the drives, so if data preservation is my main goal swapping it out and letting it rebuild would be the best?

It looks like the cache pool is mounted in read only mode since I can't run a btrfs scrub, it just instantly aborts if run through GUI or through cmd line.

Are you saying if I stop the array and unmount the drive that's having issues from the pool and then restart the array it'll resume just fine since the good drive is still in the "pool"? I thought I'd need at least two drives for it to be a cache pool and if I remove one it won't start up since it's supposed to be mirrored.

JorgeB · April 26

12 hours ago, 97WaterPolo said:

Even if it dropped offline the restart should've fixed it right?

Usually a power cycle is required for NVMe devices, not just a restart.

12 hours ago, 97WaterPolo said:

Are you saying if I stop the array and unmount the drive that's having issues from the pool and then restart the array it'll resume just fine since the good drive is still in the "pool"?

Instead of that, which will wipe the other device, I would recommend physically removing the device, in case you still need it.

97WaterPolo · April 29

I did what was suggested and did a full power cycle, shut down fully, then bring back up. My cache pool seemed to have mounted in read-only mode since one of the disks was failing. I did do a btrfs dev stats and the same drive still had corruption errors after wiping.

So I did do a "rsyc" to copy as much as I could off the cache pool to the array, I had a tail running on the syslog and it was constantly spouting errors about corruption, thankfully it finished and spot checking the data it looked good.

I then shutdown the server again and took it out to be worked on. I gave the whole thing a good dusting and replaced the 970 Evo Plus drive with a 980 Pro. I then restarted my system which gave me an error stating that the drive was missing. I then mounted the new drive as part of the cache pool and started the array. Once started I saw that it started a balance operation on its own and was reading a bunch from the good drive and writing to the new drive. Took < 1 hour for it to build and balance (unfortunately the new drive "overheated" so i got the notification but it seems all good). I had some issues bring my docker container back up because the br0 network interface was in use but I just brought it down/up again and everything was working great!

Thank you!

BTRFS Cache drive failed, how to restore and update to ZFS?

Recommended Posts

97WaterPolo

Link to comment

JorgeB

Link to comment

97WaterPolo

Link to comment

JorgeB

Link to comment

97WaterPolo

Link to comment

JorgeB

Link to comment

97WaterPolo

Link to comment

JorgeB

Link to comment

97WaterPolo

Link to comment

Join the conversation