Cache drive BTRFS Errors

Followers

November 24, 20232 yr

Woke up to non-responsive dockers and VM's. I'm seeing a ton of cache pool disk errors. I have two mirrored nvme drives. I tried to copy a VM disk image to the array and the copy failed, so now I'm in a frustrated panic and not thinking clearly. I think I might have rebooted before I captured the first diagnostics. After rebooting I did recreate the docker.img since it was clearly corrupted, then rebuilt some dockers. Now I thinking even that was a bad idea. Really hoping we can save my cache drive data. Please assist.

unraid-diagnostics-20231124-0533.zip unraid-diagnostics-20231124-0736.zip

Edited November 24, 20232 yr by WashingtonMatt

Quote

November 24, 20232 yr

Author

Does this mean I've got a bad drive, or is it some other BTRFS unhappiness?

Quote

November 24, 20232 yr

Author

I deleted a large vm disk image I didn't care about, then ran a scrub which reported a ton of corrections. My VM disk images all seem to be corrupted. I was able to repair my critical VM enough that I can focus on it later.

Are there BTRFS options that may help save my VMs?

edit: and errors seemed to have stopped, except for the loop3 error, which I think occurred when I attempted to start a VM.

Edited November 24, 20232 yr by WashingtonMatt

Quote

November 27, 20232 yr

Community Expert

Those errors suggest on the NVMe devices dropped offline in that past, run a correcting scrub an post the results, also take a look here for better pool monitoring.

Quote

November 27, 20232 yr

Author

So when the original event occurred the cache pool had gone read only. After some reading here, even though the UI showed I had tons of free space, I deleted a large test VM I didn't care about and then ran a scrub. The scrub reported it made a bunch of corrections(posted above). The errors seemed to stop after that. Now I'm getting the loop3 errors when I try to start some of the VM's. GUI appears to be complaining about TPM.

Anyway, I disabled Docker and VM and ran a scrub again which came back clean. Re-enabled Docker and VM. Tried to start a problem VM and still get an error.

unraid-diagnostics-20231127-0932.zip

Quote

November 27, 20232 yr

Community Expert

Libvirt.img is corrupt, restore form a backup if available.

Quote

December 4, 20232 yr

Author

Just to follow-up, I have recreated Libvirt.img and all errors have cleared up. Lost one VM to corruption, but was able to recover my important ones.

I'm still feeling uneasy about the original issue. At this point I'm assuming it was due to not having Scrub scheduled, and if that is the case, shouldn't it be enabled by default? I could have sworn I had a dynamix plugin for that at some point.

I did upgrade cache drives from mirrored 500GB to mirrored 2TB a few months back, had plenty of free space. And more recently upgraded from 6.11 to 6.12.

If anyone can offer any insight as to what might have happened, I'd appreciate it. Just want to make sure I'm covering the bases now.

Edited December 4, 20232 yr by WashingtonMatt

Quote

December 4, 20232 yr

Community Expert

12 minutes ago, WashingtonMatt said:

I'm still feeling uneasy about the original issue. At this point I'm assuming it was due to not having Scrub scheduled,

This was basically caused by the device dropping offline, take a look at the link I posted above, if you monitor the pool you will know when a scrub is needed, also note the info about NOCOW shares, those cannot be fixed if they still exist, and a few releases ago system shares where NOCOW by default.

Quote

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Followers

Go to topic listing

Cache drive BTRFS Errors

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)