November 24, 20232 yr Woke up to non-responsive dockers and VM's. I'm seeing a ton of cache pool disk errors. I have two mirrored nvme drives. I tried to copy a VM disk image to the array and the copy failed, so now I'm in a frustrated panic and not thinking clearly. I think I might have rebooted before I captured the first diagnostics. After rebooting I did recreate the docker.img since it was clearly corrupted, then rebuilt some dockers. Now I thinking even that was a bad idea. Really hoping we can save my cache drive data. Please assist. unraid-diagnostics-20231124-0533.zip unraid-diagnostics-20231124-0736.zip Edited November 24, 20232 yr by WashingtonMatt
November 24, 20232 yr Author Does this mean I've got a bad drive, or is it some other BTRFS unhappiness?
November 24, 20232 yr Author I deleted a large vm disk image I didn't care about, then ran a scrub which reported a ton of corrections. My VM disk images all seem to be corrupted. I was able to repair my critical VM enough that I can focus on it later. Are there BTRFS options that may help save my VMs? edit: and errors seemed to have stopped, except for the loop3 error, which I think occurred when I attempted to start a VM. Edited November 24, 20232 yr by WashingtonMatt
November 27, 20232 yr Community Expert Those errors suggest on the NVMe devices dropped offline in that past, run a correcting scrub an post the results, also take a look here for better pool monitoring.
November 27, 20232 yr Author So when the original event occurred the cache pool had gone read only. After some reading here, even though the UI showed I had tons of free space, I deleted a large test VM I didn't care about and then ran a scrub. The scrub reported it made a bunch of corrections(posted above). The errors seemed to stop after that. Now I'm getting the loop3 errors when I try to start some of the VM's. GUI appears to be complaining about TPM. Anyway, I disabled Docker and VM and ran a scrub again which came back clean. Re-enabled Docker and VM. Tried to start a problem VM and still get an error. unraid-diagnostics-20231127-0932.zip
December 4, 20232 yr Author Just to follow-up, I have recreated Libvirt.img and all errors have cleared up. Lost one VM to corruption, but was able to recover my important ones. I'm still feeling uneasy about the original issue. At this point I'm assuming it was due to not having Scrub scheduled, and if that is the case, shouldn't it be enabled by default? I could have sworn I had a dynamix plugin for that at some point. I did upgrade cache drives from mirrored 500GB to mirrored 2TB a few months back, had plenty of free space. And more recently upgraded from 6.11 to 6.12. If anyone can offer any insight as to what might have happened, I'd appreciate it. Just want to make sure I'm covering the bases now. Edited December 4, 20232 yr by WashingtonMatt
December 4, 20232 yr Community Expert 12 minutes ago, WashingtonMatt said: I'm still feeling uneasy about the original issue. At this point I'm assuming it was due to not having Scrub scheduled, This was basically caused by the device dropping offline, take a look at the link I posted above, if you monitor the pool you will know when a scrub is needed, also note the info about NOCOW shares, those cannot be fixed if they still exist, and a few releases ago system shares where NOCOW by default.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.