[6.9.2] Cache drive won't mount - bad tree block start

iamatesla · July 1, 2021

I've been trying to get a Win10 VM to work on unRAID all day and just encountered a major issue out of nowhere with the cache drive.

I had just got the VM working with a pass-through 5700XT GPU. I was in the win10 VM (it lives on a brand new 2TB nvme drive) and I had just resized it (using unRAID gui button at VM tab) from 30GB to 80GB so I could install the AMD GPU drivers - I used win10 disk manager inside the VM to expand into the new storage - I installed GPU drivers - all good.

Restarted the VM and boom - the entire box crashed and showed a segfault on the separate unRAID terminal monitor I had connected. I hard re-booted and the system came up, but now cache drive won't mount and is giving the following errors:

FYI - pic is somewhat confusing. I manually tied to play with the fs type as I thought my cache might have been xfs and unRAID was trying to mount it with btrfs, but changing and re-spinning up the array had no effect on the core problem.

unraidProblem.PNG.2d45a56d7e0edbba66e539e1ae4c5ae3.PNG

I'm attaching the system logs i have. Any help is greatly appreciated!

tower-diagnostics-20210630-2101.zip

JorgeB · July 1, 2021

There are some recovery options here, first one to try for this would be the one with "ro,notreelog,nologreplay".

iamatesla · July 1, 2021

Thanks! that was helpful.

I *think* my initial drive corruption issue is resolved. I was able to mount the problem nvme drive with the command you recommended. I ended up installing an extra nvme drive in my server, mounting it, and was able to copy over the data from the old drive. I then installed the new drive as the only cache drive and it's working as expected with no errors. My dockers are back up luckily!

I might go back and re-format the original 2TB nvme, copy the data back over, re-install, and try it as a cache drive again when I have more time. I don't think the drive itself caused this problem.

I'm still hitting just about every problem I can imagine trying to get my Win10 VM to actually work with a passed through GPU but that may necessitate a separate post. Core issue there seems to be unraid not wanting to release the GPU to the VM completely? Tried moving to motherboard 2nd GPU slot, etc... with no luck. Win10 worked with the passed-through GPU up until I installed actual AMD drivers.... I've got a Nvidia Quadro in slot 1 that should be being used by unraid for everything else...

JorgeB · July 1, 2021

2 minutes ago, iamatesla said:

get my Win10 VM to actually work with a passed through GPU but that may necessitate a separate post.

Yeah, can't really help with that, you should create a new post about it in the KVM subforum.

iamatesla · July 1, 2021

Unfortunately I spoke too soon.

Things were mostly good, and I was watching the dockers and all of the sudden I started getting errors about "read-only file system" in the docker logs and they wouldn't start back up.

I tried to reboot and now the completely different NVME drive is acting the exact same way as the first - won't mount. Nothing working again...

moreErrors1.PNG.e9a21237b7aa00e4398352942e6e6e65.PNG

A new log should be attached.

tower-diagnostics-20210701-1136.zip

No clue what to do now. I could keep re-building the drive as earlier, but I'm afraid this problem is going to keep appearing. I was trying to get the win10 VM to work... maybe there is something crazy wrong there that keeps causing this, and if I just nuke the VM and start from scratch it might be ok?

JorgeB · July 1, 2021

There fs corruption again:

Jul  1 11:33:13 Tower kernel: BTRFS error (device dm-5): bad tree block start, want 204523241472 have 7362525906269925244

This is a new filesystem? If so you likely have a hardware issue, start by running memtest.

iamatesla · July 1, 2021

This is neat - I Tried to run Memtest86+ from the GRUB screen before unraid boots... and I get a blank screen for 10s and it loops back to the boot screen as if nothing was clicked...

Unraid does boot.

Gonna try to rebuild the NVMe as I did at the start, but gonna leave VMs disabled and see if I see the problem again. If I don't, I think that means my VM is somehow causing the corruption...

and yes - I fresh formatted that 2nd NVMe just an hour ago before I moved the corrupted drive's data over to it.

Edited July 1, 2021 by iamatesla

JorgeB · July 1, 2021

1 minute ago, iamatesla said:

and I get a blank screen for 10s and it loops back to the boot screen as if nothing was clicked...

Make sure you're booting CSM/legacy, memtest doesn't support UEFI boot.

iamatesla · July 1, 2021

I enabled CSM/legacy in my BIOS - still has the same problem - maybe there's another hidden bios setting i'm missing but who knows.

I did run the memory check built into my MSI motherboard BIOS - it found no errors on any of my drives.

I have a 3rd NVME in the box now with the copied/recovered cache drive data on it. It's mounted and working again. I'm NOT going to enable the win10 vm this time and make sure the system is stable for a few hours before more testing.

My current theory is that somehow increasing the win10 VM disk size from 30GB to 80GB somehow led to a memory/corruption issue. Is there any way just changing 30 to 80 in the VM GUI could have a corrupting effect on the cache file system where the VM disk lives?

I think the next thing to do if the system stays stable is to burn my current win10 vm image and config to the ground and start over.

JorgeB · July 2, 2021

12 hours ago, iamatesla said:

Is there any way just changing 30 to 80 in the VM GUI could have a corrupting effect on the cache file system where the VM disk lives?

Don't see how this could happen.

itimpi · July 2, 2021

17 hours ago, iamatesla said:

Is there any way just changing 30 to 80 in the VM GUI could have a corrupting effect on the cache file system where the VM disk lives?

Only if doing so meant that you ran out of space on the cache drive as BTRFS file system seems prone to corruption if the free space is exhausted. I can see it not being immediately obvious a vdisks are created as ‘sparse’ files which means they do not use all the space allocated until the code in the VM writes to parts of the vdisk file not currently being used. This is just a theory though so no idea if it could apply in your scenario.

[6.9.2] Cache drive won't mount - bad tree block start

Recommended Posts

iamatesla

Link to comment

JorgeB

Link to comment

iamatesla

Link to comment

JorgeB

Link to comment

iamatesla

Link to comment

JorgeB

Link to comment

iamatesla

Link to comment

JorgeB

Link to comment

iamatesla

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

Join the conversation