Jump to content

[6.9.2] Cache drive won't mount - bad tree block start


Recommended Posts

I've been trying to get a Win10 VM to work on unRAID all day and just encountered a major issue out of nowhere with the cache drive.

 

I had just got the VM working with a pass-through 5700XT GPU. I was in the win10 VM (it lives on a brand new 2TB nvme drive) and I had just resized it (using unRAID gui button at VM tab) from 30GB to 80GB so I could install the AMD GPU drivers - I used win10 disk manager inside the VM to expand into the new storage - I installed GPU drivers - all good.

 

Restarted the VM and boom - the entire box crashed and showed a segfault on the separate unRAID terminal monitor I had connected. I hard re-booted and the system came up, but now cache drive won't mount and is giving the following errors:

 

FYI - pic is somewhat confusing. I manually tied to play with the fs type as I thought my cache might have been xfs and unRAID was trying to mount it with btrfs, but changing and re-spinning up the array had no effect on the core problem.

 

unraidProblem.PNG.2d45a56d7e0edbba66e539e1ae4c5ae3.PNG

 

I'm attaching the system logs i have. Any help is greatly appreciated!

 

tower-diagnostics-20210630-2101.zip

Link to comment

Thanks! that was helpful.

 

I *think* my initial drive corruption issue is resolved. I was able to mount the problem nvme drive with the command you recommended. I ended up installing an extra nvme drive in my server, mounting it, and was able to copy over the data from the old drive. I then installed the new drive as the only cache drive and it's working as expected with no errors. My dockers are back up luckily!

 

I might go back and re-format the original 2TB nvme, copy the data back over, re-install, and try it as a cache drive again when I have more time. I don't think the drive itself caused this problem.

 

I'm still hitting just about every problem I can imagine trying to get my Win10 VM to actually work with a passed through GPU but that may necessitate a separate post. Core issue there seems to be unraid not wanting to release the GPU to the VM completely? Tried moving to motherboard 2nd GPU slot, etc... with no luck. Win10 worked with the passed-through GPU up until I installed actual AMD drivers.... I've got a Nvidia Quadro in slot 1 that should be being used by unraid for everything else...

 

newError.thumb.PNG.0a9398ca3e5bdd2a64937c787e53b7f8.PNG

 

 

Link to comment

Unfortunately I spoke too soon.

 

Things were mostly good, and I was watching the dockers and all of the sudden I started getting errors about "read-only file system" in the docker logs and they wouldn't start back up.

 

I tried to reboot and now the completely different NVME drive is acting the exact same way as the first - won't mount. Nothing working again...

 

moreErrors2.thumb.PNG.0a7bd79be3ab662b073d11d909302b01.PNG

 

moreErrors1.PNG.e9a21237b7aa00e4398352942e6e6e65.PNG

 

A new log should be attached.

tower-diagnostics-20210701-1136.zip

 

No clue what to do now. I could keep re-building the drive as earlier, but I'm afraid this problem is going to keep appearing. I was trying to get the win10 VM to work... maybe there is something crazy wrong there that keeps causing this, and if I just nuke the VM and start from scratch it might be ok?

 

 

Link to comment

This is neat - I Tried to run Memtest86+ from the GRUB screen before unraid boots... and I get a blank screen for 10s and it loops back to the boot screen as if nothing was clicked...

 

Unraid does boot.

 

Gonna try to rebuild the NVMe as I did at the start, but gonna leave VMs disabled and see if I see the problem again. If I don't, I think that means my VM is somehow causing the corruption...

 

and yes - I fresh formatted that 2nd NVMe just an hour ago before I moved the corrupted drive's data over to it.

Edited by iamatesla
Link to comment

I enabled CSM/legacy in my BIOS - still has the same problem - maybe there's another hidden bios setting i'm missing but who knows.

 

I did run the memory check built into my MSI motherboard BIOS - it found no errors on any of my drives.

 

I have a 3rd NVME in the box now with the copied/recovered cache drive data on it. It's mounted and working again. I'm NOT going to enable the win10 vm this time and make sure the system is stable for a few hours before more testing.

 

My current theory is that somehow increasing the win10 VM disk size from 30GB to 80GB somehow led to a memory/corruption issue. Is there any way just changing 30 to 80 in the VM GUI could have a corrupting effect on the cache file system where the VM disk lives?

 

I think the next thing to do if the system stays stable is to burn my current win10 vm image and config to the ground and start over.

Link to comment
17 hours ago, iamatesla said:

Is there any way just changing 30 to 80 in the VM GUI could have a corrupting effect on the cache file system where the VM disk lives?

Only if doing so meant that you ran out of space on the cache drive as BTRFS file system seems prone to corruption if the free space is exhausted.    I can see it not being immediately obvious a vdisks are created as ‘sparse’ files which means they do not use all the space allocated until the code in the VM writes to parts of the vdisk file not currently being used.    This is just a theory though so no idea if it could apply in your scenario.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...