VM Randomly Refuses to Start


Recommended Posts

Hi, I've been using unraid for a while now, and have always wanted to get a stable ubuntu vm running reliably. However, when I decide to reboot the vm to change a setting, I end up not being able to reboot. Sometimes just the gpu just refuses to be passed through, and I can ssh into ubuntu, so I know its starting, just a black screen. Sometimes ubuntu doesn't even boot.

 

I really wish I could do more concrete debugging and test 1 variable at a time, but through a combination of my weird set up, and unraid just sometimes randomly booting the vm, I can't really, so I'm just gonna try to give you guys everything that has happened with thus far.

 

Now I do have 2 gpus installed. However, the gpu I want to use with the vm (gigabyte 780 ghz edition) also ends up being the one that unraid boots on. I've tried everything I can with moving the secondary slot, but I can't move the gigabyte card as it is hardline watercooled (one of my biggest regrets with this build). However, I know gpu passthrough is possible since the vm has turned on just fine on multiple occasions.

 

Also, it seems that my vm debugging always follows this pattern:

Create vm, everything works -> Decide to reboot, and after either a. modifying the vm xml or b. rebooting the vm itself the vm is now broken

Now, a couple of scenarios have already happened.

1. Only a black screen, but I can still ssh into the vm.

2. VM doesn't even boot, and can't ssh into the vm. This is typically after a couple of failed tries to reboot the vm, so maybe this is just me messing something up after rebooting so many times.

 

At this point, to try to fix it, I go into the xml and change the gpu to vnc. Now, either

1. I can vnc into the machine

    a. I just change shut off, change back to my gigabyte card and everything is dandy

    b. Nothing changes, and I still can't fix anything.

2. VNC is stuck on waiting for guest display (or something like that)

 

The most reliable "fix" I usually have is to just create a new vm template and use the exact same virtual drive location. This pretty much is my go to if all else fails. I did this a couple of days ago but didn't delete the broken vm template. And today, the new template broke and I "fixed" it by using the old template that wasn't working a couple of days ago.

 

Also, it really does seem that 

 

 

So here are my final questions really:

1. If I've been able to boot with my gigabyte card previously, is it fair to assume that I've passed it through successfully. Sometimes I don't think I've even given it a rom file to boot from and it starts even as the primary gpu. 

2. Is it common to have a vm not start after a shutdown?

3. Where can I get more detailed logs for the kvm's. 

4. What can I read up on the help me fix these issues.

 

I really do want to get this thing working and use it daily, and not have to worry about whether it will turn on or not the next day. 

 

 

 

Link to comment
8 hours ago, martinpetrov1568 said:

2. Is it common to have a vm not start after a shutdown?

What you described is often an issue on newer AMD cards. They have a reset issue. If a VM uses the card and shuts down or restarts, the card can't be reset correctly and ends in an unrecoverable state. Only solution is to restart the whole server in this case. On Nvidia cards this is a rare case. Mostly happenes if not all devices of the card are passed through. Always handover the HDMI audio portion of the card as well as USB devices if they exist. Check the logs if the VM won't start anymore and post them. Try what happen if you restart the server in that case. Will the VM boot up again?

 

Wrong VBIOS can also be an issue. Make sure you use the exact model and revision for your card or dump it yourself, remove the header like SpaceInvader described in his video.

 

Always have a display connected to your card or a HDMI dummy plug so the card thinks a display is attached. Some cards won't initialise without a display.

 

8 hours ago, martinpetrov1568 said:

3. Where can I get more detailed logs for the kvm's. 

For each VM you can access the logs for the last boots. Simply click your VM where you have a sub menu for start/stop/vnc etc. and there is a log entry. You can find the logs also in the full diagnostics ( tools >>> diagnostics )

 

Link to comment
11 hours ago, bastl said:

What you described is often an issue on newer AMD cards. They have a reset issue. If a VM uses the card and shuts down or restarts, the card can't be reset correctly and ends in an unrecoverable state. Only solution is to restart the whole server in this case. On Nvidia cards this is a rare case. Mostly happenes if not all devices of the card are passed through. Always handover the HDMI audio portion of the card as well as USB devices if they exist. Check the logs if the VM won't start anymore and post them. Try what happen if you restart the server in that case. Will the VM boot up again?

 

Wrong VBIOS can also be an issue. Make sure you use the exact model and revision for your card or dump it yourself, remove the header like SpaceInvader described in his video.

 

Always have a display connected to your card or a HDMI dummy plug so the card thinks a display is attached. Some cards won't initialise without a display.

 

For each VM you can access the logs for the last boots. Simply click your VM where you have a sub menu for start/stop/vnc etc. and there is a log entry. You can find the logs also in the full diagnostics ( tools >>> diagnostics )

 

Ya so I've seen the logs in the vm page, but didn't know about the diagnostics logs. I'm assuming the relavant logs would be the libvirt logs? Right now it only has logs from the past 2 hours, but there is one error that keeps showing up: 

qemuMonitorIO:720 : internal error: End of file from qemu monitor

 

However, my vm right now for the past 12 hours hasn't been restarted and is working just fine, so this error might be completely unrelated.

 

You do mention having an hdmi dummy plug so that the card thinks that a display is attatched. This actually might be very promising. Currently I have an lg 34uc97. I use it for both the vm and my laptop display via thunderbolt. Now it also has a usb hub that I have my mouse/keyboard connected. This way, if it's on the Thunderbolt input, I can use it for my laptop, but if it's in displayport input, it goes to my vm. I know for a fact that the usb hub won't be picked up with my vm if I'm in thunderbolt (I passed through a usb port for my vm, so that its not an issue with the vm losing mouse/keyboard if I switch to thunderbolt). 

 

I guess my point is that if the monitor is in a different input it might act like the built in usb hub, where it doesn't see that a display is connected unless I'm on that input. This might actually be what's causing my vm to fail to boot since usually I try to start the vm while still using my laptop with the display on the thunderbolt input. However, usually once my vm fails to boot, I unplug my laptop and keep the display on the displayport input and try to restart the vm and it still fails.

 

tldr:

Logs aren't showing much, but will keep an eye out for the libvirt logs

I am passing through the entire iommu group of the gpu (just the gpu itself as well as the sound card)

Usually when I want to power on my vm, I'm using my laptop which is connected to the same monitor as my vm, so I'm on a different input than the one the vm is using so the vm doesn't have a display input. However, once the vm doesn't start once, I typically switch the display to the input the vm uses. Subsequent restarts of the vm sometimes are still unsuccessful. Could the gpu have become stuck in some limbo state because I didn't have a display connected the first time?

 

And then I do agree that it does seem like it could be something with the vbios of the card. But could it be a result of me using a wrong vbios? If I'm using the wrong vbios wouldn't it never boot successfully? I think sometimes I haven't even provided a vbios and it has worked. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.