Tried to do GPU passthrough for VM, had to force stop, no "no bootable device"

rbroberts · December 28, 2021

I have an Ubuntu 20.04 VM I use for image processing. It would benefit from GPU passthrough and I have two GPUs just sitting in the machine doing nothing right now....

I've seen a number of trouble reports on getting the GPU passthrough to work correctly with 6.9 (I'm running 6.9.2), but thought I would give it a shot. I was not particularly surprised when the VM didn't seem to boot; I couldn't ssh into the host. Then a normal shutdown of the VM failed, so I had to do a force stop. I removed the GPU passthrough and rebooted, planning to check the host logs. But... it never came up.

Connecting via VNC, I saw only "no bootable device". The obvious check, that the vdisk is still configured, was the first thing I did. It is, and the file is there.

It's not a horrible loss if I have to rebuild from scratch; the software for image processing lives on that VM, but the data lives in the array. But I'd like to understand what exactly happened here and how to recover, not the least of which is because without logging to to check the host logs, I don't see how I can diagnose what failed with the GPU passthrough.

rbroberts · December 29, 2021

Well, the only thing I'm certain of at this point, is that attempting a GPU passthrough is the kiss of death to your VM. The vdisk is still usable, but something ends up scrambled in the XML. I may try this one more time with a before and after snapshot of the XML to see what gets broken there.

rbroberts · December 29, 2021

Because I can't quite leave this alone, I kept poking.

It looks like this is because of the VGA card I selected, or at least where in the box it is.

My motherboard is a Supermicro X9dai, 2 Xeon E5-2695, 512 GB RAM. This board has no onboard video. I have three GPUs installed; alas, I have no control over which GPU the motherboard decides to pick, it seems to pick the lowest slot number and I can't override that. Two of the GPUs are nvidia GTX 1060, one is a GT 710. I had planned on setting up the GT 710 as the video for unraid and using the other two for the dockers. But the 710 is in a higher numbered slot and the spacing on the M/B is such that I can't put it into a lower slot.

I had been using the nvidia module in unraid because I had dockers that were using the GPUs. That's no longer true, so I removed it and rebooted. Alas, that didn't seem to matter.

To have full access to all the PCIe slots, you must have both CPUs installed, which I do. If I used the GTX 1060 at 03:00:00, it works. This is the one that unraid has also selected as the primary. If I select the GTX at 82:00:00, the VM starts, but hangs someplace before the network starts and there is no logging to figure out what's happening.

If I select the GT 710 at address 04:00:00 and the GTX 1060 at address 82:00:00, then both cards work.

Now the part that confuses me is that I believe physical CPU0 controls three of the slots, and physical CPU1 controls the other three. The VM has assigned cores from both physical CPUs. But it doesn't seem to matter if I pick cores only from physical CPU0 or physical CPU1, it still won't access that card at 82:00:00 until it also is accessing the card at 04:00:00. And it really is that card; If I give the VM both of the 1060s, it still hangs and I can't connect. It also seems to be important that the GT 710 is the first GPU.

I also get an error when swapping things around. It pops up once when trying to start the VM, then the second attempt to start the VM is fine.

2021-12-29 21:04:48.813+0000: 20242: error : qemuProcessReportLogError:2097 : internal error: qemu unexpectedly closed the monitor: 2021-12-29T21:04:48.770188Z qemu-system-x86_64: -device pcie-pci-bridge,id=pci.8,bus=pci.1,addr=0x0: Bus 'pci.1' not found

There really seems to be something stateful about the changes, meaning it seems to remember not just what the current setting is, but something about previous settings. After the last hour of messing with this, my UEFI VM is, in fact, booting, but it's lost its network connection; connecting now via VNC, it has no network devices.

I've spent most of today going through permutations, but I don't really understand QEMU enough to read through the configs and understand what I'm looking at to figure out what got changed between these iterations and what it means. I've got a configuration that I can make work, and think I'm going to call it quits 😕

rbroberts · December 29, 2021

More searching and I stumbled across this

I suspect I need to do the same thing but, near as I can tell, ther is no equivalent setting in my BIOS 😞

SimonF · December 30, 2021

Not sure if itis enabled by default do you have numa enabled?

https://www.thegeekdiary.com/centos-rhel-how-to-find-if-numa-configuration-is-enabled-or-disabled/

its in the acpi settings in bios i have a x9dr3 2x2697v2

Tried to do GPU passthrough for VM, had to force stop, no "no bootable device"

Recommended Posts

rbroberts

Link to comment

rbroberts

Link to comment

rbroberts

Link to comment

rbroberts

Link to comment

SimonF

Link to comment

Join the conversation