Win 11 VM w. GPU passthrough - randomly loses GPU / NIC - won't boot - new IP


WoRie

Recommended Posts

Hi,

 

I've set up (multiple times now) a win 11 gaming vm, and passed through an Nvidia GPU.

 

During initial setup, everything works fine, even through restarts of the VM or host. However, after shutting down unraid completly and starting up afterwards, I have the following issues

 

- The VM won't come up at all. The logs show no issues, the status "started" but I can't connect or see any display output. The (installed) guest agent reports no IP

- After some forced reboots, the guest agent suddenly reports a new IP (while being set to static inside VM). Upon RDPing, a new network is detected inside the vm, the IP being again assigned by DHCP. Also, the GPU is missing from the VM completly. Everything else seems to work fine.

 

Any ideas, what is happening here? I've already set up this same VM 3 times. Today upon booting up my NAS again, everything is broken again. 

 

As far as I can tell, the logs show no issues. 

wonas-diagnostics-20230208-1056.zip

Link to comment

1.

Feb  8 01:37:18 WoNas kernel: pci 0000:01:00.0: BAR 1: assigned to efifb

You need video=efifb:off in your syslinux config.


2. Since your nvidia is marked as boot vga:

Feb  8 01:37:18 WoNas kernel: pci 0000:01:00.0: vgaarb: setting as boot VGA device (overriding previous)

you need a vbios to pass

 

3. multifunction for gpu:

    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
      </source>
      <rom file='/PATH/TO/YOUR/VBIOS.rom'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
    </hostdev>

 

Edited by ghost82
  • Like 1
Link to comment

Hi @ghost82

 

thanks for the swift response! 

 

I've now added the parameter to syslinux, I also set my boot vga to the internal one resulting in the following log entries (00:02.0 being the iGPU, 01:00.0 being the nvidia GPU, and 05:00.0 being the BMC/IPMI GPU)

 

root@WoNas:~# dmesg | grep vgaarb
[    0.798180] pci 0000:00:02.0: vgaarb: setting as boot VGA device
[    0.798180] pci 0000:00:02.0: vgaarb: bridge control possible
[    0.798180] pci 0000:00:02.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[    0.798180] pci 0000:01:00.0: vgaarb: bridge control possible
[    0.798180] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[    0.798180] pci 0000:05:00.0: vgaarb: setting as boot VGA device (overriding previous)
[    0.798180] pci 0000:05:00.0: vgaarb: bridge control possible
[    0.798180] pci 0000:05:00.0: vgaarb: VGA device added: decodes=io+mem,owns=io+mem,locks=none
[    0.798180] vgaarb: loaded
[   36.494784] vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[   37.046855] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[   37.046859] vfio-pci 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none

 

With this, do I still need to add a vbios? 

 

I've also added the "multifunction='on'" 

 

The VM will still not boot. I'll try to reinstall it again, dump the bios and then see, if these settings did the trick.

 

Thanks again for your help!

Link to comment

Ok, now it's somehow even worse...

 

As stated, I added the entry in syslinux, I bound the GPU to VFIO (and rebooted of course), I added the "multifunction='on' part to the xml file and hit update and I even extracted the vbios (150kb in size, so should be valid) from the card.

 

Extracting the vbios with Spaceinvaderones script was only possible when the card was not bound to VFIO btw.

 

But now, as soon as I add the card to the vm, either as primary or as secondary, the VNC console will just show "The guest has not initialized the display (yet)" and as far as I can tell, the VM won't boot. No IP is being assigned.

 

Do I need to mess with unsafe interrupts and ACS overrides? I already tried both settings and each individually, sadly to no avail.

 

There is a new GPU in the mail, but it is the same chipset, although a different vendor. 

 

 

Link to comment

So, to update on this:

 

- "The guest has not initilalized the display (yet)" error went away, after I reset my bios to defaults and disabled Reziable BAR.

 

Afterwards everything worked, even without VFIO binding, syslinux parameter and without multifunction. I could reinstall the VM, add the graphics and even played some games yesterday. Even after cold booting my unraid box with the vm, it still worked fine afterwards.

 

However, I just booted my unraid box again, booted the VM and had normal video output. So everything fine and same as yesterday. I forgot to passthrough my mouse and keyboard to the vm, so I shut it down through unraid, changed the VM settings and added the mouse and keyboard. 

 

And now everything is fubar again :( The VM won't boot at all, mouse/keyboard present or not. Cold reboot of the box also didn't help. Changing the machine type from q35 v7.1 down to 7.0 lets the VM boot, but again with a new NIC, therefore a new IP and no passed through GPU. Going back to v7.1 and the vm won't boot at all.

 

So it is, as if the whole hardware subsystem gets messed up. I don't understand why the NIC doesn't persist. It's virtio, but with the same MAC adress in all instances. Therefore windows should normally assign the same IP config inside the VM. But to windows, it's as if I installed a new NIC. Therefore I guess, that the hardware config is not persistent somehow and in the process, passthrough seizes to work.

 

Do you have any further ideas, where I could look? If I setup the win 11 vm with i440fx, I can't proceed through setup because the requirements for win 11 are not met.

Link to comment

fwiw i have this exact issue with both windows 11 and windows 10 VM's. I have tried a 1050Ti and 1080Ti. Issue persists identically regardless of hardware. Tried all your steps above as well. 

Question for you, when you are able to get your VM to 'work', re you sure it's running off your GPU ? When you check device manager is is there and error free ? 

Edited by whamp
Link to comment
11 hours ago, whamp said:

fwiw i have this exact issue with both windows 11 and windows 10 VM's. I have tried a 1050Ti and 1080Ti. Issue persists identically regardless of hardware. Tried all your steps above as well. 

Question for you, when you are able to get your VM to 'work', re you sure it's running off your GPU ? When you check device manager is is there and error free ? 

Great to hear, that I'm not alone in this, even if there is no solution so far. :)

 

Concerning your question: I passed through an RTX 4090, so the difference in framerate was noticable ;) 

 

I've sinced narrowed down a process to fix the vm:

 

- Set primary display as VNC, secondary as your GPU. Save the settings

- Edit the vm config again, switch to xml mode and lookup the qxl string which denotes the VNC virtual gpu. Make sure that the bus used is 0x00 and not something else. If it is something else, change it to 0x00 and save. If a message comes up that the slot is already in use, change the corresponding value until you can save and exit.

- Upon booting the VM, you should have video out through VNC

- Inside windows, it is as if the hardware subsys detects all hardware anew (again pertaining to my guess, that the hardware setup changes somehow in between). After everything is done, the GPU should again show up in device manager (along the qxl gpu, which you can throw out afterwards)

 

And at least for me, afterwards everything worked as it should. 

 

Something else I tried was to delete the VM (but not the disks) and create a new one, referencing the original disk. Then I was also able to change the machine type to i440fx. So far, this also worked.

 

We'll see how long until suddenly everything breaks again. I don't know if a future unraid version with a newer kernel would work better, since in kernel 5.19 or 6.x, hardware passthrough should be more stable (at least that is what I read in the proxmox changelog).

  • Like 2
Link to comment

So, I just had to reset the GPU setup again. The VM wouldn't boot, switching to the virtio gpu and passedthrough as secondary let me boot into windows, with the NIC again switching to DHCP. Shutting down the vm, removing the virtio and switching to the passedthrough GPU works as well.

 

But this is really not something that I would describe as "user friendly" or "rock solid". 

Link to comment
3 hours ago, WoRie said:

removing the virtio and switching to the passedthrough GPU works as well

As a general rule, once you use a passthrough GPU, you shouldn't be using the VNC driver at all.  If you need a virtual GPU, then you would tend to use RDP etc from within the VM itself.

 

3 hours ago, WoRie said:

NIC again switching to DHCP.

What happens if instead of using the virbr0 network driver you use E1000?  

Link to comment

Have you had any more luck? I have the same issues with GPU passthrough fails and causes the NIC to fail. Squid i tried to use E1000 after the issues popped up on my most recent attempt but it made no difference. Can't find the VM on my network anymore. 

I'm sort of at wits end here, i feel like i've tried everything. I'm not looking for a gaming VM, i just want BlueIRis to be able to use my GPU for the AI object detection . . .

The only thing i've noticed is AFTER i switch from NVC to GPU, the VM gets assigned to a different LAN format than my home network after trying to passthrough the GPU. The VM is now getting assigned to 


169.254.222.198 

and i run 192.168.1.0/24 format

Edited by whamp
Link to comment

Out of the blue, the VM seized to work twice already.

 

Adding the Virtio as primary, passed through as secondary let me boot again. And after shutting down the vm, removing the virtio and setting the passed through as primary, the vm works for some time and through some restarts until it breaks again.

 

If I can provide any logs that could help, please let me know. 

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.