Jump to content

Windows 10 VM crashes/breaks during Nvidia driver install for GPU passthrough.


Go to solution Solved by MustardTiger,

Recommended Posts

Hi everyone. I've been having an issue where my Windows 10 VM crashes during the install of Nvidia drivers for the GPU I'm passing through. At this point I can boot into a Windows 10 VM with the GPU passed through with the output working (albeit with limited functionality) via HDMI on a monitor. 


Initially, just after Windows 10 has been installed, the display adapter will be showing as 'Microsoft Basic Display Adapter', Soon after the driver will automatically update to an Nvidia driver, but quite an out of date one.
Once this driver is installed, I now get this error in Device Manager: "Windows has stopped this device because it has reported problems (Code 43)"

 

I'm still getting output from the GPU. However, I don't have much else functionality, such as changing the resolution, no sound, and for example GPU-Z doesn't show all of the information like it should. When I have this GPU in the same machine, same PCI slot, but booted into baremetal, the memory size and other fields are filled in correctly + it functions correctly.

XZFda4L.png

 

It's then at this point I try to install newer Nvidia drivers, and every single time it fails a few minutes into the install. The screen goes black and the VM enters a weird boot cycle that can never start up properly. I've had a look at the logs, but I can't see much that happens when the VM crashes.

 

My machine is an HP DL80 Gen9. My cache drive where the VMs and libvirt.img are located is an m2 nvme drive attached via a PCIe adapter. I have the HP RMRR patch enabled. I've also found that I needed video=efifb:off appended to be able to passthrough the GPU. I've confirmed that vt-d is enabled in the BIOS.

 

  • Here's a lit of things I've tried to fix this:
  • Enabled/Disabled PCIe ACS override + VFIO allow unsafe interrupts
  • Made sure my GPU (+ audio part) is bound to vfio at boot (confirmed in vfio-pci.log)
  • Disabled the embedded video on my motherboard in my BIOS, just keeping on the Nvidia GPU.
  • Removed the Nvidia plugin from unraid (in case there was a conflict)
  • Disabled docker.img (again, in case of conflicts)
  • I've tried two different GPUs, and I've also installed both those GPUs on a baremetal Windows 10 install on the same machine thats also running unraid, so I don't think there's a compatibility issue hardware-wise.
  • I've tried the GPU in different PCIe slots:

           I've got a dual CPU setup, with 3 physical PCIe slots for CPU1 2 for CPU2, so I've tried different combinations of having the GPU in a               PCIe slot which is connected to a specific CPU (which is then pinned to the VM) and vice-versa.
           I've also tried this but with a whole CPU isolated just for the VM. [I'll attach a pic of my topology below]

  • I've tried several different vBIOSes, including ones I've dumped using the SpaceindaverOne script here (https://github.com/SpaceinvaderOne/Dump_GPU_vBIOS) and also using GPU-Z when booted into a baremetal Windows installation where the GPU is working completely.

 

I'm sure there are other things I have tried but I've just forgotten, so when I remember I will edit this post and add more information. 

 

 

Each time I've installed the Windows VM, I install the virt-io drivers before I try to install the Nvidia driver. I do notice that in device manager (I think when I click 'show hidden devices') there's this:

 vbPbfnI.png

I'm not sure if that could be anything to do with this issue?

 

Also, I've seen here: https://forums.guru3d.com/threads/windows-line-based-vs-message-signaled-based-interrupts-msi-tool.378044/ and here: https://wiki.unraid.net/Manual/VM_Guest_Support about MSI interrupts. However, when I follow the instructions and go the the registry key, it does not have any subkeys so I can't follow the instructions. I have not yet tried it with MSI utility v3, so I'll try that next.

 

My Diagnostics and lstopo topology are attached. If it helps, I recreated the crash scenario today, I started the Nvidia install at 14.41pm today (15 November) and it crashed at around 14:43pm if that helps to locate any errors in the logs.

 

Here my Nvidia GPU is 84:00.0.

The VM and libvirt etc. is on PCI 6:00.0 nvme0n1.

topology.png

tower-diagnostics-20221115-1447.zip

 

Many thanks in advance!

Edited by MustardTiger
Link to comment

Hi, I had a look at your setup, and I can say for what I saw that you setup all correctly, you are one of few users that set the things right.

Unfortunately I haven't a solution for your use case, but I can say the following:

1. the fwcfg device is not an issue, I have that too (invalid data, code 10) and it isn't causing any issue; you don't need it, that device is used by qemu when you pass kernel parameters or files into the guest (you set additional block in the xml for this, and this isn't the case); maybe if libvirt xml is set with fwcfg parameters data wont be anymore incorrect and the device will stop reporting invalid data

2. make sure ASUSGK208edited3.rom is dumped for your card, and if dumped with gpuz, make sure to remove the nvflash header (I think you already did it); vbios must start with 55 AA (hex)

3. shared interrupts should not be a stopper: if the gpu is using irqs and these irqs are shared with other devices, all you should see is lag in digital audio or video output, that's why you switch to msi, but this isn't causing issue with installing drivers

4. 

Quote

Once this driver is installed, I now get this error in Device Manager: "Windows has stopped this device because it has reported problems (Code 43)"

Error 43 usually stands for "nvidia found that you are passing through a gpu to a vm, this is not allowed"; this was the case with old nvidia drivers and consumer gpus, nvidia drivers installed from windows software update may use a old version, so use the newest nvidia drivers from nvidia website. Nvidia is now allowing user to passthrough consumer gpus, so no more error 43

5. despite you are correctly using video=efifb:off as a kernel parameter to prevent the host to attach efifb to the gpu, if you check your syslog you can read:

Nov 15 14:30:11 Tower kernel: pci 0000:84:00.0: BAR 1: assigned to efifb

I read about a few cases and I think the issue comes from here.

Unfortunately I don't have a solution for this, could be a kernel bug, a bios bug, or who knows....

You can try to manually detach the gpu from efiframebuffer before starting the vm, with these commands:

echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

But I'm not sure it will work....because if you run 'cat /proc/iomem' efifb should not be reported...

 

Or...you can try to boot unraid in legacy mode instead of uefi: by this way you exclude at all efifb. I would definetly try this..

Note that running unraid in legacy mode doesn't prevent you from running vms in uefi mode, so your vm setup is still correct.

Edited by ghost82
Link to comment
1 hour ago, ghost82 said:

Hi, I had a look at your setup, and I can say for what I saw that you setup all correctly, you are one of few users that set the things right.

Unfortunately I haven't a solution for your use case, but I can say the following:

1. the fwcfg device is not an issue, I have that too (invalid data, code 10) and it isn't causing any issue; you don't need it, that device is used by qemu when you pass kernel parameters or files into the guest (you set additional block in the xml for this, and this isn't the case); maybe if libvirt xml is set with fwcfg parameters data wont be anymore incorrect and the device will stop reporting invalid data

2. make sure ASUSGK208edited3.rom is dumped for your card, and if dumped with gpuz, make sure to remove the nvflash header (I think you already did it); vbios must start with 55 AA (hex)

3. shared interrupts should not be a stopper: if the gpu is using irqs and these irqs are shared with other devices, all you should see is lag in digital audio or video output, that's why you switch to msi, but this isn't causing issue with installing drivers

4. 

Error 43 usually stands for "nvidia found that you are passing through a gpu to a vm, this is not allowed"; this was the case with old nvidia drivers and consumer gpus, nvidia drivers installed from windows software update may use a old version, so use the newest nvidia drivers from nvidia website. Nvidia is now allowing user to passthrough consumer gpus, so no more error 43

5. despite you are correctly using video=efifb:off as a kernel parameter to prevent the host to attach efifb to the gpu, if you check your syslog you can read:

Nov 15 14:30:11 Tower kernel: pci 0000:84:00.0: BAR 1: assigned to efifb

I read about a few cases and I think the issue comes from here.

Unfortunately I don't have a solution for this, could be a kernel bug, a bios bug, or who knows....

You can try to manually detach the gpu from efiframebuffer before starting the vm, with these commands:

echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

But I'm not sure it will work....because if you run 'cat /proc/iomem' efifb should not be reported...

 

Or...you can try to boot unraid in legacy mode instead of uefi: by this way you exclude at all efifb. I would definetly try this..

Note that running unraid in legacy mode doesn't prevent you from running vms in uefi mode, so your vm setup is still correct.

Thank you very much for the detailed response! 

 

With regard to the .rom, it has had the header removed with a hexeditor. I've done it a couple of times actually just to be sure.

 

On 5. I can enter the first line without errors, but the other two lines I get these errors:

 

root@Tower:~# echo 0 > /sys/class/vtconsole/vtcon0/bind
root@Tower:~# echo 0 > /sys/class/vtconsole/vtcon1/bind
bash: /sys/class/vtconsole/vtcon1/bind: No such file or directory
root@Tower:~# echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind
bash: echo: write error: No such device
root@Tower:~# 

 

Here is the output of 'cat /proc/iomem' in case it's useful:

Spoiler

root@Tower:~# cat /proc/iomem
00000000-00000fff : Reserved
00001000-00092fff : System RAM
00093000-00093fff : Reserved
00094000-0009ffff : System RAM
000a0000-000bffff : PCI Bus 0000:80
000c4000-000cbfff : PCI Bus 0000:00
000f0000-000fffff : System ROM
00100000-6ae80fff : System RAM
  03000000-03c01e07 : Kernel code
  03e00000-041d2fff : Kernel rodata
  04200000-043a463f : Kernel data
  04839000-049fffff : Kernel bss
6ae81000-6b480fff : Reserved
6b481000-6b481fff : System RAM
6b482000-6b502fff : Reserved
6b503000-717b7fff : System RAM
717b8000-717e8fff : Reserved
717e9000-76feb017 : System RAM
76feb018-7700b657 : System RAM
7700b658-7700c017 : System RAM
7700c018-77011457 : System RAM
77011458-77012017 : System RAM
77012018-77052e57 : System RAM
77052e58-77053017 : System RAM
77053018-77093e57 : System RAM
77093e58-77094017 : System RAM
77094018-7709e257 : System RAM
7709e258-7709f017 : System RAM
7709f018-770a4457 : System RAM
770a4458-784fefff : System RAM
784ff000-791fefff : Reserved
  791c9004-791c902f : APEI ERST
  791ca000-791d9fff : APEI ERST
791ff000-7b5fefff : ACPI Non-volatile Storage
7b5ff000-7b7fefff : ACPI Tables
7b7ff000-7b7fffff : System RAM
7b800000-7bffffff : RAM buffer
80000000-8fffffff : PCI MMCONFIG 0000 [bus 00-ff]
  80000000-8fffffff : Reserved
90000000-c7ffbfff : PCI Bus 0000:00
  90000000-92afffff : PCI Bus 0000:01
    90000000-9000ffff : 0000:01:00.2
    92800000-928fffff : 0000:01:00.2
    92900000-929fffff : 0000:01:00.2
    92a00000-92a7ffff : 0000:01:00.2
    92a80000-92a87fff : 0000:01:00.2
    92a8c000-92a8c0ff : 0000:01:00.2
    92a8d000-92a8d1ff : 0000:01:00.0
  92b00000-92dfffff : PCI Bus 0000:02
    92b00000-92bfffff : 0000:02:00.1
      92b00000-92bfffff : igb
    92c00000-92cfffff : 0000:02:00.0
      92c00000-92cfffff : igb
    92d00000-92d03fff : 0000:02:00.1
      92d00000-92d03fff : igb
    92d04000-92d07fff : 0000:02:00.0
      92d04000-92d07fff : igb
    92d80000-92dfffff : 0000:02:00.0
  92e00000-92efffff : PCI Bus 0000:06
    92e00000-92e03fff : 0000:06:00.0
      92e00000-92e03fff : nvme
  92f00000-92f007ff : 0000:00:1f.2
    92f00000-92f007ff : ahci
  92f01000-92f013ff : 0000:00:1d.0
    92f01000-92f013ff : ehci_hcd
  92f02000-92f023ff : 0000:00:1a.0
    92f02000-92f023ff : ehci_hcd
  92f04000-92f047ff : 0000:00:11.4
    92f04000-92f047ff : ahci
  92f05000-92f05fff : 0000:00:05.4
c7ffc000-c7ffcfff : dmar1
c8000000-fbffbfff : PCI Bus 0000:80
  c8000000-d1ffffff : PCI Bus 0000:84
    c8000000-cfffffff : 0000:84:00.0
    d0000000-d1ffffff : 0000:84:00.0
  d2000000-d30fffff : PCI Bus 0000:84
    d2000000-d2ffffff : 0000:84:00.0
    d3000000-d3003fff : 0000:84:00.1
    d3080000-d30fffff : 0000:84:00.0
  d3100000-d3100fff : 0000:80:05.4
fbffc000-fbffcfff : dmar0
fec00000-fecfffff : PNP0003:00
  fec00000-fec003ff : IOAPIC 0
  fec01000-fec013ff : IOAPIC 1
  fec40000-fec403ff : IOAPIC 2
fed00000-fed003ff : HPET 0
  fed00000-fed003ff : PNP0103:00
fed12000-fed1200f : pnp 00:01
fed12010-fed1201f : pnp 00:01
fed1b000-fed1bfff : pnp 00:01
fed1c000-fed3ffff : pnp 00:01
fed45000-fed8bfff : pnp 00:01
fee00000-feefffff : pnp 00:01
  fee00000-fee00fff : Local APIC
ff000000-ffffffff : pnp 00:01
100000000-a7fffffff : System RAM
38000000000-39fffffffff : PCI Bus 0000:00
  39fffe00000-39fffefffff : PCI Bus 0000:02
    39fffe00000-39fffe1ffff : 0000:02:00.1
    39fffe20000-39fffe3ffff : 0000:02:00.1
    39fffe40000-39fffe5ffff : 0000:02:00.0
    39fffe60000-39fffe7ffff : 0000:02:00.0
  39ffff00000-39ffff0ffff : 0000:00:14.0
    39ffff00000-39ffff0ffff : xhci-hcd
  39ffff10000-39ffff13fff : 0000:00:04.7
  39ffff14000-39ffff17fff : 0000:00:04.6
  39ffff18000-39ffff1bfff : 0000:00:04.5
  39ffff1c000-39ffff1ffff : 0000:00:04.4
  39ffff20000-39ffff23fff : 0000:00:04.3
  39ffff24000-39ffff27fff : 0000:00:04.2
  39ffff28000-39ffff2bfff : 0000:00:04.1
  39ffff2c000-39ffff2ffff : 0000:00:04.0
  39ffff31000-39ffff310ff : 0000:00:1f.3
3a000000000-3bfffffffff : PCI Bus 0000:80
  3bffff00000-3bffff03fff : 0000:80:04.7
  3bffff04000-3bffff07fff : 0000:80:04.6
  3bffff08000-3bffff0bfff : 0000:80:04.5
  3bffff0c000-3bffff0ffff : 0000:80:04.4
  3bffff10000-3bffff13fff : 0000:80:04.3
  3bffff14000-3bffff17fff : 0000:80:04.2
  3bffff18000-3bffff1bfff : 0000:80:04.1
  3bffff1c000-3bffff1ffff : 0000:80:04.0

 

I will try your suggestion of booting unraid into legacy mode later on today and I'll report back. Thanks!

Link to comment
5 hours ago, ghost82 said:

As supposed, there's no trace in memory of efifb, and also from the last command output it seems efifb is not in use, but the syslog is reporting BAR 1 attached to efifb.

Try the legacy mode, if it doesn't work I really don't know..

When you say legacy mode, do you mean change both the setting in unraid and in my BIOS?

 

I changed both, but halfway into booting into unraid I get some critical errors in my iLO log about the GPU: 

Uncorrectable PCI Express Error (Slot 7, Bus 128, Device 3, Function 0, Error status 0x0000002C),
Unrecoverable System Error (NMI) has occurred.  System Firmware will log additional details in a separate IML entry if possible,
PCI Bus Error (Slot 7, Bus 128, Device 3, Function 0).

 

Unraid gets stuck at this point:

fpk65NR.jpg

 

I'm assuming it's probably to do with the extra options I've got in my syslinux config. It's strange because previously when using UEFI the video output would freeze at the very first boot screen (which I guess is a good sign because I don't want unraid using the GPU), but now it shows a lot of the output. I'll have a mess around and see what I can do.

Link to comment

As far as I know that screenshot is normal if you bound to vfio the gpu: gpu is not available from that moment for the host (last line logged is vfio-pci getting loaded for the gpu), but unraid will (hopefully) complete the boot.

Try to connect to unraid gui from another device in the lan.

Edited by ghost82
Link to comment
  • Solution

SUCCESS!

 

I've finally got it working 100% now.

To fix the problem with legacy boot mode I had to re-enable the embedded graphics on the motherboard in the bios, then that allowed me to boot into unraid. However, I still had the same issues with the VM. I tried a bunch of different things including enabling Message Signaled-Based Interrupts (MSI) but nothing worked.

 

I eventually found a solution from someone with a similar Gen9 HP server, except they're using Proxmox: https://forum.proxmox.com/threads/gpu-passthrough-issue.109074/post-469825

 

They suggest putting this in syslinux config:

video=simplefb:off

 

I put that in along with intel_iommu=relax_rmrr video=efifb:off video=vesafb:off. I then installed the latest Nvidia drivers whilst in safe mode and they seemed to install fine. For the first time the Nvidia audio drivers had installed. When I rebooted I encountered the same problem I had before, just an endless bootloop. 

 

I'd realised I forgot to re-enable MSI for the GPU, so I did that in safe mode, rebooted and it's all done! I'm going to see if I can revert back to using UEFI however, and I'll post back here if it works or not, just in case there's anyone else having the same issue.

 

Thank you @ghost82 for helping me out and pointing me in the right direction!

 

Update: I switched back to UEFI in BIOS and in Unraid, and it's all still working fine.

Edited by MustardTiger
Added update
  • Like 1
Link to comment

i kind of having the same problem. i have the Asus Z10PE-D16 Mobo with an Aspeed Gpu onboard. i was using this for my unraid graphical output. 

i also have one nvidia gpu for my windows VM and one AMD gpu for my macos VM. both a bond to vfio.

Now i added a third gpu a Quadro P600 for my plex trancoding and i had to install the nvidia drivers. i dont have connected any display to the P600. But after rebooting the output of my aspeed gpu was gone and switched to the P600 what i dont like. i am in CSM mode booting so no uefi. but i would like to have my unraid output back on my aspeed gpu.

Link to comment
23 hours ago, ghost82 said:

Can you share a new diagnostics booted with uefi?

I'm just curious to see the syslog and boot arguments.

Sure, here you go: 

tower-diagnostics-20221121-0945.zip

 

EDIT: I decided to try and remove video=vesafb:off from boot config, and GPU passthrough still works fine. Here's updated diagnostics just in case anything changed.

tower-diagnostics-20221121-1109.zip

Edited by MustardTiger
  • Like 1
Link to comment
On 11/20/2022 at 9:27 AM, Benedict Eich said:

i kind of having the same problem. i have the Asus Z10PE-D16 Mobo with an Aspeed Gpu onboard. i was using this for my unraid graphical output. 

i also have one nvidia gpu for my windows VM and one AMD gpu for my macos VM. both a bond to vfio.

Now i added a third gpu a Quadro P600 for my plex trancoding and i had to install the nvidia drivers. i dont have connected any display to the P600. But after rebooting the output of my aspeed gpu was gone and switched to the P600 what i dont like. i am in CSM mode booting so no uefi. but i would like to have my unraid output back on my aspeed gpu.

Hmm...that sounds like it's a setting you need to change in the mobo BIOS. I've had a quick look at the manual, go to IntelRCSetup --> Miscellaneous Configuration --> Active Video [Offboard Device]. See if you can change a setting there to set your primary output as the Aspeed VGA, rather than the P600. If there's no setting there, have a look through all the other BIOS settings.

 

Also, just checking you have the correct pins set on the VGA jumper on page 2-28 in the manual?: https://dlcdnets.asus.com/pub/ASUS/mb/Socket2011-R3/Z10PE-D16/Manual/E13695_Z10PE-D16_Series_UM_V4_WEB.pdf

 

By the way, it's probably a good idea to start your own thread to get some more help if that doesn't work!

Link to comment
On 11/16/2022 at 10:44 PM, MustardTiger said:

To fix the problem with legacy boot mode I had to re-enable the embedded graphics on the motherboard in the bios

I think this was what fixed things, now the embed gpu is primary, assigned as boot vga from the os.

And efifb is assigned to the internal gpu 1:00.1 (it seems efifb is still used despite the boot arg and I don't know why..), so 84:00.x is really free:
 

Nov 21 10:36:00 Tower kernel: pci 0000:01:00.1: BAR 0: assigned to efifb

I think you can remove also boot arg video=simplefb:off

 

Still remain a mistery for me why with only one gpu and video=efifb:off as boot arg, efifb attaches the same to the gpu...

Edited by ghost82
Link to comment
20 hours ago, ghost82 said:

I think this was what fixed things, now the embed gpu is primary, assigned as boot vga from the os.

And efifb is assigned to the internal gpu 1:00.1 (it seems efifb is still used despite the boot arg and I don't know why..), so 84:00.x is really free:
 

Nov 21 10:36:00 Tower kernel: pci 0000:01:00.1: BAR 0: assigned to efifb

I think you can remove also boot arg video=simplefb:off

 

Still remain a mistery for me why with only one gpu and video=efifb:off as boot arg, efifb attaches the same to the gpu...

 


Yep, you're correct. I removed "video=simplefb:off" and all is still working fine.

 

I've just done a test where I've kept all my settings as they are now, so my embedded graphics is enabled, Nvidia GPU bound, my Syslinux cfg is this: intel_iommu=relax_rmrr video=efifb:off isolcpus=4-19,24-39. I installed a new Windows 10 VM, and I still end up getting the error 43, and then when I install the latest Nvidia drivers it crashes again, getting the exact same issue as before. My other VM with working passthrough still works fine, however. 

 

So, I think just having the embedded GPU enabled doesn't fix the problem.

 

So, this leads me to believe that my problem was fixed mainly by installing the Nvidia drivers in safe mode (which I didn't actually try until the time I fixed it), and whilst I'm in safe mode enabling MSI for both functions of the GPU. I'm not sure whether I succeeded in doing this by changing values in regedit or by using MSI utility v3 (which I found linked within the wiki here: https://wiki.unraid.net/Manual/VM_Guest_Support). I'm going to keep messing around till I find out exactly what fixed the problem, because I've seen many people with GPU passthrough problems with HP machines and never a real solution, so hopefully I can pass this information on. 

  • Like 1
Link to comment
  • 9 months later...

Thanks for this article. I was able to resolve my issues of error 43 by using your guide!

 

One note, is that im seeing stability issues when playing a video game. And it can be any video game. 
I've left it over night just idling and its not crashing. Tried watching youtube and its not crashing. Only when starting a game and getting into the gameplay does it crash. 

 

Have you seen this before?

Link to comment
  • 1 month later...
On 8/29/2023 at 7:09 AM, guyverjab said:

Thanks for this article. I was able to resolve my issues of error 43 by using your guide!

 

One note, is that im seeing stability issues when playing a video game. And it can be any video game. 
I've left it over night just idling and its not crashing. Tried watching youtube and its not crashing. Only when starting a game and getting into the gameplay does it crash. 

 

Have you seen this before?

Found that i had to force the pcie slot to 3.0 because of my 3.0 board instead of auto. The GPU kept trying to switch to 4.0 even though it was a 3.0 slot

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...