Jump to content

[SOLVED] Not able to passthrough GPU (RTX2060-super)


Recommended Posts

Hi there, 

 

i runned a Win10 VM with my RTX 2060 super for about 5-10 days without any problems. Playing Battlefield 5 for hours - no problem!! 

Suddenly things become weird. It all started with running GPU fans without a running VM. 

Server was in a complete idle status - nothing was running - but the GPU fans spinned up as hell. 

 

As the time goes by my VM becomes even more unaccessible with any day passes. 

Ending yesterday in a complete useless VM - where nothing works. I can't even start the VM for more than 5 Minutes - sometimes it doesn't even start.

For me - as a simple user - I can't understand it.

I swear to god - I changed nothing. No updates for the BIOS, Windows nor for my GPU. Always using the last released Nvidia driver.

 

Yesterday I spend about 5 hours to get it work again. Using other Machines (Q35_4.2) - formatting the NVME drive and installed Win10 from the scratch.

I also updated unRaid to 6.9 beta25 to see, if a kernel update or i440fx_5.0/q35_5.0 has any affects on this weird behavior - but none of this helped. 

 

Now I'm completely lost, don't have a clue and I just have 5 days left until my revocation for the RTX2060 ends. 

Maybe the GPU itself got a problem and I need to take action as soon as possible.

 

It always crashes with a 

qemu-system-x86_64: vfio: Unable to power on device, stuck in D3

 

And the dmesg output says:

[ 1013.401059] vfio-pci 0000:01:00.0: enabling device (0000 -> 0003)
[ 1013.509310] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
[ 1013.509336] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
[ 1013.661372] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x19@0x168
[ 1013.661382] vfio-pci 0000:06:00.0: vfio_ecap_init: hiding ecap 0x1e@0x190
[ 1061.488944] vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1064.267343] vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
[ 1064.989023] vfio-pci 0000:01:00.0: timed out waiting for pending transaction; performing function level reset anyway
[ 1066.213019] vfio-pci 0000:01:00.0: not ready 1023ms after FLR; waiting
[ 1067.301028] vfio-pci 0000:01:00.0: not ready 2047ms after FLR; waiting
[ 1069.412907] vfio-pci 0000:01:00.0: not ready 4095ms after FLR; waiting
[ 1073.828999] vfio-pci 0000:01:00.0: not ready 8191ms after FLR; waiting
[ 1082.533098] vfio-pci 0000:01:00.0: not ready 16383ms after FLR; waiting
[ 1099.428947] vfio-pci 0000:01:00.0: not ready 32767ms after FLR; waiting

 

Please find attached my diagnostic.zip

 

Additional information:

I run my vm as a bare metal installation. Win10 is installed on my NVME drive exclusively. 

And I connect my LCD directly via HDMI to the RTX 2060. I'm not using Parsec or any other streaming client. 

v1ew-s0urce-diagnostics-20200731-0957.zip

Edited by Maddeen
Rewrite the complete story
Link to comment

As today I can start my vm and the GPU fans immediately starts to spin up heavily bringing the vm in an unaccessible status - only forcing a shutdown and rebooting my unraid server can help to bring the GPU fans down.  

Edited by Maddeen
Link to comment

Update: I also got a message for about 2-3 times on my LCD telling me "please power down and connect the pcie power cable for this graphics card"

😳

But in the meantime the vm started fine -- for about 3-5 minutes - and then crashed again, spinning up the GPU fans and become unresponsive 😭 (Screenshot IMG_3700.jpeg - thats what I see, when I'm connected via RDP and the VM crashes)

 

Please - anyone got an idea?!? 

Thanks 

 

 

IMG_3700.jpeg

IMG_3686.jpeg

Edited by Maddeen
Link to comment

1. Do you use a vbios file? Did you dump it yourself or download from TechPowerup? Wrong vbios can cause strange and weird issues.

 

2. Does Unraid boot on the Intel iGPU?

 

3. Please attach a current diagnostic. Your attached diagnostics doesn't have any xml in it.

 

4. Maybe obvious question but do you have PCIe power (aka the VGA / GPU / graphics card power) connected from your PSU to the card? If yes, check that all connections are tight (unplug and replug is probably a good idea). Try different cables as well just in case.

  • Like 1
Link to comment

1) I used three different ones - the untouched from TPU, a "modded" one following THIS GUIDE and a dumped one that I generated by myself. 

The dumped one is the one that causes the "please power down and connect the pcie power cable for this graphics card" error. 

The untouched from TPU ended up in the typical Error 43

The modded one can't even boot up the vm. 

 

2) What do you mean exactly? I'm pretty sure it does because I deactivated the dGPU in BIOS, so unRaid is only able to use the iGPU.

Update: I connected my LCD via HDMI to my Mainboard - I immediately saw the command prompts form unraid. So this should be ok.

3) Sorry - I added/removed about 34397 VMs within the last 2 days to find a solution :( Here's a new one. 

4) I'll double check it. But imho it can't be a problem because if the GPU dont have power it doesnt show up in the device manager of my vm (with error 43) 

v1ew-s0urce-diagnostics-20200731-1100.zip

Edited by Maddeen
Verificated #2
Link to comment

1. Did you dump your vbios using the Linux command line? (i.e. it's NOT done using GPUZ).

If so, I would trust that file as the right vbios. No point trying anything else.

But just in case, watch SpaceInvader One tutorial and double check that you have done the right step (link at end of this post) - obviously ignore the primary / secondary portion since you boot Unraid on the iGPU (which would be the "primary").

 

2a. Did you connect a monitor to the onboard graphics output to be 100% sure Unraid booted on the iGPU?

It's critical to confirm this point. You may have to go as far as making sure the onboard graphics is always connected to a monitor / a dummy HDMI/DP plug.

 

2b. In addition, are you booting Unraid in legacy mode (i.e. Main -> Flash -> "Permit UEFI boot mode" is NOT ticked)?

Booting Unraid in legacy mode + make sure it boots with the iGPU (or a non-passed-through GPU) are very important.

 

3. I made a few edits to the Q35 own vbios xml (presumably that's the one you dumped yourself).

Just use that one and don't bother trying anything else. In fact, your xml is fine on its own.

 

4. So having said the above, I can only conclude that you have a hardware issue.

  • The fact the card, with the right vbios by all indications, tells you something isn't right with the PCIe power points to either the card has gone kaput or (hopefully) a PSU / power connector issue.
    • This is consistent with your description on the other topic - that is things worked for you for a few days and then stopped working. That is how a power issue tends to manifest.
    • I would suggest trying a different cable / different PSU.
  • Error 43 is generic - basically the driver fails to load. It can be because the card itself doesn't work completely or the driver refuses to load (i.e. it detects a VM environment). It is not gonna show if the card is unstable (because the driver loads successfully).

 

i7-6700 + boot Unraid legacy mode + boot Unraid with the iGPU should work fine with the RTX 2060.

Something else is going on and it's a b**** to diagnose hardware issues.

 

SIO tutorial

 

XML:

<domain type='kvm'>
  <name>Windows 10q35_5_own_vBIOS</name>
  <uuid>2c7af116-09ba-6622-ac03-5e73fefb2f3e</uuid>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>16777216</memory>
  <currentMemory unit='KiB'>16777216</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>8</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='4'/>
    <vcpupin vcpu='2' cpuset='1'/>
    <vcpupin vcpu='3' cpuset='5'/>
    <vcpupin vcpu='4' cpuset='2'/>
    <vcpupin vcpu='5' cpuset='6'/>
    <vcpupin vcpu='6' cpuset='3'/>
    <vcpupin vcpu='7' cpuset='7'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-q35-5.0'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/2c7af116-09ba-6622-ac03-5e73fefb2f3e_VARS-pure-efi.fd</nvram>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vendor_id state='on' value='a0123456789b'/>
    </hyperv>
	<kvm>
      <hidden state='on'/>
    </kvm>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' dies='1' cores='4' threads='2'/>
    <cache mode='passthrough'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='hypervclock' present='yes'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/isos/virtio-win-0.1.185.iso'/>
      <target dev='hdb' bus='sata'/>
      <readonly/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci' ports='15'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-to-pci-bridge'>
      <model name='pcie-pci-bridge'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0x9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0xa'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0xb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0xc'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0xd'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x5'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0xe'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x6'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0xf'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x7'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:be:94:98'/>
      <source bridge='br0'/>
      <model type='virtio-net'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
      </source>
      <rom file='/mnt/user/isos/rtx2060s.dump'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x2'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x2'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x3'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x3'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
    </hostdev>
    <memballoon model='none'/>
  </devices>
</domain>

 

  • Like 1
Link to comment

@testdasi - first - thank you very much for spending your time on my personal shit :) I appreciate it much!! 👍

 

1) Yes - I've done that exactly as SpaceInvader told me. Only the thing with primary/secondary - because I dont have a second dGPU.

But as you said - it shouldn't make any difference for me because my iGPU ist always primary!

 

2a) yes - I made an update in my post above. I directly connected the Mainboard-HDMI to my LCD and see the command line of unraid when booting. 

I detached the HDMI cable and put the HDMI-Dummy-Plug in - as it was before.

 

2b) yes - I disabled UEFI boot within my BIOS to make sure that it always boots in legacy.
But as you can see in my screenshot - i ticked the box. I made a translation error -- permit is not prohibited 🙈 - english is, as you may realized - not my native language :) 

But - spoiler - my VM runs again ... but I'll explain that later. And I also unticked it now!

 

Bildschirmfoto 2020-07-31 um 12.48.01.png

 

3) Thanks for that. But can you tell me what exactly you edited. I want to understand as much as possible.

BTW - where is the difference between the i440 and q35?!? I don't get it ;-) 

 

4) ... seems you got it!!!
I opened the case - reconnected the cables at the GPU-side AND use the other 6+2 connector at my PSU. 

Booted up unRAID - start the vm "Windows 10_i440fx_4.2_own_Vbios" -- and its running fine --- for now about 30 minutes without any problems. (see screenshot) 

Maybe the cable got a loose connection because the cable connected at the GPU is also right next to the side wall of the case

Later I'll try some gaming to make sure, that is stable and reliable and give a final feedback for the community. 

 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 

IMG_3702.jpg

Edited by Maddeen
Link to comment

i440fx (aka Intel 440FX) was released in 1996 (Pentium II). Q35 was released in 2007 (Pentium Core) so is more modern.

The key diff in my experience is that Q35 has better (native) support for PCIe devices while i440fx essentially emulates PCIe as PCI devices.

 

When the original guide for Unraid VM came out in the 6.0 wiki, Q35 machine type didn't work well at all (rather unstable IME) so i440fx was the default choice for Windows VM. It was almost a "pick i440fx unless your OS requires Q35".

However, since about Q35-3.0, instability wasn't an issue anymore but there was a bug requiring the "PCIe root port patch" to make PCIe runs at full speed (otherwise it runs at x1).

The x1 speed bug was fixed in Q35-4.0+ so basically there isn't really a reason to always pick i440fx over Q35 anymore.

 

Well, there is sort of an unintended reason to use i440fx. Some graphics card may work better in a VM precisely because i440fx doesn't have full support for PCIe. However, it's now more of a "pick Q35 unless your hardware requires i440fx" situation.

 

Those who started out early with Unraid who have already picked i440fx generally don't have an inventive changing to Q35.

However, for those starting a new template, especially those with latest graphics card (e.g. RTX), there is really no reason to pick i440fx.

 

In terms of what I edited in your template.

  • Change vendor id to an actual value instead of none as none sometimes doesn't work
  • Add kvm hidden tag (used to help with some cards but is now generally not needed)
Edited by testdasi
  • Thanks 1
Link to comment

Thank you for this detailed description - very useful. So I’ll try to use your template when I’m back at home. 
 

9 hours ago, testdasi said:

Change vendor id to an actual value instead of none as none sometimes doesn't work

Ahhh - I noticed a warning in the VFIO log as I updated to 6.9. These warnings said something like „no valid vendor id“ and are persistent on every device I bound via VFIO. 
 

@testdasi - as promised - here's the update:

The vm (build on your edits) runs without any problems so far. I recently gamed about 2 hours - flawless :)

Thanks again and have a nice weekend.

 

BTW - here is the exact warning of the VFIO log -- I'll give a feedback after the next reboot, if it's gone due to your edits

Loading config from /boot/config/vfio-pci.cfg
BIND=01:00.0 01:00.1 01:00.2 01:00.3 04:00.0 06:00.0
---
Processing 01:00.0
Warning: You did not supply a PCI domain, assuming 0000:01:00.0
Warning: You did not specify a Vendor:Device (vvvv:dddd), unable to validate 0000:01:00.0

IOMMU group members (sans bridges):
/sys/bus/pci/devices/0000:01:00.0/iommu_group/devices/0000:01:00.0

Binding...
Successfully bound the device at 0000:01:00.0 to vfio-pci

 

Edited by Maddeen
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...