QEMU PCIe Root Port Patch


Recommended Posts

Please can the following patch be applied to QEMU (until QEMU 4.0 is bundled with unraid, as this fix is already present in master)

 

PCIe root ports are only exposed to VM guests as x1, which results in GPU pass-through performance degradation, and in some cases on higher end NVIDIA cards, the driver doesn't initialise some features of the card. 

 

https://patchwork.kernel.org/cover/10683043/

 

Once applied, the following would be added to the VMs XML, to modify the PCIe root ports to be x16 ports:

 

<qemu:commandline>	
	<qemu:arg value='-global'/> 
	<qemu:arg value='pcie-root-port.speed=8'/> 
	<qemu:arg value='-global'/> 
	<qemu:arg value='pcie-root-port.width=16'/>
 </qemu:commandline>

Patch is well documented over here too: https://forum.level1techs.com/t/increasing-vfio-vga-performance/133443 

 

This would also increase performance of any other passed through PCIe devices which use more bandwidth provided by an x1 port (NVMe, 10Gb NICs, etc).

 

If we could have QEMU compiled from master instead of the releases though... that would be even better!

Edited by billington.mark
  • Like 3
Link to comment
3 hours ago, billington.mark said:

Please can the following patch be applied to QEMU (until QEMU 4.0 is bundled with unraid, as this fix is already present in master)

 

PCIe root ports are only exposed to VM guests as x1, which results in GPU pass-through performance degradation, and in some cases on higher end NVIDIA cards, the driver doesn't initialise some features of the card. 

 

https://patchwork.kernel.org/cover/10683043/

 

Once applied, the following would be added to the VMs XML, to modify the PCIe root ports to be x16 ports:

 


<qemu:commandline>	
	<qemu:arg value='-global'/> 
	<qemu:arg value='pcie-root-port.speed=8'/> 
	<qemu:arg value='-global'/> 
	<qemu:arg value='pcie-root-port.width=16'/>
 </qemu:commandline>

Patch is well documented over here too: https://forum.level1techs.com/t/increasing-vfio-vga-performance/133443 

 

This would also increase performance of any other passed through PCIe devices which use more bandwidth provided by an x1 port (NVMe, 10Gb NICs, etc).

 

If we could have QEMU compiled from master instead of the releases though... that would be even better!

Inside this thread he talks about how to do it without the patch. It was a big pain the damn ass but I think I got it working..

Link to comment
2 hours ago, 1812 said:

how about you share with the class?

Changed windows 10 machine to q35-3.1 machine

Inserted this below the rest of the controllers

 

 

    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='8' port='0x1f'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1c' function='0x0' multifunction='on'/>
    </controller>

Went where my hostdev's are and inserted this. Change the address in the <source></source> to your GPU's and GPU audio's on the second and it should be more or less plug and play.

  <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <rom file='/mnt/user/domains/1080ti.rom'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x41' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>

image.png.4ca41345bb53098b7fc023577448ea2a.png

Link to comment
6 minutes ago, Jerky_san said:

Changed windows 10 machine to q35-3.1 machine

Inserted this below the rest of the controllers

 

 


    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='8' port='0x1f'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1c' function='0x0' multifunction='on'/>
    </controller>

Went where my hostdev's are and inserted this. Change the address in the <source></source> to your GPU's and GPU audio's on the second and it should be more or less plug and play.


  <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <rom file='/mnt/user/domains/1080ti.rom'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x41' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>

image.png.4ca41345bb53098b7fc023577448ea2a.png

 

Thanks for this. I'm wondering if this is a windows only issue as on MacOS it reports the correct lane width (at least on one machine I have) and hits at or near bare metal benchmarks for gpu. I'll try to play around with it over the next few days!

Link to comment

Thats still not fixed. (as much as id like for it to have been that easy!)

Have a look in the NVIDIA control panel under system info at the bus in use. (id put money on it being x1!). 

 

ebTiiHi.png

 

(image is from the level1 forum as im not at home and cant take a screenshot currently)

 

You can also do a speed test by using the evga utility: https://forums.evga.com/PCIE-bandwidth-test-cuda-m1972266.aspx

 

 

The patch to add the ability to set pcie root port speeds wasn't present in the 3.1 release (which is what we're on, as of 6.7.0rc2)

 

Edited by billington.mark
Link to comment
2 hours ago, billington.mark said:

Thats still not fixed. (as much as id like for it to have been that easy!)

Have a look in the NVIDIA control panel under system info at the bus speed in use. (id put money on it being x1!). 

You can also do a speed test by using the evga utility: https://forums.evga.com/PCIE-bandwidth-test-cuda-m1972266.aspx

 

 

The patch to add the ability to set pcie root port speeds wasn't present in the 3.1 release (which is what we're on, as of 6.7.0rc2)

 

Ah shit your right.. still x1 #_# thats depressing

 

image.png.3b6af4aecf0300f9fc39ec28ff71cebe.png

Link to comment

Yep, and because of that, the NVIDIA driver is reigning in performance. 

I dont use MacOS, so im not sure if you're able to see this info on the driver... but in either case, x1 root ports will be presented to the VM guest, regardless of the OS its running. Depending on what checks the driver is doing on MacOS, it might have different performance implications than on Windows. 

Link to comment
2 hours ago, billington.mark said:

Yep, and because of that, the NVIDIA driver is reigning in performance. 

I dont use MacOS, so im not sure if you're able to see this info on the driver... but in either case, x1 root ports will be presented to the VM guest, regardless of the OS its running. Depending on what checks the driver is doing on MacOS, it might have different performance implications than on Windows. 

welp hope we get it then or maybe a way to to just run the RC over QEMU 4.0 and have a switch that turns it on & off or something.

Link to comment

I am confused

 

PCI 3.0 x16 should gibe about 15,500 Mbyte/s

 

did run the linked tool in my Win10 VM guest and it shows me the following speed

 

Unbenannt.PNG.a438337a3abbca07210a67204d8869c2.PNG

 

its off by some degree, but probably ok since its a VM guest and VT-d ?

 

GPU-Z in Win10 reports PCIe 3.0 x16 and when I use the built in test it switches to x1 when I pause and back to x16 when I continue

 

in NVIDIA Systeminfo it says

 

Unbenannt2.PNG.6fbe82e9666dd01e9e2af94a27acecdf.PNG

 

does that mean passthrough works as it should GPU on PCIe 3.0 x16 ?!  😕

 

Edited by unrateable
Link to comment
2 hours ago, unrateable said:

I am confused

 

did run the linked tool in my Win10 VM guest and it shows me the following speed

 

Unbenannt.PNG.a438337a3abbca07210a67204d8869c2.PNG

 

 

GPU-Z in Win10 reports PCIe 3.0 x16 and when I use the built in test it switches to x1 when I pause and back to x16 when I continue

 

also in NVIDIA Systeminfo it says

 

Unbenannt2.PNG.6fbe82e9666dd01e9e2af94a27acecdf.PNG

 

I believe its all good and passthrough works as it should GPU on PCIe 3.0 x16, ain´t it ?!  😕

 

I read the whole thing he posted(it was a hell of a lot) basically when the VM boots it sets a bunch of registers. Those registers impact how the driver and windows interact with the card. Latency and many things are impacted. Not just speed. Basically the patch is to tell the card when it boots "hey your in a x16 slot so set the registers accordingly!" and so it does. 

Edited by Jerky_san
Link to comment
12 hours ago, unrateable said:

I am confused

 

PCI 3.0 x16 should gibe about 15,500 Mbyte/s

 

did run the linked tool in my Win10 VM guest and it shows me the following speed

 

Unbenannt.PNG.a438337a3abbca07210a67204d8869c2.PNG

 

its off by some degree, but probably ok since its a VM guest and VT-d ?

 

GPU-Z in Win10 reports PCIe 3.0 x16 and when I use the built in test it switches to x1 when I pause and back to x16 when I continue

 

in NVIDIA Systeminfo it says

 

Unbenannt2.PNG.6fbe82e9666dd01e9e2af94a27acecdf.PNG

 

does that mean passthrough works as it should GPU on PCIe 3.0 x16 ?!  😕

 

Are you using Q35 or i440fx?

The issue here is that the NVIDIA driver is behaving differently if the bus reported is anything less than x8. Also, Latency on the VM as a whole is greatly improved when using Q35 with the patches. Its a long read, but you can see the evolution of these changes on the level1tech forum i linked in the original post. 

Link to comment
6 hours ago, billington.mark said:

Are you using Q35 or i440fx?

The issue here is that the NVIDIA driver is behaving differently if the bus reported is anything less than x8. Also, Latency on the VM as a whole is greatly improved when using Q35 with the patches. Its a long read, but you can see the evolution of these changes on the level1tech forum i linked in the original post. 

i440fx gives me the 0x while q35 gives me 1x so I assume his is the same

Link to comment
9 hours ago, billington.mark said:

Are you using Q35 or i440fx?

The issue here is that the NVIDIA driver is behaving differently if the bus reported is anything less than x8. Also, Latency on the VM as a whole is greatly improved when using Q35 with the patches. Its a long read, but you can see the evolution of these changes on the level1tech forum i linked in the original post. 

I am using i440fx-2.7 without any patch. still puzzling. Somebody here that can use the CLI tool and show their results of a working-as-supposed-to pcie 3.0 x 16 GPU ?

 

Link to comment
43 minutes ago, unrateable said:

I am using i440fx-2.7 without any patch. still puzzling. Somebody here that can use the CLI tool and show their results of a working-as-supposed-to pcie 3.0 x 16 GPU ?

 

When the VM starts the card it does a rate negotiation. The rate negotiation eventually works its way up to the proper but the card when it starts only sees x1 so it sets registers to that effect. Please read the post he did if you'd like further detail. The guys in the posts go into great depth about it

 

Link to comment
4 hours ago, unrateable said:

I am using i440fx-2.7 without any patch. still puzzling. Somebody here that can use the CLI tool and show their results of a working-as-supposed-to pcie 3.0 x 16 GPU ?

 

i440fx doesnt have any PCIe 'slots' as such. its presenting the GPU to the OS on a pci slot. Again, causing latency and a performance hit compared to bare metal. 

The CLI tool is to show that when you use Q35, the PCIe root ports are x1, not x16. 

The issue here is that the NVIDIA driver doesnt corrently initialise the card (on windows anyway), unless it detects its on an x8 or x16 slot. 

 

The comments on the patch do a good job of explaining whats going on, and whats being changed here: https://patchwork.kernel.org/cover/10683043/

 

 

I'm by no means complaining, but if there's a way to improve performance and get as close to bare metal as possible, i think its worth implementing. 👍

 

Edited by billington.mark
Link to comment
  • 2 weeks later...

I tried a lot of things to improve the performance of my VMs the last couple days and stumbled across that level1tech forum as i guess like everybody here. Great in depth information and i hope limetech is able to push that fix to us unraid users as soon as possible 😉

 

GIVE US THE FIX NOOOOOOW

 

Just kiddin. Don't push features if they aren't tested in your product. Since I'am using Unraid, even with all the RC builds I tested (every public RC since early 2018) were stable for my needs. Sure there are always performance improvments possible often on the edge of stability. Always using the bleeding edge technology is fun, sure and for a techi nice to play with but for the general user often hard to handle. It's hard for @limetech and any over tech company to find a good middle way. I believe in you guys 👍

Link to comment

The original topic of this post was to highlight a particular problem I was having (And still am),  but the main underlying point here is that over the last couple of years, development on QEMU, introduction of new hardware from AMD, and the general love for virtualisation on workstation hardware has meant development in this space is moving at quite a pace. 

Short term, a build which would include virtualisation modules from master would make a lot of people happy, but the same is inevitably going to happen when 3rd gen Ryzen, 3rd gen Threadripper, PCIe4, PCIe5, etc, etc drops in the coming months. 

 

Personally, I think the long term holy grail here is to see the ability to choose which branch we're able to run key modules like QEMU, libvirt, docker from... then be able to update and get the latest patches\performance improvements independently of an unraid release. 

 

Short term though... a build to keep us all quiet would be lovely :)

  • Like 2
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.