QEMU PCIe Root Port Patch

billington.mark · January 29, 2019

Please can the following patch be applied to QEMU (until QEMU 4.0 is bundled with unraid, as this fix is already present in master)

PCIe root ports are only exposed to VM guests as x1, which results in GPU pass-through performance degradation, and in some cases on higher end NVIDIA cards, the driver doesn't initialise some features of the card.

https://patchwork.kernel.org/cover/10683043/

Once applied, the following would be added to the VMs XML, to modify the PCIe root ports to be x16 ports:

<qemu:commandline>	
	<qemu:arg value='-global'/> 
	<qemu:arg value='pcie-root-port.speed=8'/> 
	<qemu:arg value='-global'/> 
	<qemu:arg value='pcie-root-port.width=16'/>
 </qemu:commandline>

Patch is well documented over here too: https://forum.level1techs.com/t/increasing-vfio-vga-performance/133443

This would also increase performance of any other passed through PCIe devices which use more bandwidth provided by an x1 port (NVMe, 10Gb NICs, etc).

If we could have QEMU compiled from master instead of the releases though... that would be even better!

Edited January 29, 2019 by billington.mark

Jerky_san · January 29, 2019

Second on this

Jerky_san · January 29, 2019

3 hours ago, billington.mark said:
Please can the following patch be applied to QEMU (until QEMU 4.0 is bundled with unraid, as this fix is already present in master)

PCIe root ports are only exposed to VM guests as x1, which results in GPU pass-through performance degradation, and in some cases on higher end NVIDIA cards, the driver doesn't initialise some features of the card.

https://patchwork.kernel.org/cover/10683043/

Once applied, the following would be added to the VMs XML, to modify the PCIe root ports to be x16 ports:
<qemu:commandline>	
	<qemu:arg value='-global'/> 
	<qemu:arg value='pcie-root-port.speed=8'/> 
	<qemu:arg value='-global'/> 
	<qemu:arg value='pcie-root-port.width=16'/>
 </qemu:commandline>
Patch is well documented over here too: https://forum.level1techs.com/t/increasing-vfio-vga-performance/133443

This would also increase performance of any other passed through PCIe devices which use more bandwidth provided by an x1 port (NVMe, 10Gb NICs, etc).

If we could have QEMU compiled from master instead of the releases though... that would be even better!

Inside this thread he talks about how to do it without the patch. It was a big pain the damn ass but I think I got it working..

2.6k · January 29, 2019

15 minutes ago, Jerky_san said:

It was a big pain the damn ass but I think I got it working..

how about you share with the class?

Jerky_san · January 29, 2019

2 hours ago, 1812 said:

how about you share with the class?

Changed windows 10 machine to q35-3.1 machine

Inserted this below the rest of the controllers

    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='8' port='0x1f'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1c' function='0x0' multifunction='on'/>
    </controller>

Went where my hostdev's are and inserted this. Change the address in the <source></source> to your GPU's and GPU audio's on the second and it should be more or less plug and play.

  <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <rom file='/mnt/user/domains/1080ti.rom'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x41' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>

image.png.4ca41345bb53098b7fc023577448ea2a.png

2.6k · January 29, 2019

6 minutes ago, Jerky_san said:

Changed windows 10 machine to q35-3.1 machine

Inserted this below the rest of the controllers


    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='ioh3420'/>
      <target chassis='8' port='0x1f'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1c' function='0x0' multifunction='on'/>
    </controller>

Went where my hostdev's are and inserted this. Change the address in the <source></source> to your GPU's and GPU audio's on the second and it should be more or less plug and play.


  <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <rom file='/mnt/user/domains/1080ti.rom'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x41' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>

Thanks for this. I'm wondering if this is a windows only issue as on MacOS it reports the correct lane width (at least on one machine I have) and hits at or near bare metal benchmarks for gpu. I'll try to play around with it over the next few days!

billington.mark · January 29, 2019

Thats still not fixed. (as much as id like for it to have been that easy!)

Have a look in the NVIDIA control panel under system info at the bus in use. (id put money on it being x1!).

(image is from the level1 forum as im not at home and cant take a screenshot currently)

You can also do a speed test by using the evga utility: https://forums.evga.com/PCIE-bandwidth-test-cuda-m1972266.aspx

The patch to add the ability to set pcie root port speeds wasn't present in the 3.1 release (which is what we're on, as of 6.7.0rc2)

Edited January 29, 2019 by billington.mark

GHunter · January 29, 2019

I'd like this as well if this really does work!! This could possibly fix many of the problems some people have with GPU passthrough performance.

Jerky_san · January 29, 2019

2 hours ago, billington.mark said:

Thats still not fixed. (as much as id like for it to have been that easy!)

Have a look in the NVIDIA control panel under system info at the bus speed in use. (id put money on it being x1!).

You can also do a speed test by using the evga utility: https://forums.evga.com/PCIE-bandwidth-test-cuda-m1972266.aspx

The patch to add the ability to set pcie root port speeds wasn't present in the 3.1 release (which is what we're on, as of 6.7.0rc2)

Ah shit your right.. still x1 #_# thats depressing

image.png.3b6af4aecf0300f9fc39ec28ff71cebe.png

billington.mark · January 29, 2019

Yep, and because of that, the NVIDIA driver is reigning in performance.

I dont use MacOS, so im not sure if you're able to see this info on the driver... but in either case, x1 root ports will be presented to the VM guest, regardless of the OS its running. Depending on what checks the driver is doing on MacOS, it might have different performance implications than on Windows.

Jerky_san · January 29, 2019

2 hours ago, billington.mark said:

Yep, and because of that, the NVIDIA driver is reigning in performance.

I dont use MacOS, so im not sure if you're able to see this info on the driver... but in either case, x1 root ports will be presented to the VM guest, regardless of the OS its running. Depending on what checks the driver is doing on MacOS, it might have different performance implications than on Windows.

welp hope we get it then or maybe a way to to just run the RC over QEMU 4.0 and have a switch that turns it on & off or something.

unrateable · January 29, 2019

I am confused

PCI 3.0 x16 should gibe about 15,500 Mbyte/s

did run the linked tool in my Win10 VM guest and it shows me the following speed

Unbenannt.PNG.a438337a3abbca07210a67204d8869c2.PNG

its off by some degree, but probably ok since its a VM guest and VT-d ?

GPU-Z in Win10 reports PCIe 3.0 x16 and when I use the built in test it switches to x1 when I pause and back to x16 when I continue

in NVIDIA Systeminfo it says

Unbenannt2.PNG.6fbe82e9666dd01e9e2af94a27acecdf.PNG

does that mean passthrough works as it should GPU on PCIe 3.0 x16 ?! 😕

Edited January 29, 2019 by unrateable

Jerky_san · January 29, 2019

2 hours ago, unrateable said:

I am confused

did run the linked tool in my Win10 VM guest and it shows me the following speed

GPU-Z in Win10 reports PCIe 3.0 x16 and when I use the built in test it switches to x1 when I pause and back to x16 when I continue

also in NVIDIA Systeminfo it says

I believe its all good and passthrough works as it should GPU on PCIe 3.0 x16, ain´t it ?! 😕

I read the whole thing he posted(it was a hell of a lot) basically when the VM boots it sets a bunch of registers. Those registers impact how the driver and windows interact with the card. Latency and many things are impacted. Not just speed. Basically the patch is to tell the card when it boots "hey your in a x16 slot so set the registers accordingly!" and so it does.

Edited January 29, 2019 by Jerky_san

m0ngr31 · January 29, 2019

Would be great to have this.

billington.mark · January 30, 2019

12 hours ago, unrateable said:

I am confused

PCI 3.0 x16 should gibe about 15,500 Mbyte/s

did run the linked tool in my Win10 VM guest and it shows me the following speed

its off by some degree, but probably ok since its a VM guest and VT-d ?

GPU-Z in Win10 reports PCIe 3.0 x16 and when I use the built in test it switches to x1 when I pause and back to x16 when I continue

in NVIDIA Systeminfo it says

does that mean passthrough works as it should GPU on PCIe 3.0 x16 ?! 😕

Are you using Q35 or i440fx?

The issue here is that the NVIDIA driver is behaving differently if the bus reported is anything less than x8. Also, Latency on the VM as a whole is greatly improved when using Q35 with the patches. Its a long read, but you can see the evolution of these changes on the level1tech forum i linked in the original post.

Jerky_san · January 30, 2019

6 hours ago, billington.mark said:

Are you using Q35 or i440fx?

The issue here is that the NVIDIA driver is behaving differently if the bus reported is anything less than x8. Also, Latency on the VM as a whole is greatly improved when using Q35 with the patches. Its a long read, but you can see the evolution of these changes on the level1tech forum i linked in the original post.

i440fx gives me the 0x while q35 gives me 1x so I assume his is the same

unrateable · January 30, 2019

9 hours ago, billington.mark said:

Are you using Q35 or i440fx?

The issue here is that the NVIDIA driver is behaving differently if the bus reported is anything less than x8. Also, Latency on the VM as a whole is greatly improved when using Q35 with the patches. Its a long read, but you can see the evolution of these changes on the level1tech forum i linked in the original post.

I am using i440fx-2.7 without any patch. still puzzling. Somebody here that can use the CLI tool and show their results of a working-as-supposed-to pcie 3.0 x 16 GPU ?

Jerky_san · January 30, 2019

43 minutes ago, unrateable said:

I am using i440fx-2.7 without any patch. still puzzling. Somebody here that can use the CLI tool and show their results of a working-as-supposed-to pcie 3.0 x 16 GPU ?

When the VM starts the card it does a rate negotiation. The rate negotiation eventually works its way up to the proper but the card when it starts only sees x1 so it sets registers to that effect. Please read the post he did if you'd like further detail. The guys in the posts go into great depth about it

billington.mark · January 30, 2019

4 hours ago, unrateable said:

I am using i440fx-2.7 without any patch. still puzzling. Somebody here that can use the CLI tool and show their results of a working-as-supposed-to pcie 3.0 x 16 GPU ?

i440fx doesnt have any PCIe 'slots' as such. its presenting the GPU to the OS on a pci slot. Again, causing latency and a performance hit compared to bare metal.

The CLI tool is to show that when you use Q35, the PCIe root ports are x1, not x16.

The issue here is that the NVIDIA driver doesnt corrently initialise the card (on windows anyway), unless it detects its on an x8 or x16 slot.

The comments on the patch do a good job of explaining whats going on, and whats being changed here: https://patchwork.kernel.org/cover/10683043/

I'm by no means complaining, but if there's a way to improve performance and get as close to bare metal as possible, i think its worth implementing. 👍

Edited January 30, 2019 by billington.mark

bastl · February 13, 2019

I tried a lot of things to improve the performance of my VMs the last couple days and stumbled across that level1tech forum as i guess like everybody here. Great in depth information and i hope limetech is able to push that fix to us unraid users as soon as possible 😉

GIVE US THE FIX NOOOOOOW

Just kiddin. Don't push features if they aren't tested in your product. Since I'am using Unraid, even with all the RC builds I tested (every public RC since early 2018) were stable for my needs. Sure there are always performance improvments possible often on the edge of stability. Always using the bleeding edge technology is fun, sure and for a techi nice to play with but for the general user often hard to handle. It's hard for @limetech and any over tech company to find a good middle way. I believe in you guys 👍

Tritech · February 13, 2019

Just chiming in to say I am waiting for this and any other threadripper performance related changes as well.

jordanmw · February 13, 2019

Also hoping we get this ported in SOON.

billington.mark · February 14, 2019

The original topic of this post was to highlight a particular problem I was having (And still am), but the main underlying point here is that over the last couple of years, development on QEMU, introduction of new hardware from AMD, and the general love for virtualisation on workstation hardware has meant development in this space is moving at quite a pace.

Short term, a build which would include virtualisation modules from master would make a lot of people happy, but the same is inevitably going to happen when 3rd gen Ryzen, 3rd gen Threadripper, PCIe4, PCIe5, etc, etc drops in the coming months.

Personally, I think the long term holy grail here is to see the ability to choose which branch we're able to run key modules like QEMU, libvirt, docker from... then be able to update and get the latest patches\performance improvements independently of an unraid release.

Short term though... a build to keep us all quiet would be lovely

limetech · February 15, 2019

qemu 3.1.0 with aforementioned patch will be available starting with 6.7.0-rc4.

m0ngr31 · February 15, 2019

That is amazing news. Thanks!

QEMU PCIe Root Port Patch

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

limetech

billington.mark

billington.mark

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation