Low GPU (2080ti) Performance on Win10 VM (Unraid 6.2.7)

sit_rp · August 9, 2019

Recently got a 2080ti and I have been seeing some poor performance in comparison with bare metal. About 20% less. The only difference is that bare metal machine was running on 6700k and Unraid is on 1700x. I have been reading a bunch of difference things, but can't narrow it down. I was thinking maybe I am bottlenecked by 1700x, but i noticed frame dips when both CPU and GPU are only utilized to about 65%. When I use Superposition to test, min frames are much lower on the VM and Superposition doesn't use much cpu at all. Hoping somebody over here can point me in the right direction. I read about NPT bug, but though that was resolved while back in linux kernel 4.15. Below is mine XML file.

EDIT:

Some more information. I am passing through 10 vcpus 6-15. All those cores are isolated. The remaining cores (for docker and such) are confirmed to be running low utilization during all my tests, so Unraid shouldn't have problems with not having any cpu resources available for itself. I tried both i440fx and Q35 machine settings. I get slightly better performance on i440fx. Only tried OVMF bios, not sure if seabios will help, but if you guys think it's worth a try let me know. Did multiple configuration inside the OS. Enabled MSI (only for video, I couldn't find Audio under PCI registries). High performance mode. Indexing disabled. The only two things that are running on the background during my tests is steam and MSI afterburner with overlay. I didn't see any other program using resources, the machine usually idles at like 3% cpu utilization. The hard drive that VM is using is being shared with two docker containers, but those should be very low utilization on disk. Nvidia card is running the latest driver. Windows is fully up to date (new anniversary update is not installed).

xml.txt

Edited August 9, 2019 by sit_rp

sit_rp · August 9, 2019

<?xml version='1.0' encoding='UTF-8'?>
<domain type='kvm' id='33'>
<name>Windows 10</name>
<uuid>6b147bb8-d6f6-d1f4-a397-eb12977dddeb</uuid>
<metadata>
<vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
</metadata>
<memory unit='KiB'>12582912</memory>
<currentMemory unit='KiB'>12582912</currentMemory>
<memoryBacking>
<nosharepages/>
</memoryBacking>
<vcpu placement='static'>10</vcpu>
<cputune>
<vcpupin vcpu='0' cpuset='6'/>
<vcpupin vcpu='1' cpuset='7'/>
<vcpupin vcpu='2' cpuset='8'/>
<vcpupin vcpu='3' cpuset='9'/>
<vcpupin vcpu='4' cpuset='10'/>
<vcpupin vcpu='5' cpuset='11'/>
<vcpupin vcpu='6' cpuset='12'/>
<vcpupin vcpu='7' cpuset='13'/>
<vcpupin vcpu='8' cpuset='14'/>
<vcpupin vcpu='9' cpuset='15'/>
</cputune>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch='x86_64' machine='pc-i440fx-3.1'>hvm</type>
<loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
<nvram>/etc/libvirt/qemu/nvram/6b147bb8-d6f6-d1f4-a397-eb12977dddeb_VARS-pure-efi.fd</nvram>
</os>
<features>
<acpi/>
<apic/>
<hyperv>
<relaxed state='on'/>
<vapic state='on'/>
<spinlocks state='on' retries='8191'/>
<vendor_id state='on' value='none'/>
</hyperv>
</features>
<cpu mode='host-passthrough' check='none'>
<topology sockets='1' cores='10' threads='1'/>
</cpu>
<clock offset='localtime'>
<timer name='hypervclock' present='yes'/>
<timer name='hpet' present='no'/>
</clock>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<devices>
<emulator>/usr/local/sbin/qemu</emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='raw' cache='writeback'/>
<source file='/mnt/disks/Samsung_SSD_860_EVO_M2_500GB_S414NB0K708612T/domains/Windows 10/vdisk1.img'/>
<backingStore/>
<target dev='hdc' bus='virtio'/>
<boot order='1'/>
<alias name='virtio-disk2'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>
<disk type='file' device='cdrom'>
<driver name='qemu' type='raw'/>
<source file='/mnt/user/isos/virtio-win-0.1.160-1.iso'/>
<backingStore/>
<target dev='hdb' bus='ide'/>
<readonly/>
<alias name='ide0-0-1'/>
<address type='drive' controller='0' bus='0' target='0' unit='1'/>
</disk>
<controller type='usb' index='0' model='ich9-ehci1'>
<alias name='usb'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x7'/>
</controller>
<controller type='usb' index='0' model='ich9-uhci1'>
<alias name='usb'/>
<master startport='0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0' multifunction='on'/>
</controller>
<controller type='usb' index='0' model='ich9-uhci2'>
<alias name='usb'/>
<master startport='2'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x1'/>
</controller>
<controller type='usb' index='0' model='ich9-uhci3'>
<alias name='usb'/>
<master startport='4'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x2'/>
</controller>
<controller type='pci' index='0' model='pci-root'>
<alias name='pci.0'/>
</controller>
<controller type='ide' index='0'>
<alias name='ide'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
</controller>
<controller type='virtio-serial' index='0'>
<alias name='virtio-serial0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
</controller>
<interface type='bridge'>
<mac address='52:54:00:df:10:27'/>
<source bridge='br0.5'/>
<target dev='vnet1'/>
<model type='virtio'/>
<alias name='net0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
</interface>
<serial type='pty'>
<source path='/dev/pts/1'/>
<target type='isa-serial' port='0'>
<model name='isa-serial'/>
</target>
<alias name='serial0'/>
</serial>
<console type='pty' tty='/dev/pts/1'>
<source path='/dev/pts/1'/>
<target type='serial' port='0'/>
<alias name='serial0'/>
</console>
<channel type='unix'>
<source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-33-Windows 10/org.qemu.guest_agent.0'/>
<target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
<alias name='channel0'/>
<address type='virtio-serial' controller='0' bus='0' port='1'/>
</channel>
<input type='mouse' bus='ps2'>
<alias name='input0'/>
</input>
<input type='keyboard' bus='ps2'>
<alias name='input1'/>
</input>
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0000' bus='0x2d' slot='0x00' function='0x0'/>
</source>
<alias name='hostdev0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0000' bus='0x2d' slot='0x00' function='0x1'/>
</source>
<alias name='hostdev1'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0000' bus='0x2d' slot='0x00' function='0x2'/>
</source>
<alias name='hostdev2'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0000' bus='0x2e' slot='0x00' function='0x3'/>
</source>
<alias name='hostdev3'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
</hostdev>
<hostdev mode='subsystem' type='pci' managed='yes'>
<driver name='vfio'/>
<source>
<address domain='0x0000' bus='0x2d' slot='0x00' function='0x3'/>
</source>
<alias name='hostdev4'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
</hostdev>
<memballoon model='none'/>
</devices>
<seclabel type='dynamic' model='dac' relabel='yes'>
<label>+0:+100</label>
<imagelabel>+0:+100</imagelabel>
</seclabel>
</domain>

bastl · August 9, 2019

Check your windows power plan. Switch to performance instead of balanced(default). What are the CPU clocks Unraid reports running under load?

watch grep \"cpu MHz\" /proc/cpuinfo

sit_rp · August 9, 2019

4 hours ago, bastl said:
Check your windows power plan. Switch to performance instead of balanced(default). What are the CPU clocks Unraid reports running under load?
watch grep \"cpu MHz\" /proc/cpuinfo

Yeah my profile is already set to High Performance. Just ran another round of Superposition. Base clock is 2.2Ghz and it looks like it boosts to 3.5. Hard to tell the average. Superposition doesn't really use that much CPU to begin with, that's why I am mainly testing with it. My 3dMark Firestrike score on graphics is 30000 on VM and 37000 on metal.

bastl · August 9, 2019

@sit_rp Stupid question, i know, but do you use the Nvidia driver or the Windows driver for the card?

sit_rp · August 9, 2019

22 minutes ago, bastl said:

@sit_rp Stupid question, i know, but do you use the Nvidia driver or the Windows driver for the card?

Not such thing as a stupid question when narrowing down an issue.

I am using latest Nvidia (DCH, no geforce experience) driver.

I am really at a loss here. Not sure what to turn off or turn on next. I will try to re-install the driver, but it is usually a pain in the ass, since I am using a TV and not a monitor to game on this thing.

bastl · August 9, 2019

@sit_rp You did both tests, bare metal and virtualised with the same monitor/tv?

testdasi · August 9, 2019

Let's set the expectation straight first.

The 6700K base clock is a 4GHz and the 1700X base clock is 3.4GHz. That is a significant difference. GPU-intensive benchmarks (and games) benefit more from a higher base clock.
Barebone can turbo boost higher since there's a lower load on all cores.
You have to then add virtualisation overhead to the equation. Again it's not insignificant.

So basically, your barebone 6700K vs VM 1700X comparison may not be entirely oranges vs apples but it's pretty close.

Now in terms of optimisation:

You need to appreciate that a Ryzen CPU contains 2 CCX "glued" together.
- Don't use an odd number of physical cores (e.g. your 10 logical = 5 physical) on Ryzen. An odd number of physical cores ensures that 1 CCX is always overloaded, reducing overall performance. Based on my testing, the lost performance can be as much as 1 core (e.g. 3+3 is just as fast as 3+4)
- Spread the even number of physical cores evenly across both CCX will also help (so don't do 2+4, do 3+3)
Check your CPU core numbering scheme so you don't accidentally pin the wrong hyperthreaded pair. BIOS changes have been known to change the numbering scheme (e.g. 0+1 = 1 pair becomes 0+8).
When you do your Q35 machine type template, did you add the qemu tag (see below) at the end so your emulated PCIe slots run at full PCIe x16 speed?

You need to add this above </domain> for Q35 machine type.

  <qemu:commandline>
    <qemu:arg value='-global'/>
    <qemu:arg value='pcie-root-port.speed=8'/>
    <qemu:arg value='-global'/>
    <qemu:arg value='pcie-root-port.width=16'/>
  </qemu:commandline>

sit_rp · August 9, 2019

1 minute ago, bastl said:

@sit_rp You did both tests, bare metal and virtualised with the same monitor/tv?

Unfortunately no. Physical machine was connected to a monitor and Unraid server to the TV.

sit_rp · August 9, 2019

2 minutes ago, testdasi said:
Let's set the expectation straight first.

The 6700K base clock is a 4GHz and the 1700X base clock is 3.4GHz. That is a significant difference. GPU-intensive benchmarks (and games) benefit more from a higher base clock.

Barebone can turbo boost higher since there's a lower load on all cores.

You have to then add virtualisation overhead to the equation. Again it's not insignificant.

So basically, your barebone 6700K vs VM 1700X comparison may not be entirely oranges vs apples but it's pretty close.

Now in terms of optimisation:

You need to appreciate that a Ryzen CPU contains 2 CCX "glued" together.
Don't use an odd number of physical cores (e.g. your 10 logical = 5 physical) on Ryzen. An odd number of physical cores ensures that 1 CCX is always overloaded, reducing overall performance. Based on my testing, the lost performance can be as much as 1 core (e.g. 3+3 is just as fast as 3+4)

Spread the even number of physical cores evenly across both CCX will also help (so don't do 2+4, do 3+3)

Check your CPU core numbering scheme so you don't accidentally pin the wrong hyperthreaded pair. BIOS changes have been known to change the numbering scheme (e.g. 0+1 = 1 pair becomes 0+8).

When you do your Q35 machine type template, did you add the qemu tag (see below) at the end so your emulated PCIe slots run at full PCIe x16 speed?

You need to add this above </domain> for Q35 machine type.
  <qemu:commandline>
    <qemu:arg value='-global'/>
    <qemu:arg value='pcie-root-port.speed=8'/>
    <qemu:arg value='-global'/>
    <qemu:arg value='pcie-root-port.width=16'/>
  </qemu:commandline>

Thank you for reply.

Yes, I understand that single threaded performance will be better on 6700k. This is the reason why I am leaning towards only testing with Superposition since it uses very little CPU.

I did notice that I was getting slightly higher graphics score in 3dMark with 4 physical cores passed through versus 5. The only issue is that most of the games I play do like having an extra 2 threads, since like you mentioned, 1700x 3.5ghz boost is rather low (keeping fingers crossed for that 3900x one of these days).

I am usually running on i440fx machine, but when I did test Q35, I didn't apply the template above (first time reading this), so this might be worth the try.

Another quick thing I forgot to mention. ACS Overrride is off. When I had it enabled before I think performance was worst, but I don't remember anymore since I have been making a lot of changes. Including re-installing Win10 from scratch.

bastl · August 9, 2019

ACS Override is not a thing that you set to gain performance. It's only usecase is to split your IOMMU groupings to separate devices from each other. 30k vs 37k is a huge difference for graphics score only. With overhead of virtualisation 1k maybe 2 is what you can expect. Disc IO as example shouldn't be the issue. Benchmarks and game engines are loading the most stuff at the start. Maybe the memory speed is what causing the difference for you. Are you using the same dimms and the same XMP profile for both tests?

sit_rp · August 9, 2019

1 minute ago, bastl said:

ACS Override is not a thing that you set to gain performance. It's only usecase is to split your IOMMU groupings to separate devices from each other. 30k vs 37k is a huge difference for graphics score only. With overhead of virtualisation 1k maybe 2 is what you can expect. Disc IO as example shouldn't be the issue. Benchmarks and game engines are loading the most stuff at the start. Maybe the memory speed is what causing the difference for you. Are you using the same dimms and the same XMP profile for both tests?

Tests we performed on two different machines.

In terms of Unraid, I don't recall enabling XMP, I will have to double check. RAM should be running at 3000Mhz.

bastl · August 9, 2019

As you maybe already noticed. It's kinda hard to compare 2 systems if every spec is different. Memory speeds and latency is a huge thing on the first gen Ryzen's. As already mentioned by testdasi, the chiplet design and the communication between the chips is the next thing you have to count in. Different cores for a VM can lead to different memory speeds/latency. The next thing, which slot for the GPU are you using? Some aren't connected directly to the CPU. Limiting the speed of the PCIe lanes by using a slot wired to the chipset can also be an issue. 16 vs 8 lane slots shouldn't be an issue but only using 4 lanes of the chipset shared with other devices (USB, network, storage) will bottleneck the GPU.

sit_rp · August 9, 2019

2 hours ago, bastl said:

As you maybe already noticed. It's kinda hard to compare 2 systems if every spec is different. Memory speeds and latency is a huge thing on the first gen Ryzen's. As already mentioned by testdasi, the chiplet design and the communication between the chips is the next thing you have to count in. Different cores for a VM can lead to different memory speeds/latency. The next thing, which slot for the GPU are you using? Some aren't connected directly to the CPU. Limiting the speed of the PCIe lanes by using a slot wired to the chipset can also be an issue. 16 vs 8 lane slots shouldn't be an issue but only using 4 lanes of the chipset shared with other devices (USB, network, storage) will bottleneck the GPU.

Yeah i know it's hard to troubleshoot this way. Unfortunately I don't have a choice at the moment.

GPU is connected to the first pci slot. It should be running at 16 but I will have to double check.

sit_rp · August 12, 2019

On 8/9/2019 at 9:17 AM, testdasi said:
Let's set the expectation straight first.

The 6700K base clock is a 4GHz and the 1700X base clock is 3.4GHz. That is a significant difference. GPU-intensive benchmarks (and games) benefit more from a higher base clock.

Barebone can turbo boost higher since there's a lower load on all cores.

You have to then add virtualisation overhead to the equation. Again it's not insignificant.

So basically, your barebone 6700K vs VM 1700X comparison may not be entirely oranges vs apples but it's pretty close.

Now in terms of optimisation:

You need to appreciate that a Ryzen CPU contains 2 CCX "glued" together.
Don't use an odd number of physical cores (e.g. your 10 logical = 5 physical) on Ryzen. An odd number of physical cores ensures that 1 CCX is always overloaded, reducing overall performance. Based on my testing, the lost performance can be as much as 1 core (e.g. 3+3 is just as fast as 3+4)

Spread the even number of physical cores evenly across both CCX will also help (so don't do 2+4, do 3+3)

Check your CPU core numbering scheme so you don't accidentally pin the wrong hyperthreaded pair. BIOS changes have been known to change the numbering scheme (e.g. 0+1 = 1 pair becomes 0+8).

When you do your Q35 machine type template, did you add the qemu tag (see below) at the end so your emulated PCIe slots run at full PCIe x16 speed?

You need to add this above </domain> for Q35 machine type.
  <qemu:commandline>
    <qemu:arg value='-global'/>
    <qemu:arg value='pcie-root-port.speed=8'/>
    <qemu:arg value='-global'/>
    <qemu:arg value='pcie-root-port.width=16'/>
  </qemu:commandline>

So I am making some progress here. I added the PCi express values for Q35 machine, but didn't see any increase in performance. I did notice that under Nvidia Control Panel I used to get PCI Express x1, but after adding above to XML file I am getting x16 GEN 3. So that's good.

Now cpu pinning. I have been reading some threads and experimenting with cpu assignment. For this test I turned off all my other VMs and docker containers. If I use 8 even cores, my 3dmark graphics score goes up to 32000, which is an extra 2000. No change in CPU score.

If I use 8 odd cores, my 3dmark graphics score stayes at 32000, but my CPU score goes up by 2000 points. Basically better score than when I had 10 vcores assigned. But this time only 8

At this point I am trying to determine a proper way to figure out physical core numbering and their relative logical cores. I will be doing some reading, but so far my motherboard manual has no such information.

If you guys know a decent way to figure this out please let me know. I am going to continue experiment with core assignment...

Also, I purchased another 1tb M.2 drive. I will install fresh copy of windows on it so I can do some proper testing. This way we don't have to compare 6700k 4.5ghz 3200mhz RAM machine with 1700x 3.5ghz 3000mhz RAM one.

At the end of the day, I am pretty positive on getting 3900x when new bios comes out with fvio fixes. That's if I can find one in stock...

I will update this thread as I make more progress.

Edited August 12, 2019 by sit_rp

testdasi · August 12, 2019

If I understand what you are doing:,

Your 8 even cores = 0 2 4 etc. That covers core 0 which Unraid tends to use for its own things so naturally it will have a bit lower performance than 1 3 5 etc.
Your 8 even / odd cores should by itself be more powerful than your previous 10 cores because your 10 cores = 5 physical cores with hyper threading and your 8 even / odd cores = 8 physical cores without hyper threading. The latter should always be faster than the former, especially since you turned off all other activities i.e. dockers and VMs.

Based on what you reported, it looks like your pairing is 0 + 1 = 1 pair. You can double check that on the Unraid dashboard.

Btw, what you did is one way to maximize performance. I did a similar thing (pinning 3 out of every 4 odd cores to my workstation VM) + use Process Lasso (app recommended by Wendell the Level1Tech guy) to ensure that my games and Plex and work stuff generally don't overlap.

sit_rp · August 12, 2019

2 hours ago, testdasi said:

If I understand what you are doing:,

Your 8 even cores = 0 2 4 etc. That covers core 0 which Unraid tends to use for its own things so naturally it will have a bit lower performance than 1 3 5 etc.

Your 8 even / odd cores should by itself be more powerful than your previous 10 cores because your 10 cores = 5 physical cores with hyper threading and your 8 even / odd cores = 8 physical cores without hyper threading. The latter should always be faster than the former, especially since you turned off all other activities i.e. dockers and VMs.

Based on what you reported, it looks like your pairing is 0 + 1 = 1 pair. You can double check that on the Unraid dashboard.

Btw, what you did is one way to maximize performance. I did a similar thing (pinning 3 out of every 4 odd cores to my workstation VM) + use Process Lasso (app recommended by Wendell the Level1Tech guy) to ensure that my games and Plex and work stuff generally don't overlap.

The only thing that bothers me is that according to Unraid all odd cores suppose to be hyper threaded and all even cores physical. Looking at the testing I did, it appears that all odd cores are in fact physical, and even hyper threaded.

What I don't understand is why I am getting better performance using all even cores (hyper threaded) versus using 4 physical along with 4 logical cores. So to summarize:

Low performance (Physical along with their logical, according to Unraid dashboard):

8-9

10-11

12-13

14-15

Better performance (All logical, in theory):

0-2

4-6

8-10

12-14

Best so far (All physical, in theory):

1-3

5-7

9-11

13-15

Something about this doesn't make sense to me. I am just hoping that cores are not assigned in the manner where physical core 0 is matched with logical core 3, for example. Assigning all threaded cores to the VM (all even) should have lower performance than assigning physical+corresponding (even+odd) threaded.

Is there any specific way to figure this out or only by testing the pair?

sit_rp · August 12, 2019

4 minutes ago, sit_rp said:

The only thing that bothers me is that according to Unraid all odd cores suppose to be hyper threaded and all even cores physical. Looking at the testing I did, it appears that all odd cores are in fact physical, and even hyper threaded.

What I don't understand is why I am getting better performance using all even cores (hyper threaded) versus using 4 physical along with 4 logical cores. So to summarize:

Low performance (Physical along with their logical, according to Unraid dashboard):

8-9

10-11

12-13

14-15

Better performance (All logical, in theory):

0-2

4-6

8-10

12-14

Best so far (All physical, in theory):

1-3

5-7

9-11

13-15

Something about this doesn't make sense to me. I am just hoping that cores are not assigned in the manner where physical core 0 is matched with logical core 3, for example. Assigning all threaded cores to the VM (all even) should have lower performance than assigning physical+corresponding (even+odd) threaded.

Is there any specific way to figure this out or only by testing the pair?

On top of that, I just noticed that even though I have all odd cores isolated, I still see some utilization on those cores even though machine assigned to them is down. Would this be a sign of add cores being hyper threaded? I am getting more and more confused here....

testdasi · August 12, 2019

You misunderstood physical vs logical cores.

All of your cores 0 to 16 as you see them in Unraid are logical cores i.e. with hyper-threading.
The physical cores are not numbered. They are inferred based on the logical cores pairing e.g. 0 + 1 = 1 physical core.

Imagine of your CPU as 8 cars being chained together in a train.

They can be front wheel drive (FWD), rear wheel drive (RWD) or all wheel drive (AWD).

Each car is a physical core. Each pair of wheels is a logical core.
So in your low performance config, it's equivalent of having only 4 cars driving the train but each car runs in AWD mode. So you deliver a lot of power per car but only half of the overall maximum theoretical power since only half of the cars run.
The better config, all 8 cars run but in FWD mode. Naturally you get more power. However, the front wheels of the 1st car are also used for steering so you lose a bit of power.
The best config, all 8 cars run in RWD mode. Same amount of power as the better config but because the rear wheels are not used for steering, you use the maximum power.
Of course, barebone is all 8 cars run in AWD mode.

Edited August 12, 2019 by testdasi

sit_rp · August 12, 2019

3 minutes ago, testdasi said:

You misunderstood physical vs logical cores.

All of your cores 0 to 16 as you see them in Unraid are logical cores i.e. with hyper-threading.

The physical cores are not numbered. They are inferred based on the logical cores pairing e.g. 0 + 1 = 1 physical core.

Imagine of your CPU as 8 cars being chained together in a train.

They can be front wheel drive (FWD), rear wheel drive (RWD) or all wheel drive (AWD).

Each car is a physical core. Each pair of wheels is a logical core.

So in your low performance config, it's equivalent of having only 4 cars driving the train but each car runs in AWD mode.

The better config, all 8 cars run but in FWD mode. Naturally you get more power. However, the front wheels of the 1st car are also used for steering so you lose a bit of power.

The best config, all 8 cars run in RWD mode. Same amount of power as the better config but because the rear wheels are not used for steering, you use the maximum power.

Of course, barebone is all 8 cars run in AWD mode.

Hmm interesting. I was always under impression that 4 cars AWD should be faster than 8 cars of RWD. I guess I was wrong.

It sucks that I can't fully isolate odd cores to be used in Win10 VM only. I have them isolated now and definitely see other VMs using them even though they are isolated and other VMs are pinned to even cores.

testdasi · August 12, 2019

19 minutes ago, sit_rp said:

Hmm interesting. I was always under impression that 4 cars AWD should be faster than 8 cars of RWD. I guess I was wrong.

It sucks that I can't fully isolate odd cores to be used in Win10 VM only. I have them isolated now and definitely see other VMs using them even though they are isolated and other VMs are pinned to even cores.

Hyper threading is essentially just glorified smart queue management to increase the chance that once the (physical) core is done on its current task, another task is already primed up and available to work on. It does NOT mean both tasks are done in parallel.

The automobile analogy may not be immediately clear to non-petrolheads so maybe it's easier to imagine your CPU as 8 workers, each having 2 apprentices.

Each apprentice collects necessary materials and put them in a basket for the worker to assemble.
The assembling takes more time than collection.
A worker immediately works on assembling if the materials are readily collected but has to wait if the apprentice is still collecting material (or because there's nothing to collect).

I think it probably is more obvious why having 4 workers with 8 apprentices is slower than 8 workers with 8 apprentices.

With regards to isolation, it does not mean 0% load all of the time (but rather 0% load most of the time). Back to the apprentice analogy.

You isolate the odd apprentice just to deal with VM work but the apprentice is dumb - he doesn't know if the materials handed to him is "VM" or not.
So if the odd apprentice is handed some non-VM materials, he still hands it over to the worker, who then looks at it and says "yo odd apprentice, you aren't supposed to deal with this, hand it back to someone else" (Digression: I have to do this with my interns all the time)
Now if the worker is overloaded because the even apprentice keeps on giving him work to do, then the odd apprentice will have to wait for his turn, which will record on the CPU usage measurement as "load".

The above is the reason why if you assign an isolated core to a docker, you will get 100% load only on that single core and nothing on other cores. It's because the apprentice keeps on handing non-VM stuff to the worker who keeps on having to tell the apprentice to hand it to someone else.

Of course, the above assumes you have assigned and isolated cores correctly.

sit_rp · August 13, 2019

9 hours ago, testdasi said:

Hyper threading is essentially just glorified smart queue management to increase the chance that once the (physical) core is done on its current task, another task is already primed up and available to work on. It does NOT mean both tasks are done in parallel.

The automobile analogy may not be immediately clear to non-petrolheads so maybe it's easier to imagine your CPU as 8 workers, each having 2 apprentices.

Each apprentice collects necessary materials and put them in a basket for the worker to assemble.

The assembling takes more time than collection.

A worker immediately works on assembling if the materials are readily collected but has to wait if the apprentice is still collecting material (or because there's nothing to collect).

I think it probably is more obvious why having 4 workers with 8 apprentices is slower than 8 workers with 8 apprentices.

With regards to isolation, it does not mean 0% load all of the time (but rather 0% load most of the time). Back to the apprentice analogy.

You isolate the odd apprentice just to deal with VM work but the apprentice is dumb - he doesn't know if the materials handed to him is "VM" or not.

So if the odd apprentice is handed some non-VM materials, he still hands it over to the worker, who then looks at it and says "yo odd apprentice, you aren't supposed to deal with this, hand it back to someone else" (Digression: I have to do this with my interns all the time)

Now if the worker is overloaded because the even apprentice keeps on giving him work to do, then the odd apprentice will have to wait for his turn, which will record on the CPU usage measurement as "load".

The above is the reason why if you assign an isolated core to a docker, you will get 100% load only on that single core and nothing on other cores. It's because the apprentice keeps on handing non-VM stuff to the worker who keeps on having to tell the apprentice to hand it to someone else.

Of course, the above assumes you have assigned and isolated cores correctly.

I think I forgot to mention that I didn’t dump gpu bios and it’s not configured in the XML file. I never had any errors passing 2080ti, so I never bothered with that. Can that be the problem? Is it necessary to configure?

testdasi · August 13, 2019

8 hours ago, sit_rp said:

I think I forgot to mention that I didn’t dump gpu bios and it’s not configured in the XML file. I never had any errors passing 2080ti, so I never bothered with that. Can that be the problem? Is it necessary to configure?

From my experience, the vbios only improves stability but not performance but then I don't exactly have a top range GPU like yours.

The general consensus seems to consider vbios as a fix and not a necessity.

I recommend that if you can do it, you should dump your own vbios and use it regardless if there's any issue. It's better to remove a variable than to constantly wonder if the variable affects you or not.
However, I do NOT recommend downloading the vbios from the web and edit it, unless you know exactly what you are doing (no, following SpaceInvaderOne's guide is not knowing exactly what you are doing). I have noticed people causing their own problems with pass-through due to incorrect application of the SIO method.

sit_rp · August 17, 2019

Alright. So I was able to run some benchmarks on bare metal and it appears that I am already getting pretty close to the native performance. It's crazy how much my old 6700k is better for gaming than 1700x.

At this point I am pretty positive that CPU is getting bottle necked. Playing around with pinning and isolation is definitely worth it. I have been thoroughly reading the thread below. Completely same behavior on my machine.

Thank you for your help! Hopefully 3900x will help me to get more power out of my 2080 ti.

Low GPU (2080ti) Performance on Win10 VM (Unraid 6.2.7)

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation