Huge performance drop off compared to baremetal.


Recommended Posts

I am trying to get more cpu performance out of my win10 VM. Have noticed that in cpu intensive games, performance is quite lackluster, albiet the system is quite old. CPU is i7 sandybridge mobile 4core/8thread 3.2ghz all core turbo, 3.5ghz single core. I have passed 12GB of ram, dual channel ddr3 1600mhz. GPU is 980ti 6GB w/ nvidia 446 driver.

 

On baremetal and with spectre mitigations off, I get ~330 cpu-z single thread benchmark score. Passmark v9 cpu mark total score is ~7600. This is the gta5 benchmark result at 1600x900 low settings:

Frames Per Second (Higher is better) Min, Max, Avg

Pass 0, 62.167404, 119.561455, 102.000427

Pass 1, 94.056564, 165.343918, 139.509125

Pass 2, 77.531998, 155.506470, 125.293236

Pass 3, 89.976601, 162.171799, 136.439087

Pass 4, 48.572926, 200.867737, 125.084503

Time in milliseconds(ms). (Lower is better). Min, Max, Avg

Pass 0, 8.363899, 16.085600, 9.803881

Pass 1, 6.048000, 10.631900, 7.167990

Pass 2, 6.430601, 12.897901, 7.981277

Pass 3, 6.166300, 11.114000, 7.329278

Pass 4, 4.978400, 20.587601, 7.994596

On VM I am using Q35-v4.2 OVMF, cpu host/cache passthrough, hyper-v = yes, and spectre mitigations off on both host and VM. I get ~260-270 cpu-z single thread. Interestingly, the passmark cpu mark score only drops to ~7300. This is the gta5 benchmarks at same settings:

Frames Per Second (Higher is better) Min, Max, Avg

Pass 0, 16.969919, 86.502945, 72.574181

Pass 1, 46.607010, 125.070351, 101.227242

Pass 2, 47.106037, 136.561646, 94.093735

Pass 3, 64.095169, 130.890060, 99.548531

Pass 4, 35.704082, 161.464798, 88.457596

Time in milliseconds(ms). (Lower is better). Min, Max, Avg

Pass 0, 11.560300, 58.927799, 13.779005

Pass 1, 7.995500, 21.455999, 9.878764

Pass 2, 7.322701, 21.228701, 10.627700

Pass 3, 7.639999, 15.601800, 10.045352

Pass 4, 6.193300, 28.008001, 11.304852

On the VM, I can only allocate 3 cores and their HT pairs. I have noticed passing all cores to VM gives quite bad performance. I maintain core 0 for the host and the HT thread for the vm emulator. This is my cpu assignment that I found has given the best performance:

<vcpu placement='static'>6</vcpu>

<cputune>

<vcpupin vcpu='0' cpuset='1'/>

<vcpupin vcpu='1' cpuset='5'/>

<vcpupin vcpu='2' cpuset='2'/>

<vcpupin vcpu='3' cpuset='6'/>

<vcpupin vcpu='4' cpuset='3'/>

<vcpupin vcpu='5' cpuset='7'/>

<emulatorpin cpuset='4'/>

</cputune>

 

Since it is a headless server, I use nvidia gamestream to remote access. This further kills performance. I see the cpu-z single thread drop to ~230-240 with streaming the desktop. The above gta5 results were without any streaming. Since online mode is very unoptimized in this game, it can also be another 20-50% loss in performance. I see drops to 30fps in game quite often. I don't expect the performance loss to be entirely attributed to the one less core, especially with the huge drop off in cpu-z single thread results.

 

Have tried all the following but nothing significantly bridges the gap between baremetal and VM cpu performance.

1. Changed to cpu model "Sandybridge" instead of cpu host passthrough. Resulted in significantly lower performance.

2. Passed through 2nd NIC instead of using virtual NIC. Resulted in slightly more performance.

3. Checked cpu turbo speeds on host. It does hit 3.2ghz all core in game on VM.

4. Isolated cpu cores used by VM, no noticeable improvement.

5. Changed cpu pinning and emulator pinning, but above config gives the best performance.

6. Updated kvm and virtio drivers.

7. Changed to i440fx. Resulted in slightly less performance.

 

I am out of ideas to try. Anyone know what else I could try or have experience in this? Should this be the expected performance drop off to VM from baremetal, for a sandybridge era cpu?

Link to comment

I've found some configurations that half-way bridge the gap to baremetal. The cpu core assignments that give the best performance is somewhat perplexing.

 

Passing these hyper-v features improved single thread performance noticeably. Not sure why. Found a blog that mentioned this hyper-v xml config gave him the best results. Reading the description for each feature, not obvious to me why this gets better performance. Cpu-z single thread score went up by ~20-30. More importantly, the performance loss with streaming is not as bad with these features on. Before, I'd see 30-40 less cpu-z single thread while streaming the desktop. Now, it is only ~15 less. Gta5 benchmark results improved a bit as well with these features on.

<vpindex state='on'/> 
<synic state='on'/> 
<stimer state='on'/> 
<frequencies state='on'/>

I've found that passing only the primary core threads 0,1,2,3 to the VM give the best overall performance and best single thread performance. I also have emulator pin on HT 4 and iothread on HT 5. Haven't noticed performance improvement with iothread pinning. I get ~320 cpu-z single thread, pretty close to the 330 on bare metal. I get the best gta5 benchmark results with this config. It is about half way in between my baremetal and vm results from earlier. I figure the remaining difference in performance is that baremetal is 2 threads per core, and that this vm config is only 1 thread per core.

Frames Per Second (Higher is better) Min, Max, Avg 
Pass 0, 19.245239, 106.438461, 89.463310 
Pass 1, 80.630203, 154.502197, 128.269775 
Pass 2, 61.948658, 142.904099, 109.883057 
Pass 3, 5.634600, 159.941147, 119.195976 
Pass 4, 35.715050, 166.900330, 103.768700 
Time in milliseconds(ms). (Lower is better). Min, Max, Avg 
Pass 0, 9.395100, 51.960903, 11.177767 
Pass 1, 6.472400, 12.402301, 7.796069 
Pass 2, 6.997700, 16.142399, 9.100584 
Pass 3, 6.252300, 177.474899, 8.389544 
Pass 4, 5.991600, 27.999401, 9.636817

I was under the assumption that passing the primary core thread + HT thread pair is most optimal, but I am not seeing that. Originally I passed core threads 1,2,3 and their HT pairs 5,6,7 and emulator pin to HT 4. That gave much lower single and multi thread performance. It seems that all four cores are critical to getting best performance. Even though this config is 6 threads compared to 4, it yields worse performance because it is only 3 cores.

 

I have also passed all cpu threads 0,1,2,3,4,5,6,7 to VM and that gives me a cpu-z single thread result of ~300 and the best multi thread result of ~1500. But the performance in gta is not as good as passing only the primary core threads 0-3. If I define emulator pin with this config, I get absolutely atrocious performance, so I didnt define emulator or iothread pin. I don't know why this is. This should give me the closest performance to baremetal, but it doesn't.

 

 

So TLDR, to recap, I get the closest to baremetal performance in gta5 and best cpu-z single thread in VM with the following:

- add the hyper v features mentioned above in the xml

- pass only the primary core threads 0,1,2,3 and none of the HT pairs to VM, HT 4 is emulator pin

- turn off spectre/meltdown mitigations in both host and vm (if baremetal also had them off)

- pass physical NIC has better performance than virtual NIC, takes some load off cpus

 

 

Edited by kakashisensei
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.