Performance Improvements in VMs by adjusting CPU pinning and assignment


Recommended Posts

I see many people commenting that they always change their VM XML from something like the following to improve system performance. I've done the same but I came across Microsoft's "coreinfo" utility which revealed that the VM wasn't actually seeing any hyper-threaded CPUs. So I decided to benchmark it using the following topologies on an existing Win10 VM.

Iteration 1:  <topology sockets='1' cores='32' threads='1'/>

Iteration 2: <topology sockets='1' cores='16' threads='2'/>

 

Based on my testing, I do not see any improvements. I used Cinebench R20 & PerformanceTest 9.0, running each ten times and taking the max & average score, discarding any test that scored too far off of the high & low end of the variance (difference between the high & low scores) until the variance fell into an acceptable value based on my observations of running these benchmarks dozens of times. I picked these because I needed a couple and they were quick & easy to set up (I lost count on how many Win10 VM's I've set up in the past month).

 

For the initial test, I ran with cores=32 threads=1 and then tried to get close to or better variance on the cores=16 threads=2 test. In my scenario, I'm running a Threadripper 2 2990wx with the CPU's 0-15 & 32-47 pinned (numa 0 & 2 which has the PCIe & RAM attached) and the emulator pin on CPU 16 (numa 1). This particular config is my intended use case with video broadcasting using Livestream Studio outputting a 1080p@60fps stream to YouTube plus at least two NDI streams of a 1080p@60fps video feed to be consumed by other VM/PC's. OS is Windows 10 fully updated, GPU is a Quadro P2000.

 

Benchmark  CB R20  PT CPU  PT RAM | CB R20  PT CPU  PT RAM
CPU Topo   1/32/1  1/32/1  1/32/1 | 1/16/2  1/16/2  1/16/2
Average      6572   20944    1261 |   6515   20831    1257
Highest      6620   21085    1263 |   6443   20873    1258 
Lowest       6537   20810    1255 |   6484   20805    1254
Variance       83     275       8 |     59      68       4

 The average scores are lower in the 1/16/2 config but they're also tighter. I'm currently running this test again passing in the numa configuration to match the host. Past test runs have shown marked improvements.

Edited by jbartlett
  • Thanks 1
Link to comment

Just learned something new. Using Microsoft's "Coreinfo" utility, Intel CPU's that support hyper threading will show up as hyper threaded CPU's to the guest regardless if you have threads=1 or threads=2. AMD CPU's, notably the Threadripper series (the only ones I have), do not. I was clued in when I was viewing the VM logs and saw a warning that AMD doesn't support the feature (which doesn't show up all the time either).

 

For the unRAID GUI, it seemed to auto flag my Intel CPU for threads=2 'cause I don't recall making that change myself (my mileage will vary).

Link to comment

My testing shows that setting up a numa configuration in your guest benefits memory speed but not really CPU performance on an AMD system. Best to leave threads=1 for a slightly improved CPU performance over threads=2.

 

Edit: ARGH! Some how, the CPU Mode ended up set to Emulated instead of passthrough. I didn't make that change but let me re-do these tests yet again.

Edited by jbartlett
  • Haha 1
Link to comment
2 minutes ago, BRiT said:

Out of sick curiosity, how does it perform with Threads=4 and CPU=Half ?

CPU is half if threads=2. Did you mean a quarter? I've tested to see if such a thing would even boot with CPU & threads reversed, it did, but I didn't do any benchmarks with it. That was back when I was trying to figure out how to get the AMD guest OS to see hyperthreaded CPUs before I discovered that AMD doesn't support it.

Link to comment
18 minutes ago, BRiT said:

Yeah, I meant Half of Half. Yeah, that's it! 

I'll give that a shot.

 

For informational sake, having an emulated CPU gave a 9% boost in CPU performance with PerformanceTest 9.0 on 1/16/2 but only on every OTHER test. On the odd test, it scored the same as 1/32/1. Twenty test over two runs showed the same pattern. Cinebench R20 showed comparable scores between 1/32/1 & 1/16/2.

Link to comment

CPU Threadripper 2990WX with a VM running Windows 10 fully patched, RAM is G.SKILL Ripjaws 4 Series 64GB (8 x 8GB) DDR4 2133 (PC4 17000)

Motherboard is a ASUS ROG Zenith Extreme Alpha X399. MB & RAM are at stock settings, CPU governor set to Performance. VM is pinned to NUMA nodes 0 & 2 which has the PCIe & RAM attached, utilizing all CPUs and the emulator pin is on NUMA node 1, CPU 16. Total OS memory assigned is 12GB.

No NUMA                                                    || NUMA
Benchmark  CB R20  PT CPU  PT RAM | CB R20  PT CPU  PT RAM || Benchmark  CB R20  PT CPU  PT RAM | CB R20  PT CPU  PT RAM
CPU Topo   1/32/1  1/32/1  1/32/1 | 1/16/2  1/16/2  1/16/2 || CPU Topo   1/32/1  1/32/1  1/32/1 | 1/16/2  1/16/2  1/16/2
Average      6572   20944    1261 |   6515   20831    1257 || Average      6408   20617    1389 |   6539   20958    1300
Highest      6620   21085    1263 |   6443   20873    1258 || Highest      6525   20728    1391 |   6589   21144    1306
Lowest       6537   20810    1255 |   6484   20805    1254 || Lowest       6438   20455    1385 |   6511   20746    1297
Variance       83     275       8 |     59      68       4 || Variance       87     273       6 |     78     398       9

The left set has no NUMA node configuration. The 1/16/2 paring shows a roughly 0.7% drop in CPU performance and a negligible difference in RAM performance. However, the variance was much lower - or the difference between the high & low scores which indicates increased stability in processing speeds at a slight performance loss.

 

With a NUMA configuration, things were different. CineBench showed roughly the same performance but PerformanceTest 9.0 showed a larger variance in scores in which there was several much higher scores that were dropped in order to bring the variance down below a thousand. What the NUMA configuration clearly benefited is in RAM test scores if you create the numa node in the guest OS to match the host.

<cpu mode='host-passthrough' check='none'>
  <topology sockets='1' cores='32' threads='1'/>
  <numa>
    <cell id='0' cpus='0-15' memory='6291456' unit='KiB'/>
    <cell id='1' cpus='16-31' memory='6291456' unit='KiB'/>
  </numa>
</cpu>

In short, since AMD CPU's do not support hyperthreaded CPU's to the Guest OS, setting cores=16/threads=2 shows mixed results based on if you specify a numa node or not. It's probably recommended to always have threads=1 so it matches what the guest OS sees.

 

 

Edited by jbartlett
  • Thanks 1
Link to comment

@jbartlett Did you by any chance set a strict RAM allocation to the node which cores you're using? If not, you might have to test this again. By not setting this up, unraid will use RAM from all nodes.

  <numatune>
    <memory mode='strict' nodeset='1'/>
  </numatune>

The following shows you from which node the VMs taking their RAM

numastat qemu

 

Link to comment
2 hours ago, bastl said:

@jbartlett Did you by any chance set a strict RAM allocation to the node which cores you're using? If not, you might have to test this again. By not setting this up, unraid will use RAM from all nodes.

Ya know, I had a feeling someone would pop in and tell me all my tests were invalid because there was another optimization. Ha! Sa'right. I'm in processing of recreating the VM with the numa setting in place from the start and I'll retest the numa config with the memory pinned to nodes 0 & 2. It was grabbing all 12G of RAM from node 0.

Link to comment
14 minutes ago, jbartlett said:

This is what it took to get it to divide up the memory between the nodes.


<numatune>
  <memory mode='interleave' nodeset='0,2'/>
</numatune>

Couldn't use any of the "auto" methods because numad isn't part of the unraid package.

Already seeing a 19% improvement in the memory score with this set. I'll also test hugepages.

Link to comment

I found a domain feature which may enable hyperthreading in AMD CPU's that don't support it.

<feature policy='require' name='topoext'/>

 

I will experiment with it.

 

Utilizing the following gave a 19% improvement in RAM scores with negligible differences between 1/32/1 & 1/62/2. VM had 12 GB of RAM assigned and this evenly spread it between the two numa nodes that had the memory attached to it - VM was likewise pinned to those nodes so the VM matched the physical.

<numatune>
  <memory mode='interleave' nodeset='0,2'/>
</numatune>
<cpu mode='host-passthrough' check='none'>
  <topology sockets='1' cores='32' threads='1'/>
  <numa>
    <cell id='0' cpus='0-15' memory='6291456' unit='KiB'/>
    <cell id='1' cpus='16-31' memory='6291456' unit='KiB'/>
  </numa>
</cpu>

 

  • Thanks 1
Link to comment
<cpu mode='host-passthrough' check='none'>
  <topology sockets='1' cores='4' threads='2'/>
  <feature policy='require' name='topoext'/>
</cpu>

Topoext is automatically enabled in guests if the host supports it. Adding it to an AMD host that doesn't support it doesn't force it on. The CoreInfo utility still showed no hyper-threading.

 

Link to comment
7 hours ago, jbartlett said:

Ya know, I had a feeling someone would pop in and tell me all my tests were invalid because there was another optimization. Ha! Sa'right. I'm in processing of recreating the VM with the numa setting in place from the start and I'll retest the numa config with the memory pinned to nodes 0 & 2. It was grabbing all 12G of RAM from node 0.

I know that feeling. There is always that one guy that has another little tweak 😂

 

7 hours ago, jbartlett said:
7 hours ago, jbartlett said:

This is what it took to get it to divide up the memory between the nodes.



<numatune>
  <memory mode='interleave' nodeset='0,2'/>
</numatune>

Couldn't use any of the "auto" methods because numad isn't part of the unraid package.

Already seeing a 19% improvement in the memory score with this set. I'll also test hugepages.

By using "interleave" you spread the RAM accross all memory controllers from all nodes, even the ones from the node you're maybe not using in the VM. On first gen TR4 this was a big issue, because it added a lot of RAM latency. Sure you get the higher memory bandwith by using "quad channel" but in most scenarios in my tests the lower latency was the preferred option. Not exactly sure how big of a difference it is on second gen TR4, but using "Preferred" or "Strict" was the better choice for me. Every program, game or benchmark is more or less affected by the lower bandwith by basically turning the RAM into a dual channel configuration. The bigger impact I saw by reducing the latency by using the "strict" setting. Maybe have a look into the "Cache & Memory Benchmark" which comes with AIDA64 to test this.

 

5 hours ago, jbartlett said:

<feature policy='require' name='topoext'/>

This is a part of the extra CPU flags I use for a while now.

  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC</model>
    <topology sockets='1' cores='7' threads='2'/>
    <cache level='3' mode='emulate'/>
    <feature policy='require' name='topoext'/>
    <feature policy='disable' name='monitor'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='disable' name='svm'/>
    <feature policy='disable' name='x2apic'/>
  </cpu>

By forcing Windows into recognizing the CPU as an EPYC with these tweaks, it also recognizes the correct L1, L2 and L3 cache sizes which the node has to offer. Without it showed wrong cache sizes and wrong mapping numbers. Without these tweaks and the correct readings, starting up 3DMark for example always crashed or frooze the VM completly at the point, where it gathers the system infos. Not sure which other software might be affected, but this helped me in this scenario.

grafik.png.99866b20db29f7a90c931064e6910c01.png

Obvisiously the vcore is reported wrong, but the cache info is reported correctly with this tweak. 1 core is used for iothread and emulatorpin

    <emulatorpin cpuset='8,24'/>
    <iothreadpin iothread='1' cpuset='8,24'/>

and the rest only specifically for this one VM. One 8 core die of the 2 from the 1950x is dedicated to this VM only and by adding up the numbers it exactly matches the specs of AMD.

grafik.png.99aade2073dd358a68aabec08fe18291.png

 

BUT this isn't the complete list of tweaks. There are way more you can play around with 😂😂😂

  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC-IBPB</model>
    <vendor>AMD</vendor>
    <topology sockets='1' cores='4' threads='2'/>
    <feature policy='require' name='tsc-deadline'/>
    <feature policy='require' name='tsc_adjust'/>
    <feature policy='require' name='arch-capabilities'/>
    <feature policy='require' name='cmp_legacy'/>
    <feature policy='require' name='perfctr_core'/>
    <feature policy='require' name='virt-ssbd'/>
    <feature policy='require' name='skip-l1dfl-vmentry'/>
    <feature policy='require' name='invtsc'/>
  </cpu>

At some point I stopped, because I had no time back than to fiddle arround with it even further and the system was stable enough anyways. Main programs run fine and games performed great.

 

 

Edit:

5 hours ago, jbartlett said:

Topoext is automatically enabled in guests if the host supports it. Adding it to an AMD host that doesn't support it doesn't force it on. The CoreInfo utility still showed no hyper-threading.

Forgot to mention it reports "Hyperthreaded" for me in CoreInfo.

grafik.png.970734d2cb47ad3fc14235e796506b03.png

 

Edited by bastl
  • Thanks 2
Link to comment



 mode='host-passthrough' check='none'>  sockets='1' cores='4' threads='2'/>  policy='require' name='topoext'/>

Topoext is automatically enabled in guests if the host supports it. Adding it to an AMD host that doesn't support it doesn't force it on. The CoreInfo utility still showed no hyper-threading.
 



Hi, anyone on ryzen 3000 tried to activate hyper-threading (as jbarlett mentions) and checked for impact on performance? Supposedly it is deactivated since qemu 3. 1.
Link to comment

It looks like it only works if you pass trough extra infos for the cache like I did.

 

https://git.qemu.org/?p=qemu.git;a=commit;h=7210a02c58572b2686a3a8d610c6628f87864aed

https://www.reddit.com/r/VFIO/wiki/known_issues#wiki_enabling_smt_on_amd_processors_with_qemu_3.1.2B

Quote

In order to use SMT on AMD cpus you need to manually enable it and provide caching information:


<cpu mode='host-passthrough' check='none'>
	<topology sockets='1' cores='8' threads='2'/>
	<cache mode='passthrough'/>
	<feature policy='require' name='topoext'/> 
</cpu>

 

 

 

 

 

 

 

Link to comment

CPU pinning with isolcpus ( https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF)

 

Alternatively, make sure that you have isolated CPUs properly. In this example, let us assume you are using CPUs 4-7. Use the kernel parameters isolcpus nohz_full rcu_nocbs to completely isolate the CPUs from the kernel. For example:

isolcpus=4-7 nohz_full=4-7 rcu_nocbs=4-7

Then, run qemu-system-x86_64 with taskset and chrt:

# chrt -r 1 taskset -c 4-7 qemu-system-x86_64 ...

The chrt command will ensure that the task scheduler will round-robin distribute work (otherwise it will all stay on the first cpu).

 

Is it necessary to do this?

 

Link to comment
7 hours ago, bastl said:

By using "interleave" you spread the RAM accross all memory controllers from all nodes, even the ones from the node you're maybe not using in the VM. On first gen TR4 this was a big issue, because it added a lot of RAM latency. Sure you get the higher memory bandwith by using "quad channel" but in most scenarios in my tests the lower latency was the preferred option. Not exactly sure how big of a difference it is on second gen TR4, but using "Preferred" or "Strict" was the better choice for me. Every program, game or benchmark is more or less affected by the lower bandwith by basically turning the RAM into a dual channel configuration. The bigger impact I saw by reducing the latency by using the "strict" setting. Maybe have a look into the "Cache & Memory Benchmark" which comes with AIDA64 to test this.

The 2990WX has four nodes, ram is attached to notes 0 & 2. I pinned all the CPU's on nodes 0 & 2 to the VM so I didn't have to worry about getting memory from other nodes.

         Node 0 Node 1 Node 2 Node 3 Total
         ------ ------ ------ ------ -----
Huge          0      0      0      0     0
Heap          0      0      0      0     0
Stack         0      0      0      0     0
Private    7171      0   5199      0 12370
-------  ------ ------ ------ ------ -----
Total      7171      0   5199      0 12370

I can't get CPU-Z to work at all if I don't use the stock template, gets to the CPU detection part at 10% and just hangs. 3DMark, in the past, was questionable as it would hang at the collecting system info after a test or two. I'll try it again. I'll download AIDA64 and retry 3DMark.

 

I don't know how my CPU will handle your 1950x's tweaks since the 1950 only has one numa node. It's promising because my main NAS us running the 1950 and I have a Win10 VM there that is my main driver for email & plex.

Link to comment
2 minutes ago, jbartlett said:

the 1950 only has one numa node

Thats wrong. All the first gen TR4 have 2 nodes, each with it's own memory controller to get quad channel memory working on that platform. Starting with 4 cores per node (1900x), 6 cores (1920x) and the 8 core per node 1950x. 4 nodes where only available on first gen Epyc and where introduced on second gen TR4 like on yours.

Link to comment
13 minutes ago, luca2 said:

isolcpus=4-7 nohz_full=4-7 rcu_nocbs=4-7

I had my CPU's isolated (CPU Pinning in the Settings makes it easy) but I didn't know about nohz_full & rcu_nocbs. Just added those. Good point on the task scheduler wanting CPU 0 because that's pinned to the main VM.

 

8 hours ago, bastl said:

<emulatorpin cpuset='8,24'/>
<iothreadpin iothread='1' cpuset='8,24'/>

I used to pin the emulator to the whole core but I've never seen the Dashtop show any usage on the 2nd one.

Link to comment
1 minute ago, bastl said:

Thats wrong. All the first gen TR4 have 2 nodes, each with it's own memory controller to get quad channel memory working on that platform. Starting with 4 cores per node (1900x), 6 cores (1920x) and the 8 core per node 1950x. 4 nodes where only available on first gen Epyc and where introduced on second gen TR4 like on yours.

Looks like I'll be installing Topo before I open my mouth again. haha Unless you already have the PNG....

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.