Performance Improvements in VMs by adjusting CPU pinning and assignment


Recommended Posts

The basic architecture didn't really changed between first and second gen TR4. They only added 2 more nodes and improved the interconnection latencies of the first gen. The rest is basically the same.

 

5 minutes ago, jbartlett said:

Unless you already have the PNG

1029505682_topology2019.thumb.jpg.d519e3a433a1698af70ce69055e4ec8b.jpg

 

3 minutes ago, jbartlett said:

Is there another section?

Inside the cputune tag.

  <vcpu placement='static'>14</vcpu>
  <iothreads>1</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='9'/>
    <vcpupin vcpu='1' cpuset='25'/>
    <vcpupin vcpu='2' cpuset='10'/>
    <vcpupin vcpu='3' cpuset='26'/>
    <vcpupin vcpu='4' cpuset='11'/>
    <vcpupin vcpu='5' cpuset='27'/>
    <vcpupin vcpu='6' cpuset='12'/>
    <vcpupin vcpu='7' cpuset='28'/>
    <vcpupin vcpu='8' cpuset='13'/>
    <vcpupin vcpu='9' cpuset='29'/>
    <vcpupin vcpu='10' cpuset='14'/>
    <vcpupin vcpu='11' cpuset='30'/>
    <vcpupin vcpu='12' cpuset='15'/>
    <vcpupin vcpu='13' cpuset='31'/>
    <emulatorpin cpuset='8,24'/>
    <iothreadpin iothread='1' cpuset='8,24'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='1'/>
  </numatune>
  <resource>

 

  • Thanks 1
Link to comment
8 hours ago, bastl said:

This is a part of the extra CPU flags I use for a while now.


  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC</model>
    <topology sockets='1' cores='7' threads='2'/>
    <cache level='3' mode='emulate'/>
    <feature policy='require' name='topoext'/>
    <feature policy='disable' name='monitor'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='disable' name='svm'/>
    <feature policy='disable' name='x2apic'/>
  </cpu>

 

 

Holy CPUs Batman! This got me hyper-threading in my VM! Of course, it had to go through a Blue Screen first and then setting up my devices but yuh!

 

Another set of Benchmarks to compare apples to apples coming up....

Link to comment

With Hyper-threading detected in the VM, Cinebench R20 gave a 1.89% improvement in average scores (1/16/2) but the scores were a lot tighter. I was able to get the variance down to 50 with the high/low differing at 23 & 27 after dropping the three scores that varied farthest from the average (which were on the high side) with only three runs.

Link to comment
10 hours ago, bastl said:

By using "interleave" you spread the RAM accross all memory controllers from all nodes, even the ones from the node you're maybe not using in the VM. On first gen TR4 this was a big issue, because it added a lot of RAM latency. Sure you get the higher memory bandwith by using "quad channel" but in most scenarios in my tests the lower latency was the preferred option. Not exactly sure how big of a difference it is on second gen TR4, but using "Preferred" or "Strict" was the better choice for me. Every program, game or benchmark is more or less affected by the lower bandwith by basically turning the RAM into a dual channel configuration. The bigger impact I saw by reducing the latency by using the "strict" setting. Maybe have a look into the "Cache & Memory Benchmark" which comes with AIDA64 to test this.

  <numatune>
    <memory mode='interleave' nodeset='0,2'/>
  </numatune>

You're right, spreading the memory over both nodes did increase the average latency from 92.7 to 125.6 (26% increase).

 

However, the read/write/copy scores improved under this "RAID" like effect by 27.9% / 45.5% / 31.2% respectively. Seems like a good trade off.

Edited by jbartlett
Link to comment

@jbartlett I didn't found any usecase, where I benefit from a higher memory bandwith. Not in gaming, not in general workflow and not in any cad software I use. Less latency in most situations should end in a quicker and lets say snappier work with any software at least in my case. I don't have any high bandwith related tasks. Don't know if it makes any differences in decoding/encoding video for example but in most cases the faster access to RAM is the more preferred scenario. Correct me if I'am wrong.

Link to comment
20 minutes ago, bastl said:

@jbartlett I didn't found any usecase, where I benefit from a higher memory bandwith. Not in gaming, not in general workflow and not in any cad software I use. Less latency in most situations should end in a quicker and lets say snappier work with any software at least in my case. I don't have any high bandwith related tasks. Don't know if it makes any differences in decoding/encoding video for example but in most cases the faster access to RAM is the more preferred scenario. Correct me if I'am wrong.

I'll have to ponder this. My use case is video overlay which Livestream Studio seems to do in memory vs the GPU and exporting video over NDI.

 

Once I'm done building my Ubuntu VM on the same box which takes that NDI feed, possibly an average CPU utilization with it spread out or on the same node will be telling.

Link to comment

Couple points of observation.

1. The AMD hyper-threading workaround with setting it to emulate the EPIC processor doesn't always work every boot. With one VM, CoreInfo does not always detect hyper-threading but does after a reboot (OS reboot, VM still active).

2. If you edit a VM after changing anything in the GUI editor, you have to go back and reset your CPU topology if you have to manually edit to set the threads back to 2.

Edited by jbartlett
Link to comment

Question: I created VM "A" and then copied the qcow2 directory over to VM "B" to try to save on installation time. (Or just move it to a different directory). SeaBios does not detect a bootable drive even though the drive reference is correct. Anyone know why?

 

I still have roughly 5 more Windows 10 installs to do after the current one. :)

Link to comment
2 hours ago, jbartlett said:

Question: I created VM "A" and then copied the qcow2 directory over to VM "B" to try to save on installation time. (Or just move it to a different directory). SeaBios does not detect a bootable drive even though the drive reference is correct. Anyone know why?

 

I still have roughly 5 more Windows 10 installs to do after the current one. :)

Only VM I've tested it, is a SeaBIOS Windows 7 VM and I can copy it without problems and use it with a clean new VM template. Never had any issues with it before. Had to do it a couple times during my testing depending qcow2 corruptions on 6.8 RC1-5. I also have an old Windows7 SeaBIOS VM as template, also a qcow2 file which I use for testing and also never had that issue. Unfortunatly I don't have a i440fx Win10 vdisk anymore to test with. Maybe in an old backup if I'am lucky.  I'll have a look tomorrow.

 

2 hours ago, jbartlett said:

copied the qcow2 directory over

I usually create the new VM first without booting it up, size of the vdisk doesn't matter and after that, I copy over the old vdisk only and overwrite the newly created one. Maybe try it this way.

 

2 hours ago, jbartlett said:

1. The AMD hyper-threading workaround with setting it to emulate the EPIC processor doesn't always work every boot. With one VM, CoreInfo does not always detect hyper-threading but does after a reboot (OS reboot, VM still active).

Will check this tomorrow.

2 hours ago, jbartlett said:

2. If you edit a VM after changing anything in the GUI editor, you have to go back and reset your CPU topology if you have to manually edit to set the threads back to 2.

Unfortunately you have to reconfigure almost all custom settings you made after using the GUI every time. 😥

 

Would be nice to have a custom field where you could enter custom lines in the GUI for the let's say extra CPU settings, devices, interfaces etc. and keep them in the GUI to be able to re-add it with saving it via the editor automatically. In theory this should be possible, but I'am not the coding guy who could provide such a feature 🙄

Edited by bastl
Link to comment
7 minutes ago, bastl said:

 

2 hours ago, jbartlett said:

copied the qcow2 directory over

I usually create the new VM first without booting it up, size of the vdisk doesn't matter and after that, I copy over the old vdisk only and overwrite the newly created one. Maybe try it this way.

I'll give this a shot tonight.

 

8 minutes ago, bastl said:

 

2 hours ago, jbartlett said:

1. The AMD hyper-threading workaround with setting it to emulate the EPIC processor doesn't always work every boot. With one VM, CoreInfo does not always detect hyper-threading but does after a reboot (OS reboot, VM still active).

Will check this tomorrow.

I just finished setting up a new Win10 install and then applied the EPIC settings to it. Wasn't visible on first boot after the EPIC settings but was after a Start > Reboot. The big 32 thread VM never had an issue with it. The similarities are the VM's have 4 cores/2 threads on the same NUMA node (no numa settings), 4G of RAM allocated. Both Seabios i440fx. Same CPU settings you gave, but emulator/thread pinned to a single CPU.

Link to comment

I had Livestream Studio ingesting two Brio web cams at 1080p@60fps and outputting those two cams to two VM's via NDI. With the numatune memory mode set to interleave, the 32 CPU VM was hovering between 30% and 31%. Without the numatune, it hovered around 29% and 30%. So while the memory read/write/copy speeds are significantly higher, the increased latency @bastl mentioned did have a small impact to CPU time.

 

I've run into an annoying issue. I was adjusting the assigned CPU cores down to see what I could get away with on an NDI into OBS VM and for some reason, no matter what I have set in the XML, the VM only sees 2 cores/2 threads. Created a new VM template with 20 CPU's assigned and the VM still only saw 4 total. Had to restore the backup of the VDrive. Not sure if it's related or not but the Win 10 versions aren't registered yet as I'm still setting them up/testing.

 

 

Link to comment

@jbartlett I'am testing RC6 right now. Your mentioned issue that on first boot with the custom EPYC tweak the the CPU isn't reported as hyperthreaded I can't reproduce. As soon as I add the tweaks and boot it up, the CPU is detected as it should be. Used the following for testing on a default W10 template. Nothing special, 4GB of RAM, 30GB vdisk currently on the XFS array. Btw no corruptions so far.

  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC</model>
    <topology sockets='1' cores='2' threads='2'/>
    <cache level='3' mode='emulate'/>
    <feature policy='require' name='topoext'/>
    <feature policy='disable' name='monitor'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='disable' name='svm'/>
    <feature policy='disable' name='x2apic'/>
  </cpu>

 

4 hours ago, jbartlett said:

Created a new VM template with 20 CPU's assigned and the VM still only saw 4 total.

Did you checked the device manager if the CPU with the new tweaks is completly initialized? Remember, applying this tweaks the first time, Windows recognizes it as new CPU and has to install the drivers first. Maybe the combination of changing the CPU type AND the core count produces this issue. Did you searched for newly installed hardware in the device manager to trigger a rescan manually? Maybe that helps to detect the new cores. Just an idea.

Link to comment
5 hours ago, bastl said:

 

9 hours ago, jbartlett said:

Created a new VM template with 20 CPU's assigned and the VM still only saw 4 total.

Did you checked the device manager if the CPU with the new tweaks is completly initialized? Remember, applying this tweaks the first time, Windows recognizes it as new CPU and has to install the drivers first. Maybe the combination of changing the CPU type AND the core count produces this issue. Did you searched for newly installed hardware in the device manager to trigger a rescan manually? Maybe that helps to detect the new cores. Just an idea.

Turns out, the VDrive needed scanning for errors. When I viewed the device manager, it reported one unknown PCI device (balloon driver). When I restarted the VM with the driver ISO, Windows didn't pick it up. So I decided to go bare bones down to 1 CPU and 2 GB of RAM to see if it picked up that change. It did, but it also detected drive issues at boot and did a scan & fix. No issues after that.

 

I believe I found the issue with the random no-hyperthreading - the GUI editor stripping out the <cpu> attributes and removing the fallback tag. I saw the other optional cpu tags so my brain gave it a green checkmark.

Edited by jbartlett
Link to comment

I've started working on enhancing the VM Form Editor in my spare time to add extra support for the CPU block so these tweaks we're making can be applied without manually editing the XML.

 

Unfortunately, spare time is kinda non-existent.

 

So far, added NUMA detection & grouping to the form editor. (I got tired of counting CPU's to find the nodes)

 

image.png.f92b86da569f9a54b59522bfc69a6358.png

  • Like 2
  • Thanks 2
Link to comment

@jbartlett This looks really promising and might help a lot of people to get a better understanding of the topology of their CPUs. Let's hope this will make it into a final release in the future 👍

 

Edit:

You now have to add more infos from lstopo to show where the RAM and all the PCIE devices are connected. 😂

Just an small button for running LSTOPO to present a popup with the result would be enough for me, or even better, parse the PCI adresses and read them from the devices list and show also the names besides the depending blocks 😁

Edited by bastl
Link to comment
29 minutes ago, bastl said:

You now have to add more infos from lstopo to show where the RAM and all the PCIE devices are connected. 😂

Just an small button for running LSTOPO to present a popup with the result would be enough for me, or even better, parse the PCI adresses and read them from the devices list and show also the names besides the depending blocks 😁

I thought of that too but the problem is lstopo has to be manually installed first following the spaceinvaderone videos though it could probably be added to a "VM Helper Plugin" with SI1's permission.

Link to comment
22 minutes ago, bastl said:

@jbartlett Is there a reason, why lstopo is only available via the unraid gui boot. Can't remember which Unraid version first implemented it, I think short after SIO's video Limetech added it.

Ah, I thought it was numad that was put in the GUI version and then later apparently removed. I missed that it was lstopo. This gives me a starting point but would require the GUI boot to enable.

 

I believe it was to keep the size down on the bzroot root. Looks like it would add 15.9MB to the compressed file.

Link to comment
  • 3 weeks later...

The question as always is what you need more for your VMs and your workflow? It's kinda easy for a VM which has only one usecase lets say video encoding. Check the specific programs you are using and test it in both scenarios. Each program behaves differently and I think for daily use the lower latency is the way to go. Did you saw any users in the forums yet with the newer TRX4 platform and the 3rd gen chips? @jbartlett The new chip layouts shouldn't have these latency issues.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.