Basic VM tuning guide for multi socket motherboards


mkfelidae

Recommended Posts

I see questions on Reddit and here on the forums asking about VM performance for Dual(or multi) socket motherboards. So I figured I’d write up a basic tuning guide for NUMA environments. In this guide I will discuss three things, CPU tuning, memory tuning and IO tuning. The three intertwine quite a bit so, while I will try to write them separate, they really should be considered as one complex tuning. Also, if there is one rule I recommend you follow it is this: don’t cross NUMA nodes for performance sensitive VMs unless you have no choice.

 

So for CPU tuning, it is fairly simple to determine, from the command line, what CPUs are connected to which node. Issuing the following command:

 

numactl -H

 

Will give you a readout of which CPUs are connected to which nodes and should look like this:

numactl.thumb.png.74424ca4268a69ece97bac36aa5e986c.png

(Yes they are hyperthreaded 10-cores, they are from 2011 and your first warning should be that they are only $25USD a pair on eBay: xeon e7-2870, LGA1567)

This shows us that CPUs 0-9 and their hyperthreads 20-29 are on node 0, it also shows us the memory layouts for the system, which will be useful later. With this layout pinning a VM to cores 0-7 and 20-27 would give the VM 8 cores and 8 threads all from one CPU. If you were to pin cores 7-14 and 27-34 your VM would still have 8 cores and 8 threads but now you have a problem for your VM, without tuning the XML, it has no idea that the CPU it was given is really on two CPUs.  One other thing that you can do to help with latency is to isolate an entire CPU in the unRAID settings, (Settings>CPU Pinning). That would basically reserve the CPU for that VM and help reduce unnecessary cache misses by the VM.

 

 

For memory tuning, you will need to add some XML to the VM to force allocation of memory from the correct node. That XML will look something like:

 

<numatune>

<memory mode='strict' nodeset='0' />

</numatune>

 

For this snippet of XML mode=”strict” means that if there isn’t enough memory for the VM to allocate it all to this node then it will fail to start, you can change this to “preferred” if you would like it to start anyway with some of its memory allocated on another NUMA node.

 

 

Lastly, IO tuning is a bit different from the last two. Before we were choosing CPUs and memory to assign to the VM based on their node, but for IO tuning the device you want to pass-through, (be it PCI or USB) the node is fixed and you may not have the same kind of resource(a graphics card) on the other node. This means that ultimately the IO devices you want to pass-through will, in most cases, actually determine which node your VM should prefer to be assigned to. To determine which node a PCI device is connected to you will first need that devices bus address, which should look like this: 0000:0e:00.0. To find the devices address in the unRAID webGUI go to Tools>System Devices then serach for your devices in the PCI Devices and IOMMU Groups box. Then open a terminal and issue the following commands:

 

cd /sys/bus/pci/devices/[PCI bus address]

cat numa_node

 

The output should look like this:

numanode.thumb.png.610748ddb70d92d3605f87512fdff734.png

For my example here you can see that my device is assigned to NUMA node 0.  I will point out that if you are passing multiple devices, (GPU, USB controller, NVMe drive) that they all might not be on the same node, in that case i would prioritize which node you ultimately assign your VM to based on the use of that VM.  For gaming i would prioritize the GPU being on the same node personally but YMMV.

 

 

Other thing that you can do to help with latency is to isolate an entire CPU for the a VM if it is for something like Gaming.  That would basically reserve the CPU for that VM and help reduce unnecessary cache misses by the VM

 

It can be easy to visualize NUMA nodes as their own computers. Each node may have its own CPU, RAM and even IO devices. The nodes are interconnected through high-speed interconnects but if one node wants memory or IO from a device on another node, it has to ask the other node for it, rather than address it directly. This request causes latency and costs performance. In the CPU section of this guide we issued the command “numactl -H” and this command also shows us the distance from one node to another, abstractly, with the nodes laid out in a grid showing distance from one node to another. The farther the distance, the longer the turnaround time for cross-node requests and the higher the latency.

 

 

Advanced tuning:

It is possible, if you have need of it, to craft the XML for a VM in such a way as to make the VM NUMA aware so that the VM is able to properly use two or more socketed CPUs. This can be done by changing both the <cputune> and <vcpu> tags. This is outside the scope of a basic tuning guide and I will just include a link to https://libvirt.org/formatdomain.html which included the entire specification for libvirt Domain XML, which the unraid VMs are written in.

Edited by mkfelidae
typos
  • Like 6
Link to comment

Thanks for this guide @mkfelidae, it's appreciated!  I thought I had tweaked my VM's for ultimate performance, but your guide allowed me to eek out a bit more performance!

 

I just have 2 comments / questions.  The first is that you have a typo in your guide.  For numatune, the proper format (I believe) is -->

 

<numatune>
    <memory mode='strict' nodeset='0'/>
</numatune>

 

So single vs double quotes.  The other thing is a question.  How do I know that the numatune settings are being applied correctly?

 

Thanks again!


~Spritz

Link to comment
7 minutes ago, Spritzup said:

How do I know that the numatune settings are being applied correctly?

I feel kind of foolish for not including the answer to this. The command you are looking for is:

 

numastat -c qemu

 

This will list out all of the processes whose name starts with qemu (like qemu-system-x86) and display their numa node memory usage.  If you have multiple VMs started you may need to differentiate VMs by process ID (PID) instead, (figuring out which "qemu-system-x86" process is which specific VM can be frustrating). In that case the command would look something like:

 

numastat -p [PID]

 

It is normal for a process with its memory bound to a node other than 0 with the <numatune> tag to have a small ammount of memory allocated from node 0.  I suspect that this is due to the qemu emulation process itself being bound to node 0. I did not find a way to eliminate allocation from node 0 entirely.

 

Hope this helps

Link to comment
  • 4 weeks later...
On 11/23/2020 at 12:47 PM, mkfelidae said:

I see questions on Reddit and here on the forums asking about VM performance for Dual(or multi) socket motherboards. So I figured I’d write up a basic tuning guide for NUMA environments. In this guide I will discuss three things, CPU tuning, memory tuning and IO tuning.

Thanks for this!!!

 

Was looking to use some winter break down time to tune vms.  I already reserve and pin core pairs for vms.   But I don't strict reserve VM memory in the pinned socket or make sure passed through video cards were going to vms that are pinned to the socket the video card is wired too....

 

But I can't even get out of the gate.

 

root@mu:~# numactl -H
available: 1 nodes (0)

 

For some reason Unraid sees 1 NUMA node.   I have two Xeon E5-2660 v2.   Is this to be expected in this old hardware?   Thanks.

Link to comment

To be honest, I have found that nothing is to be expected when dealing with old hardware.

 

That said, check inside the bios to make sure that any NUMA associated options are enabled or selected. I have seen a board that allowed you, (for some reason) to configure the memory profile in UMA the opposite of NUMA if you wanted, and all it did was hide the node structure from the OS without actually helping with memory access at all. 

 

Do you have memory attached to both sockets? And if so is Unraid detecting all of it?  There may be a mode in the bios that controls memory access policies at the hardware level that may be dictating how the system appears to the OS above.

 

I would offer more help but I would ideally need a copy of your syslog's first 50-100 lines and a look at your whole bios, but the bios screens would probably be quite difficult to collect due to the number of anticipated submenus.  Just the top stub of the syslog would allow me to see if linux configured your system in Numa to start or if it detected the system as a single node.

Link to comment

I could never find in my system documentation information about which slots were wired to which socket.   Using your tutorial 3 video cards come up node 0.   Kind of a bummer.   So now do I isolate all but 2 cores on node 0 and see if most of the system will run on node 1 with all or most of the cores?   Seems crazy to have something like plex cross numa_nodes but if the os is numa aware maybe it'll head to node 1 and stay there.

 

Another general question I'm unsure of is what is best practices for emulatorpin cpuset?

 

Thanks.

On 12/22/2020 at 12:17 PM, mkfelidae said:

To be honest, I have found that nothing is to be expected when dealing with old hardware.

 

That said, check inside the bios to make sure that any NUMA associated options are enabled or selected. I have seen a board that allowed you, (for some reason) to configure the memory profile in UMA the opposite of NUMA if you wanted, and all it did was hide the node structure from the OS without actually helping with memory access at all. 

 

Do you have memory attached to both sockets? And if so is Unraid detecting all of it?  There may be a mode in the bios that controls memory access policies at the hardware level that may be dictating how the system appears to the OS above.

 

I would offer more help but I would ideally need a copy of your syslog's first 50-100 lines and a look at your whole bios, but the bios screens would probably be quite difficult to collect due to the number of anticipated submenus.  Just the top stub of the syslog would allow me to see if linux configured your system in Numa to start or if it detected the system as a single node.

 

Link to comment

Emulator pinning and iothread pinning can help further improve performance, and unless you are crossing Numa nodes they too should be pinned to the same node your cpu is pinned to, that said, they do not need to be isolated in most cases.  You can use numactl to pin processes like plex to node 1 if you choose.  Ultimately plex is pretty light weight in terms of cpu usage and it might not really change plex performance crossing Numa nodes.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.