I see questions on Reddit and here on the forums asking about VM performance for Dual(or multi) socket motherboards. So I figured I’d write up a basic tuning guide for NUMA environments. In this guide I will discuss three things, CPU tuning, memory tuning and IO tuning. The three intertwine quite a bit so, while I will try to write them separate, they really should be considered as one complex tuning. Also, if there is one rule I recommend you follow it is this: don’t cross NUMA nodes for performance sensitive VMs unless you have no choice.
So for CPU tuning, it is fairly simple to determine, from the command line, what CPUs are connected to which node. Issuing the following command:
Will give you a readout of which CPUs are connected to which nodes and should look like this:
(Yes they are hyperthreaded 10-cores, they are from 2011 and your first warning should be that they are only $25USD a pair on eBay: xeon e7-2870, LGA1567)
This shows us that CPUs 0-9 and their hyperthreads 20-29 are on node 0, it also shows us the memory layouts for the system, which will be useful later. With this layout pinning a VM to cores 0-7 and 20-27 would give the VM 8 cores and 8 threads all from one CPU. If you were to pin cores 7-14 and 27-34 your VM would still have 8 cores and 8 threads but now you have a problem for your VM, without tuning the XML, it has no idea that the CPU it was given is really on two CPUs. One other thing that you can do to help with latency is to isolate an entire CPU in the unRAID settings, (Settings>CPU Pinning). That would basically reserve the CPU for that VM and help reduce unnecessary cache misses by the VM.
For memory tuning, you will need to add some XML to the VM to force allocation of memory from the correct node. That XML will look something like:
<memory mode='strict' nodeset='0' />
For this snippet of XML mode=”strict” means that if there isn’t enough memory for the VM to allocate it all to this node then it will fail to start, you can change this to “preferred” if you would like it to start anyway with some of its memory allocated on another NUMA node.
Lastly, IO tuning is a bit different from the last two. Before we were choosing CPUs and memory to assign to the VM based on their node, but for IO tuning the device you want to pass-through, (be it PCI or USB) the node is fixed and you may not have the same kind of resource(a graphics card) on the other node. This means that ultimately the IO devices you want to pass-through will, in most cases, actually determine which node your VM should prefer to be assigned to. To determine which node a PCI device is connected to you will first need that devices bus address, which should look like this: 0000:0e:00.0. To find the devices address in the unRAID webGUI go to Tools>System Devices then serach for your devices in the PCI Devices and IOMMU Groups box. Then open a terminal and issue the following commands:
cd /sys/bus/pci/devices/[PCI bus address]
The output should look like this:
For my example here you can see that my device is assigned to NUMA node 0. I will point out that if you are passing multiple devices, (GPU, USB controller, NVMe drive) that they all might not be on the same node, in that case i would prioritize which node you ultimately assign your VM to based on the use of that VM. For gaming i would prioritize the GPU being on the same node personally but YMMV.
Other thing that you can do to help with latency is to isolate an entire CPU for the a VM if it is for something like Gaming. That would basically reserve the CPU for that VM and help reduce unnecessary cache misses by the VM
It can be easy to visualize NUMA nodes as their own computers. Each node may have its own CPU, RAM and even IO devices. The nodes are interconnected through high-speed interconnects but if one node wants memory or IO from a device on another node, it has to ask the other node for it, rather than address it directly. This request causes latency and costs performance. In the CPU section of this guide we issued the command “numactl -H” and this command also shows us the distance from one node to another, abstractly, with the nodes laid out in a grid showing distance from one node to another. The farther the distance, the longer the turnaround time for cross-node requests and the higher the latency.
It is possible, if you have need of it, to craft the XML for a VM in such a way as to make the VM NUMA aware so that the VM is able to properly use two or more socketed CPUs. This can be done by changing both the <cputune> and <vcpu> tags. This is outside the scope of a basic tuning guide and I will just include a link to https://libvirt.org/formatdomain.html which included the entire specification for libvirt Domain XML, which the unraid VMs are written in.