Ryzen/Threadripper PSA: Core Numberings and Assignments


Recommended Posts

A person on reddit told me the answer to my problem. If you do the below QEMU provides EPYC instead and all the cache is right. It dropped latency accrossed the board. L3 is down to 13ns and l1-1ns l2-2-3ns. Machine seems MUCH more responsive. Should also mention they said they had updated their kernel and a patch on QEMU made theirs see it properly without this code so hopefully we will see it on unraid as well.

 

<cpu mode='custom' match='exact' check='partial'>

<model fallback='allow'>EPYC-IBPB</model>

<topology sockets='1' cores='8' threads='2'/>

<feature policy='require' name='topoext'/>

</cpu>

 

To give an idea of how much of a change. If your wondering why the read/write/copy of the Ls is "higher" on the old its because I had cores bounding from multiple numa in an attempt to make things faster. New is from numa 0 with 1-7 and its SMT cores bound.

 

Edit:

An update to this is it decreased latency across the board but you cannot have multiple NUMA. So far stating numa topology fails and the OS is unaware of when you cross NUMA. Working on that part since once I fix that I will be able to raise my read/write/copy speeds a lot.

 

Edit2:

 

I can report substantial increases in FPS for my games and the stutter thus far has been eliminated.

 

Old

vm.PNG.82309a95866a8b7dd810fdae0ddbaf95.PNG

 

New

Performance.PNG.6a2c8cc53019cfbe2a913d615c826564.PNG

 

Physical Baremetal performance

physical.PNG.0bfc8bb2bbe9a12376ec0eb48ff12cd5.PNG

 

CPUZ

Baremetal

1150463777_physicalfront.PNG.69c30976de5cb7dda746400d092e51d1.PNG

 

Virtual Old

virtual.PNG.992c955b3649e99d34c9891457cba96f.PNG

 

Virtual New

693605673_cpuzfixed.PNG.10a30284b26cf2280f634f7c78b2ed0b.PNG

 

Another user posted this to me

So the Qemu code does not allow the passing of the cache topology.

Here is a patch for 3.0.0 that allows chache topology to be passed through:

diff --git a/target/i386/cpu.c b/target/i386/cpu.c

index 723e02221e..2912867872 100644

--- a/target/i386/cpu.c

+++ b/target/i386/cpu.c @@ -4221,6 +4221,10

@@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count,

break;

case 0x8000001D:

*eax = 0;

+ if (cpu->cache_info_passthrough) {

+ host_cpuid(index, count, eax, ebx, ecx, edx);

+ break;

+ } switch (count) {

case 0: /* L1 dcache info */

encode_cache_cpuid8000001d(env->cache_info_amd.l1d_cache, cs,

Use at your own risk.

 

Edited by Jerky_san
Link to comment

I just tried the settings and I still get the same as the Virtual Old. My latency for Memory and L1 are close to yours but the write/copy are about half. This VM only has 8 vCpu (4c/8t).  The L2 and L3 don't even run, I am on the trial version of aida64 so maybe that is why they don't show?

image.png.b548bb73e3dc9d4a12cfbc554eba153c.png

 

 

Link to comment
11 minutes ago, TType85 said:

I just tried the settings and I still get the same as the Virtual Old. My latency for Memory and L1 are close to yours but the write/copy are about half. This VM only has 8 vCpu (4c/8t).  The L2 and L3 don't even run, I am on the trial version of aida64 so maybe that is why they don't show?

image.png.b548bb73e3dc9d4a12cfbc554eba153c.png

 

 

The biggest change for me was L3. The memory latency was something in the VM itself and some tweaks.(I purged the whole VM) The L3 was a 6x decrease in latency. L1 for me was 3ns now 1ns and thats massive for L1 as its used non stop but the biggest thing is that the cache is properly allocated. Before if you look i had 5x16 on my L3 which is literally impossible and L1 was 2x larger than it was supposed to be and only 2 way instead of 8 way.

 

Also with this change you can't span numa. So if you have any procs allocated outside a single numa it will cause more latency on memory. If we can get it to identify threadripper properly then we can do the NUMA cross and be fine.

 

Should also state I have 2990wx and I have my memory set to channel from "auto" in the bios that provided a large increase in performance as well.

"In my case, under Advanced -> CBS -> DF there is a Memory model item that has choices like auto | distribute | channel. In our case, we are interested in channel mode, as it will expose NUMA information to the host once more."

  <vcpu placement='static'>14</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='1'/>
    <vcpupin vcpu='1' cpuset='33'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='34'/>
    <vcpupin vcpu='4' cpuset='3'/>
    <vcpupin vcpu='5' cpuset='35'/>
    <vcpupin vcpu='6' cpuset='4'/>
    <vcpupin vcpu='7' cpuset='36'/>
    <vcpupin vcpu='8' cpuset='5'/>
    <vcpupin vcpu='9' cpuset='37'/>
    <vcpupin vcpu='10' cpuset='6'/>
    <vcpupin vcpu='11' cpuset='38'/>
    <vcpupin vcpu='12' cpuset='7'/>
    <vcpupin vcpu='13' cpuset='39'/>
    <emulatorpin cpuset='1-7'/>
  </cputune>
  <numatune>
    <memory mode='interleave' nodeset='0'/>
  </numatune>

Edited by Jerky_san
Link to comment
1 hour ago, Jerky_san said:

The biggest change for me was L3. The memory latency was something in the VM itself and some tweaks.(I purged the whole VM) The L3 was a 6x decrease in latency. L1 for me was 3ns now 1ns and thats massive for L1 as its used non stop but the biggest thing is that the cache is properly allocated. Before if you look i had 5x16 on my L3 which is literally impossible and L1 was 2x larger than it was supposed to be and only 2 way instead of 8 way.

 

Also with this change you can't span numa. So if you have any procs allocated outside a single numa it will cause more latency on memory. If we can get it to identify threadripper properly then we can do the NUMA cross and be fine.

 

Should also state I have 2990wx and I have my memory set to channel from "auto" in the bios that provided a large increase in performance as well.

"In my case, under Advanced -> CBS -> DF there is a Memory model item that has choices like auto | distribute | channel. In our case, we are interested in channel mode, as it will expose NUMA information to the host once more."

  <vcpu placement='static'>14</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='1'/>
    <vcpupin vcpu='1' cpuset='33'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='34'/>
    <vcpupin vcpu='4' cpuset='3'/>
    <vcpupin vcpu='5' cpuset='35'/>
    <vcpupin vcpu='6' cpuset='4'/>
    <vcpupin vcpu='7' cpuset='36'/>
    <vcpupin vcpu='8' cpuset='5'/>
    <vcpupin vcpu='9' cpuset='37'/>
    <vcpupin vcpu='10' cpuset='6'/>
    <vcpupin vcpu='11' cpuset='38'/>
    <vcpupin vcpu='12' cpuset='7'/>
    <vcpupin vcpu='13' cpuset='39'/>
    <emulatorpin cpuset='1-7'/>
  </cputune>
  <numatune>
    <memory mode='interleave' nodeset='0'/>
  </numatune>

I set mine to Channel this weekend and noticed an increase in performance in WoW.  I just added the numatune to my configs pointing to the correct nodeset (wifes VM is on node 1, mine is on node 0).  I'll have to wait till I get home to check gaming performance. since I am RDP'd in right now.

 

If I can get it to read the cache right maybe it will get even better.

Edited by TType85
Link to comment
3 minutes ago, TType85 said:

I set mine to Channel this weekend and noticed an increase in performance in WoW.  I just added the numatune to my configs pointing to the correct nodeset (wifes VM is on node 1, mine is on node 0).  I'll have to wait till I get home to check gaming performance. since I am RDP'd in right now.

 

If I can get it to read the cache right maybe it will get even better.

So I probably should of set that to "strict" instead of "interleave" as that is for when I am spanning two numas.

Link to comment
34 minutes ago, Jerky_san said:

So I probably should of set that to "strict" instead of "interleave" as that is for when I am spanning two numas.

Oddness, on my wife's VM memory bandwidth is half and latency is 50% higher.  All settings the same except she is on the other side of the chip. CPUz scores are the same (better than they were before, in the 410 range instead of the 360 range)

Edited by TType85
Link to comment
29 minutes ago, TType85 said:

Oddness, on my wife's VM memory bandwidth is half and latency is 50% higher.  All settings the same except she is on the other side of the chip. CPUz scores are the same (better than they were before, in the 410 range instead of the 360 range)

So the "node set" function tells it where to pull memory from. It should only pull memory from the node its assigned to die wise. It always pulls a a little ram from the other side. 64081 is my main gaming VM. Also is your ram populated across at least 4 dims?


Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Node 2 Node 3 Total
---------------  ------ ------ ------ ------ -----
52912 (qemu-syst      4      0   1510      0  1514
64081 (qemu-syst  26427      0   1833      0 28260
---------------  ------ ------ ------ ------ -----
Total             26431      0   3343      0 29773
 

Edited by Jerky_san
Link to comment
2 hours ago, Jerky_san said:

So the "node set" function tells it where to pull memory from. It should only pull memory from the node its assigned to die wise. It always pulls a a little ram from the other side. 64081 is my main gaming VM. Also is your ram populated across at least 4 dims?


Per-node process memory usage (in MBs)
PID              Node 0 Node 1 Node 2 Node 3 Total
---------------  ------ ------ ------ ------ -----
52912 (qemu-syst      4      0   1510      0  1514
64081 (qemu-syst  26427      0   1833      0 28260
---------------  ------ ------ ------ ------ -----
Total             26431      0   3343      0 29773
 

Currently there are 4 dimms installed. What command shows the per-node usage?

Link to comment
7 minutes ago, Jerky_san said:

 numastat -c qemu
 

image.png.47efd700a48991c984a1b4c00de5be5c.png

 

Wifes VM is set to strict, nodeset 1, mine was strict nodeset 0 but I rebooted the server and my vm wouldn't start because it ran out of memory. I had to take out the numatune to get my vm up.

 

Edit, both are set to 8GB of ram

Edited by TType85
Link to comment
1 minute ago, Jerky_san said:

Don't know why it wouldn't start but since you have your VM's split between the nodes it looks like its using the other node's ram. Still strange though.

Yeah, really odd.  The Wifes VM usually starts first. I would assume the numatune setting would push it over to node 1, when mine starts it would be in node 0.  Will have to wait till tomorrow to play with it more.  Numastat -c below, that numa_miss.... 

image.png.49626678c5fde27997fddb366c484bb2.png

Link to comment
1 minute ago, TType85 said:

Yeah, really odd.  The Wifes VM usually starts first. I would assume the numatune setting would push it over to node 1, when mine starts it would be in node 0.  Will have to wait till tomorrow to play with it more.  Numastat -c below, that numa_miss.... 

image.png.49626678c5fde27997fddb366c484bb2.png

how about your VM feel any better?

Link to comment

I emergency purchased a 2950X and MSI X399 SLI Plus mobo on Tues when my mobo died - I hadn't done any research, I just kind of knew that my next system would be a TR for the value. 

 

As the purchase wasn't planned, I haven't done my TR research although even without any tweaking I'm very happy with the performance boost today for the first day of running - I just want to make sure I'm not driving a ferrari in the slow lane.

 

I've been reading this thread today and I'm not sure how I'm best supposed to assign cores to my VMs.  At the moment I've got x3 VMs with 3 cores each and x1 with 2 cores and I've just gone 10-13,14-19,20-25 and 26-31 (I've got most of my dockers on 4-9 with my VM emulator pins on 2-3, and 0-1 left for unRAID) as lstopo said I only had one die, so I mistakenly thought NUMA was for the WX but now I guess my mobo is set to UMA, even though I can't see a setting to change in the bios?

 

My questions are,

 

i) do I need to enable NUMA somehow in the bios and then assign cores to the VMs from the same die as the PCIe devices, or is UMA ok?

ii) If I'm supposed to enable NUMA, where's the setting in the SLI Plus as I can't find it?

iii) if I stick with UMA, are my core assignments ok?

iv) Are there any other unRAID/bios changes I should be making?

 

Thanks in advance

Link to comment

If you're running in NUMA mode you gain improvement in memory bandwith cause you're accessing both memory controllers at the same time, 1 each die. But you will have slightly higher latency in memory access. All i tested for me and normal use of a VM (browsing, office stuff, gaming) i could't see any big differences. Switching the GPU in another slot or using a NVME in another one so the device isn't connected to the die i passthrough directly for me shows the same. No big noticable differences. I don't really care about +-5fps in games as long as everything runs smooth and it does. Read and write speeds of the NVME also kinda the same. You might see some hickups if the device is connected to the other die which is under heavy load. For me it never happens that my dockers or the Unraid itself gets to the point to use all the CPU ressources. Might be different with a plex container or some other transcoding dockers pushing the CPU to the limit. It all depends on how you're using your rig. 

 

I don't really know where you can find the BIOS setting on a MSI board. On AsRock you can find it under CBS / DF common options. Default for memory interleaving setting for me is Auto. Channel switches it to Numa. Besides from setting my XMP profile for the RAM i only enabled the hardware virtualization support (SVM Mode and SR-IOU) and enabled IOMMU. Not really sure how the last one exactly is called. Something with IOMMU. 

 

No matter how you set your memory you should always use cores/threads from the same die for a single VM and don't mix up the cores to reduce the latency. On my 1950x it looks like this:

 

cores.JPG.edb2c0fd5bb3b4a9e14b5e75fa15fbe3.JPG

 

The die1 is only for my main VM with GPU and NVME passthrough. Cores 8 and 24 are the emulatorpins and the rest is isolated and only used by a Win10 VM. On die0 I have all my dockers running and a couple VMs which i use from time to time. I am playing around currently with the emulatorpin setting, but i can't see any difference as without it. Isolating the emu pins, not isolating, no pinning at all for me i can't really see a difference. Using the emulatorpin on the other die i didn't checked yet. Worth a try. 

 

  • Like 1
Link to comment

Thanks that was useful.  My mobo apparently can't switch NUMA on, but it looks like I'm not missing anything.

 

I've switched the cores for my VMs to get my cores lined up with dies even tough I'm UMA - is the layout the same for 2950x?

 

I agree about the emulator pins - I do it out of habit, even though I haven't seen any evidence to say it is worthwhile

Link to comment
52 minutes ago, DZMM said:

My mobo apparently can't switch NUMA on

I bet you have this setting to. All the x399 BIOSes I saw so far have tons of settings more or less good structures. If you search long enough you might find it 😁

 

54 minutes ago, DZMM said:

is the layout the same for 2950x?

Should be pretty similar. Both have 2 dies each with 8 cores. 

Link to comment
1 minute ago, bastl said:

I bet you have this setting to. All the x399 BIOSes I saw so far have tons of settings more or less good structures. If you search long enough you might find it 😁

 

I even emailed msi and they said no, although the reply was a bit short so I'm not 100% certain the person understood the question

Link to comment
3 hours ago, DZMM said:

I even emailed msi and they said no, although the reply was a bit short so I'm not 100% certain the person understood the question

Mine wasn't called numa. Mine was auto, die, and channel. Channel is best on mine.

 

From a guide I used

In my case, under Advanced -> CBS -> DF there is a Memory model item that has choices like auto | distribute | channel. In our case, we are interested in channel mode, as it will expose NUMA information to the host once more.

Edited by Jerky_san
Link to comment
  • thenonsense changed the title to Ryzen/Threadripper PSA: Core Numberings and Assignments

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.