Ryzen/Threadripper PSA: Core Numberings and Assignments


Recommended Posts

4 hours ago, Jerky_san said:

So.. its important to set so you don't go outside a core that has access to the memory. It will introduce stuttering into games with intense graphics. The one I test on is Dying Light. Before I did this little patch I was getting max 80fps. This patch skyrocketed my fps to 120 nearly consistent. Your L3 cache gets the largest boost of from 50ish to 10-11 and your memory latency usually drops from low 100's to very close to bare metal. Its amazing.

 

 

Thanks for all the explanations, it's coming together slowly in my head.

 

I'm not sure how I would know which core has access to the memory though, any chance would you be able to explain that bit for me?

Link to comment
On 11/21/2018 at 3:36 PM, Nooke said:

Go into BIOS -> OC -> Advanced DRAM Configuration.

Scroll down to "Misc Item" and look for "memory interleaving". Change this from "auto" to "channel" and you are in NUMA mode.

Brilliant - it worked.  The results were interesting.  Organising my 4 VMs the way I want looks easy, although the SM961 m.2 drive they share is on the wrong NUMA for two of them.  Luckily, they are the two less important ones (kids):

topology2.thumb.png.df61f1cccec1547536c4411f8c75209e.png

 

Does NUMA 'tuning' also apply to HDDs?  My cache pool drives sdl and sdk are on different NUMAs - should I swap sdl's connector with one from sdc-sdj?

 


Once someone confirms where numatune goes in the xml file, I'll start experimenting!

 

Edit: found it

<numatune>
    <memory mode='interleave' nodeset='1'/>
  </numatune>

 

Edited by DZMM
found nodeset location
Link to comment
numastat qemu

Shows current running VMs and their use of RAM. As i said earlier, you can tell the VM with the numatune option which node to use the memory from but it looks like it doesn't. There are always a couble megs used from the other node. 

 

numastat.jpg.10a13e9cbe250721653904619289336e.jpg

First VM is set to use 8GB and 4 cores from node0, second VM is set to use 16GB and 14 cores from node1. 

<memory mode='strict' nodeset='1'/>

Strict should limit the VM only to a specific node. But it doesn't. "prefered" or "interleaved" doesn't change anything for me. 

Link to comment
11 minutes ago, bastl said:

numastat qemu

Shows current running VMs and their use of RAM. As i said earlier, you can tell the VM with the numatune option which node to use the memory from but it looks like it doesn't. There are always a couble megs used from the other node. 

 

numastat.jpg.10a13e9cbe250721653904619289336e.jpg

First VM is set to use 8GB and 4 cores from node0, second VM is set to use 16GB and 14 cores from node1. 


<memory mode='strict' nodeset='1'/>

Strict should limit the VM only to a specific node. But it doesn't. "prefered" or "interleaved" doesn't change anything for me. 

I've just done 3 of my  4 VMs and my memory usage looks the same for interleaved:

 

Per-node process memory usage (in MBs)
PID                               Node 0          Node 1           Total
-----------------------  --------------- --------------- ---------------
18388 (qemu-system-x86)          4143.74          167.77         4311.51
29504 (qemu-system-x86)          4864.92         3407.02         8271.94
36634 (qemu-system-x86)          4881.64         3376.45         8258.08
115105 (qemu-system-x86)         8236.66           25.16         8261.83
-----------------------  --------------- --------------- ---------------
Total                           22126.96         6976.40        29103.36

18388 is the VM I haven't tweaked yet as it's my pfsense firewall, so I have to do that when the family aren't doing stuff.  Unfortunately all the memory is on Node 0, when the cores are on 1.

 

I'm wondering if memory mode isn't working because the memory is already in use e.g. by unRAID, dockers etc so if there's not enough when the VM is created unRAID uses memory from the other node?  Otherwise to allocate say 8GB when there's only 6GB available it would have to move other blocks around????

 

Link to comment

The 16GB VM i tried to use RAM and cores from node0 only and it shows the same. It always uses a couple megs from the other node. Don't know why. Something is off. 64GB in total and not even running any Docker or other stuff it happens.

 

Can you check one of your VMs with "preferred" instead of "interleaved" how it behaves? @DZMM Also i don't know why you using "interleaved" anyways. This forces Qemu to spread the RAM across all nodes. The point for pinning down the RAM to a specific node/die is to reduce latency and improve the performance of a VM.

Link to comment
26 minutes ago, bastl said:

@DZMM Also i don't know why you using "interleaved" anyways. This forces Qemu to spread the RAM across all nodes. The point for pinning down the RAM to a specific node/die is to reduce latency and improve the performance of a VM.

I just followed @Symon earlier post blindly ;-)

 

All good now, will set 19218 to Node 1 strict later:

Per-node process memory usage (in MBs)
PID                               Node 0          Node 1           Total
-----------------------  --------------- --------------- ---------------
10236 (qemu-system-x86)          8269.48            0.00         8269.48
19218 (qemu-system-x86)          4146.88          181.60         4328.48
124935 (qemu-system-x86)           19.52         8257.12         8276.64
127416 (qemu-system-x86)         8264.96            0.00         8264.97
-----------------------  --------------- --------------- ---------------
Total                           20700.85         8438.73        29139.58

 

Link to comment
1 hour ago, bastl said:

numastat qemu

Shows current running VMs and their use of RAM. As i said earlier, you can tell the VM with the numatune option which node to use the memory from but it looks like it doesn't. There are always a couble megs used from the other node. 

 

numastat.jpg.10a13e9cbe250721653904619289336e.jpg

First VM is set to use 8GB and 4 cores from node0, second VM is set to use 16GB and 14 cores from node1. 


<memory mode='strict' nodeset='1'/>

Strict should limit the VM only to a specific node. But it doesn't. "prefered" or "interleaved" doesn't change anything for me. 

 

I guess you took too much RAM on your VM.

 

I'm running with strict memory and my Gaming VM is only using stuff from the node I set it to.

image.png.619df125a5ca57c2520c17325c5d9bf5.png

 

 

What you have to keep in mind, if you have like 32 GB overall with 4x8 GB sticks and have those in normal quad channel mode (motherboard manual). you can't take exactly 16GB for a VM.

Because even without running any VM there is some small amount of memory already used from each node.

image.png.daf5b3bdf531635efe73fe868094daa7.png

 

so just check how much memory is available before starting the VM with "numactl --hardware" and then just use that much and not more.

 

 

cheers

Link to comment
1 hour ago, DZMM said:

I just followed @Symon earlier post blindly ;-)

For me it works this way :).

If I use interleave, it uses the configured node for most of the RAM.. (10 MB always seem to get taken from the other node)

If I use strict, the VM takes longer to boot but all of the RAM is on one node.

Same as you, I have the problem that my cache disks are on a differente node than my GPU.. For now I decided to go with the node that the GPU is connected to for the gaming VM.

 

My configuration currently looks now like this:

(Cores 1-7, 18-23 are isolated from unraid)

  <vcpu placement='static'>12</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='2'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <vcpupin vcpu='2' cpuset='3'/>
    <vcpupin vcpu='3' cpuset='19'/>
    <vcpupin vcpu='4' cpuset='4'/>
    <vcpupin vcpu='5' cpuset='20'/>
    <vcpupin vcpu='6' cpuset='5'/>
    <vcpupin vcpu='7' cpuset='21'/>
    <vcpupin vcpu='8' cpuset='6'/>
    <vcpupin vcpu='9' cpuset='22'/>
    <vcpupin vcpu='10' cpuset='7'/>
    <vcpupin vcpu='11' cpuset='23'/>
    <emulatorpin cpuset='1,17'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.10'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/18da3180-7fb3-6e5c-7013-963dfa89ec0a_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vendor_id state='on' value='none'/>
    </hyperv>
  </features>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>EPYC-IBPB</model>
    <topology sockets='1' cores='6' threads='2'/>
    <feature policy='require' name='topoext'/>
    <feature policy='disable' name='monitor'/>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='disable' name='svm'/>
  </cpu>

On the second node I run 6 x Windows 2016 servers and the Win 10 VM of my wife (with numatune)

 

Edited by Symon
Link to comment

Tried it with "strict". The VM takes quite a bit longer to start but now the RAM is completely set to one node.

grafik.png.2b5e821ea14438b2f597fb36b0fcd3ed.png

 

I will have to do some tests :)

Update: After doing a Cinebench test, memory has been moved to the other node again :)

grafik.png.13cec49f06f10b6aee20f7db72bc2ad1.png

Update2: Did a Cinebench test on my wives VM (which is on the same node as GPU / cache disks) and the memory stayed on the same node. So my guess is that 'strict' only works if the node is connected to the storage (cache/passthorugh discs) as well.

grafik.png.2125a322f19f26ce4f9ccaf9f5140662.png

Edited by Symon
Link to comment

@Nooke I have 4x16GB. 2x16 each die and if i have nothing run except of 1 VM with 16GB please explain me why I

1 hour ago, Nooke said:

took too much RAM on my VM.

Only reason could be Unraid claiming more than 16GB from node1. 

1 hour ago, Symon said:

So my guess is that 'strict' only works if the node is connected to the storage (cache/passthorugh discs) as well.

My GPU and the NVME controller i pass through to the VM are connected to node1 which cores and RAM I assigned to the the VM and still it uses RAM from node0 ^^

Link to comment

Guys, need some help. Having read all of the above I'm not convinced my set up is optimal.

 

Hardware: MSI x399 MEG Creation (Latest as of today BIOS) - TR 2950X - Hyperx 3200 32GB (2x16)

BIOS: Channel enabled (from Auto)

UNRAID 6.6.5

VM: W10, 12GB mem, 6 cores (12 threads) CPU

 

I haven't made any modifications (numa, epyc) to the XML for my W10 VM and ran a Aida64 mem bench. I think the stats are poor right?

I know I'm only using 12 threads but still my memory scores should be better?

 

I see that the topic mainly relates to 2990wx but I have a 2950x.

Any tips on what I need to check, change would be appreciated.

 

UNRAID

Image 1.png

 

BARE METAL

Untitled.png

Edited by mikeyosm
added physical mem bench
Link to comment
8 hours ago, mikeyosm said:

Guys, need some help. Having read all of the above I'm not convinced my set up is optimal.

 

Hardware: MSI x399 MEG Creation (Latest as of today BIOS) - TR 2950X - Hyperx 3200 32GB (2x16)

BIOS: Channel enabled (from Auto)

UNRAID 6.6.5

VM: W10, 12GB mem, 6 cores (12 threads) CPU

 

I haven't made any modifications (numa, epyc) to the XML for my W10 VM and ran a Aida64 mem bench. I think the stats are poor right?

I know I'm only using 12 threads but still my memory scores should be better?

 

I see that the topic mainly relates to 2990wx but I have a 2950x.

Any tips on what I need to check, change would be appreciated.

 

UNRAID

Image 1.png

 

BARE METAL

Untitled.png

 

So you may be allocating ram from the other die so if your CPUs are all on a single die then you need to use the node-set parameter and set it to the node its running on. On a 2950x I'm guessing its just 0 or 1. You should also try the EPYC fix though your latency isn't nearly as high as mine was so you probably want to see if your cache is out of wack first. Mine definitely was.

<numatune>
    <memory mode='static' nodeset='0'/>
  </numatune>

 

Please read the below two posts from me earlier in the thread.

 

 

Link to comment
1 hour ago, Jerky_san said:

 

So you may be allocating ram from the other die so if your CPUs are all on a single die then you need to use the node-set parameter and set it to the node its running on. On a 2950x I'm guessing its just 0 or 1. You should also try the EPYC fix though your latency isn't nearly as high as mine was so you probably want to see if your cache is out of wack first. Mine definitely was.

<numatune>
    <memory mode='static' nodeset='0'/>
  </numatune>

 

Please read the below two posts from me earlier in the thread.

 

 

Ill try that thank you. BTW, should that be strict not static for numatune?

Link to comment
8 hours ago, TType85 said:

I have been having issues getting the cache to show up right using this tweak.  I ended up creating a new VM and the cache showed up correctly; so I created a new VM XML for my existing windows install and now it shows correctly.  I will have to test games when I get home.

I hope it turns out great for you.

Link to comment

 

9 hours ago, Jerky_san said:

Yeah sorry on that one

So, I added the EPYC section and numatune strict, however performance was no better. So I changed to interleave and there is an improvement in L3 bench.

Still not happy with memory read/write/copy being half the bare metal performance.

numa.png

Edited by mikeyosm
Link to comment

Don't use the numatune option if you want the bandwith of both channels. Or set it to "interleaved" and specify both nodes.

<numatune>
    <memory mode='interleave' nodeset='0-1'/>
</numatune>

It's up on you what works better for you. Depends on the application and the games. Some benefit of the higher bandwith some of the lower latency. 

Edited by bastl
typo
  • Like 1
Link to comment
14 minutes ago, bastl said:

Don't use the numatune option if you want the bandwith of both channels. Or set it to "interleaved" and specify both nodes.


<numatune>
    <memory mode='interleaved' nodeset='0-1'/>
</numatune>

It's up on you what works better for you. Depends on the application and the games. Some benefit of the higher bandwith some of the lower latency. 

OK, makes sense. However, before I made any mods to the XML or setting my BIOS to channel from auto, memory read/write/copy in Aida64 were exactly the same (half the bare metal mem perf). Adding numatune and EPYC has not impacted read/write/copy - only improve L3 chache a bit.

Link to comment

The numatune setting only works if you set your BIOS settings for the RAM to channel. On all TR4 boards i saw so far the default setting is auto and the CPU is presented as 1 node to the OS. You might have to check your manual of the board how to setup the RAM slots correctly for 2 dimms. You're the first one i see here in the forum with only 2 dimms on the TR4 platform.

Link to comment
  • thenonsense changed the title to Ryzen/Threadripper PSA: Core Numberings and Assignments

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.