AMD EPYC 7551P (Naples) assigning logical CPU-s to VMs to keep them on single die

reyo · January 17, 2022

Hi!

I just remembered that Naples is 4x 8c/16t dies on 1 SoC and die sharing is bad for performance. Also RAM sharing isn't good for performance, but this can't be controlled I think.

PS. I'm kinda new with all this EPYC stuff

Thanks in advance

Edited January 17, 2022 by reyo
Die not core sharing, typo

ghost82 · January 17, 2022

I think that the best you can do is to pass cores corresponding to ccx, and set the number of dies in the xml corresponding to that passed to the vm. Maybe also setting emulatorpin on 0/32 and iothreadpin on 1/33.

The 7551p should have 4 dies.

https://developer.amd.com/wp-content/resources/56308-NUMA Topology for AMD EPYC™ Naples Family Processors.PDF

Avoid to pass core 0/32 to the vm.

Cores per ccx should be organized as this:

-----------DIE1-----------

0/32, 1/33, 2/34, 3/35 --> CCX 1

4/36, 5/37, 6/38, 7/39 --> CCX 2

-----------DIE1-----------

-----------DIE2-----------

8/40, 9/41, 10/42, 11/43 --> CCX 3

12/44, 13/45, 14/46, 15/47 --> CCX 4

-----------DIE2-----------

-----------DIE3-----------

16/48, 17/49, 18/50, 19/51 --> CCX 5

20/52, 21/53, 22/54, 23/55 --> CCX 6

-----------DIE3-----------

-----------DIE4-----------

24/56, 25/57, 26/58, 27/59 --> CCX 7

28/60, 29/61, 30/62, 31/63 --> CCX 8

-----------DIE4-----------

for a total of 8 ccx including 4 cores per ccx.

As an example, for a 16 cores 32 thread vm, I would pass:

16/48, 17/49, 18/50, 19/51 --> CCX 5

20/52, 21/53, 22/54, 23/55 --> CCX 6

24/56, 25/57, 26/58, 27/59 --> CCX 7

28/60, 29/61, 30/62, 31/63 --> CCX 8

<vcpu placement='static'>32</vcpu>
  <iothreads>2</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='16'/>
    <vcpupin vcpu='1' cpuset='48'/>
    <vcpupin vcpu='2' cpuset='17'/>
    <vcpupin vcpu='3' cpuset='49'/>
    <vcpupin vcpu='4' cpuset='18'/>
    <vcpupin vcpu='5' cpuset='50'/>
    <vcpupin vcpu='6' cpuset='19'/>
    <vcpupin vcpu='7' cpuset='51'/>
    <vcpupin vcpu='8' cpuset='20'/>
    <vcpupin vcpu='9' cpuset='52'/>
    <vcpupin vcpu='10' cpuset='21'/>
    <vcpupin vcpu='11' cpuset='53'/>
    <vcpupin vcpu='12' cpuset='22'/>
    <vcpupin vcpu='13' cpuset='54'/>
    <vcpupin vcpu='14' cpuset='23'/>
    <vcpupin vcpu='15' cpuset='55'/>
    <vcpupin vcpu='16' cpuset='24'/>
    <vcpupin vcpu='17' cpuset='56'/>
    <vcpupin vcpu='18' cpuset='25'/>
    <vcpupin vcpu='19' cpuset='57'/>
    <vcpupin vcpu='20' cpuset='26'/>
    <vcpupin vcpu='21' cpuset='58'/>
    <vcpupin vcpu='22' cpuset='27'/>
    <vcpupin vcpu='23' cpuset='59'/>
    <vcpupin vcpu='24' cpuset='28'/>
    <vcpupin vcpu='25' cpuset='60'/>
    <vcpupin vcpu='26' cpuset='29'/>
    <vcpupin vcpu='27' cpuset='61'/>
    <vcpupin vcpu='28' cpuset='30'/>
    <vcpupin vcpu='29' cpuset='62'/>
    <vcpupin vcpu='30' cpuset='31'/>
    <vcpupin vcpu='31' cpuset='63'/>
    <emulatorpin cpuset='0,32'/>
    <iothreadpin iothread='2' cpuset='1,33'/>
  </cputune>

and:

  <cpu mode='host-passthrough' check='none' migratable='on'>
    <topology sockets='1' dies='2' cores='16' threads='2'/>
    <cache mode='passthrough'/>
    <feature policy='require' name='topoext'/>
  </cpu>

Edited January 17, 2022 by ghost82

ghost82 · January 17, 2022

But I may be totally wrong...I think @testdasihas ideas more clear about this.

reyo · January 17, 2022

Thank you! But wouldn't CCX-5-8 cross die. Wouldn't it be faster if CCX5-6? I read somewhere that best performance would be 4x 8 core VMs, or whatever configuration, untill no cross-die configuration is used. Because of the L3, probably.

This die / CCX configuration is much appreciated!

Also, how much this die crossing will affect performance / latency. Running a docker based web server, currently on the host itself, but thinking to make 3x 8c/16t VMs (3 dies). Would the performance be better or will it be neglectable (drive backend is raid 1, nvme). My thoughts are that maybe the extra latency virtualization adds and the gains from sticking within a die are outweighted.

Edited January 17, 2022 by reyo

ghost82 · January 17, 2022

There are some discussions on unraid and on reddit about this, just search for 'ccx and dies' unfortunately I don't want to say anything more simply because I'm not sure it's correct or not.

But first thing I can say for sure is that you should need the layout of the cpu, so to know which cores are inside which ccx and which ccx belong to which die.

6 minutes ago, reyo said:

But wouldn't CCX-5-8 cross die

Apart the '5' and '8' numbers, if that cores belong to different ccx, yes! But if you want a vm with more than 8 cores/16 threads you have no way other than cross die, each die includes 8 cores.

8 minutes ago, reyo said:

Wouldn't it be faster if CCX5-6?

Again apart the '5' and '6' numbers, yes for a vm with 8 cores/16 threads.

Easy way, if I were you, could be making some tests pinning different cpus/with different settings in the xml and see how the results change.

SimonF · January 17, 2022

2 hours ago, ghost82 said:

Easy way, if I were you, could be making some tests pinning different cpus/with different settings in the xml and see how the results change.

Do they show as numanodes?

root@unraid:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 64397 MB
node 0 free: 436 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64505 MB
node 1 free: 218 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

reyo · January 18, 2022

So @ghost82 was exactly right. Didn't know that you could request information like this:

This is the output

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 15987 MB
node 0 free: 159 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 16124 MB
node 1 free: 51 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 16124 MB
node 2 free: 50 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 16122 MB
node 3 free: 49 MB
node distances:
node   0   1   2   3 
  0:  10  16  16  16 
  1:  16  10  16  16 
  2:  16  16  10  16 
  3:  16  16  16  10

Memory is taken the nearest memory controller for the die / CCX? When a process runs at node 3 (the way that docker spawns processes), then it the memory could be allocated from node 0, then latency happens? It's probably best to make dedicated VM-s, because then the nearest memory is allocated. Sorry for my bad English and my lack of understanding on chip level :), haven't really looked at CPU diagrams and tried to make sence how stuff works on die level.

ghost82 · January 18, 2022

This may help to understand the layout:

reyo · January 18, 2022

This is my current layout Very-very good stuff!

Supermicro H11SSL-i

Thank you @ghost82 and @SimonF

Edited January 19, 2022 by reyo

AMD EPYC 7551P (Naples) assigning logical CPU-s to VMs to keep them on single die

Recommended Posts

reyo

Link to comment

ghost82

Link to comment

ghost82

Link to comment

reyo

Link to comment

ghost82

Link to comment

SimonF

Link to comment

reyo

Link to comment

ghost82

Link to comment

reyo

Link to comment

Join the conversation