AMD EPYC 7551P (Naples) assigning logical CPU-s to VMs to keep them on single die


reyo

Recommended Posts

Hi!

 

I just remembered that Naples is 4x 8c/16t dies on 1 SoC and die sharing is bad for performance. Also RAM sharing isn't good for performance, but this can't be controlled I think.

 

PS. I'm kinda new with all this EPYC stuff :)

 

Thanks in advance

Screen Shot 2022-01-17 at 11.35.20.png

Edited by reyo
Die not core sharing, typo
Link to comment

I think that the best you can do is to pass cores corresponding to ccx, and set the number of dies in the xml corresponding to that passed to the vm. Maybe also setting emulatorpin on 0/32 and iothreadpin on 1/33.

The 7551p should have 4 dies.

https://developer.amd.com/wp-content/resources/56308-NUMA Topology for AMD EPYC™ Naples Family Processors.PDF

Avoid to pass core 0/32 to the vm.

Cores per ccx should be organized as this:

-----------DIE1-----------

0/32, 1/33, 2/34, 3/35 --> CCX 1

4/36, 5/37, 6/38, 7/39 --> CCX 2

-----------DIE1-----------

-----------DIE2-----------

8/40, 9/41, 10/42, 11/43 --> CCX 3

12/44, 13/45, 14/46, 15/47 --> CCX 4

-----------DIE2-----------

-----------DIE3-----------

16/48, 17/49, 18/50, 19/51 --> CCX 5

20/52, 21/53, 22/54, 23/55 --> CCX 6

-----------DIE3-----------

-----------DIE4-----------

24/56, 25/57, 26/58, 27/59 --> CCX 7

28/60, 29/61, 30/62, 31/63 --> CCX 8

-----------DIE4-----------

 

for a total of 8 ccx including 4 cores per ccx.

 

As an example, for a 16 cores 32 thread vm, I would pass:

16/48, 17/49, 18/50, 19/51 --> CCX 5

20/52, 21/53, 22/54, 23/55 --> CCX 6

24/56, 25/57, 26/58, 27/59 --> CCX 7

28/60, 29/61, 30/62, 31/63 --> CCX 8

 

<vcpu placement='static'>32</vcpu>
  <iothreads>2</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='16'/>
    <vcpupin vcpu='1' cpuset='48'/>
    <vcpupin vcpu='2' cpuset='17'/>
    <vcpupin vcpu='3' cpuset='49'/>
    <vcpupin vcpu='4' cpuset='18'/>
    <vcpupin vcpu='5' cpuset='50'/>
    <vcpupin vcpu='6' cpuset='19'/>
    <vcpupin vcpu='7' cpuset='51'/>
    <vcpupin vcpu='8' cpuset='20'/>
    <vcpupin vcpu='9' cpuset='52'/>
    <vcpupin vcpu='10' cpuset='21'/>
    <vcpupin vcpu='11' cpuset='53'/>
    <vcpupin vcpu='12' cpuset='22'/>
    <vcpupin vcpu='13' cpuset='54'/>
    <vcpupin vcpu='14' cpuset='23'/>
    <vcpupin vcpu='15' cpuset='55'/>
    <vcpupin vcpu='16' cpuset='24'/>
    <vcpupin vcpu='17' cpuset='56'/>
    <vcpupin vcpu='18' cpuset='25'/>
    <vcpupin vcpu='19' cpuset='57'/>
    <vcpupin vcpu='20' cpuset='26'/>
    <vcpupin vcpu='21' cpuset='58'/>
    <vcpupin vcpu='22' cpuset='27'/>
    <vcpupin vcpu='23' cpuset='59'/>
    <vcpupin vcpu='24' cpuset='28'/>
    <vcpupin vcpu='25' cpuset='60'/>
    <vcpupin vcpu='26' cpuset='29'/>
    <vcpupin vcpu='27' cpuset='61'/>
    <vcpupin vcpu='28' cpuset='30'/>
    <vcpupin vcpu='29' cpuset='62'/>
    <vcpupin vcpu='30' cpuset='31'/>
    <vcpupin vcpu='31' cpuset='63'/>
    <emulatorpin cpuset='0,32'/>
    <iothreadpin iothread='2' cpuset='1,33'/>
  </cputune>

 

and:

  <cpu mode='host-passthrough' check='none' migratable='on'>
    <topology sockets='1' dies='2' cores='16' threads='2'/>
    <cache mode='passthrough'/>
    <feature policy='require' name='topoext'/>
  </cpu>

 

Edited by ghost82
  • Thanks 1
Link to comment

Thank you! But wouldn't CCX-5-8 cross die. Wouldn't it be faster if CCX5-6? I read somewhere that best performance would be 4x 8 core VMs, or whatever configuration, untill no cross-die configuration is used. Because of the L3, probably.

 

This die / CCX configuration is much appreciated!

 

Also, how much this die crossing will affect performance / latency. Running a docker based web server, currently on the host itself, but thinking to make 3x 8c/16t VMs (3 dies). Would the performance be better or will it be neglectable (drive backend is raid 1, nvme). My thoughts are that maybe the extra latency virtualization adds and the gains from sticking within a die are outweighted. 

Edited by reyo
Link to comment

There are some discussions on unraid and on reddit about this, just search for 'ccx and dies' unfortunately I don't want to say anything more simply because I'm not sure it's correct or not.

But first thing I can say for sure is that you should need the layout of the cpu, so to know which cores are inside which ccx and which ccx belong to which die.

 

6 minutes ago, reyo said:

But wouldn't CCX-5-8 cross die

Apart the '5' and '8' numbers, if that cores belong to different ccx, yes! But if you want a vm with more than 8 cores/16 threads you have no way other than cross die, each die includes 8 cores.

 

8 minutes ago, reyo said:

Wouldn't it be faster if CCX5-6?

Again apart the '5' and '6' numbers, yes for a vm with 8 cores/16 threads.

 

Easy way, if I were you, could be making some tests pinning different cpus/with different settings in the xml and see how the results change.

 

 

Link to comment
2 hours ago, ghost82 said:

Easy way, if I were you, could be making some tests pinning different cpus/with different settings in the xml and see how the results change.

Do they show as numanodes?

 

root@unraid:~# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
node 0 size: 64397 MB
node 0 free: 436 MB
node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 64505 MB
node 1 free: 218 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

 

Link to comment

So @ghost82 was exactly right. Didn't know that you could request information like this:

This is the output

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 15987 MB
node 0 free: 159 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 16124 MB
node 1 free: 51 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 16124 MB
node 2 free: 50 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 16122 MB
node 3 free: 49 MB
node distances:
node   0   1   2   3 
  0:  10  16  16  16 
  1:  16  10  16  16 
  2:  16  16  10  16 
  3:  16  16  16  10 

 

Memory is taken the nearest memory controller for the die / CCX? When a process runs at node 3 (the way that docker spawns processes), then it the memory could be allocated from node 0, then latency happens? It's probably best to make dedicated VM-s, because then the nearest memory is allocated. Sorry for my bad English and my lack of understanding on chip level :), haven't really looked at CPU diagrams and tried to make sence how stuff works on die level.

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.