reyo Posted January 17, 2022 Share Posted January 17, 2022 (edited) Hi! I just remembered that Naples is 4x 8c/16t dies on 1 SoC and die sharing is bad for performance. Also RAM sharing isn't good for performance, but this can't be controlled I think. PS. I'm kinda new with all this EPYC stuff Thanks in advance Edited January 17, 2022 by reyo Die not core sharing, typo Quote Link to comment
ghost82 Posted January 17, 2022 Share Posted January 17, 2022 (edited) I think that the best you can do is to pass cores corresponding to ccx, and set the number of dies in the xml corresponding to that passed to the vm. Maybe also setting emulatorpin on 0/32 and iothreadpin on 1/33. The 7551p should have 4 dies. https://developer.amd.com/wp-content/resources/56308-NUMA Topology for AMD EPYC™ Naples Family Processors.PDF Avoid to pass core 0/32 to the vm. Cores per ccx should be organized as this: -----------DIE1----------- 0/32, 1/33, 2/34, 3/35 --> CCX 1 4/36, 5/37, 6/38, 7/39 --> CCX 2 -----------DIE1----------- -----------DIE2----------- 8/40, 9/41, 10/42, 11/43 --> CCX 3 12/44, 13/45, 14/46, 15/47 --> CCX 4 -----------DIE2----------- -----------DIE3----------- 16/48, 17/49, 18/50, 19/51 --> CCX 5 20/52, 21/53, 22/54, 23/55 --> CCX 6 -----------DIE3----------- -----------DIE4----------- 24/56, 25/57, 26/58, 27/59 --> CCX 7 28/60, 29/61, 30/62, 31/63 --> CCX 8 -----------DIE4----------- for a total of 8 ccx including 4 cores per ccx. As an example, for a 16 cores 32 thread vm, I would pass: 16/48, 17/49, 18/50, 19/51 --> CCX 5 20/52, 21/53, 22/54, 23/55 --> CCX 6 24/56, 25/57, 26/58, 27/59 --> CCX 7 28/60, 29/61, 30/62, 31/63 --> CCX 8 <vcpu placement='static'>32</vcpu> <iothreads>2</iothreads> <cputune> <vcpupin vcpu='0' cpuset='16'/> <vcpupin vcpu='1' cpuset='48'/> <vcpupin vcpu='2' cpuset='17'/> <vcpupin vcpu='3' cpuset='49'/> <vcpupin vcpu='4' cpuset='18'/> <vcpupin vcpu='5' cpuset='50'/> <vcpupin vcpu='6' cpuset='19'/> <vcpupin vcpu='7' cpuset='51'/> <vcpupin vcpu='8' cpuset='20'/> <vcpupin vcpu='9' cpuset='52'/> <vcpupin vcpu='10' cpuset='21'/> <vcpupin vcpu='11' cpuset='53'/> <vcpupin vcpu='12' cpuset='22'/> <vcpupin vcpu='13' cpuset='54'/> <vcpupin vcpu='14' cpuset='23'/> <vcpupin vcpu='15' cpuset='55'/> <vcpupin vcpu='16' cpuset='24'/> <vcpupin vcpu='17' cpuset='56'/> <vcpupin vcpu='18' cpuset='25'/> <vcpupin vcpu='19' cpuset='57'/> <vcpupin vcpu='20' cpuset='26'/> <vcpupin vcpu='21' cpuset='58'/> <vcpupin vcpu='22' cpuset='27'/> <vcpupin vcpu='23' cpuset='59'/> <vcpupin vcpu='24' cpuset='28'/> <vcpupin vcpu='25' cpuset='60'/> <vcpupin vcpu='26' cpuset='29'/> <vcpupin vcpu='27' cpuset='61'/> <vcpupin vcpu='28' cpuset='30'/> <vcpupin vcpu='29' cpuset='62'/> <vcpupin vcpu='30' cpuset='31'/> <vcpupin vcpu='31' cpuset='63'/> <emulatorpin cpuset='0,32'/> <iothreadpin iothread='2' cpuset='1,33'/> </cputune> and: <cpu mode='host-passthrough' check='none' migratable='on'> <topology sockets='1' dies='2' cores='16' threads='2'/> <cache mode='passthrough'/> <feature policy='require' name='topoext'/> </cpu> Edited January 17, 2022 by ghost82 1 Quote Link to comment
ghost82 Posted January 17, 2022 Share Posted January 17, 2022 But I may be totally wrong...I think @testdasihas ideas more clear about this. Quote Link to comment
reyo Posted January 17, 2022 Author Share Posted January 17, 2022 (edited) Thank you! But wouldn't CCX-5-8 cross die. Wouldn't it be faster if CCX5-6? I read somewhere that best performance would be 4x 8 core VMs, or whatever configuration, untill no cross-die configuration is used. Because of the L3, probably. This die / CCX configuration is much appreciated! Also, how much this die crossing will affect performance / latency. Running a docker based web server, currently on the host itself, but thinking to make 3x 8c/16t VMs (3 dies). Would the performance be better or will it be neglectable (drive backend is raid 1, nvme). My thoughts are that maybe the extra latency virtualization adds and the gains from sticking within a die are outweighted. Edited January 17, 2022 by reyo Quote Link to comment
ghost82 Posted January 17, 2022 Share Posted January 17, 2022 There are some discussions on unraid and on reddit about this, just search for 'ccx and dies' unfortunately I don't want to say anything more simply because I'm not sure it's correct or not. But first thing I can say for sure is that you should need the layout of the cpu, so to know which cores are inside which ccx and which ccx belong to which die. 6 minutes ago, reyo said: But wouldn't CCX-5-8 cross die Apart the '5' and '8' numbers, if that cores belong to different ccx, yes! But if you want a vm with more than 8 cores/16 threads you have no way other than cross die, each die includes 8 cores. 8 minutes ago, reyo said: Wouldn't it be faster if CCX5-6? Again apart the '5' and '6' numbers, yes for a vm with 8 cores/16 threads. Easy way, if I were you, could be making some tests pinning different cpus/with different settings in the xml and see how the results change. Quote Link to comment
SimonF Posted January 17, 2022 Share Posted January 17, 2022 2 hours ago, ghost82 said: Easy way, if I were you, could be making some tests pinning different cpus/with different settings in the xml and see how the results change. Do they show as numanodes? root@unraid:~# numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35 node 0 size: 64397 MB node 0 free: 436 MB node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42 43 44 45 46 47 node 1 size: 64505 MB node 1 free: 218 MB node distances: node 0 1 0: 10 21 1: 21 10 Quote Link to comment
reyo Posted January 18, 2022 Author Share Posted January 18, 2022 So @ghost82 was exactly right. Didn't know that you could request information like this: This is the output available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39 node 0 size: 15987 MB node 0 free: 159 MB node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47 node 1 size: 16124 MB node 1 free: 51 MB node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55 node 2 size: 16124 MB node 2 free: 50 MB node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63 node 3 size: 16122 MB node 3 free: 49 MB node distances: node 0 1 2 3 0: 10 16 16 16 1: 16 10 16 16 2: 16 16 10 16 3: 16 16 16 10 Memory is taken the nearest memory controller for the die / CCX? When a process runs at node 3 (the way that docker spawns processes), then it the memory could be allocated from node 0, then latency happens? It's probably best to make dedicated VM-s, because then the nearest memory is allocated. Sorry for my bad English and my lack of understanding on chip level :), haven't really looked at CPU diagrams and tried to make sence how stuff works on die level. Quote Link to comment
ghost82 Posted January 18, 2022 Share Posted January 18, 2022 This may help to understand the layout: Quote Link to comment
reyo Posted January 18, 2022 Author Share Posted January 18, 2022 (edited) This is my current layout Very-very good stuff! Supermicro H11SSL-i Thank you @ghost82 and @SimonF Edited January 19, 2022 by reyo Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.