Ryzen/Threadripper PSA: Core Numberings and Assignments


Recommended Posts

Hi everyone,

 

For those of you using a Threadripper build, or any recent AMD CPU, you may have concerns with the Infinity Fabric induced latency between CCXs, and further latency between dies.  I dug through multiple posts to try and find exactly what cores were associated with what CCX/die, and after finding insufficient info, decided to ask AMD and my motherboard supplier, Gigabyte.

 

For those of you who had not seen the informative post on AMD's latency tests, it has been found that base Ryzen uses sequential numberings (no interleave) for their core assignments.  Lodging a ticket with AMD support confirms this for Threadripper as well.  Gigabyte also returned the same info for physical cores, but gave no info on logical ones. 

 

Ultimately that means cores 0-1 are both logically related, cores 0-7 are on the same CCX, and (if on Threadripper) cores 0-15 are on the same die.  

 

This will hopefully help someone else come across the information more quickly and reduce their struggle by some small amount.  I hate to cop out, but I'm at work now, so if anyone has an article they feel will supplement this info, please comment it below.  

  • Like 3
Link to comment
  • 2 weeks later...

Hi everyone,

 

first of all, let me say hello to everyone. My first posting after nearly 1 year of using Unraid. So far i had no big issues and if there where some smaller ones it was easy to find a fix for it in the forums. I fiddled around with the topic of core assignments when i started with the TR4 platform end of last year. I thought i figured it out after doing some tests back than, which die corresponds to which core shown in the Unraid Gui.

 

First of all my specs:

CPU:         1950x locked at 3,9GHz @1.15V
Mobo:       ASRock Fatal1ty X399 Professional Gaming
RAM:         4x16GB TridentZ 3200MHz
GPU:          EVGA 1050ti
                   Asus Strix 1080ti
Storage:    Samsung 960 EVO 500GB NVME (Cache Drive)
                   Samsung 850 EVO 1TB SSD (Steam Library)
                   Samsung 960 Pro 512GB NVME (passthrough Win10 Gaming VM)
                   3x WD Red 3TB (Storage)
            
After reading your post @thenonsense  i was kinda confused. So i decided to do some more testing. Here are my results which basically confirmes your findings. I did some benchmarks with Cinebench (3 times in a row) inside a Win10 vm i use since end of last year for gaming and video editing. Also i did some cache and memory benchmarks with Aida64.

 

Specs Win10 VM:

8cores+8threads
16GB RAM
Asus Strix 1080ti
960 Pro 512GB NVME Passthrough

 

TEST 1

initial cores assigned:

1955058351_initialwin10assignments.JPG.7eaee5274d8e3335c83c09a6c86d8e0a.JPG

 

Cinebench Scores:
run1:    1564
run2:    1567
run3:    1567

 

729399258_win10oldsettings8-248-2510-2611-2712-2813-2914-3015-31.JPG.a23985e8bc407335ff4b0be2da8373d0.JPG


Next i did the exact same tests with the following core assignments you suggested @thenonsense

 

TEST 2

161830517_currentwin10assignments.JPG.932fdd5d563f0627149c3f75c9b4ff7c.JPG


Cinebench Scores:
run1:    2226
run2:    2224
run3:    2216

 

737957968_win10new16-32_2.JPG.feae961a24fb7d35f02b14dbc7d166f2.JPG


Both the CPU and the memory score was improved. The memory performance almost doubled!! Clearly a sign that on the second test, only one die was used and the performance wasn`t limited by the communication between the dies over the Infinity Fabric as on my old setting.

After that i decided to do some more testing. This time with a Windows 7 VM with only 4 cores and 4GB of RAM to check which are the physical cores and which ones are the corresponding SMT threads.

 

first test:

assigned cores 4 5 6 7 (physical cores only)

Cinebench Scores:
run1:    558
run2:    558
run3:    557

 

540181844_win7cores4567.JPG.b4ce868832e4c9d63cf3c1956d3205c5.JPG


second test:

assigned cores 12 13 14 15 (SMT cores only)

Cinebench Scores:
run1:    540
run2:    542
run3:    541

 

609890859_win7cores12131415.JPG.2d44a62c418681686366d26c304a4254.JPG


third test:
assigned cores 4 5 12 13 (physical + corresponding SMT cores)

Cinebench Scores:
run1:    561
run2:    563
run3:    560

 

1686839139_win7cores451213.JPG.3337980be859031ae53b54d2be09a5ef.JPG


And again a clear sign your statement is correct @thenonsense. The cores 0-7 are the physical cores and the cores 8-15 are the SMT cores. Test 2 only uses the SMT cores and clearly showes that the performance is worse than using physical cores like in test 1. 

 

Threadripper_1950x_core_numbering.jpg.8c3bd0a2b67ff5e2c8a1c764baf8eb54.jpg

 

I'm really sure based on my first tests last year i configured my WIN10 vm to only use the cores from one die and all other vm`s to use the correct corresponding core pairs. Clearly not. Did UNRAID changed something in how the cores are presented in the webgui in one of the last versions? i never checked if something is changed. All my VM`S run smooth without any hickups or freezes but as the tests showed the performance wasn't optimal. 

 

@limetech

It would be nice if you guys could find a way to recognize the CPU if its a Ryzen/Threadripper based system and present the user the correct core pairing in the webui. Over all, i had no bigger issues over the time i use your product. Let me say thank you for providing us UNRAID

 

 

Greetings from Germany

 

 

and sry for my bad english ?

 

 

Edited by bastl
  • Like 3
  • Thanks 2
Link to comment
On 8/25/2018 at 3:11 PM, eschultz said:

Those are pretty interesting results, @bastl! I'll need to confirm those thread pairings on my Threadripper -- It's possible Linux is confused on the actual thread parings for threadripper.

Would more data points, from the other users be useful to you @eschultz?  Since bastl posted his methods I could try reproducing the experiment (same CPU but different mobo) and post results.  Or would posting results at this stage be noise?

 

EDIT: Well I'm noise simply by asking, so I may as well have some data.  Ran the same tests as @bastl, was able to reproduce similar benchmark numbers (yay science!), I learned my current vm pin-outs are least optimized. Easy enough to change and try out for a while. :D

 

cpu8-31.JPG

test2.JPG

test3.JPG

test4.JPG

test5.JPG

Edited by Jcloud
Link to comment

Time for an BIOS update you said?

 

Ok, here we go. Until today i ran Unraid on the BIOS Version 2.00 from Nov. last year. The latest stable version until they released 3.20 and 3.30 with support for second generation Threadripper a couple days ago. I upgraded the BIOS to the latest version 3.30 without any issues. I reconfigured all the BIOS settings as i had them before (enable Virtualisation, IOMMU Support, OC and Fan Settings etc) and i did the same tests again. The results are basically the same as in my earlier tests. The only noticable difference is a slightly improved L3 Cache latency performance. All tests showed an improvement of 10-15ns. Everything else performed as before. 

 

Also the core pairings presented by Unraid are the same. 

pairings.JPG.d62d4396e6eac26c46ac72162eca5057.JPG

 

So an updated BIOS for me didn't make any difference. Would be nice to know how @thenonsense core pairings is shown in the Unraid gui. Maybe Gigabyte delivers the core pairs different than other manufacturers to the OS. As @Jcloud and my tests showed, ASUS and ASRock kinda doin the same thing here. Maybe the 2990WX is the reason your core pairs are different. Who knows?! Is there a chance you have a 1950x laying around to test your board with? @testdasi Sry, stupid question, but i want an solution which works for everyone 😁

 

If there are any tests we can do to find a solution @eschultz let us know. Btw did you had any time yet to check the behaviour on your Threadripper system?

 

 

Link to comment

@testdasi I don't really get the point why you ask if memory interleaving is on or off. If you ask because of the L3 Cache 10-15ns performance bump i got. I can tell you this has nothing really to do with the main memory configuration. I think the reason is the newer microcode update which comes with the new BIOS.

 

The point of all these tests are to figure out which core pair shown by unraid is on the same die. My 1950x has 2 dies each with its own memory controller adressing 2 channels. As soon as you have a communication between the dies the memory bandwith is reduced. I used this behaviour to find out, in which configuration unraid is only using one die and in which case it uses cores on two different dies. The way i configured the RAM is the same as before. Load the XMP profil, done. The XMP profil is stored on the memory sticks itself and the settings are absolute the same as before. No more extra tweaking from me and the settings are exactly matching the old BIOS settings. 

 

For your 2990WX it gets even more interresting. The second gen Threadripper still uses quad channel memory. The difference to an EPYC is, that only on 2 dies out of 4 the memory controllers are active. Lets say you configure a VM to use a complete die (8 cores, 16 threads), you will see a difference depending on which die you give to the VM. As soon as you give your Windows VM a full die without an active memory controller you will see the exact same bandwith decreases as i showed before. The reason for this is the die has to comunicate across the infinity fabric with the neigbour die with an existing mem controller. You might check your current config and do a couple tests on your own to find the best performance ;)

 

Maybe the shown core pairs 0-1 etc aren't actual the correct ones for you. Just sayin. 

 

Link to comment
13 minutes ago, bastl said:

@testdasi I don't really get the point why you ask if memory interleaving is on or off. If you ask because of the L3 Cache 10-15ns performance bump i got. I can tell you this has nothing really to do with the main memory configuration. I think the reason is the newer microcode update which comes with the new BIOS.

 

The point of all these tests are to figure out which core pair shown by unraid is on the same die. My 1950x has 2 dies each with its own memory controller adressing 2 channels. As soon as you have a communication between the dies the memory bandwith is reduced. I used this behaviour to find out, in which configuration unraid is only using one die and in which case it uses cores on two different dies. The way i configured the RAM is the same as before. Load the XMP profil, done. The XMP profil is stored on the memory sticks itself and the settings are absolute the same as before. No more extra tweaking from me and the settings are exactly matching the old BIOS settings. 

 

For your 2990WX it gets even more interresting. The second gen Threadripper still uses quad channel memory. The difference to an EPYC is, that only on 2 dies out of 4 the memory controllers are active. Lets say you configure a VM to use a complete die (8 cores, 16 threads), you will see a difference depending on which die you give to the VM. As soon as you give your Windows VM a full die without an active memory controller you will see the exact same bandwith decreases as i showed before. The reason for this is the die has to comunicate across the infinity fabric with the neigbour die with an existing mem controller. You might check your current config and do a couple tests on your own to find the best performance ;)

 

Maybe the shown core pairs 0-1 etc aren't actual the correct ones for you. Just sayin. 

 

Memory interleaving may be the difference because it relates to the Threadripper design. A Threadripper CPU is essentially equivalent to a dual-CPU / quad-CPU in the server world, which leads to the UMA / NUMA distinction. When the CPU is in UMA mode, memory is interleaved and exposed to both dies with priority for throughput. When in NUMA mode, there's no interleaving and each die access its own memory bus first and then the other die i.e. priority for better latency. In other words, UMA treats the CPU as one unit and NUMA treats each die as its own CPU.

 

For the 1950X, UMA / NUMA can be selected. For the 2990WX, for the same reasons that you mentioned, only NUMA mode is available.

 

So when it comes to pairing logical cores to physical cores, it might be done incorrectly in UMA if the numbering is based on NUMA. It also makes sense why the 2990WX has a different numbering scheme since NUMA is the only option.

 

Of course, that's just my hypothesis since I can't turn on interleaving on my 2990WX to test. :)

Link to comment

I did a couple more tests with all available memory interleaving settings. In the ASRock BIOS settings under AMD CBS / DF Common Options i found 5 available options (auto, die, channel, socket, none). Auto settings is what i used before in all of my tests. This time i only tested with the Win10 VM using cores 16-31. The die and socket option produced pretty much the same results as auto

 

1776380582_win1016-31autodiesocket.thumb.jpg.ea6a6d3804ebe2ee235f8cddb4094940.jpg

 

and as expected choosing channel or none interleaving showed the worst performance.

 

1077202318_win1016-31nonechannel.thumb.jpg.bf4c8d9cb5b2a6ada181b86add8d69e2.jpg

 

If i had accidentally choosen cores from both dies i guess the results by selecting the die option for memory interleaving would be different. I searched around and tested a bit in the BIOS for an option to maybe force the BIOS to report the cores in a different way to unraid, but without luck. I couldn't find any option specific to select UMA or NUMA either. I know you can set it in Ryzen Master, but the software doesn't work inside a vm. Maybe i will test it tomorrow with a bare metal install and check what else is changed in the BIOS after choosing the NUMA/UMA setting in Ryzen Master.

 

Enough for today. Good night to everyone 😑

Link to comment

BIOS is up to date now with version 3.30. Same results. Core pairings are showing wrong in 6.6.0-rc1. I played around a bit and tested a couple things. First boot it came up with tons of PCI reset errors, but it looks fine now after the second reboot. ACS override i can disable now and get most devices split up in there own groups now. Only the network interfaces are grouped together. 

Link to comment
On 9/1/2018 at 8:16 AM, eschultz said:

6.6.0-rc1 was just released.  We removed the threadripper reset patch because a BIOS update will fix the issue better than the reset patch would.  The author of the hack said it was just a bad hack anyways.  So update your BIOS before upgrading.

Do you know which AGESA was the proper fix in? That probably helps the TR peeps to know for sure the min BIOS to use.

Link to comment
15 minutes ago, testdasi said:

Do you know which AGESA was the proper fix in? That probably helps the TR peeps to know for sure the min BIOS to use.

Agesa 1.0.0.4 you needed some sort of extra patch. The Agesa 1.0.0.6 never released for me at least from ASRock as stable. Only a beta version was available, i never testet. I think it mainly adressed memory incompatibility for the AM4 Ryzen chips and came with some microcode updates to fix security issues. The Agesa version 1.1.0.0 should be the first one including the fix.

 

Edited by bastl
Link to comment

Apparently there's already commands to tell which core is on which die.

@bastl @Jcloud Perhaps you guys can try to see what shows up?

~# numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 48208 MB
node 0 free: 350 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 size: 48354 MB
node 2 free: 4680 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3
  0:  10  16  16  16
  1:  16  10  16  16
  2:  16  16  10  16
  3:  16  16  16  10

 

Apparently can even check which VM is using how much RAM connecting to which node:

~# numastat qemu

Per-node process memory usage (in MBs)
PID                               Node 0          Node 1          Node 2
-----------------------  --------------- --------------- ---------------
33117 (qemu-system-x86)          1751.71            0.00         2442.32
33297 (qemu-system-x86)          2840.03            0.00         1326.58
82938 (qemu-system-x86)         28445.78            0.00        20757.30
91591 (qemu-system-x86)           182.21            0.00         8052.15
-----------------------  --------------- --------------- ---------------
Total                           33219.73            0.00        32578.35

PID                               Node 3           Total
-----------------------  --------------- ---------------
33117 (qemu-system-x86)             0.00         4194.02
33297 (qemu-system-x86)             0.00         4166.61
82938 (qemu-system-x86)             0.00        49203.09
91591 (qemu-system-x86)             0.00         8234.37
-----------------------  --------------- ---------------
Total                               0.00        65798.09

 

Link to comment

As reported earlier for the 1950x on a ASRock Fatal1ty x399 Gaming Pro something is reported differently. Looks like the same happened for Jcloud on his Asus Board. Currently I'am on the 6.6 RC2. I couldn't realy find a BIOS setting to change the behaviour how the dies are reported to the OS.  It always been reported as 1 node. 

 

numactl.JPG.a8a4c1b33ba578c23ac1f86428ff2e22.JPG

 

numastat.JPG.ddf8c699918ad77e3e6401af834a69f9.JPG

 

Edit:

@testdasi

It looks like your RAM usage for your VMs isn't optimized either. If I understand the shown scheme right, for example your VM with PID 33117 uses half the RAM from 2 different nodes which have a memory controller build in. In case u have more than 1 die assigned to the VM thats ok, but if you use lets say 4 cores from 1 die, it should use the 4GB RAM from the same node and not from another node.

Edited by bastl
Link to comment
Linux 4.18.6-unRAID.
root@HYDRA:~# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 128796 MB
node 0 free: 27195 MB
node distances:
node   0
  0:  10
root@HYDRA:~# numastat qemu

Per-node process memory usage (in MBs) for PID 42539 (qemu-system-x86)
                           Node 0           Total
                  --------------- ---------------
Huge                         0.00            0.00
Heap                         0.05            0.05
Stack                        0.05            0.05
Private                  36961.88        36961.88
----------------  --------------- ---------------
Total                    36961.98        36961.98
root@HYDRA:~#

Per request.

Link to comment

Ok, now it gets interesting. I already watched almost all videos from Wendell, but thanks for mentioning it here for people stumbling across this thread. @tjb_altf4 

 

I might overlooked something by doin all my tests and the presented core pairings are alright. I assumed that the better memory performance depends on the cores and from which die they are. By switching between the options auto, die, channel, socket and none in the BIOS under AMD CBS settings, I should have already noticed that as soon as I limit a VM to only 1 die I get the memory bandwith from this specific memory controller. I basically cut the bandwith in half from quad channel (both dies) to dual channel. Makes perfectly sense. How could i miss that?

 

If you need the memory bandwith for your applications, the UMA mode is the way to go. For me i have to set it to Auto, Socket or Die for the memory to get interleaved over all 4 channels and the CPU gets reported as only 1 node. By choosing the option Channel (Numa mode) I basically limit the memory access to the 2 channels from the specific die. The latency in this case should be reduced because you removed the hop to the other die. Option None will limit it to single channel memory and cuts the bandwith even further as shown in the pictures above. I'am actually not sure whats the difference between Auto, Die and Socket are. They all show similar results in the tests. And it should be also mentioned that it looks like Cinebench is more memory bandwith related as most people are reporting.

 

Wendell mentioned in that video by using the lstopo to check which PCIE slots are directly connected to which die. Is there a way to check this without lstopo, which isn't available on Unraid? Right now my 1080ti is placed on the third PCIE slot x16 (1st slot 1050ti x16, second slot empty x8) and I'am not sure if it's directly attached to the correct die in my gaming VM. Maybe there is something already implemented in Unraid for listing the topology in a way lstopo did.

 

Any ideas?

 

Edit:

Another thing i should have checked earlier are the behaviour of the clock speeds. Damn i feel so stupid right now. 

 

watch grep \"cpu MHz\" /proc/cpuinfo

 

Checking this command during the tests would have shown that as soon as i choose cores from both dies for a VM the clocks on all cores ramp up. If i assign the core paires Unraid gives me, only one die ramps up to full speed and the other stays on idle clocks.  🙄

 

Edited by bastl
  • Upvote 1
Link to comment

@bastl You can try installing hwloc (package containing lstopo) and all the dependencies manually. There's a guide here. In addition to the listed dependencies, you will need harfbuzz and libxshmfence as well.

 

This is a pain in the ass though and it's probably better not to mess with unRAID.

 

I think the better way is to boot into a linux distro on USB and install hwloc using the package manager.

 

I have 2 gpus (1070 and 1080ti) on the two x16 pcie slots and it is showing that the furthest gpu (1080ti) is tied to node 0.

 

IOMMU group 14:	[10de:1b06] 08:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
IOMMU group 30:	[10de:1b81] 41:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)

 

lstopo on unRAID:

numa-unraid.thumb.png.16de2b291907e9c9ec54a5e8bf91ef35.png

 

lstopo on manjaro USB:

numa_manjaro.thumb.png.09ad6e4a688a375ee23e1771b3a818cc.png

 

1070 on top of 1080ti:

rwhUu7R.thumb.jpg.5740732bca0ab02178b8250986cbb258.jpg

Edited by 3flappp
Link to comment
  • 2 weeks later...
  • 1 month later...

So I noticed something today.. It finally hit me. Why is latency so high? Because the VM isn't getting the correct information. Look at the pics below. The left is what a 2990wx looks like in windows bare metal. The right is my VM. Now granted I'm only passing 20 cores so I expect certain things but look at the cache sizes. L1 is double and only 2 way instead of 8-way. L1 inst is 2 instead of 4 way, l2 is the right size but 16 way instead of 8 way and l3 is 5x 16MB instead of of 8x8 mbytes. Can anyone confirm this?

 

495762661_physicalfront.PNG.8a5268e85499b8545d3d27cf7f7dad57.PNGvirtual.PNG.337bd733c9fbd985e6c3301e1025950c.PNG

Edited by Jerky_san
Link to comment

@Jerky_san I wouldn't trust CPUz in the first place inside a VM as well as other software. Cinebench never reads the correct core clock nore does any tool show me the right Vcore used. CPUz also shows me the core clock always on its max speed where in the backround on unraid you can see it's running on the idle speeds as it should be. I guess thats the way Qemu/KVM showing/emulating the CPU to the guest OS. 

Link to comment
3 hours ago, bastl said:

@Jerky_san I wouldn't trust CPUz in the first place inside a VM as well as other software. Cinebench never reads the correct core clock nore does any tool show me the right Vcore used. CPUz also shows me the core clock always on its max speed where in the backround on unraid you can see it's running on the idle speeds as it should be. I guess thats the way Qemu/KVM showing/emulating the CPU to the guest OS. 

Even if I don't 100% trust CPUZ or ignore all my experience with Hyper-V and VMware(over 10 years and going now) and it being accurate in those instances. Even Linux reads the wrong cache size with LSCPU

Ubuntu VM

L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
L3 cache:            16384K
 

L1d cache is wrong and l3 cache is double the size that it should be.

 

Unraid:

L1d cache:           32K
L1i cache:           64K
L2 cache:            512K
L3 cache:            8192K

 

Link to comment
  • thenonsense changed the title to Ryzen/Threadripper PSA: Core Numberings and Assignments

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.