Ryzen/Threadripper PSA: Core Numberings and Assignments


Recommended Posts

I've been following this topic with interest but can't say I understand more than half of it yet.

 

Things are a little different for me with threadripper 1900x, I've set my CPU's as follows:

 

image.png.f281f07bbbba0691df3979f36bfe22eb.png

 

This did deliver a marked improvement in AIDA64 over my previous setup (which had a 6 pairs for 12 cores):

image.thumb.png.a60d98540ee5f71faac5dc17770b3408.png

 

New setup:

image.thumb.png.a74a3b639b89e5e61d387082e0972855.png

 

Not entirely sure how the NUMA settings would affect me on this CPU as it has less cores (and perhaps less dies. Would it be worth changing the BIOS setting to Channel as well, considering that AMD "confines the 1900X's active cores to a single CCX inside each die"?

 

I see some comments above suggest that any of this may not actually make any real world difference, or am I misreading? I did get a bit lost after page 1. The AIDA64 numbers do look markedly improved though!

 

Link to comment

@Chamzamzoo Every first gen Threadripper has 2 dies. Max cores per die are 8. The smallest TR4 (1900x) has only 4 cores per die enabled. 0-3 + HT are on one die and 4-7 + HT on the second die. The increase of memory bandwith you see in AIDA is to the fact that you're using both dies each with it's own memory controller with 2 channels each. So with 2 Ryzen dies you get quad channel. 

 

There is no actual "best setting" all depends on your needs. If you need the memory bandwith for your applications, use both dies. If you need the lower latency use only cores from one die and set you memory in your BIOS to NUMA or channel in most cases. There are still some quirks with KVM and the memory setting. You can "strictly" set your VM to use only memory from a specific node but a couple people reported that a bit of memory is still used from the other die which inreases the latency again. Also tweaking your xml to present the CPU as an Epyc to the VM can improve the performance a bit. In this case the actual CPU cache is presented to the VM in the correct way. Unraid itself for some reason changes the L1 L2 and L3 cache that the VM sees with it's standard settings for CPU model passthrough.

 

Also noted, you only passing through the HT cores to the VM. The usual way is to passthrough the main core + it's hyperthread. I can't really tell if this makes any big differences in performance. I never tested it like you have set it up.

Link to comment
On 11/16/2018 at 9:05 PM, DZMM said:

I even emailed msi and they said no, although the reply was a bit short so I'm not 100% certain the person understood the question

Go into BIOS -> OC -> Advanced DRAM Configuration.

Scroll down to "Misc Item" and look for "memory interleaving". Change this from "auto" to "channel" and you are in NUMA mode.

 

I have the MSI X399 SLI Plus myself you should definately check if your GPU and additional M.2 devices are running in PCIe Gen 3 (GPU-Z etc).

Mine always fallback to PCIe Gen 1 / 2 for the devices which drastically reduced my performance.

 

Had a support Ticket on MSI open and after 4 different BIOS versions, they could fix the issue.

 

 

cheers

  • Upvote 1
Link to comment

Thanks @bastl, that's very helpful.

 

I'm mostly interested in performance for the gaming side of things, my plex and other dockers don't see a huge amount of usage. Would gaming benefit from either increased bandwidth or latency, or is it even individual game dependant?

 

I will try the Epyc code and see how it runs in some benchmarks, looks interesting.

 

I tried this as a more traditional main core + HT core assignment but it was back to dual channel mode I think, scores were down 50% for L1-3 cache.

849906190_Screenshot2018-11-2115_32_23.png.13decb7cbc88629536c2cc59687154ab.png

 

Sorry for the large images, they come out like this on Mac.. must be the retina screen.

Link to comment
28 minutes ago, Nooke said:

Go into BIOS -> OC -> Advanced DRAM Configuration.

Scroll down to "Misc Item" and look for "memory interleaving". Change this from "auto" to "channel" and you are in NUMA mode.

 

I have the MSI X399 SLI Plus myself you should definately check if your GPU and additional M.2 devices are running in PCIe Gen 3 (GPU-Z etc).

Mine always fallback to PCIe Gen 1 / 2 for the devices which drastically reduced my performance.

 

Had a support Ticket on MSI open and after 4 different BIOS versions, they could fix the issue.

 

 

cheers

Thanks for finding this - the manual is awful.  I'll make this change the next time I reboot.  Do you recommend NUMA mode?  I'm hoping my LSTOPO layout doesn't make it hard for me to assign cores.  If I match cores to a die, does unRAID automatically add RAM from the same die, or do I need to make other changes?

 

Are there any other settings I should have enabled in my bios?

 

I just checked my GTX 1060 and it's in PCIe 3.0 - I'll check the other two VMs later.

Link to comment

Trying to optimize my setup, maybe you guys will have some suggestions.  Here is the setup:

1920x with 4x gtx 960 setup as 4 gaming machines and 1 game server.  The performance is decent, but seems like it could be better.  Here is my current config, what changes do you think I should make to optimize it?  Also I saw all the info about the EYPC cache tweaks and wondering if that is something I should do also.  Can someone chime in with their best guess? Brown is the game server- all other colors are individual machines with the same color graphics card as the CPUs.

lstopo.png

Edited by jordanmw
Link to comment
17 minutes ago, jordanmw said:

Trying to optimize my setup, maybe you guys will have some suggestions.  Here is the setup:

1920x with 4x gtx 960 setup as 4 gaming machines and 1 game server.  The performance is decent, but seems like it could be better.  Here is my current config, what changes do you think I should make to optimize it?  Also I saw all the info about the EYPC cache tweaks and wondering if that is something I should do also.  Can someone chime in with their best guess?

lstopo.png

For me the EYPC tweaks helped a lot especially under high load.. If the VM thinks the cache is much bigger than expected it might cause issues. Anyways all you can do is try and see.

Link to comment
On 11/14/2018 at 2:51 PM, Jerky_san said:

A person on reddit told me the answer to my problem. If you do the below QEMU provides EPYC instead and all the cache is right. It dropped latency accrossed the board. L3 is down to 13ns and l1-1ns l2-2-3ns. Machine seems MUCH more responsive. Should also mention they said they had updated their kernel and a patch on QEMU made theirs see it properly without this code so hopefully we will see it on unraid as well.

 

<cpu mode='custom' match='exact' check='partial'>

<model fallback='allow'>EPYC-IBPB</model>

<topology sockets='1' cores='8' threads='2'/>

<feature policy='require' name='topoext'/>

</cpu>

I was just about to try this on one of my my VMs and I'm a bit confused now about cores and threads.  I have a 3 core VM:

 

1605114323_FireShotCapture4-Highlander_UpdateVM_-https___1d087a25aac48109ee9a15217a1.png.790d58710199f17e8a43c9c94c177dfe.png

 

But in my xml it says 6 cores, 1 thread:

 <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='6' threads='1'/>
  </cpu>

Has unRAID got confused, or is it (more likely) me?

 

If I do the EPYC Changed should my config be:

<cpu mode='custom' match='exact' check='partial'>
	<model fallback='allow'>EPYC-IBPB</model>
	<topology sockets='1' cores='3' threads='2'/>
	<feature policy='require' name='topoext'/>
</cpu>

 

Link to comment
1 hour ago, DZMM said:

 

 

But in my xml it says 6 cores, 1 thread:


 <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='6' threads='1'/>
  </cpu>

 

 

 

 

yeah, I'm seeing

<cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='4' threads='1'/>
  </cpu>

when assigning 

image.png.a90b8277ad655b2a073fdf4e4b485fdf.png

I'm sure I read something about this though

Link to comment
1 hour ago, jordanmw said:

 

yeah, I'm seeing

<cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='4' threads='1'/>
  </cpu>

when assigning 

image.png.a90b8277ad655b2a073fdf4e4b485fdf.png

I'm sure I read something about this though

You have to manually adjust the cores/threads and even then I don't believe QEMU plays nicely with SMT cores on threadripper yet but I've not noticed any ill effects.

  • Upvote 1
Link to comment
23 minutes ago, Jerky_san said:

You have to manually adjust the cores/threads and even then I don't believe QEMU plays nicely with SMT cores on threadripper yet but I've not noticed any ill effects.

Should this be submitted as a bug to the unRAID team?  Maybe it's something they can fix

Link to comment

It doesn't matter performance wise if set it to  cores='4' threads='1' or  cores='2' threads='2'. For me in my tests it always shows the same performance. I did a couple tests on the current 6.6.5 with different benchmarks (Cinebench, Aida, CPUz) and games (GTA, BF1, Rust) and all scores are nearly the same.

 

The issue with the L1, L2 and L3 cache thats reported wrong to the VM, I don't know if this a Unraid specific thing that @limetech can fix or has to be implemented in the Linux kernel, Libvirt or Qemu.

Link to comment
1 hour ago, bastl said:

It doesn't matter performance wise if set it to  cores='4' threads='1' or  cores='2' threads='2'. For me in my tests it always shows the same performance. I did a couple tests on the current 6.6.5 with different benchmarks (Cinebench, Aida, CPUz) and games (GTA, BF1, Rust) and all scores are nearly the same.

 

The issue with the L1, L2 and L3 cache thats reported wrong to the VM, I don't know if this a Unraid specific thing that @limetech can fix or has to be implemented in the Linux kernel, Libvirt or Qemu.

A guy on reddit told me this code would fix it. I also requested the fix in bug fixes and posted a few posts. Basically we just need the option to turn on this patch.

 

  • Like 1
Link to comment
On 11/21/2018 at 1:20 PM, Jerky_san said:

For me the EYPC tweaks helped a lot especially under high load.. If the VM thinks the cache is much bigger than expected it might cause issues. Anyways all you can do is try and see.

So how exactly should I be adding this tweak?  Is it in the XML of the individual machines?  Where do I add it, and what should I add for a 1920x?

Link to comment
1 hour ago, jordanmw said:

So how exactly should I be adding this tweak?  Is it in the XML of the individual machines?  Where do I add it, and what should I add for a 1920x?

Change the below into

 

<cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='4' threads='1'/>
  </cpu>

 

This but remember to change the cores to half(if your using SMT) of what you assigned. So 16 cores(8 SMT) would be 8 cores 2 threads. Also if you ever change anything on your config you'll have to re-apply this as unraid will alter it back to the above.

 

<cpu mode='custom' match='exact' check='partial'>
	<model fallback='allow'>EPYC-IBPB</model>
	<topology sockets='1' cores='3' threads='2'/>
	<feature policy='require' name='topoext'/>
</cpu>

 

  • Upvote 1
Link to comment
37 minutes ago, bastl said:

@Jerky_san Whatever i set the topology to, 8cores 1 thread or 4 cores 2 threads benchmarks showing the same performance. Even windows shows 8 virtual CPUs no matter what i set. There is no difference with these setting.

As I just stated it is broken for threadripper. The version of QEMU we have are supposed to have the cache fixes and SMT fixes but it appears they are both missing for some reason. No idea why. There honestly is way to much confusion around it. It took me days of research on the EPYC tweak thing and apparently others claim they don't need it anymore so it just what version of QEMU 3.0 we are running..

26 minutes ago, jordanmw said:

Are you seeing differences with the eypc CPU tweaks?

 

With the EPYC cpu tweaks it gives me substantially less latency across the board please see my previous posts in this thread if you'd like to see more.

 

Also its the closet to bare metal I've ever gotten btw.

Edited by Jerky_san
  • Upvote 2
Link to comment

Tried these points on my computer and finally solved the stutter problems while gaming (TR1950) :)

What I did:

Changed Bios Ram settings (Asus Rog Zenith Extreme):

Advanced > DF common options > Memory interleaving: Auto > Channel

Added CPU patch:

<cpu mode='custom' match='exact' check='partial'>
	<model fallback='allow'>EPYC-IBPB</model>
	<topology sockets='1' cores='3' threads='2'/>
	<feature policy='require' name='topoext'/>
</cpu>

And Numatune:

  <numatune>
    <memory mode='interleave' nodeset='1'/>
  </numatune>

Thanks for your help guys! 👍

Edited by Symon
  • Upvote 1
Link to comment
15 minutes ago, Symon said:

Tried these points on my computer and finally solved the stutter problems while gaming (TR1950) :)

What I did:

Changed Bios Ram settings (Asus Rog Zenith Extreme):

Advanced > DF common options > Memory interleaving: Auto > Channel

Added CPU patch:


<cpu mode='custom' match='exact' check='partial'>
	<model fallback='allow'>EPYC-IBPB</model>
	<topology sockets='1' cores='3' threads='2'/>
	<feature policy='require' name='topoext'/>
</cpu>

And Numatune:


  <numatune>
    <memory mode='interleave' nodeset='1'/>
  </numatune>

Thanks for your help guys! 👍

I'm going to try this out tomorrow - I didn't get a chance this week as I've been busy.

 

Has anyone seen any benefit from emulatorpin to the same numa?  

Link to comment
4 hours ago, DZMM said:

I'm going to try this out tomorrow - I didn't get a chance this week as I've been busy.

 

Has anyone seen any benefit from emulatorpin to the same numa?  

So.. its important to set so you don't go outside a core that has access to the memory. It will introduce stuttering into games with intense graphics. The one I test on is Dying Light. Before I did this little patch I was getting max 80fps. This patch skyrocketed my fps to 120 nearly consistent. Your L3 cache gets the largest boost of from 50ish to 10-11 and your memory latency usually drops from low 100's to very close to bare metal. Its amazing.

 

4 hours ago, Symon said:

Tried these points on my computer and finally solved the stutter problems while gaming (TR1950) :)

What I did:

Changed Bios Ram settings (Asus Rog Zenith Extreme):

Advanced > DF common options > Memory interleaving: Auto > Channel

Added CPU patch:


<cpu mode='custom' match='exact' check='partial'>
	<model fallback='allow'>EPYC-IBPB</model>
	<topology sockets='1' cores='3' threads='2'/>
	<feature policy='require' name='topoext'/>
</cpu>

And Numatune:


  <numatune>
    <memory mode='interleave' nodeset='1'/>
  </numatune>

Thanks for your help guys! 👍

I'm very glad you got better performance. I struggled with this for weeks/months. I tried literally everything I could think of and it really started to bother me. When I realize the cache was hosed I was baffled I missed that. I went back and read a guide I had read in the beginning 

 

https://tripleback.net/post/chasingvfioperformance/ <- this guy and realized he had stuck a patch on it and then I dug to find out what that patch did. I found some stuff related to that which I requested be added in the feature upgrades(if you could go over there and up vote them so we can get more attention). I was super happy that the person on Reddit provided me the CPU information like they did and the other provided me the code snippets(I honestly don't know how to apply them in Unraid though). Anyways I'm really am glad you guys are getting benefits. I'm a higher up server admin but my focus is ESXI/Windows/Netscalers/Netapp. It boggled my mind that I couldn't get the performance I was searching for out of this proc with all my experience with ESXI.  Anyways lets keep digging and chasing that performance. We are within striking distance of baremetal but sadly please keep in mind. With this "fix" you CANNOT cross NUMA or you will get a pretty bad latency penalty because the OS no longer understands there is a numa and all my tweaking hasn't been able to get it to understand that it has a NUMA involved so it attempts to access memory on the other controller all the time. Once we can get a legit fix in though and use the name='topoext' only then we can pass NUMA information again I believe but SMT may still be broken. I know QEMU has 3.1.0-RC2 out.. Hopefully they've rolled all the fixes we need into it and when it comes out to 3.1.0 limetech will integrate it.

 

Specifically 

 and but I need to word this one better I believe.

 

Link to comment
  • thenonsense changed the title to Ryzen/Threadripper PSA: Core Numberings and Assignments

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.