Chamzamzoo Posted November 21, 2018 Share Posted November 21, 2018 I've been following this topic with interest but can't say I understand more than half of it yet. Things are a little different for me with threadripper 1900x, I've set my CPU's as follows: This did deliver a marked improvement in AIDA64 over my previous setup (which had a 6 pairs for 12 cores): New setup: Not entirely sure how the NUMA settings would affect me on this CPU as it has less cores (and perhaps less dies. Would it be worth changing the BIOS setting to Channel as well, considering that AMD "confines the 1900X's active cores to a single CCX inside each die"? I see some comments above suggest that any of this may not actually make any real world difference, or am I misreading? I did get a bit lost after page 1. The AIDA64 numbers do look markedly improved though! Quote Link to comment
bastl Posted November 21, 2018 Share Posted November 21, 2018 @Chamzamzoo Every first gen Threadripper has 2 dies. Max cores per die are 8. The smallest TR4 (1900x) has only 4 cores per die enabled. 0-3 + HT are on one die and 4-7 + HT on the second die. The increase of memory bandwith you see in AIDA is to the fact that you're using both dies each with it's own memory controller with 2 channels each. So with 2 Ryzen dies you get quad channel. There is no actual "best setting" all depends on your needs. If you need the memory bandwith for your applications, use both dies. If you need the lower latency use only cores from one die and set you memory in your BIOS to NUMA or channel in most cases. There are still some quirks with KVM and the memory setting. You can "strictly" set your VM to use only memory from a specific node but a couple people reported that a bit of memory is still used from the other die which inreases the latency again. Also tweaking your xml to present the CPU as an Epyc to the VM can improve the performance a bit. In this case the actual CPU cache is presented to the VM in the correct way. Unraid itself for some reason changes the L1 L2 and L3 cache that the VM sees with it's standard settings for CPU model passthrough. Also noted, you only passing through the HT cores to the VM. The usual way is to passthrough the main core + it's hyperthread. I can't really tell if this makes any big differences in performance. I never tested it like you have set it up. Quote Link to comment
Nooke Posted November 21, 2018 Share Posted November 21, 2018 On 11/16/2018 at 9:05 PM, DZMM said: I even emailed msi and they said no, although the reply was a bit short so I'm not 100% certain the person understood the question Go into BIOS -> OC -> Advanced DRAM Configuration. Scroll down to "Misc Item" and look for "memory interleaving". Change this from "auto" to "channel" and you are in NUMA mode. I have the MSI X399 SLI Plus myself you should definately check if your GPU and additional M.2 devices are running in PCIe Gen 3 (GPU-Z etc). Mine always fallback to PCIe Gen 1 / 2 for the devices which drastically reduced my performance. Had a support Ticket on MSI open and after 4 different BIOS versions, they could fix the issue. cheers 1 Quote Link to comment
Chamzamzoo Posted November 21, 2018 Share Posted November 21, 2018 Thanks @bastl, that's very helpful. I'm mostly interested in performance for the gaming side of things, my plex and other dockers don't see a huge amount of usage. Would gaming benefit from either increased bandwidth or latency, or is it even individual game dependant? I will try the Epyc code and see how it runs in some benchmarks, looks interesting. I tried this as a more traditional main core + HT core assignment but it was back to dual channel mode I think, scores were down 50% for L1-3 cache. Sorry for the large images, they come out like this on Mac.. must be the retina screen. Quote Link to comment
DZMM Posted November 21, 2018 Share Posted November 21, 2018 28 minutes ago, Nooke said: Go into BIOS -> OC -> Advanced DRAM Configuration. Scroll down to "Misc Item" and look for "memory interleaving". Change this from "auto" to "channel" and you are in NUMA mode. I have the MSI X399 SLI Plus myself you should definately check if your GPU and additional M.2 devices are running in PCIe Gen 3 (GPU-Z etc). Mine always fallback to PCIe Gen 1 / 2 for the devices which drastically reduced my performance. Had a support Ticket on MSI open and after 4 different BIOS versions, they could fix the issue. cheers Thanks for finding this - the manual is awful. I'll make this change the next time I reboot. Do you recommend NUMA mode? I'm hoping my LSTOPO layout doesn't make it hard for me to assign cores. If I match cores to a die, does unRAID automatically add RAM from the same die, or do I need to make other changes? Are there any other settings I should have enabled in my bios? I just checked my GTX 1060 and it's in PCIe 3.0 - I'll check the other two VMs later. Quote Link to comment
jordanmw Posted November 21, 2018 Share Posted November 21, 2018 (edited) Trying to optimize my setup, maybe you guys will have some suggestions. Here is the setup: 1920x with 4x gtx 960 setup as 4 gaming machines and 1 game server. The performance is decent, but seems like it could be better. Here is my current config, what changes do you think I should make to optimize it? Also I saw all the info about the EYPC cache tweaks and wondering if that is something I should do also. Can someone chime in with their best guess? Brown is the game server- all other colors are individual machines with the same color graphics card as the CPUs. Edited November 21, 2018 by jordanmw Quote Link to comment
Jerky_san Posted November 21, 2018 Share Posted November 21, 2018 17 minutes ago, jordanmw said: Trying to optimize my setup, maybe you guys will have some suggestions. Here is the setup: 1920x with 4x gtx 960 setup as 4 gaming machines and 1 game server. The performance is decent, but seems like it could be better. Here is my current config, what changes do you think I should make to optimize it? Also I saw all the info about the EYPC cache tweaks and wondering if that is something I should do also. Can someone chime in with their best guess? For me the EYPC tweaks helped a lot especially under high load.. If the VM thinks the cache is much bigger than expected it might cause issues. Anyways all you can do is try and see. Quote Link to comment
DZMM Posted November 21, 2018 Share Posted November 21, 2018 On 11/14/2018 at 2:51 PM, Jerky_san said: A person on reddit told me the answer to my problem. If you do the below QEMU provides EPYC instead and all the cache is right. It dropped latency accrossed the board. L3 is down to 13ns and l1-1ns l2-2-3ns. Machine seems MUCH more responsive. Should also mention they said they had updated their kernel and a patch on QEMU made theirs see it properly without this code so hopefully we will see it on unraid as well. <cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>EPYC-IBPB</model> <topology sockets='1' cores='8' threads='2'/> <feature policy='require' name='topoext'/> </cpu> I was just about to try this on one of my my VMs and I'm a bit confused now about cores and threads. I have a 3 core VM: But in my xml it says 6 cores, 1 thread: <cpu mode='host-passthrough' check='none'> <topology sockets='1' cores='6' threads='1'/> </cpu> Has unRAID got confused, or is it (more likely) me? If I do the EPYC Changed should my config be: <cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>EPYC-IBPB</model> <topology sockets='1' cores='3' threads='2'/> <feature policy='require' name='topoext'/> </cpu> Quote Link to comment
jordanmw Posted November 21, 2018 Share Posted November 21, 2018 1 hour ago, DZMM said: But in my xml it says 6 cores, 1 thread: <cpu mode='host-passthrough' check='none'> <topology sockets='1' cores='6' threads='1'/> </cpu> yeah, I'm seeing <cpu mode='host-passthrough' check='none'> <topology sockets='1' cores='4' threads='1'/> </cpu> when assigning I'm sure I read something about this though Quote Link to comment
Jerky_san Posted November 21, 2018 Share Posted November 21, 2018 1 hour ago, jordanmw said: yeah, I'm seeing <cpu mode='host-passthrough' check='none'> <topology sockets='1' cores='4' threads='1'/> </cpu> when assigning I'm sure I read something about this though You have to manually adjust the cores/threads and even then I don't believe QEMU plays nicely with SMT cores on threadripper yet but I've not noticed any ill effects. 1 Quote Link to comment
DZMM Posted November 22, 2018 Share Posted November 22, 2018 23 minutes ago, Jerky_san said: You have to manually adjust the cores/threads and even then I don't believe QEMU plays nicely with SMT cores on threadripper yet but I've not noticed any ill effects. Should this be submitted as a bug to the unRAID team? Maybe it's something they can fix Quote Link to comment
bastl Posted November 22, 2018 Share Posted November 22, 2018 It doesn't matter performance wise if set it to cores='4' threads='1' or cores='2' threads='2'. For me in my tests it always shows the same performance. I did a couple tests on the current 6.6.5 with different benchmarks (Cinebench, Aida, CPUz) and games (GTA, BF1, Rust) and all scores are nearly the same. The issue with the L1, L2 and L3 cache thats reported wrong to the VM, I don't know if this a Unraid specific thing that @limetech can fix or has to be implemented in the Linux kernel, Libvirt or Qemu. Quote Link to comment
Jerky_san Posted November 22, 2018 Share Posted November 22, 2018 1 hour ago, bastl said: It doesn't matter performance wise if set it to cores='4' threads='1' or cores='2' threads='2'. For me in my tests it always shows the same performance. I did a couple tests on the current 6.6.5 with different benchmarks (Cinebench, Aida, CPUz) and games (GTA, BF1, Rust) and all scores are nearly the same. The issue with the L1, L2 and L3 cache thats reported wrong to the VM, I don't know if this a Unraid specific thing that @limetech can fix or has to be implemented in the Linux kernel, Libvirt or Qemu. A guy on reddit told me this code would fix it. I also requested the fix in bug fixes and posted a few posts. Basically we just need the option to turn on this patch. 1 Quote Link to comment
jordanmw Posted November 23, 2018 Share Posted November 23, 2018 On 11/21/2018 at 1:20 PM, Jerky_san said: For me the EYPC tweaks helped a lot especially under high load.. If the VM thinks the cache is much bigger than expected it might cause issues. Anyways all you can do is try and see. So how exactly should I be adding this tweak? Is it in the XML of the individual machines? Where do I add it, and what should I add for a 1920x? Quote Link to comment
Jerky_san Posted November 23, 2018 Share Posted November 23, 2018 1 hour ago, jordanmw said: So how exactly should I be adding this tweak? Is it in the XML of the individual machines? Where do I add it, and what should I add for a 1920x? Change the below into <cpu mode='host-passthrough' check='none'> <topology sockets='1' cores='4' threads='1'/> </cpu> This but remember to change the cores to half(if your using SMT) of what you assigned. So 16 cores(8 SMT) would be 8 cores 2 threads. Also if you ever change anything on your config you'll have to re-apply this as unraid will alter it back to the above. <cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>EPYC-IBPB</model> <topology sockets='1' cores='3' threads='2'/> <feature policy='require' name='topoext'/> </cpu> 1 Quote Link to comment
bastl Posted November 23, 2018 Share Posted November 23, 2018 The core topology doesn't matter. Quote Link to comment
Jerky_san Posted November 23, 2018 Share Posted November 23, 2018 56 minutes ago, bastl said: The core topology doesn't matter. It does it simply doesn't work on threadripper. Yet another thing that is broken in QMEU. There is a patch for that as well to make SMT work properly. Quote Link to comment
jordanmw Posted November 23, 2018 Share Posted November 23, 2018 Does anyone know if the next RC or version will include some of this stuff, and maybe lstopo and hwloc? I saw lstopo mentioned. Quote Link to comment
bastl Posted November 23, 2018 Share Posted November 23, 2018 @Jerky_san Whatever i set the topology to, 8cores 1 thread or 4 cores 2 threads benchmarks showing the same performance. Even windows shows 8 virtual CPUs no matter what i set. There is no difference with these setting. Quote Link to comment
jordanmw Posted November 23, 2018 Share Posted November 23, 2018 10 minutes ago, bastl said: @Jerky_san Whatever i set the topology to, 8cores 1 thread or 4 cores 2 threads benchmarks showing the same performance. Even windows shows 8 virtual CPUs no matter what i set. There is no difference with these setting. Are you seeing differences with the eypc CPU tweaks? Quote Link to comment
Jerky_san Posted November 23, 2018 Share Posted November 23, 2018 (edited) 37 minutes ago, bastl said: @Jerky_san Whatever i set the topology to, 8cores 1 thread or 4 cores 2 threads benchmarks showing the same performance. Even windows shows 8 virtual CPUs no matter what i set. There is no difference with these setting. As I just stated it is broken for threadripper. The version of QEMU we have are supposed to have the cache fixes and SMT fixes but it appears they are both missing for some reason. No idea why. There honestly is way to much confusion around it. It took me days of research on the EPYC tweak thing and apparently others claim they don't need it anymore so it just what version of QEMU 3.0 we are running.. 26 minutes ago, jordanmw said: Are you seeing differences with the eypc CPU tweaks? With the EPYC cpu tweaks it gives me substantially less latency across the board please see my previous posts in this thread if you'd like to see more. Also its the closet to bare metal I've ever gotten btw. Edited November 23, 2018 by Jerky_san 2 Quote Link to comment
bastl Posted November 23, 2018 Share Posted November 23, 2018 45 minutes ago, jordanmw said: Are you seeing differences with the eypc CPU tweaks? With that tweaks the VM shows the correct L1-L3 caches of the CPU in CPUz. It feels a bit smoother, might be to the lower latency i see in AIDA same as @Jerky_san reported earlier. 1 Quote Link to comment
Symon Posted November 24, 2018 Share Posted November 24, 2018 (edited) Tried these points on my computer and finally solved the stutter problems while gaming (TR1950) What I did: Changed Bios Ram settings (Asus Rog Zenith Extreme): Advanced > DF common options > Memory interleaving: Auto > Channel Added CPU patch: <cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>EPYC-IBPB</model> <topology sockets='1' cores='3' threads='2'/> <feature policy='require' name='topoext'/> </cpu> And Numatune: <numatune> <memory mode='interleave' nodeset='1'/> </numatune> Thanks for your help guys! 👍 Edited November 24, 2018 by Symon 1 Quote Link to comment
DZMM Posted November 25, 2018 Share Posted November 25, 2018 15 minutes ago, Symon said: Tried these points on my computer and finally solved the stutter problems while gaming (TR1950) What I did: Changed Bios Ram settings (Asus Rog Zenith Extreme): Advanced > DF common options > Memory interleaving: Auto > Channel Added CPU patch: <cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>EPYC-IBPB</model> <topology sockets='1' cores='3' threads='2'/> <feature policy='require' name='topoext'/> </cpu> And Numatune: <numatune> <memory mode='interleave' nodeset='1'/> </numatune> Thanks for your help guys! 👍 I'm going to try this out tomorrow - I didn't get a chance this week as I've been busy. Has anyone seen any benefit from emulatorpin to the same numa? Quote Link to comment
Jerky_san Posted November 25, 2018 Share Posted November 25, 2018 4 hours ago, DZMM said: I'm going to try this out tomorrow - I didn't get a chance this week as I've been busy. Has anyone seen any benefit from emulatorpin to the same numa? So.. its important to set so you don't go outside a core that has access to the memory. It will introduce stuttering into games with intense graphics. The one I test on is Dying Light. Before I did this little patch I was getting max 80fps. This patch skyrocketed my fps to 120 nearly consistent. Your L3 cache gets the largest boost of from 50ish to 10-11 and your memory latency usually drops from low 100's to very close to bare metal. Its amazing. 4 hours ago, Symon said: Tried these points on my computer and finally solved the stutter problems while gaming (TR1950) What I did: Changed Bios Ram settings (Asus Rog Zenith Extreme): Advanced > DF common options > Memory interleaving: Auto > Channel Added CPU patch: <cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>EPYC-IBPB</model> <topology sockets='1' cores='3' threads='2'/> <feature policy='require' name='topoext'/> </cpu> And Numatune: <numatune> <memory mode='interleave' nodeset='1'/> </numatune> Thanks for your help guys! 👍 I'm very glad you got better performance. I struggled with this for weeks/months. I tried literally everything I could think of and it really started to bother me. When I realize the cache was hosed I was baffled I missed that. I went back and read a guide I had read in the beginning https://tripleback.net/post/chasingvfioperformance/ <- this guy and realized he had stuck a patch on it and then I dug to find out what that patch did. I found some stuff related to that which I requested be added in the feature upgrades(if you could go over there and up vote them so we can get more attention). I was super happy that the person on Reddit provided me the CPU information like they did and the other provided me the code snippets(I honestly don't know how to apply them in Unraid though). Anyways I'm really am glad you guys are getting benefits. I'm a higher up server admin but my focus is ESXI/Windows/Netscalers/Netapp. It boggled my mind that I couldn't get the performance I was searching for out of this proc with all my experience with ESXI. Anyways lets keep digging and chasing that performance. We are within striking distance of baremetal but sadly please keep in mind. With this "fix" you CANNOT cross NUMA or you will get a pretty bad latency penalty because the OS no longer understands there is a numa and all my tweaking hasn't been able to get it to understand that it has a NUMA involved so it attempts to access memory on the other controller all the time. Once we can get a legit fix in though and use the name='topoext' only then we can pass NUMA information again I believe but SMT may still be broken. I know QEMU has 3.1.0-RC2 out.. Hopefully they've rolled all the fixes we need into it and when it comes out to 3.1.0 limetech will integrate it. Specifically and but I need to word this one better I believe. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.