Jerky_san Posted November 14, 2018 Share Posted November 14, 2018 (edited) A person on reddit told me the answer to my problem. If you do the below QEMU provides EPYC instead and all the cache is right. It dropped latency accrossed the board. L3 is down to 13ns and l1-1ns l2-2-3ns. Machine seems MUCH more responsive. Should also mention they said they had updated their kernel and a patch on QEMU made theirs see it properly without this code so hopefully we will see it on unraid as well. <cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>EPYC-IBPB</model> <topology sockets='1' cores='8' threads='2'/> <feature policy='require' name='topoext'/> </cpu> To give an idea of how much of a change. If your wondering why the read/write/copy of the Ls is "higher" on the old its because I had cores bounding from multiple numa in an attempt to make things faster. New is from numa 0 with 1-7 and its SMT cores bound. Edit: An update to this is it decreased latency across the board but you cannot have multiple NUMA. So far stating numa topology fails and the OS is unaware of when you cross NUMA. Working on that part since once I fix that I will be able to raise my read/write/copy speeds a lot. Edit2: I can report substantial increases in FPS for my games and the stutter thus far has been eliminated. Old New Physical Baremetal performance CPUZ Baremetal Virtual Old Virtual New Another user posted this to me So the Qemu code does not allow the passing of the cache topology. Here is a patch for 3.0.0 that allows chache topology to be passed through: diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 723e02221e..2912867872 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -4221,6 +4221,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, break; case 0x8000001D: *eax = 0; + if (cpu->cache_info_passthrough) { + host_cpuid(index, count, eax, ebx, ecx, edx); + break; + } switch (count) { case 0: /* L1 dcache info */ encode_cache_cpuid8000001d(env->cache_info_amd.l1d_cache, cs, Use at your own risk. Edited November 14, 2018 by Jerky_san Quote Link to comment
TType85 Posted November 14, 2018 Share Posted November 14, 2018 I just tried the settings and I still get the same as the Virtual Old. My latency for Memory and L1 are close to yours but the write/copy are about half. This VM only has 8 vCpu (4c/8t). The L2 and L3 don't even run, I am on the trial version of aida64 so maybe that is why they don't show? Quote Link to comment
Jerky_san Posted November 14, 2018 Share Posted November 14, 2018 (edited) 11 minutes ago, TType85 said: I just tried the settings and I still get the same as the Virtual Old. My latency for Memory and L1 are close to yours but the write/copy are about half. This VM only has 8 vCpu (4c/8t). The L2 and L3 don't even run, I am on the trial version of aida64 so maybe that is why they don't show? The biggest change for me was L3. The memory latency was something in the VM itself and some tweaks.(I purged the whole VM) The L3 was a 6x decrease in latency. L1 for me was 3ns now 1ns and thats massive for L1 as its used non stop but the biggest thing is that the cache is properly allocated. Before if you look i had 5x16 on my L3 which is literally impossible and L1 was 2x larger than it was supposed to be and only 2 way instead of 8 way. Also with this change you can't span numa. So if you have any procs allocated outside a single numa it will cause more latency on memory. If we can get it to identify threadripper properly then we can do the NUMA cross and be fine. Should also state I have 2990wx and I have my memory set to channel from "auto" in the bios that provided a large increase in performance as well. "In my case, under Advanced -> CBS -> DF there is a Memory model item that has choices like auto | distribute | channel. In our case, we are interested in channel mode, as it will expose NUMA information to the host once more." <vcpu placement='static'>14</vcpu> <cputune> <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='33'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='34'/> <vcpupin vcpu='4' cpuset='3'/> <vcpupin vcpu='5' cpuset='35'/> <vcpupin vcpu='6' cpuset='4'/> <vcpupin vcpu='7' cpuset='36'/> <vcpupin vcpu='8' cpuset='5'/> <vcpupin vcpu='9' cpuset='37'/> <vcpupin vcpu='10' cpuset='6'/> <vcpupin vcpu='11' cpuset='38'/> <vcpupin vcpu='12' cpuset='7'/> <vcpupin vcpu='13' cpuset='39'/> <emulatorpin cpuset='1-7'/> </cputune> <numatune> <memory mode='interleave' nodeset='0'/> </numatune> Edited November 14, 2018 by Jerky_san Quote Link to comment
TType85 Posted November 14, 2018 Share Posted November 14, 2018 (edited) 1 hour ago, Jerky_san said: The biggest change for me was L3. The memory latency was something in the VM itself and some tweaks.(I purged the whole VM) The L3 was a 6x decrease in latency. L1 for me was 3ns now 1ns and thats massive for L1 as its used non stop but the biggest thing is that the cache is properly allocated. Before if you look i had 5x16 on my L3 which is literally impossible and L1 was 2x larger than it was supposed to be and only 2 way instead of 8 way. Also with this change you can't span numa. So if you have any procs allocated outside a single numa it will cause more latency on memory. If we can get it to identify threadripper properly then we can do the NUMA cross and be fine. Should also state I have 2990wx and I have my memory set to channel from "auto" in the bios that provided a large increase in performance as well. "In my case, under Advanced -> CBS -> DF there is a Memory model item that has choices like auto | distribute | channel. In our case, we are interested in channel mode, as it will expose NUMA information to the host once more." <vcpu placement='static'>14</vcpu> <cputune> <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='33'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='34'/> <vcpupin vcpu='4' cpuset='3'/> <vcpupin vcpu='5' cpuset='35'/> <vcpupin vcpu='6' cpuset='4'/> <vcpupin vcpu='7' cpuset='36'/> <vcpupin vcpu='8' cpuset='5'/> <vcpupin vcpu='9' cpuset='37'/> <vcpupin vcpu='10' cpuset='6'/> <vcpupin vcpu='11' cpuset='38'/> <vcpupin vcpu='12' cpuset='7'/> <vcpupin vcpu='13' cpuset='39'/> <emulatorpin cpuset='1-7'/> </cputune> <numatune> <memory mode='interleave' nodeset='0'/> </numatune> I set mine to Channel this weekend and noticed an increase in performance in WoW. I just added the numatune to my configs pointing to the correct nodeset (wifes VM is on node 1, mine is on node 0). I'll have to wait till I get home to check gaming performance. since I am RDP'd in right now. If I can get it to read the cache right maybe it will get even better. Edited November 14, 2018 by TType85 Quote Link to comment
Jerky_san Posted November 14, 2018 Share Posted November 14, 2018 3 minutes ago, TType85 said: I set mine to Channel this weekend and noticed an increase in performance in WoW. I just added the numatune to my configs pointing to the correct nodeset (wifes VM is on node 1, mine is on node 0). I'll have to wait till I get home to check gaming performance. since I am RDP'd in right now. If I can get it to read the cache right maybe it will get even better. So I probably should of set that to "strict" instead of "interleave" as that is for when I am spanning two numas. Quote Link to comment
TType85 Posted November 14, 2018 Share Posted November 14, 2018 (edited) 34 minutes ago, Jerky_san said: So I probably should of set that to "strict" instead of "interleave" as that is for when I am spanning two numas. Oddness, on my wife's VM memory bandwidth is half and latency is 50% higher. All settings the same except she is on the other side of the chip. CPUz scores are the same (better than they were before, in the 410 range instead of the 360 range) Edited November 14, 2018 by TType85 Quote Link to comment
Jerky_san Posted November 14, 2018 Share Posted November 14, 2018 (edited) 29 minutes ago, TType85 said: Oddness, on my wife's VM memory bandwidth is half and latency is 50% higher. All settings the same except she is on the other side of the chip. CPUz scores are the same (better than they were before, in the 410 range instead of the 360 range) So the "node set" function tells it where to pull memory from. It should only pull memory from the node its assigned to die wise. It always pulls a a little ram from the other side. 64081 is my main gaming VM. Also is your ram populated across at least 4 dims? Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Node 3 Total --------------- ------ ------ ------ ------ ----- 52912 (qemu-syst 4 0 1510 0 1514 64081 (qemu-syst 26427 0 1833 0 28260 --------------- ------ ------ ------ ------ ----- Total 26431 0 3343 0 29773 Edited November 14, 2018 by Jerky_san Quote Link to comment
TType85 Posted November 15, 2018 Share Posted November 15, 2018 2 hours ago, Jerky_san said: So the "node set" function tells it where to pull memory from. It should only pull memory from the node its assigned to die wise. It always pulls a a little ram from the other side. 64081 is my main gaming VM. Also is your ram populated across at least 4 dims? Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Node 3 Total --------------- ------ ------ ------ ------ ----- 52912 (qemu-syst 4 0 1510 0 1514 64081 (qemu-syst 26427 0 1833 0 28260 --------------- ------ ------ ------ ------ ----- Total 26431 0 3343 0 29773 Currently there are 4 dimms installed. What command shows the per-node usage? Quote Link to comment
Jerky_san Posted November 15, 2018 Share Posted November 15, 2018 1 minute ago, TType85 said: Currently there are 4 dimms installed. What command shows the per-node usage? numastat -c qemu Quote Link to comment
TType85 Posted November 15, 2018 Share Posted November 15, 2018 (edited) 7 minutes ago, Jerky_san said: numastat -c qemu Wifes VM is set to strict, nodeset 1, mine was strict nodeset 0 but I rebooted the server and my vm wouldn't start because it ran out of memory. I had to take out the numatune to get my vm up. Edit, both are set to 8GB of ram Edited November 15, 2018 by TType85 Quote Link to comment
Jerky_san Posted November 15, 2018 Share Posted November 15, 2018 1 minute ago, TType85 said: Wifes VM is set to strict, nodeset 1, mine was strict nodeset 0 but I rebooted the server and my vm wouldn't start because it ran out of memory. I had to take out the numatune to get my vm up. How much ram do you have? Quote Link to comment
TType85 Posted November 15, 2018 Share Posted November 15, 2018 Just now, Jerky_san said: How much ram do you have? 32GB total, 8GB per vm, 2 VM's. I do have a bunch of dockers too. Quote Link to comment
Jerky_san Posted November 15, 2018 Share Posted November 15, 2018 3 minutes ago, TType85 said: 32GB total, 8GB per vm, 2 VM's. I do have a bunch of dockers too. Thats strange should of started then. Run sudo dmidecode -t memory | grep -i size and post Quote Link to comment
TType85 Posted November 15, 2018 Share Posted November 15, 2018 6 minutes ago, Jerky_san said: Thats strange should of started then. Run sudo dmidecode -t memory | grep -i size and post Unraid sees all 32GB. Currently at 62% used. 64GB upgrade is on the list but will have to wait till after the holidays. Quote Link to comment
Jerky_san Posted November 15, 2018 Share Posted November 15, 2018 2 minutes ago, TType85 said: Unraid sees all 32GB. Currently at 62% used. 64GB upgrade is on the list but will have to wait till after the holidays. Don't know why it wouldn't start but since you have your VM's split between the nodes it looks like its using the other node's ram. Still strange though. Quote Link to comment
TType85 Posted November 15, 2018 Share Posted November 15, 2018 1 minute ago, Jerky_san said: Don't know why it wouldn't start but since you have your VM's split between the nodes it looks like its using the other node's ram. Still strange though. Yeah, really odd. The Wifes VM usually starts first. I would assume the numatune setting would push it over to node 1, when mine starts it would be in node 0. Will have to wait till tomorrow to play with it more. Numastat -c below, that numa_miss.... Quote Link to comment
Jerky_san Posted November 15, 2018 Share Posted November 15, 2018 1 minute ago, TType85 said: Yeah, really odd. The Wifes VM usually starts first. I would assume the numatune setting would push it over to node 1, when mine starts it would be in node 0. Will have to wait till tomorrow to play with it more. Numastat -c below, that numa_miss.... how about your VM feel any better? Quote Link to comment
TType85 Posted November 15, 2018 Share Posted November 15, 2018 Just now, Jerky_san said: how about your VM feel any better? Mine feels great. I had some weirdness running WoW and youtube in chrome at the same time but now that seems ok. Will have to play with it more though. Quote Link to comment
Jerky_san Posted November 15, 2018 Share Posted November 15, 2018 Would like to add I no longer know when people are using my plex server.. I hope that this patch is added in or QEMU gets updated and then I can span numas but damn it's great so far. Can't believe what cache does for me Quote Link to comment
DZMM Posted November 15, 2018 Share Posted November 15, 2018 I emergency purchased a 2950X and MSI X399 SLI Plus mobo on Tues when my mobo died - I hadn't done any research, I just kind of knew that my next system would be a TR for the value. As the purchase wasn't planned, I haven't done my TR research although even without any tweaking I'm very happy with the performance boost today for the first day of running - I just want to make sure I'm not driving a ferrari in the slow lane. I've been reading this thread today and I'm not sure how I'm best supposed to assign cores to my VMs. At the moment I've got x3 VMs with 3 cores each and x1 with 2 cores and I've just gone 10-13,14-19,20-25 and 26-31 (I've got most of my dockers on 4-9 with my VM emulator pins on 2-3, and 0-1 left for unRAID) as lstopo said I only had one die, so I mistakenly thought NUMA was for the WX but now I guess my mobo is set to UMA, even though I can't see a setting to change in the bios? My questions are, i) do I need to enable NUMA somehow in the bios and then assign cores to the VMs from the same die as the PCIe devices, or is UMA ok? ii) If I'm supposed to enable NUMA, where's the setting in the SLI Plus as I can't find it? iii) if I stick with UMA, are my core assignments ok? iv) Are there any other unRAID/bios changes I should be making? Thanks in advance Quote Link to comment
bastl Posted November 16, 2018 Share Posted November 16, 2018 If you're running in NUMA mode you gain improvement in memory bandwith cause you're accessing both memory controllers at the same time, 1 each die. But you will have slightly higher latency in memory access. All i tested for me and normal use of a VM (browsing, office stuff, gaming) i could't see any big differences. Switching the GPU in another slot or using a NVME in another one so the device isn't connected to the die i passthrough directly for me shows the same. No big noticable differences. I don't really care about +-5fps in games as long as everything runs smooth and it does. Read and write speeds of the NVME also kinda the same. You might see some hickups if the device is connected to the other die which is under heavy load. For me it never happens that my dockers or the Unraid itself gets to the point to use all the CPU ressources. Might be different with a plex container or some other transcoding dockers pushing the CPU to the limit. It all depends on how you're using your rig. I don't really know where you can find the BIOS setting on a MSI board. On AsRock you can find it under CBS / DF common options. Default for memory interleaving setting for me is Auto. Channel switches it to Numa. Besides from setting my XMP profile for the RAM i only enabled the hardware virtualization support (SVM Mode and SR-IOU) and enabled IOMMU. Not really sure how the last one exactly is called. Something with IOMMU. No matter how you set your memory you should always use cores/threads from the same die for a single VM and don't mix up the cores to reduce the latency. On my 1950x it looks like this: The die1 is only for my main VM with GPU and NVME passthrough. Cores 8 and 24 are the emulatorpins and the rest is isolated and only used by a Win10 VM. On die0 I have all my dockers running and a couple VMs which i use from time to time. I am playing around currently with the emulatorpin setting, but i can't see any difference as without it. Isolating the emu pins, not isolating, no pinning at all for me i can't really see a difference. Using the emulatorpin on the other die i didn't checked yet. Worth a try. 1 Quote Link to comment
DZMM Posted November 16, 2018 Share Posted November 16, 2018 Thanks that was useful. My mobo apparently can't switch NUMA on, but it looks like I'm not missing anything. I've switched the cores for my VMs to get my cores lined up with dies even tough I'm UMA - is the layout the same for 2950x? I agree about the emulator pins - I do it out of habit, even though I haven't seen any evidence to say it is worthwhile Quote Link to comment
bastl Posted November 16, 2018 Share Posted November 16, 2018 52 minutes ago, DZMM said: My mobo apparently can't switch NUMA on I bet you have this setting to. All the x399 BIOSes I saw so far have tons of settings more or less good structures. If you search long enough you might find it 😁 54 minutes ago, DZMM said: is the layout the same for 2950x? Should be pretty similar. Both have 2 dies each with 8 cores. Quote Link to comment
DZMM Posted November 16, 2018 Share Posted November 16, 2018 1 minute ago, bastl said: I bet you have this setting to. All the x399 BIOSes I saw so far have tons of settings more or less good structures. If you search long enough you might find it 😁 I even emailed msi and they said no, although the reply was a bit short so I'm not 100% certain the person understood the question Quote Link to comment
Jerky_san Posted November 16, 2018 Share Posted November 16, 2018 (edited) 3 hours ago, DZMM said: I even emailed msi and they said no, although the reply was a bit short so I'm not 100% certain the person understood the question Mine wasn't called numa. Mine was auto, die, and channel. Channel is best on mine. From a guide I used In my case, under Advanced -> CBS -> DF there is a Memory model item that has choices like auto | distribute | channel. In our case, we are interested in channel mode, as it will expose NUMA information to the host once more. Edited November 16, 2018 by Jerky_san Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.