Jump to content

Jerky_san

Members
  • Content Count

    173
  • Joined

  • Last visited

Community Reputation

6 Neutral

About Jerky_san

  • Rank
    Advanced Member

Converted

  • Gender
    Undisclosed

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. Mine wasn't called numa. Mine was auto, die, and channel. Channel is best on mine. From a guide I used In my case, under Advanced -> CBS -> DF there is a Memory model item that has choices like auto | distribute | channel. In our case, we are interested in channel mode, as it will expose NUMA information to the host once more.
  2. <cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>EPYC-IBPB</model> <topology sockets='1' cores='8' threads='2'/> <feature policy='require' name='topoext'/> </cpu> Is the XML and yes unraid only stuff.. Apparently QEMU has patches for 3.0 that fix a lot of this and make SMT work properly for threadripper but I don't think we can apply them on unraid. We will have to wait for Unraid to apply them but the problem is some aren't "official" while others are.
  3. I found out that the cache is wrong on threadripper and if you tell it to emulate an EPYC CPU the cache will all pass through correctly and BAM no more lag or stuttering or anything. Tis amazin. See the post below but I'm hoping QEMU or limetech will do the patch I posted as I don't know how so I can cross NUMA again. But now I get substantial FPS increase and all my cache levels are very close to bare metal. If you attempt to cross NUMA to the other die with a memory controller though it does't pass the NUMA info to the VM so it gets crap memory performance.
  4. Would like to add I no longer know when people are using my plex server.. I hope that this patch is added in or QEMU gets updated and then I can span numas but damn it's great so far. Can't believe what cache does for me
  5. how about your VM feel any better?
  6. Don't know why it wouldn't start but since you have your VM's split between the nodes it looks like its using the other node's ram. Still strange though.
  7. Thats strange should of started then. Run sudo dmidecode -t memory | grep -i size and post
  8. How much ram do you have?
  9. Jerky_san

    Change qemu priority

    Renice -20 -p The -20 is the highest priority possible. After -p is process id. It's what I do but others might know better
  10. So the "node set" function tells it where to pull memory from. It should only pull memory from the node its assigned to die wise. It always pulls a a little ram from the other side. 64081 is my main gaming VM. Also is your ram populated across at least 4 dims? Per-node process memory usage (in MBs) PID Node 0 Node 1 Node 2 Node 3 Total --------------- ------ ------ ------ ------ ----- 52912 (qemu-syst 4 0 1510 0 1514 64081 (qemu-syst 26427 0 1833 0 28260 --------------- ------ ------ ------ ------ ----- Total 26431 0 3343 0 29773
  11. So I probably should of set that to "strict" instead of "interleave" as that is for when I am spanning two numas.
  12. The biggest change for me was L3. The memory latency was something in the VM itself and some tweaks.(I purged the whole VM) The L3 was a 6x decrease in latency. L1 for me was 3ns now 1ns and thats massive for L1 as its used non stop but the biggest thing is that the cache is properly allocated. Before if you look i had 5x16 on my L3 which is literally impossible and L1 was 2x larger than it was supposed to be and only 2 way instead of 8 way. Also with this change you can't span numa. So if you have any procs allocated outside a single numa it will cause more latency on memory. If we can get it to identify threadripper properly then we can do the NUMA cross and be fine. Should also state I have 2990wx and I have my memory set to channel from "auto" in the bios that provided a large increase in performance as well. "In my case, under Advanced -> CBS -> DF there is a Memory model item that has choices like auto | distribute | channel. In our case, we are interested in channel mode, as it will expose NUMA information to the host once more." <vcpu placement='static'>14</vcpu> <cputune> <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='33'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='34'/> <vcpupin vcpu='4' cpuset='3'/> <vcpupin vcpu='5' cpuset='35'/> <vcpupin vcpu='6' cpuset='4'/> <vcpupin vcpu='7' cpuset='36'/> <vcpupin vcpu='8' cpuset='5'/> <vcpupin vcpu='9' cpuset='37'/> <vcpupin vcpu='10' cpuset='6'/> <vcpupin vcpu='11' cpuset='38'/> <vcpupin vcpu='12' cpuset='7'/> <vcpupin vcpu='13' cpuset='39'/> <emulatorpin cpuset='1-7'/> </cputune> <numatune> <memory mode='interleave' nodeset='0'/> </numatune>
  13. A person on reddit told me the answer to my problem. If you do the below QEMU provides EPYC instead and all the cache is right. It dropped latency accrossed the board. L3 is down to 13ns and l1-1ns l2-2-3ns. Machine seems MUCH more responsive. Should also mention they said they had updated their kernel and a patch on QEMU made theirs see it properly without this code so hopefully we will see it on unraid as well. <cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>EPYC-IBPB</model> <topology sockets='1' cores='8' threads='2'/> <feature policy='require' name='topoext'/> </cpu> To give an idea of how much of a change. If your wondering why the read/write/copy of the Ls is "higher" on the old its because I had cores bounding from multiple numa in an attempt to make things faster. New is from numa 0 with 1-7 and its SMT cores bound. Edit: An update to this is it decreased latency across the board but you cannot have multiple NUMA. So far stating numa topology fails and the OS is unaware of when you cross NUMA. Working on that part since once I fix that I will be able to raise my read/write/copy speeds a lot. Edit2: I can report substantial increases in FPS for my games and the stutter thus far has been eliminated. Old New Physical Baremetal performance CPUZ Baremetal Virtual Old Virtual New Another user posted this to me So the Qemu code does not allow the passing of the cache topology. Here is a patch for 3.0.0 that allows chache topology to be passed through: diff --git a/target/i386/cpu.c b/target/i386/cpu.c index 723e02221e..2912867872 100644 --- a/target/i386/cpu.c +++ b/target/i386/cpu.c @@ -4221,6 +4221,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, break; case 0x8000001D: *eax = 0; + if (cpu->cache_info_passthrough) { + host_cpuid(index, count, eax, ebx, ecx, edx); + break; + } switch (count) { case 0: /* L1 dcache info */ encode_cache_cpuid8000001d(env->cache_info_amd.l1d_cache, cs, Use at your own risk.
  14. Even if I don't 100% trust CPUZ or ignore all my experience with Hyper-V and VMware(over 10 years and going now) and it being accurate in those instances. Even Linux reads the wrong cache size with LSCPU Ubuntu VM L1d cache: 64K L1i cache: 64K L2 cache: 512K L3 cache: 16384K L1d cache is wrong and l3 cache is double the size that it should be. Unraid: L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K