• [6.6.0-rc2] VM Memory allocation across numa boundaries


    Rhynri
    • Closed Minor

    So, after getting into RC2, I was trying to optimize my pinning using the new interface.  I looked up my NUMA boundaries in the process: 

     

    numactl --hardware
    available: 2 nodes (0-1)
    node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
    node 0 size: 32040 MB
    node 0 free: 256 MB # <<< Make note of this value
    node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
    node 1 size: 32243 MB
    node 1 free: 19974 MB
    node distances:
    node   0   1
      0:  10  16
      1:  16  10

    I'm currently running two VMs as of this command.

     

    VM 1: 16gb RAM, CPUs 4-7, 20-23 (So, numa node 0, in CPU pairs)

    VM 2: 16gb RAM, CPUs 8-11, 24-27 (Numa node 1)

     

    But as you'll note, all the RAM is being allocated to node 0.  Uh oh. Let's check:

    numastat qemu
    
    Per-node process memory usage (in MBs)
    PID                               Node 0          Node 1           Total
    -----------------------  --------------- --------------- ---------------
    13479 (qemu-system-x86)         16473.43            0.25        16473.68
    27148 (qemu-system-x86)         13259.18         3204.48        16463.66
    -----------------------  --------------- --------------- ---------------
    Total                           29732.60         3204.74        32937.34

    Well, crap.  That's no good.  I then tried to force it using the <numatune> tags.  This works fine for VM 1, which is completely in it's own node, but for VM 2, this makes it take forever to start up, because it tries to force the second qemu instance onto node 1 (where it should be) and you get a bunch of numa misses when the memory is allocated to node 0 anyway.  This can also cause some NVRAM corruption in combination with other numa-optimizations and xml configuration settings, though I'm not able to remember exactly which one borked up the VM so bad I had to restore the .img file, nvram and xml to get nvidia drivers working again.

     

    I imagine this will be extra important for 2990 users as two of the cores have significantly better memory access than the others and you'd want to keep VMs nicely in line with these boundaries for optimum performance.

     

    Obviously we don't want this boundary crossing to happen with other processors (like my 1950) for performance reasons as well.

    Bonus bug: WebTerminal is really slow this release once you have some text in the window compared to last release.

     

    Bonus question: Any chance of getting 'numad' baked in so we can use "auto" in numatune?

    • Like 1
    • Upvote 1



    User Feedback

    Recommended Comments

    Any suggestion on how to fix this?  Also if you can point to 'numad' source, would be helpful.

     

    What do you mean by slow WebTerminal - seems to function same as always for me.

    Link to comment

    NUMA daemon source

     

    As for the webterminal, once it has enough text to get a decent scroll back the scrolling gets choppy and the typing lags a little. I do use a fairly old MacBook Air and chrome to access unraid, but it’s not something I noticed last build.  It’s possible it’s just that machine being goofy too.

     

    I haven’t had time to research the issue fully, but I’ll look into it tomorrow and let you know if I find any suggestions. 

    Link to comment

    I've been looking into this, and I think it may have something to do with which NUMA node the GPU is on. I was able to force correct NUMA allocations by changing the memory size of my node0 VM to neatly fill the available memory on that node, then booting the remaining two, but that results in a super lopsided memory allocation (28,16,8), and it's a very manual process. 

     

    I'm going to be asking around the VFIO community to see if there is anything I've been overlooking.

     

    I've been trying to install hwloc (slackbuild link) into unraid so I can have access to the very useful

    lstopo

    which would let me know which node(s) my pcie devices are on.  I keep running in to compilation issues, however, so I'm going to keep working on that.  However, the lstopo output as a standalone would be something very useful to have on the tools page as it gives you a very good idea of what devices are nested for pass-through... it's arguably as useful as anything on the [Tools]>[System Devices] page in terms of pass-through usage.  I've also attached an image of what the lstopo gui output looks like.

     

    Example (not my system):

    # lstopo
    Machine (256GB)
      NUMANode L#0 (P#0 128GB)
        Socket L#0 + L3 L#0 (20MB)
          L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
          L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#2)
          L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#4)
          L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#6)
          L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#8)
          L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#10)
          L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#12)
          L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#14)
        HostBridge L#0
          PCIBridge
            PCI 1000:005d
              Block L#0 "sda"
          PCIBridge
            PCI 14e4:16a1
              Net L#1 "eth0"
            PCI 14e4:16a1
              Net L#2 "eth1"
            PCI 14e4:16a1
              Net L#3 "eth2"
            PCI 14e4:16a1
              Net L#4 "eth3"
          PCI 8086:8d62
          PCIBridge
            PCIBridge
              PCIBridge
                PCIBridge
                  PCI 102b:0534
          PCI 8086:8d02
            Block L#5 "sr0"
      NUMANode L#1 (P#1 128GB)
        Socket L#1 + L3 L#1 (20MB)
          L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#1)
          L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#3)
          L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#5)
          L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#7)
          L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#9)
          L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#11)
          L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#13)
          L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
        HostBridge L#7
          PCIBridge
            PCI 15b3:1003
              Net L#6 "eth4"
              Net L#7 "eth5"

     

    image.png

    Link to comment
    4 hours ago, testdasi said:

    @Rhynri: so does <numatune> work at all?

    It looks like it's trying to work.  It will slow down the startup significantly and cause the numa misses to skyrocket. I've since discovered that only one of my VMs behaves this way.  I'm wondering if I can move that one to the other node it keeps trying to allocate memory on and see if that fixes the issue.  Does anyone know if it matters which cores are isolated?  Say, if i want to move my isolated cores to the beginning (0-11 physical), instead of at the end (4-15 physical) if unraid cares at all?

    Link to comment

    I wrote a rather in-depth reply then accidentally deleted it and there is no undelete. 

     

    Suffice to say moving the VM to the other NUMA node reduced the incidence of the problem and improved the rendering performance of the VM in question.  It's still not gone but I think a lot of the remaining NUMA misses are related to unraid caching things, which is hardly a priority operation:

     

    numastat
                               node0           node1
    numa_hit              2773556844      1684914320
    numa_miss                6233397       193845232
    numa_foreign           193845232         6233397
    interleave_hit             84430           84643
    local_node            2773481539      1684881326
    other_node               6308702       193878226

    Starting from a clean boot and looking at numastat when booting the two important VMs yields very few numa_miss (es) relative to the previous configuration.  This is after 8 days of uptime.

     

    @limetech - If you could please include lstopo in a future release I'd greatly appreciate it.  I linked a slackware build for hwloc in a previous post in this thread if that helps.  There are a few BIOS settings relating to IOMMU allocation in relation to the CCX's on Threadripper and I'd like to do some A/B testing with lstopo to see what if any difference they make.  As I mentioned in that reply, it would also potentially be a useful addition to the System Devices page.  Please and thank you for your time and effort in making Unraid OS awesome.

    Edited by Rhynri
    clarity
    • Like 1
    Link to comment

    FYI: The 'hwloc' package which includes 'lstopo' command is included in Unraid OS 6.6.1 but only available in Desktop GUI boot mode.

    Link to comment

    Thank you very much for this.  I completely understand if it's only available in GUI-boot.  Just gives me an excuse to go see the GUI!  Hopefully other people find it useful as well.

    Link to comment
    On 9/30/2018 at 9:42 AM, Rhynri said:

    Thank you very much for this.  I completely understand if it's only available in GUI-boot.  Just gives me an excuse to go see the GUI!  Hopefully other people find it useful as well.

    Hey I was wondering if you figured anything? I've been fighting latency, lag, and other problems constantly with this 2990wx. Been trying to use numa tune and many other things and I can't seem to wipe the latency out. Plus my cpu single threaded score can't even beat a 1700 stock even though I have watercooling and PDO level 3 enabled. Only 68% of single threaded vs a 1700. I've gone through so many iterations of trying things I'm starting to think I'm going in circles. Doesn't help the way SMT works on Zenith boards appears entirely different than others.

    Link to comment

    Btw I second the numad thing.. Redhats documentation shows its quiet an amazing little program that can make the creation of a VM with multiple NUMA much easier as you can simply tell it "auto" and it goes out and determines the best place to set the VM by what CPUs you gave it.

    • Upvote 1
    Link to comment

    Yeah, having access to NUMAD would be great, I have a Dual CPU system and I have horrible lag in a lot of things.

    I can see now that my VMs memory is being split between the memory pools/NUMA Nodes.

     

    @limetech?

     

    image.png.5483c55127d06e043c9d93dfd9ce470d.png

     

    If I go by the instructions at https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html-single/virtualization_tuning_and_optimization_guide/index and add 

    <numatune>
    	<memory mode='strict' placement='auto'/>
    </numatune>

    or

    <numatune>
    	<memory mode='strict' nodeset='1'/>
    </numatune>

    to my XML file I get this error;

    image.thumb.png.f74fd15da7ac5307e38e9ff683327134.png

    Link to comment

    "placement='auto'" uses NUMAD which isn't available yet in Unraid. 

     

    <memory mode='strict' nodeset='1'/>
    or
    <memory mode='preferred' nodeset='1'/>

    Should work in theory. I have a VM set to strict and to use RAM from node 1 only and "numastat -c qemu" shows with this setting it only uses RAM from node 0. Weird.

     

    Maybe it starts counting the nodes with 1?

    Nope!

    With "nodeset='2'" it complains that there isn't a node 2. 

     

    Link to comment

    I was just looking for this info, I noticed so bad RAM performance on my dual xeon server. Running numactl --hardware and shutting the machine down showed all the ram was allocated from numa node 0 when all my cpu cores were pinned to numa node 1. I added...
     

    <numatune>
        <memory mode='strict' nodeset='1'/>
      </numatune>

    and it almost worked, it took all the RAM from node 1 down to 2.5GB and then pinched the remaining 6GB from node 0, suppose that's better then not starting at all. Since my main vm uses the entire 2 CPU, [I have isolated that cpu from unraid]. Can I persuade unRAID to keep away from that ram? It seems to use it up pretty fast from boot.

    Link to comment

    Hi all,

     

    I'm a newbie in the linux world, my first contact with linux was with unraid...

    Looking for some virtualization platform i assembled a server with a threadripper 1950X, 4x16gb ram, an Geforce gtx 1080 for main VM (with screen, keyboard and mouse) and two geforce gtx 1050 for secondary VM's with remote access (one is all the time passedtrough to the same VM and the other is frequently passedtrough to any other VM whenever i want to use it).

    i've 2 NVME SSDs for cache and 3 HDDs for the array and main VM runs from a passedtrough NVME SSD while the other VM's run from another NVME SSD (unassigned)

    After playing around for some days, tweaking my unraid at the same time i was learning about linux architecture, virtualization, networking, and NUMA, my platform started to be more and more stable, every issue was beeing sorted out one by one and today i can say something about threadripper and NUMA.

    I've done it every possible way and failed several times... then i started to understand that my ram allocations were spilling in to the wrong node, causing my latency to spike.

    Today i've all my dockers waiting to start AFTER all the VM's are started... and that did an huge diference, even when i've CPUs 4-31 isolated. 

    The memory mode='strict' is tricky because it would only work right if you matched the right cores with the respective memory channel, for example, each numa node have 2 memory channels. Each channel matches to 4 cores (8 treads)...

    For that reason, in the 'strict' mode my VM performance could only get right if i set it up to use 4 cores with <16 gb ram or 8 cores with <32 gb ram, if the ram values were exceded the latency would spike.

    The GPU placement will only matter if all the cores of your VM are located on the wrong numa node...

    If the VM have cores from 2 memory channels in the same node you wil get double the memory bandwith, with allmost the same latency of single channel placement, as long as they stay in the same node.

    So... for my main VM i use 8 cores and 16 gb Ram in interleave mode, i get a good latency and good memory bandwith.

    For secondary VM's i use 4 cores  and 8 gb ram in strict mode, the latency is even better but the memory bandwith would be half, as expected.

    I've my server working with S3 sleep (now that i've tweaked it a bit it works fine), waking from keyboard/mouse or wol, but the dockers need to be stoped before sleep so when the server wakes up, the sleep plugin makes them wait to start so the VM's get plenty of memory for their needs. 

    Another thing i noticed... everytime i transfer large amounts of data between network - cache or cache - array, i get numa spills and the avaliable ram on each node gets cripled.

    That makes the latency spike too.

    For that reason i tweaked the mover script to dropcaches after the move process ends, that way the ram gets free again and the VM's may use it.

    Another weird event is that every time the server wakes from S3 sleep, the cpu clock gets nutz... that causes bad performance on the VM's too.

    To solve that problem i added the "echo 1 > /sys/module/processor/parameters/ignore_ppc ; " comand in the S3 postrun script.

    Forgive my english (i'm from Portugal) and forgive me if i said something unacurate, i just tryed to share my small experience.

    I'm loving unraid and i've already assembled a secondary backup server (old parts) with unraid too.

    I would like to recomend an upgrade to the CPU pinning page... add different colours for the cpus of each memory channel... that way it would be more intuitive to set up numa nodes correctly.

     

    My best regards

    Bruno Gomes 

     

     

    Edited by btagomes
    corrected some glitches
    • Like 2
    Link to comment

    Wow, I never even considered how the memory channels could play into all of this. Thanks so much for the detailed response. 👍

    Link to comment
    9 hours ago, xrstokes said:

    Wow, I never even considered how the memory channels could play into all of this. Thanks so much for the detailed response. 👍

    Well... It depends how you set your numa mode in bios... You can select none/auto/die/socket/channel... I get the best results when i set it to channel, and that means that the each cpu will want to talk with its respective memory channel before talking to the other channels. The 1950x have 16 cores (32 treads) in each socket, 8 cores (16 treads) in each die, 4 cores(8 treads) for each memory channel...

    If you set numa to "die" the 2 memory channels from the same die will be handled as one channel... The latency will rise as there are data being transfered trough the infinit fabric to the adjacent memory channel, anyway, this latency wont be as bad as having the cores of one die talking to the memory of the other die... 

    Edited by btagomes
    Link to comment

    To better understand the particularities of NUMA assignment this diagram help a lot...

    Every time the data travel trough infinity fabric it gets latency.

    Inter-CCX latency is tolerable.

    Inter-DIE latency is bad.

    For de 1950X:

    If a VM get 4 cores from  1 CCX it gets 1 memory channel and the best latency you can get from threadripper (69ns) and .

    If a VM get 8 cores from 2 CCX in the same DIE it gets 2 memory channels, that means some latency (73ns) but double memory bandwith.

    If a VM get more than 8 cores or if a VM have cores from 2 different DIES you will quadruple the memory bandwith but the latency gets terrible (130ns).

    If you plan to use cores from 2 diferent DIES in the same VM then you would get better results if you set your memory mode to AUTO in the bios...

    That way the system will manage the memory in use so you get bittersweet latency (100ns).

     

    AMD+IMAGE+5.png

     

    For the "interleave", "strict" and "prefered" modes...

    If you select "strict" your VM will only boot after all the memory gets alocated (if you have the ram full than it must dump it before the vm starts) and i'm almost sure that if it cannot get enough memory it wont boot.

    The "interleave" mode and the "prefered" mode are almost the same (interleave uses roundrobin)... they try to get memory from designated node but if they cannot get it they will spill to the other node, increasing latency.

    Sure the strict mode is nicer but then some times the VM cannot boot... 

    When you use strict mode some times the machines get paused and wont resume (as they cannot acess all the alocated memory).

     

     

    Edited by btagomes
    • Like 1
    Link to comment

    I recently switched from having disk images to passing through NVME controllers.  This can drag your VM across nodes if the drive is on a different one from the rest of the VM hardware.  I had a well behaved VM (memory wise) that now apportions a bit of memory across nodes).

    Edited by Rhynri
    Link to comment
    3 hours ago, Rhynri said:

    I recently switched from having disk images to passing through NVME controllers.  This can drag your VM across nodes if the drive is on a different one from the rest of the VM hardware.  I had a well behaved VM (memory wise) that now apportions a bit of memory across nodes).

    Yeah, no other choice but to redo CPU pinning. That's why SpaceInvaderOne topology video is extremely useful.

    Link to comment

    Just saying. "Majorly interesting thread" and never thought about it. Have a dual socket xeon motherboards and interesting to see where this leads and how i can optimise vm memory useage this way myself.

    Tnx for that

    Link to comment

    I still saying, should exist some colour diferenciation in the cpu pinning menu (based on the numa architecture obtained from numactl --hardware) so anyone could understand wich cpus are closer to each memory channel.

    I know that it wont be easy to implement but would be a wonderfull feature on the threadripper era.

     

    i'm still in battle with the memory spills caused by data being transfered, and in my case the memory spills create interference noise in my onbord sound card and in my gpu causes micro stutters. It only happens after the memory spill to the other node, and if i drop caches regularly it won't happen, but that would kill the purpouse of the cache and can put my cache data in risk...

     

     

     

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.