Jump to content

**VIDEO GUIDE** How To Use LSTOPO for Better VM Performance on Multi CPU and Threadripper Systems


SpaceInvaderOne

Recommended Posts

Hi Guys. This video is a tutorial on how to examine the topology of a multi CPU or a Threadripper server which has more than one numa node. This is useful so we can pin vCpu cores from the same numa node as the GPU which we want to pass through so therefore, getting better performance. The video shows how to download and install hwloc and all of its dependencies using a script and @Squid  great user script plugin. Hope you find it useful :)

 

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**note** If using the new 6.6.0 rc versions of unraid (or above)**

             before running the lstopo command you will need to create a symlink using this command first

 ln -s /lib64/libudev.so.1 /lib64/libudev.so.0 

**         dont run the above command unless on unraid 6.6.0 or above!

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

EDIT--

Unraid now has lstopo built in! You will need to boot your server in GUI mode for it to work. Then once you are in GUI mode just open terminal and run the command and you are good to go. Much easier than messing with loading it manually like in my video.

 

 

 

  • Like 7
Link to comment

Thanks Spaceinvader. Reposting from my comment on the video in case other forum users run into the same issue.  Running Unraid 6.6.0-rc3 and get - lstopo: error while loading shared libraries: libudev.so.0: cannot open shared object file: No such file or directory when trying to generate the png. Any thoughts?

Edited by jspence27
Link to comment
5 hours ago, jspence27 said:

Thanks Spaceinvader. Reposting from my comment on the video in case other forum users run into the same issue.  Running Unraid 6.6.0-rc3 and get - lstopo: error while loading shared libraries: libudev.so.0: cannot open shared object file: No such file or directory when trying to generate the png. Any thoughts?

Same error on RC4

Link to comment

Thanks @SpaceInvaderOne the symlink fixed it for me 

 

Why the hell is the first PCIE slot connected to the second die and the third slot to the first die? In the first slot i have a 1050ti which is used by a Linux VM which uses some cores from the first die. The 1080ti on the 3rd slot is mainly used for a gaming VM and using all cores (8-15;24-31 isolated) on the second die. I wish i could flip a switch in the BIOS to reverse that. I guess there is no chance for such an option, right?

 

1145779383_topology-Kopie.thumb.png.646ee5248d26a94e71013357104bf766.png

Edited by bastl
Link to comment

Great vid! I was super doped realising that my first slot is connected to node 2. That means I still can leave unRAID to do unRAID things in node 0 like it prefers and have my workstation VM isolated on node 2. Now if only I can force it to only use the RAM connected to node 2.

 

On 9/17/2018 at 8:06 AM, bastl said:

Thanks @SpaceInvaderOne the symlink fixed it for me 

 

Why the hell is the first PCIE slot connected to the second die and the third slot to the first die? In the first slot i have a 1050ti which is used by a Linux VM which uses some cores from the first die. The 1080ti on the 3rd slot is mainly used for a gaming VM and using all cores (8-15;24-31 isolated) on the second die. I wish i could flip a switch in the BIOS to reverse that. I guess there is no chance for such an option, right?

You can always flip the cards and pass them through via vbios method. For gaming VM, it's better to lock everything to the same node (although you are running in UMA not NUMA so not sure how much benefit it's gonna get).

 

15 hours ago, thenonsense said:

I've been experimenting with Numactl and pinning properly, and at least in terms of memory the benefits are immediate.  This tool not only helps with memory, but checks PCIE slots as well?  Awesome

How do you that please?

It's always a "hope and pray" situation with me to get RAM assigned correctly.

Now that I isolated the entire node 2 for my VM, it would allocate 99% of my RAM to node 2, which is good (enough) - but ONLY after a clean reboot.

The longer I wait after the reboot, the more it allocates to node 0 (sometimes it would go nut and allocate 100% to node 0, even though there's nothing else running). Don't know how to do it without rebooting.

 

 

 

 

Link to comment

@testdasi

Yesterday I reduced my Gaming VM to 6 cores + 6 threads on node 1 with all cores isolated and did a couple of benchmarks without running anything else on that die. Than i switched all my dockers and other VMs from node 0 to node 1, isolated the last 6 out of 8 cores and their threads on node 0 from unraid and switched over the gaming VM to node 0 where still my 1080ti should be attached to (if lstopo is correct). I don't flipped the cards around yet, because for now i don't need any vbios to pass through. The performance is basically the same except from small stutters/hickups and sound bugs every 30-40 seconds. Every game (BF5, Farcry5, DayZ, Superposition + Heaven benchmark) i tested gave me nearly the same performance as on node 1 + that weird stuttering. Don't exactly know why. I never had that issue if i isolate the second die and use these cores only. This gets me back again to my initiall idea, that maybe the BIOS is reporting the core pairings wrong to the OS. Why should i get stutters if the GPU is connected to the cores which are used directly and no stutters across the infitity fabric. Weird! 

 

I didn't retested in NUMA mode. I did that before and as long as i don't mix up the dies for one VM it makes no difference in gaming performance. Using the UMA mode showed me in my tests that i get a higher memory bandwith with no real performance losts. 

Link to comment

Just tried this but sadly my performance seems to be a lot worse only using cpu cores from the numa where my gpu and m2 device is connected. FPS in dylight dropped from 100 to barely 60 with a lot more stuttering hitching.. Very strange. My board is kind of stupid as well since it maps the CPUs differently it appears than other people's boards on this forum with a 2990wx.

 

Should say previously I was only mapping odd cores and leaving all even cores out. For some reason it improves performance even when it crosses dies.

 

Second die contains my M.2 Drive and 1070. Other die has the drives all attached to it for unraid.

topology.png

Capture.PNG

Edited by Jerky_san
Link to comment
1 hour ago, Jerky_san said:

Asus Zenith X399

I would hypothesize that your motherboard reports the core incorrectly as a "display problem". i.e. underlying it's still 0 + 1 = 1 pair. Hence when you assign all the odd cores, you effectively running your VM in non-SMT mode (e.g. 16 non-SMT = 16 physical cores > 16 SMT = 8 physical cores).

I made the above educated guess based on the fact that I see the exact same result i.e. assigning odd cores to VM gives me better performance.

 

I think we'll need to have the right expectation. This is not a magical cure.

 

Assigning all cores in the same die would only improve latency (assuming memory allocation is also to the same numa node). The processing power of a core (+ its SMT sister) doesn't change.

  • If your workload scales well with multiple cores (e.g. transcoding), having more cores spread across multiple dies will almost always help more than having fewer (physical) cores trying to ring-fence the process within the same node.
  • If your workload doesn't scale well and/or is bottle-necked by a pass-through device (e.g. gaming) then you will benefit from better consistency of performance (e.g. less stuttering i.e. more consistency fps) by ring-fencing the node with the connected device.

What we lacked previously is the ability to make educated tuning for the 2nd use case.

Link to comment
  • 2 weeks later...

Awesome video.  I'd like to note that in "independent research" I got hwloc/lstopo included with the GUI boot in Unraid 6.6.1.  So that's another option requiring about the same number of reboots as the script method.  I.e. - Reboot into GUI, take snapshot, reboot back to CLI.  Of course if you run GUI all the time, this is just a bonus for you.

 

Also, here is a labeled version of the Asus x399 ZE board in NUMA mode.  Enjoy, and thanks @SpaceInvaderOne!

(Note: this is with all M.2 slots and the U.2 4x/PCIE 4x split enabled with installed media.  Slot numbers are counting by full-length slots in order of physical closeness to CPU socket... so top down for most installs)

Asus-ZE_LSTOPO_Labled.png

Edited by Rhynri
Clarification
Link to comment
On 9/19/2018 at 8:51 AM, testdasi said:

I would hypothesize that your motherboard reports the core incorrectly as a "display problem". i.e. underlying it's still 0 + 1 = 1 pair. Hence when you assign all the odd cores, you effectively running your VM in non-SMT mode (e.g. 16 non-SMT = 16 physical cores > 16 SMT = 8 physical cores).

I made the above educated guess based on the fact that I see the exact same result i.e. assigning odd cores to VM gives me better performance.

 

I think we'll need to have the right expectation. This is not a magical cure.

 

Assigning all cores in the same die would only improve latency (assuming memory allocation is also to the same numa node). The processing power of a core (+ its SMT sister) doesn't change.

  • If your workload scales well with multiple cores (e.g. transcoding), having more cores spread across multiple dies will almost always help more than having fewer (physical) cores trying to ring-fence the process within the same node.
  • If your workload doesn't scale well and/or is bottle-necked by a pass-through device (e.g. gaming) then you will benefit from better consistency of performance (e.g. less stuttering i.e. more consistency fps) by ring-fencing the node with the connected device.

What we lacked previously is the ability to make educated tuning for the 2nd use case.

I turned off SMT for an experiment. The even/odd when SMT is off are physical cores but they are very weirdly broken up. There is a bios programmer over in overclocker's forum that works on my board so I asked him if he'd look at it but so far not much of an answer.

topology1.png

Edited by Jerky_san
Link to comment
  • 2 weeks later...

I just need advise, how to separate GPU, to have them on separate CPU. It seems, as I understand, everything landed on NUMANode P0.

I have 2 VM's and two GPU's.

On other hand, I took Mainboard manual and I see this :

54 PCI Express 3.0 x16 Slot (PCIE2, Blue) from CPU_BSP1
56 PCI Express 3.0 x16 Slot (PCIE3, Blue) from CPU_BSP1 
57 PCI Express 3.0 x16 Slot (PCIE4, Blue) from CPU_BSP1 
58 PCI Express 3.0 x16 Slot (PCIE5, Blue) from CPU_AP1 
60 PCI Express 3.0 x 4 Slot (PCIE6, White) from CPU_BSP1

Am I correct saying, CPU_AP1 have only one PCI Express bus ?

topo.png

 

And did some PCI cards shuffling, now it looks like this. Main VM is on NUMANode 0. Just curious what affect have all other devices on performance, like eth etc.

 

 

topo1.png

Edited by mucflyer
Link to comment

I keep getting

 

root@Tower:~# lstopo /media/user/downloads/topo.png
Failed to open /media/user/downloads/topo.png for writing (No such file or directory)

 

but it exists the directory that is. I dunno wtf I'm doing

 

this is after doing the "ln -s /lib64/libudev.so.1 /lib64/libudev.so.0"

 

edit just "lstopo" spits out

root@Tower:~# lstopo
Machine (16GB)
  Package L#0
    L3 L#0 (8192KB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
      L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#5)
      L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3
        PU L#6 (P#6)
        PU L#7 (P#7)
    L3 L#1 (8192KB)
      L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4
        PU L#8 (P#8)
        PU L#9 (P#9)
      L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5
        PU L#10 (P#10)
        PU L#11 (P#11)
      L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6
        PU L#12 (P#12)
        PU L#13 (P#13)
      L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7
        PU L#14 (P#14)
        PU L#15 (P#15)
  HostBridge L#0
    PCIBridge
      PCI 144d:a808
    PCIBridge
      PCI 1022:43c8
        Block(Disk) L#0 "sdb"
        Block(Disk) L#1 "sdg"
        Block(Disk) L#2 "sde"
        Block(Disk) L#3 "sdc"
        Block(Disk) L#4 "sdf"
        Block(Disk) L#5 "sdd"
      PCIBridge
        PCIBridge
          PCI 8086:1539
            Net L#6 "eth0"
    PCIBridge
      PCI 10de:1c03
    PCIBridge
      PCI 1022:7901

so it kinda works I guess.

 

edit edit. calling it top.png works wtf

root@Tower:~# lstopo /mnt/user/appdata/top.png

 

Edited by BomB191
extra stuff
Link to comment
  • 4 weeks later...

 The 16 -16 etc isn't really telling you the slot speed but what it is currently running at. To get the correct speed you really need to put a load onto the card.

It is into do with power saving of the gpu when not being used.

It quite easy to see this in a windows VM using gpuz. You will see the speed there that the card can run at under bus speed. Hover over that and it will tell you the speed the card is currently running. Then to the right if you click on the question mark you get the option to put some stress on the gpu and you will see the number change then.

 

  • Like 1
Link to comment
  • 5 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...