VIDEO GUIDE How To Use LSTOPO for Better VM Performance on Multi CPU and Threadripper Systems

SpaceInvaderOne · September 16, 2018

Hi Guys. This video is a tutorial on how to examine the topology of a multi CPU or a Threadripper server which has more than one numa node. This is useful so we can pin vCpu cores from the same numa node as the GPU which we want to pass through so therefore, getting better performance. The video shows how to download and install hwloc and all of its dependencies using a script and @Squid great user script plugin. Hope you find it useful

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**note** If using the new 6.6.0 rc versions of unraid (or above)**

before running the lstopo command you will need to create a symlink using this command first

 ln -s /lib64/libudev.so.1 /lib64/libudev.so.0

** dont run the above command unless on unraid 6.6.0 or above!

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

EDIT--

Unraid now has lstopo built in! You will need to boot your server in GUI mode for it to work. Then once you are in GUI mode just open terminal and run the command and you are good to go. Much easier than messing with loading it manually like in my video.

jonp · September 16, 2018

Another great guide by our friendly Space Invader!!

Sent from my Nexus 6 using Tapatalk

revilo951 · September 17, 2018

Thanks for all you do, friendly space-man!

jspence27 · September 17, 2018

Thanks Spaceinvader. Reposting from my comment on the video in case other forum users run into the same issue. Running Unraid 6.6.0-rc3 and get - lstopo: error while loading shared libraries: libudev.so.0: cannot open shared object file: No such file or directory when trying to generate the png. Any thoughts?

Edited September 17, 2018 by jspence27

2.6k · September 17, 2018

great vid, but curious why no benchmarks???

bastl · September 17, 2018

5 hours ago, jspence27 said:

Thanks Spaceinvader. Reposting from my comment on the video in case other forum users run into the same issue. Running Unraid 6.6.0-rc3 and get - lstopo: error while loading shared libraries: libudev.so.0: cannot open shared object file: No such file or directory when trying to generate the png. Any thoughts?

Same error on RC4

SpaceInvaderOne · September 17, 2018

39 minutes ago, bastl said:

Same error on RC4

Sorry I never tested this on the new rc unraid versions. This happens because libudev has been updated from .so.0 to .so.1 so it can't find libudev.so.0.

So to fix this just create a symlink and it will fix this

ln -s /lib64/libudev.so.1 /lib64/libudev.so.0

1926632960_Screenshotfrom2018-09-1707-44-32.png.4d7a75b430b7f8f195c3ef74e23aa297.png

bastl · September 17, 2018

Thanks @SpaceInvaderOne the symlink fixed it for me

Why the hell is the first PCIE slot connected to the second die and the third slot to the first die? In the first slot i have a 1050ti which is used by a Linux VM which uses some cores from the first die. The 1080ti on the 3rd slot is mainly used for a gaming VM and using all cores (8-15;24-31 isolated) on the second die. I wish i could flip a switch in the BIOS to reverse that. I guess there is no chance for such an option, right?

Edited September 17, 2018 by bastl

thenonsense · September 17, 2018

I've been experimenting with Numactl and pinning properly, and at least in terms of memory the benefits are immediate. This tool not only helps with memory, but checks PCIE slots as well? Awesome

testdasi · September 18, 2018

Great vid! I was super doped realising that my first slot is connected to node 2. That means I still can leave unRAID to do unRAID things in node 0 like it prefers and have my workstation VM isolated on node 2. Now if only I can force it to only use the RAM connected to node 2.

On 9/17/2018 at 8:06 AM, bastl said:

Thanks @SpaceInvaderOne the symlink fixed it for me

Why the hell is the first PCIE slot connected to the second die and the third slot to the first die? In the first slot i have a 1050ti which is used by a Linux VM which uses some cores from the first die. The 1080ti on the 3rd slot is mainly used for a gaming VM and using all cores (8-15;24-31 isolated) on the second die. I wish i could flip a switch in the BIOS to reverse that. I guess there is no chance for such an option, right?

You can always flip the cards and pass them through via vbios method. For gaming VM, it's better to lock everything to the same node (although you are running in UMA not NUMA so not sure how much benefit it's gonna get).

15 hours ago, thenonsense said:

I've been experimenting with Numactl and pinning properly, and at least in terms of memory the benefits are immediate. This tool not only helps with memory, but checks PCIE slots as well? Awesome

How do you that please?

It's always a "hope and pray" situation with me to get RAM assigned correctly.

Now that I isolated the entire node 2 for my VM, it would allocate 99% of my RAM to node 2, which is good (enough) - but ONLY after a clean reboot.

The longer I wait after the reboot, the more it allocates to node 0 (sometimes it would go nut and allocate 100% to node 0, even though there's nothing else running). Don't know how to do it without rebooting.

bastl · September 18, 2018

@testdasi

Yesterday I reduced my Gaming VM to 6 cores + 6 threads on node 1 with all cores isolated and did a couple of benchmarks without running anything else on that die. Than i switched all my dockers and other VMs from node 0 to node 1, isolated the last 6 out of 8 cores and their threads on node 0 from unraid and switched over the gaming VM to node 0 where still my 1080ti should be attached to (if lstopo is correct). I don't flipped the cards around yet, because for now i don't need any vbios to pass through. The performance is basically the same except from small stutters/hickups and sound bugs every 30-40 seconds. Every game (BF5, Farcry5, DayZ, Superposition + Heaven benchmark) i tested gave me nearly the same performance as on node 1 + that weird stuttering. Don't exactly know why. I never had that issue if i isolate the second die and use these cores only. This gets me back again to my initiall idea, that maybe the BIOS is reporting the core pairings wrong to the OS. Why should i get stutters if the GPU is connected to the cores which are used directly and no stutters across the infitity fabric. Weird!

I didn't retested in NUMA mode. I did that before and as long as i don't mix up the dies for one VM it makes no difference in gaming performance. Using the UMA mode showed me in my tests that i get a higher memory bandwith with no real performance losts.

ars92 · September 18, 2018

Might try this on my 1920x. Wonder if it'll help with my rpcs3 performance as BMS compared to KVM, the performance for rpcs3 on KVM is just so much more worse then BMS for me.

Sent from my SM-N960F using Tapatalk

Jerky_san · September 18, 2018

Just tried this but sadly my performance seems to be a lot worse only using cpu cores from the numa where my gpu and m2 device is connected. FPS in dylight dropped from 100 to barely 60 with a lot more stuttering hitching.. Very strange. My board is kind of stupid as well since it maps the CPUs differently it appears than other people's boards on this forum with a 2990wx.

Should say previously I was only mapping odd cores and leaving all even cores out. For some reason it improves performance even when it crosses dies.

Second die contains my M.2 Drive and 1070. Other die has the drives all attached to it for unraid.

Edited September 18, 2018 by Jerky_san

bastl · September 19, 2018

@Jerky_san what board are you using?

Jerky_san · September 19, 2018

6 hours ago, bastl said:

@Jerky_san what board are you using?

Asus Zenith X399

testdasi · September 19, 2018

1 hour ago, Jerky_san said:

Asus Zenith X399

I would hypothesize that your motherboard reports the core incorrectly as a "display problem". i.e. underlying it's still 0 + 1 = 1 pair. Hence when you assign all the odd cores, you effectively running your VM in non-SMT mode (e.g. 16 non-SMT = 16 physical cores > 16 SMT = 8 physical cores).

I made the above educated guess based on the fact that I see the exact same result i.e. assigning odd cores to VM gives me better performance.

I think we'll need to have the right expectation. This is not a magical cure.

Assigning all cores in the same die would only improve latency (assuming memory allocation is also to the same numa node). The processing power of a core (+ its SMT sister) doesn't change.

If your workload scales well with multiple cores (e.g. transcoding), having more cores spread across multiple dies will almost always help more than having fewer (physical) cores trying to ring-fence the process within the same node.
If your workload doesn't scale well and/or is bottle-necked by a pass-through device (e.g. gaming) then you will benefit from better consistency of performance (e.g. less stuttering i.e. more consistency fps) by ring-fencing the node with the connected device.

What we lacked previously is the ability to make educated tuning for the 2nd use case.

thenonsense · September 25, 2018

Anyone have an issue identifying between multiple GPUs of the same make/model? I used cinebench-marks via OpenGL to judge which was which, but I wasn't able to identify without it.

AnnabellaRenee87 · October 4, 2018

So I did this on my ProLiant DL370 G6 and got this back, whats up with these thread numberings?

Rhynri · October 7, 2018

Awesome video. I'd like to note that in "independent research" I got hwloc/lstopo included with the GUI boot in Unraid 6.6.1. So that's another option requiring about the same number of reboots as the script method. I.e. - Reboot into GUI, take snapshot, reboot back to CLI. Of course if you run GUI all the time, this is just a bonus for you.

Also, here is a labeled version of the Asus x399 ZE board in NUMA mode. Enjoy, and thanks @SpaceInvaderOne!

(Note: this is with all M.2 slots and the U.2 4x/PCIE 4x split enabled with installed media. Slot numbers are counting by full-length slots in order of physical closeness to CPU socket... so top down for most installs)

Edited October 7, 2018 by Rhynri
Clarification

Jerky_san · October 10, 2018

On 9/19/2018 at 8:51 AM, testdasi said:

I would hypothesize that your motherboard reports the core incorrectly as a "display problem". i.e. underlying it's still 0 + 1 = 1 pair. Hence when you assign all the odd cores, you effectively running your VM in non-SMT mode (e.g. 16 non-SMT = 16 physical cores > 16 SMT = 8 physical cores).

I made the above educated guess based on the fact that I see the exact same result i.e. assigning odd cores to VM gives me better performance.

I think we'll need to have the right expectation. This is not a magical cure.

Assigning all cores in the same die would only improve latency (assuming memory allocation is also to the same numa node). The processing power of a core (+ its SMT sister) doesn't change.

If your workload scales well with multiple cores (e.g. transcoding), having more cores spread across multiple dies will almost always help more than having fewer (physical) cores trying to ring-fence the process within the same node.

If your workload doesn't scale well and/or is bottle-necked by a pass-through device (e.g. gaming) then you will benefit from better consistency of performance (e.g. less stuttering i.e. more consistency fps) by ring-fencing the node with the connected device.

What we lacked previously is the ability to make educated tuning for the 2nd use case.

I turned off SMT for an experiment. The even/odd when SMT is off are physical cores but they are very weirdly broken up. There is a bios programmer over in overclocker's forum that works on my board so I asked him if he'd look at it but so far not much of an answer.

Edited October 10, 2018 by Jerky_san

mucflyer · October 21, 2018

I just need advise, how to separate GPU, to have them on separate CPU. It seems, as I understand, everything landed on NUMANode P0.

I have 2 VM's and two GPU's.

On other hand, I took Mainboard manual and I see this :

54 PCI Express 3.0 x16 Slot (PCIE2, Blue) from CPU_BSP1
56 PCI Express 3.0 x16 Slot (PCIE3, Blue) from CPU_BSP1 
57 PCI Express 3.0 x16 Slot (PCIE4, Blue) from CPU_BSP1 
58 PCI Express 3.0 x16 Slot (PCIE5, Blue) from CPU_AP1 
60 PCI Express 3.0 x 4 Slot (PCIE6, White) from CPU_BSP1

Am I correct saying, CPU_AP1 have only one PCI Express bus ?

And did some PCI cards shuffling, now it looks like this. Main VM is on NUMANode 0. Just curious what affect have all other devices on performance, like eth etc.

Edited October 21, 2018 by mucflyer

BomB191 · October 28, 2018

I keep getting

root@Tower:~# lstopo /media/user/downloads/topo.png
Failed to open /media/user/downloads/topo.png for writing (No such file or directory)

but it exists the directory that is. I dunno wtf I'm doing

this is after doing the "ln -s /lib64/libudev.so.1 /lib64/libudev.so.0"

edit just "lstopo" spits out

root@Tower:~# lstopo
Machine (16GB)
  Package L#0
    L3 L#0 (8192KB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#1)
      L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1
        PU L#2 (P#2)
        PU L#3 (P#3)
      L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2
        PU L#4 (P#4)
        PU L#5 (P#5)
      L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3
        PU L#6 (P#6)
        PU L#7 (P#7)
    L3 L#1 (8192KB)
      L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4
        PU L#8 (P#8)
        PU L#9 (P#9)
      L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5
        PU L#10 (P#10)
        PU L#11 (P#11)
      L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6
        PU L#12 (P#12)
        PU L#13 (P#13)
      L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7
        PU L#14 (P#14)
        PU L#15 (P#15)
  HostBridge L#0
    PCIBridge
      PCI 144d:a808
    PCIBridge
      PCI 1022:43c8
        Block(Disk) L#0 "sdb"
        Block(Disk) L#1 "sdg"
        Block(Disk) L#2 "sde"
        Block(Disk) L#3 "sdc"
        Block(Disk) L#4 "sdf"
        Block(Disk) L#5 "sdd"
      PCIBridge
        PCIBridge
          PCI 8086:1539
            Net L#6 "eth0"
    PCIBridge
      PCI 10de:1c03
    PCIBridge
      PCI 1022:7901

so it kinda works I guess.

edit edit. calling it top.png works wtf

root@Tower:~# lstopo /mnt/user/appdata/top.png

Edited October 28, 2018 by BomB191
extra stuff

dadarara · November 24, 2018

Guys

what are the small numbers on the diagram mean?

like I see the 10de:1b81 . is the 1070 . and it has number 4.0 - 4.0 .

and 1002:67df is the ATIRX480 and it has 16 - 16.

is it the PCI bus speed ? if so, it means the 1070 is connected on a slow slot?

SpaceInvaderOne · December 1, 2018

The 16 -16 etc isn't really telling you the slot speed but what it is currently running at. To get the correct speed you really need to put a load onto the card.

It is into do with power saving of the gpu when not being used.

It quite easy to see this in a windows VM using gpuz. You will see the speed there that the card can run at under bus speed. Hover over that and it will tell you the speed the card is currently running. Then to the right if you click on the question mark you get the option to put some stress on the gpu and you will see the number change then.

Darnshelm · May 26, 2019

Hi

I get this error

lstopo: pci-common.c:180: hwloc_pci_try_insert_siblings_below_new_bridge: Assertion `comp == HWLOC_PCI_BUSID_INCLUDED' failed.
Aborted

**VIDEO GUIDE** How To Use LSTOPO for Better VM Performance on Multi CPU and Threadripper Systems

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation

VIDEO GUIDE How To Use LSTOPO for Better VM Performance on Multi CPU and Threadripper Systems