SpaceInvaderOne Posted September 16, 2018 Share Posted September 16, 2018 Hi Guys. This video is a tutorial on how to examine the topology of a multi CPU or a Threadripper server which has more than one numa node. This is useful so we can pin vCpu cores from the same numa node as the GPU which we want to pass through so therefore, getting better performance. The video shows how to download and install hwloc and all of its dependencies using a script and @Squid great user script plugin. Hope you find it useful --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- **note** If using the new 6.6.0 rc versions of unraid (or above)** before running the lstopo command you will need to create a symlink using this command first ln -s /lib64/libudev.so.1 /lib64/libudev.so.0 ** dont run the above command unless on unraid 6.6.0 or above! --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- EDIT-- Unraid now has lstopo built in! You will need to boot your server in GUI mode for it to work. Then once you are in GUI mode just open terminal and run the command and you are good to go. Much easier than messing with loading it manually like in my video. 7 Quote Link to comment
jonp Posted September 16, 2018 Share Posted September 16, 2018 Another great guide by our friendly Space Invader!!Sent from my Nexus 6 using Tapatalk Quote Link to comment
revilo951 Posted September 17, 2018 Share Posted September 17, 2018 Thanks for all you do, friendly space-man! Quote Link to comment
jspence27 Posted September 17, 2018 Share Posted September 17, 2018 (edited) Thanks Spaceinvader. Reposting from my comment on the video in case other forum users run into the same issue. Running Unraid 6.6.0-rc3 and get - lstopo: error while loading shared libraries: libudev.so.0: cannot open shared object file: No such file or directory when trying to generate the png. Any thoughts? Edited September 17, 2018 by jspence27 Quote Link to comment
1812 Posted September 17, 2018 Share Posted September 17, 2018 great vid, but curious why no benchmarks??? Quote Link to comment
bastl Posted September 17, 2018 Share Posted September 17, 2018 5 hours ago, jspence27 said: Thanks Spaceinvader. Reposting from my comment on the video in case other forum users run into the same issue. Running Unraid 6.6.0-rc3 and get - lstopo: error while loading shared libraries: libudev.so.0: cannot open shared object file: No such file or directory when trying to generate the png. Any thoughts? Same error on RC4 Quote Link to comment
SpaceInvaderOne Posted September 17, 2018 Author Share Posted September 17, 2018 39 minutes ago, bastl said: Same error on RC4 Sorry I never tested this on the new rc unraid versions. This happens because libudev has been updated from .so.0 to .so.1 so it can't find libudev.so.0. So to fix this just create a symlink and it will fix this ln -s /lib64/libudev.so.1 /lib64/libudev.so.0 Quote Link to comment
bastl Posted September 17, 2018 Share Posted September 17, 2018 (edited) Thanks @SpaceInvaderOne the symlink fixed it for me Why the hell is the first PCIE slot connected to the second die and the third slot to the first die? In the first slot i have a 1050ti which is used by a Linux VM which uses some cores from the first die. The 1080ti on the 3rd slot is mainly used for a gaming VM and using all cores (8-15;24-31 isolated) on the second die. I wish i could flip a switch in the BIOS to reverse that. I guess there is no chance for such an option, right? Edited September 17, 2018 by bastl Quote Link to comment
thenonsense Posted September 17, 2018 Share Posted September 17, 2018 I've been experimenting with Numactl and pinning properly, and at least in terms of memory the benefits are immediate. This tool not only helps with memory, but checks PCIE slots as well? Awesome Quote Link to comment
testdasi Posted September 18, 2018 Share Posted September 18, 2018 Great vid! I was super doped realising that my first slot is connected to node 2. That means I still can leave unRAID to do unRAID things in node 0 like it prefers and have my workstation VM isolated on node 2. Now if only I can force it to only use the RAM connected to node 2. On 9/17/2018 at 8:06 AM, bastl said: Thanks @SpaceInvaderOne the symlink fixed it for me Why the hell is the first PCIE slot connected to the second die and the third slot to the first die? In the first slot i have a 1050ti which is used by a Linux VM which uses some cores from the first die. The 1080ti on the 3rd slot is mainly used for a gaming VM and using all cores (8-15;24-31 isolated) on the second die. I wish i could flip a switch in the BIOS to reverse that. I guess there is no chance for such an option, right? You can always flip the cards and pass them through via vbios method. For gaming VM, it's better to lock everything to the same node (although you are running in UMA not NUMA so not sure how much benefit it's gonna get). 15 hours ago, thenonsense said: I've been experimenting with Numactl and pinning properly, and at least in terms of memory the benefits are immediate. This tool not only helps with memory, but checks PCIE slots as well? Awesome How do you that please? It's always a "hope and pray" situation with me to get RAM assigned correctly. Now that I isolated the entire node 2 for my VM, it would allocate 99% of my RAM to node 2, which is good (enough) - but ONLY after a clean reboot. The longer I wait after the reboot, the more it allocates to node 0 (sometimes it would go nut and allocate 100% to node 0, even though there's nothing else running). Don't know how to do it without rebooting. Quote Link to comment
bastl Posted September 18, 2018 Share Posted September 18, 2018 @testdasi Yesterday I reduced my Gaming VM to 6 cores + 6 threads on node 1 with all cores isolated and did a couple of benchmarks without running anything else on that die. Than i switched all my dockers and other VMs from node 0 to node 1, isolated the last 6 out of 8 cores and their threads on node 0 from unraid and switched over the gaming VM to node 0 where still my 1080ti should be attached to (if lstopo is correct). I don't flipped the cards around yet, because for now i don't need any vbios to pass through. The performance is basically the same except from small stutters/hickups and sound bugs every 30-40 seconds. Every game (BF5, Farcry5, DayZ, Superposition + Heaven benchmark) i tested gave me nearly the same performance as on node 1 + that weird stuttering. Don't exactly know why. I never had that issue if i isolate the second die and use these cores only. This gets me back again to my initiall idea, that maybe the BIOS is reporting the core pairings wrong to the OS. Why should i get stutters if the GPU is connected to the cores which are used directly and no stutters across the infitity fabric. Weird! I didn't retested in NUMA mode. I did that before and as long as i don't mix up the dies for one VM it makes no difference in gaming performance. Using the UMA mode showed me in my tests that i get a higher memory bandwith with no real performance losts. Quote Link to comment
ars92 Posted September 18, 2018 Share Posted September 18, 2018 Might try this on my 1920x. Wonder if it'll help with my rpcs3 performance as BMS compared to KVM, the performance for rpcs3 on KVM is just so much more worse then BMS for me.Sent from my SM-N960F using Tapatalk Quote Link to comment
Jerky_san Posted September 18, 2018 Share Posted September 18, 2018 (edited) Just tried this but sadly my performance seems to be a lot worse only using cpu cores from the numa where my gpu and m2 device is connected. FPS in dylight dropped from 100 to barely 60 with a lot more stuttering hitching.. Very strange. My board is kind of stupid as well since it maps the CPUs differently it appears than other people's boards on this forum with a 2990wx. Should say previously I was only mapping odd cores and leaving all even cores out. For some reason it improves performance even when it crosses dies. Second die contains my M.2 Drive and 1070. Other die has the drives all attached to it for unraid. Edited September 18, 2018 by Jerky_san Quote Link to comment
bastl Posted September 19, 2018 Share Posted September 19, 2018 @Jerky_san what board are you using? Quote Link to comment
Jerky_san Posted September 19, 2018 Share Posted September 19, 2018 6 hours ago, bastl said: @Jerky_san what board are you using? Asus Zenith X399 Quote Link to comment
testdasi Posted September 19, 2018 Share Posted September 19, 2018 1 hour ago, Jerky_san said: Asus Zenith X399 I would hypothesize that your motherboard reports the core incorrectly as a "display problem". i.e. underlying it's still 0 + 1 = 1 pair. Hence when you assign all the odd cores, you effectively running your VM in non-SMT mode (e.g. 16 non-SMT = 16 physical cores > 16 SMT = 8 physical cores). I made the above educated guess based on the fact that I see the exact same result i.e. assigning odd cores to VM gives me better performance. I think we'll need to have the right expectation. This is not a magical cure. Assigning all cores in the same die would only improve latency (assuming memory allocation is also to the same numa node). The processing power of a core (+ its SMT sister) doesn't change. If your workload scales well with multiple cores (e.g. transcoding), having more cores spread across multiple dies will almost always help more than having fewer (physical) cores trying to ring-fence the process within the same node. If your workload doesn't scale well and/or is bottle-necked by a pass-through device (e.g. gaming) then you will benefit from better consistency of performance (e.g. less stuttering i.e. more consistency fps) by ring-fencing the node with the connected device. What we lacked previously is the ability to make educated tuning for the 2nd use case. Quote Link to comment
thenonsense Posted September 25, 2018 Share Posted September 25, 2018 Anyone have an issue identifying between multiple GPUs of the same make/model? I used cinebench-marks via OpenGL to judge which was which, but I wasn't able to identify without it. Quote Link to comment
AnnabellaRenee87 Posted October 4, 2018 Share Posted October 4, 2018 So I did this on my ProLiant DL370 G6 and got this back, whats up with these thread numberings? Quote Link to comment
Rhynri Posted October 7, 2018 Share Posted October 7, 2018 (edited) Awesome video. I'd like to note that in "independent research" I got hwloc/lstopo included with the GUI boot in Unraid 6.6.1. So that's another option requiring about the same number of reboots as the script method. I.e. - Reboot into GUI, take snapshot, reboot back to CLI. Of course if you run GUI all the time, this is just a bonus for you. Also, here is a labeled version of the Asus x399 ZE board in NUMA mode. Enjoy, and thanks @SpaceInvaderOne! (Note: this is with all M.2 slots and the U.2 4x/PCIE 4x split enabled with installed media. Slot numbers are counting by full-length slots in order of physical closeness to CPU socket... so top down for most installs) Edited October 7, 2018 by Rhynri Clarification Quote Link to comment
Jerky_san Posted October 10, 2018 Share Posted October 10, 2018 (edited) On 9/19/2018 at 8:51 AM, testdasi said: I would hypothesize that your motherboard reports the core incorrectly as a "display problem". i.e. underlying it's still 0 + 1 = 1 pair. Hence when you assign all the odd cores, you effectively running your VM in non-SMT mode (e.g. 16 non-SMT = 16 physical cores > 16 SMT = 8 physical cores). I made the above educated guess based on the fact that I see the exact same result i.e. assigning odd cores to VM gives me better performance. I think we'll need to have the right expectation. This is not a magical cure. Assigning all cores in the same die would only improve latency (assuming memory allocation is also to the same numa node). The processing power of a core (+ its SMT sister) doesn't change. If your workload scales well with multiple cores (e.g. transcoding), having more cores spread across multiple dies will almost always help more than having fewer (physical) cores trying to ring-fence the process within the same node. If your workload doesn't scale well and/or is bottle-necked by a pass-through device (e.g. gaming) then you will benefit from better consistency of performance (e.g. less stuttering i.e. more consistency fps) by ring-fencing the node with the connected device. What we lacked previously is the ability to make educated tuning for the 2nd use case. I turned off SMT for an experiment. The even/odd when SMT is off are physical cores but they are very weirdly broken up. There is a bios programmer over in overclocker's forum that works on my board so I asked him if he'd look at it but so far not much of an answer. Edited October 10, 2018 by Jerky_san Quote Link to comment
mucflyer Posted October 21, 2018 Share Posted October 21, 2018 (edited) I just need advise, how to separate GPU, to have them on separate CPU. It seems, as I understand, everything landed on NUMANode P0. I have 2 VM's and two GPU's. On other hand, I took Mainboard manual and I see this : 54 PCI Express 3.0 x16 Slot (PCIE2, Blue) from CPU_BSP1 56 PCI Express 3.0 x16 Slot (PCIE3, Blue) from CPU_BSP1 57 PCI Express 3.0 x16 Slot (PCIE4, Blue) from CPU_BSP1 58 PCI Express 3.0 x16 Slot (PCIE5, Blue) from CPU_AP1 60 PCI Express 3.0 x 4 Slot (PCIE6, White) from CPU_BSP1 Am I correct saying, CPU_AP1 have only one PCI Express bus ? And did some PCI cards shuffling, now it looks like this. Main VM is on NUMANode 0. Just curious what affect have all other devices on performance, like eth etc. Edited October 21, 2018 by mucflyer Quote Link to comment
BomB191 Posted October 28, 2018 Share Posted October 28, 2018 (edited) I keep getting root@Tower:~# lstopo /media/user/downloads/topo.png Failed to open /media/user/downloads/topo.png for writing (No such file or directory) but it exists the directory that is. I dunno wtf I'm doing this is after doing the "ln -s /lib64/libudev.so.1 /lib64/libudev.so.0" edit just "lstopo" spits out root@Tower:~# lstopo Machine (16GB) Package L#0 L3 L#0 (8192KB) L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#1) L2 L#1 (512KB) + L1d L#1 (32KB) + L1i L#1 (64KB) + Core L#1 PU L#2 (P#2) PU L#3 (P#3) L2 L#2 (512KB) + L1d L#2 (32KB) + L1i L#2 (64KB) + Core L#2 PU L#4 (P#4) PU L#5 (P#5) L2 L#3 (512KB) + L1d L#3 (32KB) + L1i L#3 (64KB) + Core L#3 PU L#6 (P#6) PU L#7 (P#7) L3 L#1 (8192KB) L2 L#4 (512KB) + L1d L#4 (32KB) + L1i L#4 (64KB) + Core L#4 PU L#8 (P#8) PU L#9 (P#9) L2 L#5 (512KB) + L1d L#5 (32KB) + L1i L#5 (64KB) + Core L#5 PU L#10 (P#10) PU L#11 (P#11) L2 L#6 (512KB) + L1d L#6 (32KB) + L1i L#6 (64KB) + Core L#6 PU L#12 (P#12) PU L#13 (P#13) L2 L#7 (512KB) + L1d L#7 (32KB) + L1i L#7 (64KB) + Core L#7 PU L#14 (P#14) PU L#15 (P#15) HostBridge L#0 PCIBridge PCI 144d:a808 PCIBridge PCI 1022:43c8 Block(Disk) L#0 "sdb" Block(Disk) L#1 "sdg" Block(Disk) L#2 "sde" Block(Disk) L#3 "sdc" Block(Disk) L#4 "sdf" Block(Disk) L#5 "sdd" PCIBridge PCIBridge PCI 8086:1539 Net L#6 "eth0" PCIBridge PCI 10de:1c03 PCIBridge PCI 1022:7901 so it kinda works I guess. edit edit. calling it top.png works wtf root@Tower:~# lstopo /mnt/user/appdata/top.png Edited October 28, 2018 by BomB191 extra stuff Quote Link to comment
dadarara Posted November 24, 2018 Share Posted November 24, 2018 Guys what are the small numbers on the diagram mean? like I see the 10de:1b81 . is the 1070 . and it has number 4.0 - 4.0 . and 1002:67df is the ATIRX480 and it has 16 - 16. is it the PCI bus speed ? if so, it means the 1070 is connected on a slow slot? Quote Link to comment
SpaceInvaderOne Posted December 1, 2018 Author Share Posted December 1, 2018 The 16 -16 etc isn't really telling you the slot speed but what it is currently running at. To get the correct speed you really need to put a load onto the card. It is into do with power saving of the gpu when not being used. It quite easy to see this in a windows VM using gpuz. You will see the speed there that the card can run at under bus speed. Hover over that and it will tell you the speed the card is currently running. Then to the right if you click on the question mark you get the option to put some stress on the gpu and you will see the number change then. 1 Quote Link to comment
Darnshelm Posted May 26, 2019 Share Posted May 26, 2019 Hi I get this error lstopo: pci-common.c:180: hwloc_pci_try_insert_siblings_below_new_bridge: Assertion `comp == HWLOC_PCI_BUSID_INCLUDED' failed. Aborted Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.