OS X Performance: cpu pinning, benchmarks, discussion and windows comparison

2.6k · February 1, 2017

-WARNING - LONG POST-

I had a project cancel this am, so I was able to do get to this a little sooner than expected. And since this got long, I decided to remove it from the OS X sierra install thread and give the discussion its own space.

I believe I realized why topology settings did not affect performance before in my testing a few months ago. I tested topology settings on a windows vm. And as I discuss later, you’ll see why I thought it did nothing (spoiler: it really doesn’t…)

The following tests were completed to determine optimal cpu performance settings in relation to cpu pinning and topology in os x. There were not done to test lag, video/audio responsiveness, or any other element.

For the previous tests in the sierra install discussion, no topology was ever defined for the vm. So for these tests, I swapped to using the Linux xml that gridrunner provided which defines topology. And there were some noticeable changes between showing the vm either a grouping of individual single thread cores vs cores with a virtual hyper-threading processor, and an interesting revelation about os x in virtual environments.

SO, on to the scores!

Note: no tests were done with audio/video

Modified Linux xml (previous tests were modified Ubuntu xml)

8GB ram

Cinebench R15

Same fresh copy of un-updated os x

emulator pin 1-2

All set to host passthrough

Proc 2 isolated from unRaid

CPU Thread Pairings
Proc 1
cpu 0 <===> cpu 12
cpu 1 <===> cpu 13
cpu 2 <===> cpu 14
cpu 3 <===> cpu 15
cpu 4 <===> cpu 16
cpu 5 <===> cpu 17

Proc 2
cpu 6 <===> cpu 18
cpu 7 <===> cpu 19
cpu 8 <===> cpu 20
cpu 9 <===> cpu 21
cpu 10 <===> cpu 22
cpu 11 <===> cpu 23

(test numbers correlate on the attached image of processor utilization in unRaid)

1. Topology 3 cores:2 threads, processors 6-11 (see image 1) not all cores not utilized to 100%

482

485

486

2.Topology 3 cores :2 threads, processors 6,18 7,19 8, 20 (HT Pairs on physical cpus) same issue, vm HT cores not pushed to 100% utilization

351

350

3. Topology 3 cores :2 threads, processors 6, 7, 8, 18, 19, 20 (HT Pairs on physical cpu) same issue, HT cores not pushed to 100% utilization

345

351

348

4. Topology 6 cores: 1 thread, processors 6-11, 100% cpu usage

541

535

540

5. Topology 6 cores: 1 thread, processors 6,7,8, 18,19, 20, 100% cpu usage

359

356

360

6. Topology 6 cores: 1 thread, Processors 6,18 7,19 8, 20 (ht Pairs on physical cpu) image 4 100% cpu utilization

360

363

361

and for fun:

7. Topology 2 sockets, 3 cores, 1 threads, processors 6-11, 100% cpu utilization

540

536

542

8. topology 2 sockets, 3 cores, 1 threads, processors 6,18 7,19 8, 20, 100% cpu utilization

365

364

363

———————

The info you don’t care about -

Setting the topology to 2 processors with 3 cores 1 thread each showed no real change in benchmark vs showing the vm 1 processor of 6 cores with 1 thread each. Not that this was ever debated, but I figured it wouldn’t take too long just to see what happens.

In any test utilizing HT cores/threaded pairs on a physical processor, scores were consistently lower than using non-paired cores. Not that that was ever in question either, but shows consistency in testing.

Now something interesting-

Declaring a topology that presents an OS X vm a hyper-threaded processor results in degraded performance numbers. This performance loss is not improved regardless if you use only physical cores or physical threaded/HT core pairs. It is continually lower vs. either setting a topology of 6 cores 1 thread, or no topology definition at all (which defaults to 6:1)

And I think this is why-

Through viewing cpu utilization in unRaid of the cores, and presenting OS X with a virtual HT processor created via topology declaration (3:2), and assigned on 6 physical single thread cores, it displayed that every other core was not achieving 100% utilization. (see image 1.) Even as the core assignment changed to the defined virtual processor (3:2), the degradation of performance on every other core remained (see benchmarks/images 2 &3.) I believe this is because OS X does not push “HT threads” (a second thread on a perceived core) to full utilization. This also means that OS X sees vm HT pairs assigned in horizontal groups, not the vertical rows in the xml as is currently believed to be the case with windows.

To achieve os x HT pairing using this method would be defined in xml as

<cputune>
    <vcpupin vcpu='0' cpuset=‘6’/>
    <vcpupin vcpu='1' cpuset=’18’/>
    <vcpupin vcpu='2' cpuset=‘7’/>
    <vcpupin vcpu=‘3’ cpuset=’19’/>
    <vcpupin vcpu=‘4’ cpuset=‘8’/>
    <vcpupin vcpu=‘5’ cpuset=’20’/>
  </cputune>

while the windows accepted standard way is

<cputune>
    <vcpupin vcpu='0' cpuset=‘6’/>
    <vcpupin vcpu='1' cpuset=’7’/>
    <vcpupin vcpu='2' cpuset=‘8’/>
    <vcpupin vcpu=‘3’ cpuset=’18’/>
    <vcpupin vcpu=‘4’ cpuset=‘19’/>
    <vcpupin vcpu=‘5’ cpuset=’20’/>
  </cputune>

I needed to determine if this behavior was indicative of my hardware, or os x. At this point, I booted up a windows 10 vm and proceeded with the same testing methodology.

Windows 10

8GB ram

no emulator pin (forgot, but inconsequential for testing reasons)

Cinebench R15

host passthrough

A. Topology 1 socket, 3 cores, 2 threads, processors 6-11, 100% cpu utilization

535

527

526

B. Topology 1 socket, 6 cores, 1 threads, processors 6-11, 100% cpu utilization

551

549

551

C. Topology 1 socket, 3 cores, 2 threads, processors 6,7,8, 18, 19, 20, (accepted windows pairing) 100% cpu utilization

376

369

368

D. Topology 1 socket, 6 cores, 1 threads, processors 6,7,8, 18, 19, 20, 100% (accepted windows pairing) cpu utilization

368

371

E. Topology 1 socket, 3 cores, 2 threads, processors 6,18, 7,19, 8, 20, (os x thread pairing) 100% cpu utilization

365

367

F. Topology 1 socket, 6 cores, 1 threads, processors 6,18, 7,19, 8, 20, (os x thread pairing) 100% cpu utilization

367

368

367

Windows testing results were as expected in terms of a vm running 2 threads on 1 core performing less than one using 1 thread per core. Also evident was that Windows does not show the reduction in core utilization regardless of topology declaration (I did not attach images because they all showed the same result.) And as such, I am unable to determine if the HT pairings that windows uses are actually different than OS X. And this may be for good reason.

I don’t think virtualized windows cares about HT pairings-

When testing topology declaration that would seem to match os x HT pairing (grouped) vs the accepted windows HT pairing (row) the results showed almost no difference, which is also interesting because it implies that windows treats what it believes as paired threads on a physical core being presented to it as the same. This is demonstrated by the close benchmark results that occurred when using the same physical cpus, but topology differences. And if you look at task manger>performance>cpu, it lists a total count of virtual processors, no topology definition. Windows is self aware…

So trying to pair threads by defining topology for a windows vm, in terms of performance, is essentially an exercise in futility as you reap no tangible gains vs presenting the vm a grouping of cores. And as the tests showed, even mixing up what could be be considered HT pairs and putting the “wrong” pair together had essentially no change in results (Compare tests C:E, D:F.)

How does this relate to os x-

Since os x is not optimized for virtual environments, it still “believes” it is on bare metal and tries to compensate accordingly by not forcing 100% cpu usage on what it perceives as the second thread on a physical cpu, since there is no way to achieve 100% utilization of both threaded pairs at the same time. This is demonstrated by tests that use 3 physical cores and 2 threads on each achieving lower scores than 6 physical cores of 1 thread each.

Bottom line-

Based on these cumulative test results, if you want to maximize os x cpu performance, give it a group of single threaded cpus and don’t present it a virtual processor with HT pairs defined in topology.

I would ask anyone else to replicate this process and see if their results match.

methanoid · February 3, 2017

OK I will bite..

Aren't tests C-D-E-F all identical layouts so will yield identical results!??!

Doesn't your testing also show the expected results... 6 physical cores better than 3+3HT... Your test 1 is actually 6 cores, not 3+3 Tests 2 + 3 are same.. Not sure what diff between 1 and 4 is...

This just confuses me ?!?!

2.6k · February 3, 2017

OK I will bite..

Aren't tests C-D-E-F all identical layouts so will yield identical results!??!

The logical threads used are identical. How they are presented to the vm is what changes.

C presents the vm a 3:2 virtual processor on a 3:2 physical

D presents the vm a 6:1 virtual processor on a 3:2 physical

This would illustrate differences in presenting the same paired physical cores to a vm as either a virtual ht processor (which is accepted as how you should do it) vs. presenting the virtual processor as 6 non hyper threaded cores. As it turns out, there are none.

E presents the vm a 3:2 virtual processor on a 3:2 physical

F presents the vm a 6:1 virtual processor on a 3:2 physical

The difference with E & F is that the order of cores are mixed vs C & D, and not in what is generally accepted as the optimal topological layout.

To view differences between accepted windows ht pairing format and the implied optimal one for os x when using ht pairs, You would compare tests C & E, and D & F, which also shows essentially no change.

Everyone says "You have to put your ht pair on the ht thread of the physical cpu." Well.... testing implies otherwise. Regardless of how the virtual cpu was presented to the vm or even the core order, mixing what we perceived as windows ht cores doesn't matter in terms of performance/cpu benchmarks.

And your observation is correct, they yielded similar scores. which is why I wrote "I don’t think virtualized windows cares about HT pairings."

Windows testing implies that that defining a topology really didn't matter in benchmark scores in virtualized windows, as it appears to treat all cores presented to it as the same, unlike os x.

Doesn't your testing also show the expected results... 6 physical cores better than 3+3HT... Your test 1 is actually 6 cores, not 3+3 Tests 2 + 3 are same.. Not sure what diff between 1 and 4 is...

Yes, expected results that 6 cores are better than 3 thread pairs, which is why I wrote: "In any test utilizing HT cores/threaded pairs on a physical processor, scores were consistently lower than using non-paired cores. Not that that was ever in question either, but shows consistency in testing."

Test 2 & 3 are not the same in terms of assignment:

Test 2 presented the vm a 3:2 virtual cpu consisting 3 physical cpus and their ht pair, laid out with the vm ht pair on the physical cpu ht thread. (see image 2 in attachment)

Test 3 presented the vm a 3:2 virtual cpu consisting of 3 physical cpus and their ht pairs laid out with the the accepted windows HT pairing scheme, which then placed os x ht threads in alternating positions of the vertical row. (see number 3 in attachment) I suspect the results are similar because regardless of where the ht thread is placed in these 2 tests, both logical threads on a core were still utilized to the same percentage because they each contained 1 thread at 100% and 1 thread at less than 100%.

Test 1 presented the vm a 3:2 virtual cpu consisting of 6 physical cores (no pairs used)

Test 4 presented the vm a 6:1 virtual cpu consisting of 6 physical cores (no pairs used)

Differences in benchmark scores of these tests show how the os x vm performs when presented a 3:2 virtual cpu vs a 6:1. And this is actually the most important test for os x, as it led to the discovery that if you give the vm a virtual HT processor, (3:2 in this case) then half of the cores will not be utilized to 100%, reducing overall performance.

This is the key difference between windows and os x on kvm. And as I proposed earlier, I believe it is because windows "knows" it is running in a virtual state and does not throttle down any cores as hyper threads. OS X on the other hand, believes it is a real boy (computer) and won't push it's HT cores to 100% because in a real world setting, you can not achieve 100% utilization on 2 threads of a physical core. If you could, then the benchmark score of a single thread of a given core would be doubled when you ran the test again utilizing both its threads. But that does not happen. If it did, then every test run in this benchmark across 6 physical cores would have a similar score to the run on the 3:2 physical cpu.

There is roughly an 8-12% difference between benchmarks scores of the physical cpu at 3:2 vs 6:1, with the latter scoring higher (as expected.) And if you look at the os x images (attached) that presented the vm a virtual HT processor, unRaid shows cpu usage on the ht cores to be 6-8% less than 100%. I believe the variance of a few points between these two sets of numbers (benchmark differences and displayed cpu utilization) can be explained as not expecting unRaid to be 100% precise on the dashboard utilization, since it is not a realtime graph but a snapshot, combined with the vm also using cpu resources to run itself as well as the benchmark.

Does that 10% difference matter in terms of overall performance on a vm? On the small scale side of things, probably not. If you're scoring a 300 and reach a 330, it's not a big deal. But if you're running a 24 core vm and intend to use an application that will push your cores to full utilization, then there is definable difference in how long it takes to complete a given task if you're benching at 1420 vs 1291.

This just confuses me ?!?!

Tell me about it.....

alexciurea · February 17, 2017

Thanks 1812 for such detailed tests!

I wonder if having a mix between cpu HT pairs and sequence assignment is a good compromise for achieving both better performance overall, and some pressure valve for when there is heavy load of entire unraid box...

For example in a 6/12 CPU, with 2 VM's under heavy load, to assign:

VM1: 1,7 + 2-4

VM2: 5,11 + 8-10

(assuming 0, 6 for unraid)

I mean, by doing sequence assignment only (e.g. VM1: 1-5, VM2: 7-11), if both VM's are under heavy load you'll probably will end with some responsiveness issues?

Whereas if you assign in pairs, you won't utilize to the max the cpus.

just a though... will have to try

2.6k · February 17, 2017

If I know that 2 vm's are going to be using 80% or more of a given thread constantly, performance wise it actually really makes little to no difference if they are sharing a core or on isolated cores from one another (when using a minimum of 2 threads per vm). In terms of virtualized windows 10, it will always try to get 100% of utilization of the thread, regardless if it is sharing a core or not. It won't actually achieve it, and the performance hit will then be the same in either situation. If you're running vm's on top of each other, both not be able to achieve full utilization. If you give the vm 3 cores, 6 threads, by virtue of sharing cores with itself, none of those paired threads will be able to hit 100% utilization at the same time. Unraid dashboard and windows will show 100%, but benchmarking is always consistently lower because it is not really showing the loss that occurs when trying to run a thread pair through a single core. In other words: you can't cram 10lbs of **** in an 8lb sack, regardless of where the **** comes from.

I've done one testing in regards to latency of vm's using isolated cores (no HT sharing among vm's) vs shared cores, but haven't put it up since this info above is still quite a bit just on its own. But I found that two 6 core vm's run on 3 HT threaded pairs vs on top of each other (sharing physical cores) have very similar latency scores. Which makes me start to think that a user's hardware often comes into play more than pinning assignment. In all my tests, using mixed core assignments, putting vm's on shared cores, etc..., I could never get any real difference in latency or cause any audio/video sync issues. The only time I was able to achieve degradation to latency/responsiveness and cause audio/video issues was when I either put 2 vm's on the exact same threads (so 2 of them on 6-11) or when a vm shared a thread with unRaid/dockers-- but then audio issues only occurred when the thread was utilized over about 70-80%.

Much of this testing was done on enterprise hardware which was built for vm's, so that may play a big part in my inability to experience latency when taking similar thread pinning actions. The one thing I would suggest though, if someone is considering sharing cores with vm's is to flip the order of the core assignment for the second vm. That way both vm's "cpu 0" is not on the same physical core. A few times i've seen windows and other operating systems, rely on their first core for single core tasks. Often they cycle through the threads available to them, but some applications seem intent on staying on "core 0." (Also, on boot, some operating systems will max "cpu 0" before utilizing other cores in he startup process.) So if 2 vm's are pushing a single core, then you have a better chance at them both no trying to utilize the same HT pair at 100% at the same time.

Would be nice to see what other users experience, in addition to what hardware they are doing it on.

OS X Performance: cpu pinning, benchmarks, discussion and windows comparison

Recommended Posts

1812

Link to comment

methanoid

Link to comment

1812

Link to comment

alexciurea

Link to comment

1812

Link to comment

Join the conversation