Preempting lowlatency kernel performance


Recommended Posts

I've recently just gotten over the last challenge in my setup- a harsh drop in performance and stability when running 2 gaming VMs at once. Running KVM with PCIe passthrough, a game in one VM always seemed to affect a game in the other VM. This seemed to get worse with disk access, so it definitely seemed to be some kind of scheduling/priority issue.

 

The general verdict is that running the 'generic' 100-tick kernel is more than enough for games, but no matter what there seemed to be a constant battle between the VMs. Im pinning VCPUs but both VMs ultimately share all the cores. I'm sure this is the root of my issue as there wouldnt be scheduling issues between the VMs if they didnt share cores.

 

So I installed a low latency kernel ('lowlatency' 1000-tick w/ kernel-space preemption).. and it kinda fixed everything. The general verdict is that the lowlatency kernel will most likely be a performance drain in gaming situations, but I havent noticed FPS loss (when running just one WM) so it must not be that much overhead (might be a different scenario when gaming in a KVM). I'm also not sure I could notice any FPS loss because beforehand the games were so jumpy and jittery i really couldnt gauge it. Afterwards everything is butter smooth... but with im assuming is some lost CPU efficiency.

 

Has anyone else tried a high-tick preempting kernel? I'm really expecting something horrible to happen, but so far it seems to have been a great decision.  I read that preempting can drop frame rates up to 10%, so i just overclocked the cpu/cache 10%... fixed, right?

Link to comment

doing the... 1:1 pinning like that just makes sure that linux doesnt try to move a VMs VCPU thread to another physical CPU (like it would with normal threads, schedules them based on a bunch of rules).

 

just for more backstory- I wanted to allow the windows VM to take advantage of more resources when the other VM is off. I've got a pretty simple script that solves that easily- just watches libvirt and just repins the VCPUs depending on which box(s) are running.

 

Trying to run both VMs across all cores is.... more of an experiment. With a single VM I was already having kinda bad latency issue. Audio pops, stuttering graphics when disk access was high, kinda jerky FPS. Then as an experiment I tried running another game with the overlapped CPU pinning on the other VM, and things got 1000x worse. So this definitely looked like a scheduling issue. My CPU wasnt 'pegged', so it just wasnt getting to the right task fast enough. The general 'solution' to this seems to be with CPU pinning, can even get crazy enough to move all IRQs to a system core so the VM cores will be almost 100% clean except what the VM runs on it. What i noticed with just the 1 VM already seemed to indicate a scheduling issue, so i kinda went with this as a 'worst case scenario' to diagnose it.

 

So, low latency kernel- in theory this does 2 things for me: a) increases the kernel tick speed- used for process preemption, basically subdivides the CPU time more (in this case, went from 100 tick to 1000 tick). b) allows the kernel to preempt kernel processes (drivers, kernel space and time). I think 'a' here is really whats making the difference- if you're split-pinning VCPUs, a VM really is free to basically use as much CPU time as it wants as theres no real demand for it. But, unless you made linux stay away from that core entirely, its never really 'clean'. IRQBalance might try to fill it with interrupts, the linux scheduler may put tasks there, and kernel tasks may run there which cant normally be preempted/stopped. So unless great care is taken, a VM still might generally have to 'wait' on a real CPU before it can do anything.

 

Basically, low latency kernel seems to have done a lot to fix things for me. Wondering if anyone else has experimented with preemption and/or high-tick kernels before? Hypothetically the lowlatency kernel should actually be rather inefficient and offer lower-throughput than the normal kernel (due to context switching). Anyone have any experience with that? This is also a skylake CPU which is supposed to offer optimizations for context switching. Are context switches potentially just not that expensive on this architecture?

 

 

 

 

Link to comment

Very neat. Any chance you had DPC latency measurements before & after the kernel change? Obviously your situation improved, but numbers always help illustrate the point.

 

I run a single gaming VM with isolcpus. Obviously if I ever wanted to run another on my i5 I'm really stuck, not enough cores. Throwing more hardware at the problem isn't practical (League of Legends & such just isn't that intensive) so a new kernel would be nice

Link to comment

Whoops, sorry let this go dark for a second.

 

I didn't do any technical measurements unfortunately :( The only numbers I was able to observe was a solid 10ms latency drop in Steam Streaming (40ms -> 30ms), plus stability in latency over time (steam will graph the latency for you), so far less spikes in the end.

 

@squark yeah even with single VM (no isolcpu for linux or core isolation for guests) disk access would always give me the biggest 'pops' in latency. With lowlatency on single VM, its butter smooth. No tasks really seem to delay any others noticeably (at the cost of context switching, im sure). This of course goes the same for another VM, even with CPU resources shared equally between 2 guests and the host, its really smooth.

 

As far as the kernel... I'm actually doing this on a reference setup (ubuntu xenial) before i switch over to unraid full time. Roughly the same package versions as unraid beta 6.2.0- kernel 4.4, qemu 2.5, libvirt 1.3.

 

Kernel configs i tried:

'linux-generic' - 4.4, 250 tick, vol_preempt

'linux-lowlatency' - 4.4, 1000 tick, invol_preempt, forced irq threads

 

Lowlatency definitely gave me the latency i wanted, but not sure which of the features did it =\ According to what I've read, CONFIG_PREEMPT will allow any higher priority process to interrupt and preempt any LOWER priority process (including kernel 'threads'). So technically, my normal priority VMs shouldnt actually be preempting anything except each other. Why this makes a difference with a single VM is a good question, but its mostly likely just due to the tick rate and probably doesnt have much to do with preemption (maybe on a IO device level).

 

The tick makes sense as you are just subdividing the time more between procs- afaik the normal tick based preemption works similarly, but takes into account time allocation- higher prio procs will be given more 'ticks' than lower. So in theory it should be switching between running drivers/kernel and the VM process rather often, but with priority still given to the drivers. This alone could improve latency by letting procs even do 'small' amounts of work rather quickly.

 

The other interesting thing is the 'forced irq threads'- this basically gets a lot of soft/hard irqs out of the kernel itself- afaik this subjects these kernel threads to the same scheduling principles as regular procs- this alone may also improve latency just by making the bulk of the kernel work follow the same 'tick' the rest of the procs do and time share more evenly.

 

So all in all I do really need to figure out which one of those 3 features gives me the massive boost =\ Maybe its truly the combo of all three, but I suspect CONFIG_PREEMPT might be unnecessary here. Plus the recommended 'tick' rate for gaming kernels is something with a multiple of 60 (600 i hear is good) to roughly match timing needed for frame gen. 600 would provide the gaming vm thread with a theoretical minimum of 10 involuntary switches per frame. Not bad.

 

The default 100 is pretty terrible- 10ms ticks means you may waste up to 10ms at the beginning of a frame gen. You can also end up with a 'beat frequency', or interference between two frequencies (60 and 100 here), that can cause regular/period latency changes (latency/stability 'pulsing' on a regular interval), so you would theoretically get interference at 40Hz. The interference frequency should be either really close to 0, or really high so its not noticeable. If its close but not 0, you'll get really long periods of bad/good latency.

Link to comment

@goober07 - also, im in the same boat as you. i5 lol. This approach has definitely let me run 2 games on the 2 VMS and saturate the CPU without noticeable latency issues. Overall performance does go down tho because the CPU is maxed out.

 

My next improvement after getting the kernels good is switching to an i7 for HT. I've done extensive research on this, and in theory HT should be amazing for 100% overcommiting exactly 2 VMs- 8 logical host cores, 4 logical cores per VM.

 

The whole goal here with my kernel tuning is to let the kernel effectively share CPU resources 'equally'. The 2 problems here being: linux has to context switch procs in order to 'time share' a core, and a VM might not get to start executing immediately.

 

So, with HT, this actually takes a bunch of load off of the kernel. With the 2 VMs, linux will spend most of its time alternating between mostly just 2 VM threads on a single core (some other threads here and there, but mostly the VM). With HT, we can assign a VM thread from each VM to each  'virtual core' of a single physical core. So the kernel has to do less, it can leave the threads there longer, and the physical CPU itself will handle executing both threads BASICALLY AT THE SAME TIME. Big misconception with HT- both threads are equal. Worse case scenario is each thread performs at half the speed of the physical core. Compared to non-HT at 4 physical non-HT cores and 8 VM threads- roughly half speed of a physical core, PLUS issues with timing/scheduling/context switches time dividing your 8 VM threads.

 

So ultimately HT i think is going to give me the boost I want, but I'll still need a high tick kernel. We need to give linux all the ability it can to start executing a VM thread *immediately*. CPUs/cores/hyper threads dont always perform the same- CPU may clock down for thermal issues, a core might suffer from intense cache misses or something, so performance is never stable.

 

What is stable tho, is that execution starts immediately when requested. AFAIK this guarantee can only be met with either CPU pinning or a lowlatency/rt kernel.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.