Win10 VM MicroStutter


Recommended Posts

Alright guys, I am experiencing microstuttering in games/benchmarks. It happens about every 1-2 seconds and does not last very long, but it will drop frames to 20fps in games. To give an idea I need to play a game like Dark Soul 3 and Monster Hunter World at 1080p and everything setting at low just to be able to hit 55-60fps and I still get hard fps drops. *Dark Souls 3 is capped at 60fps but this should be able to run it at 4k 60fps and if not that Max settings 1080p @60fps with zero drops*

 

In 3Dmark time spy with EVGA 2080 I get 8489 which shows top 82% of tested computers

In Passmark 9 total score is 4050 with 79% of tested computers

Passmark low scores show terrible 2D scores "Simple vectors : 18, Fonts and Text: 199, Image Filters: 611, Direct 2d: 34, Complex Vectors: 58, Windows Interface: 50, Image Rendering: 750"

and a few terrible memory scores "Database Operations: 57, Memory Latency: 39"

 

I am out of answers for searching and testing. Hopefully I can give all the information that would be helpful.

 

Build

UnRaid OS : 6-7-0-rc5

 

BASE

CPUs : Intel Xeon E5-2687W v2 3.4GHz 8 Core

Motherboard : AsRock Rack EP2C602-4L/D16

RAM : 128GB Kingston DDR3 ECC 1866 PC3 14900

PSU : EVGA P2 1000Watt

 

STORAGE

SDD CACHE : Mushkin Source - 250GB (MKNSSDSR250GB) QTY 4 in RAID 10 *I assume that is how UnRaid manages it*

HDD Stor/Par : WD Red 8TB NAS Hard Disk Drive - 5400 RPM Class SATA 6 GB/S 256 MB Cache 3.5" (WD80EFAX) QTY 2 Parity QTY 1 Storage

 

PCI

EVGA 2080 XC Hybrid Gaming 08G-P4-2184-KR 8GB

NVIDIA Quadro P4000 8GB

Gigabyte GeForce GTX 1070 Mini ITX OC 8GB

FebSmart 4 Ports USB 3.0 Super Fast 5Gbps PCI Express

 

Temperatures of components

CPUs are liquid cooled and run 30-35C under load

Storage show about 30-31C

EVGA 2080 35C Idle 68C load

Nvidia Quadro P4000 Idk because it is just used for hardware transcoding and doesnt see much load

Gigabyte 1070 mini 40C idle 75-78C load

 

BIOS settings

all power settings are set to performance

C states are turned off

PCI are set to 3rd gen

 

Unraid Settings

Tips and Tweaks

CPU Gov Performance

Enable Intel Turbo Boost Yes

SSD trim * I dont know what the settings are on there*

 

I use Cache only VM to store the Disc Images

Each Image has 235GB

 

Now to tell you what I have tried.

 

Attempts at VMS:

 

I'm just going to explain what I have used and attempted *which is A LOT*. I currently DO NOT have a VM made for the 2080 because I am waiting for direction from someone with more knowledge than I have. I have made and remade and retested around 10 clean VMs and always end up with micro stuttering

 

CPU mode host passthrough

Log CPU

I have used 2 core 2 HT, 4 core 4 HT, and 6 core 6 HT in this pattern

PIN 2/18,3/19,4/20,5/21,6/22,7/23

ISO 2/18,3/19,4/20,5/21,6/22,7/23

 

Memory ranging from Initial 2GB MAX 8GB to Initial 50GB to MAX 50GB

 

Machine I have used i440fx-3.1 and Q35-3.1

 

Bios OVMF

 

Hyper V  Tried Enabled and Disabled

 

VirtIO driver ISO : Virtio-win-0.1.160-1.iso

Vdisk bus VirtIO

I have used with and without Graphics ROM bios

 

In Windows 10 I have all power set to performance most of the basic stuff following VM install and performance Space Invader One on youtube.

 

I installed win10, update everything, install latest nvidia drivers *everything in package*, change all pci devices to MSI mode, disable Spectre and Meltdown Patch *windows patch that causes a lot of usage issues*

make standard recommended NVCP settings for max performance

run game/benchmark hello microstuttering

 

or bare minimum

 

I install win10, zero updates, install nvidia driver only,, change pci devices to MSI mode,

make standard recommended NVCP settings for max performance

run game/benchmark hello microstuttering

 

I have also run these commands to no avail

bcdedit /set disabledynamictick yes

bcdedit /set useplatformclock true

bcdedit /set tscsyncpolicy Enhanced

 

 

Link to comment

Add this to your XML file:

  <cputune>
	...
    ...
    ...
    <vcpupin vcpu='6' cpuset='7'/>
    <emulatorpin cpuset='0,4'/>
  </cputune>

and adjust it to your setup, this will move the Qemu overhead to whatever CPU core you wish...  By default the Linux kernel and the Windows kernel try to use the same CPU cores, this forced handoff and inability for one kernel to reliably schedule tasks for those cores is what is causing this sort of thing...

 

The extreme example of this is to also add:

... isolcpus=1-3,5-7 ....

(Again, adjust to your setup...  Any cores set this way will be unavailable to the Linux kernel, only usable if something specifically requests them...)

to your syslinux config...  UnRaid will use CPU 0 no matter what you put here, but any cores you isolate will only be used if something like QEMU comes along and specifically requests to use them after that...  This will force a separation between the Linux Kernel and the Windows Kernel...

 

On my setup, an i7-7700k, 64gb ram, at first I ran with the first core and its hyper-thread Reserved for UnRaid, used Emulatorpin to pin QEMU to them also (possibly redundant after CPU isolation), then ran windows on the remaining 3 cores and their HT's...  While this worked, I realized I was having about 25% of my CPU sitting idle...

Currently I removed the Isolation, which allows Linux to have access to all cores, set EmulatorPin to only core 0 (without it's HT), and run windows on 3 full cores and one HT... This allows Windows and Linux to share the primary core while not actually sharing it...  Been running this way for 6 or 7 months, and think it works much better...

 

This however means that if I start using the Linux side more (or Docker also), that Linux has the ability to max out all 8 threads and kill my VM...  I just currently use the Linux side sparingly, so for me, this works...

Link to comment

I will need explanation on how to utilize the emulator pin correctly so through my testing if the cpuset looks odd its because of a lack of knowledge in this area. If you dont want to read through the scores and tests I can explain. Each test is different setups of emulatorpins until the very end were i settled with "emulatorpin cpuset=0-4" *not that it matters what I chose I just stuck with it just because.* I also made changed to the MSI priority for the RTX 2080 and other interrupts testing overall performance and whether or not I experienced Micostuttering.

 

Final result performance increased with MSI RTX 2080 Priority set to High along with Emulator pinning *regardless of selected cores*. None of the changes I have made has resolved microstuttering.

 

upon install of the VM and testing benchmarks I noted decreases in performance. Previous passmark 4050 or higher. Now 3370 avg

All tests show pretty consistent CPU, 3D graphics score and 2D graphics score *albiet still low 2D graphics*


No emulatorpin (MSI priority Normal) apparent microstutter 3373 *memory 1467.6 disk 11086*
No emulatorpin (MSI Priority High) no apparent microstutter 3373 *memory  1525 disk 9214*

 

<emulatorpin cpuset='0,4'/>
MSI priority Normal
passmark 3593.1 *memory 1777.4 disk 11730.1* apparent microstutter

 

<emulatorpin cpuset='2-3,18-19'/>
MSI priority Normal
passmark 3449.2 *memory 1760.7 disk 6517.1* apparent microstutter

 

<emulatorpin cpuset='0-4'/>
MSI priority Normal
passmark 3515 *memory 1770 disk 7887* apparent microstutter
3D Mark Timespy 8863 (82nd Percentile) apparent microstutter


<emulatorpin cpuset='0-4'/>
MSI priority High
passmark 3633 *memory 1700 disk 10949* apparent microstutter *Even though I wished it not to be true and MSI priority High to be the fix*
3D Mark Timespy 8887 apparent microstutter

 

<emulatorpin cpuset='0-4'/>
MSI priority High Changed few other interrupts from undefined priority to normal
passmark 3680 *memory 1678 disk 10822* apparent microstutter
3D Mark Timespy could not run caused instability revert

 

<emulatorpin cpuset='0-4'/>
MSI priority High Unloaded Most Windows 10 bloatware
passmark 3572 *memory 1788 disk 9566* apparent microstutter
3D Mark Timespy 8679 apparent microstutter

Link to comment
On 3/16/2019 at 7:58 PM, member378807 said:

CPUs : Intel Xeon E5-2687W v2 3.4GHz 8 Core

I noticed your board is an dual scket board. In case you have 2x CPU's please post the output of the following command whilst you have some VMs runnning

numastat qemu

and

numactl --hardware

For me it sounds like an issue mostly reported by Threadripper users. The way the CPUs are presented to the OS is different as it actual is. With other words the 2 separate die's of an AMD 1950x, each with 8 cores are presented to the OS as one single CPU with 18 cores as default. Each die has it's own memory controller and OS isn't aware of this in UMA configuration. In this case lets say you have a VM running on cores of node 1 and it uses memory attached to the other node you have latency issues. In NUMA configuration, the dies are presented individualy to the OS and you can force qemu/kvm to use only one specific node and it's ressources like memory or pcie devices. You also have to make sure that the graphics card you pass through to a VM is placed in an slot thats directly attached to the cores the VM are using. You can check the topology with LSTOPO, available in GUI mode only i think. 

 

UMA example

1198282279_topology-Kopie.thumb.png.2ecc7e1f675e78c3e0a40f98a90e1338.png

 

NUMA example

02_channel_topology.thumb.png.169dc7d8614615ce3b3c3fcb491274a9.png

 

This is an example how it looks on my system. As soon as I use more than 32GB of total 64GB in one VM, I notice this inside the VM with higher latency in memory access and also in small stuttering in games. 

 

EDIT:

Here a part of my tweaked VM to only use cores and RAM from a specific node.

  <vcpu placement='static'>14</vcpu>
  <iothreads>1</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='9'/>
    <vcpupin vcpu='1' cpuset='25'/>
    <vcpupin vcpu='2' cpuset='10'/>
    <vcpupin vcpu='3' cpuset='26'/>
    <vcpupin vcpu='4' cpuset='11'/>
    <vcpupin vcpu='5' cpuset='27'/>
    <vcpupin vcpu='6' cpuset='12'/>
    <vcpupin vcpu='7' cpuset='28'/>
    <vcpupin vcpu='8' cpuset='13'/>
    <vcpupin vcpu='9' cpuset='29'/>
    <vcpupin vcpu='10' cpuset='14'/>
    <vcpupin vcpu='11' cpuset='30'/>
    <vcpupin vcpu='12' cpuset='15'/>
    <vcpupin vcpu='13' cpuset='31'/>
    <emulatorpin cpuset='8,24'/>
    <iothreadpin iothread='1' cpuset='8,24'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='1'/>
  </numatune>

 

Edited by bastl
Link to comment

Numastat qemu

 

Per-node process memory usage (in MBs)
PID                               Node 0          Node 1           Total
-----------------------  --------------- --------------- ---------------
830 (qemu-system-x86)           15740.48          735.10        16475.59
4107 (qemu-system-x86)             10.62        32321.86        32332.48
21028 (qemu-system-x86)         38827.25        11422.36        50249.61
-----------------------  --------------- --------------- ---------------
Total                           54578.35        44479.32        99057.67

 

---------------------------------------------------------------------------------------------------------------------------------------------------------------------

 

Numactl --hardware

 

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 64456 MB
node 0 free: 473 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 64508 MB
node 1 free: 231 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

 

Link to comment
5 hours ago, member378807 said:

I will need explanation on how to utilize the emulator pin correctly so through my testing if the cpuset looks odd its because of a lack of knowledge in this area. If you dont want to read through the scores and tests I can explain. Each test is different setups of emulatorpins until the very end were i settled with "emulatorpin cpuset=0-4" *not that it matters what I chose I just stuck with it just because.* I also made changed to the MSI priority for the RTX 2080 and other interrupts testing overall performance and whether or not I experienced Micostuttering.

 

Final result performance increased with MSI RTX 2080 Priority set to High along with Emulator pinning *regardless of selected cores*. None of the changes I have made has resolved microstuttering.

 

upon install of the VM and testing benchmarks I noted decreases in performance. Previous passmark 4050 or higher. Now 3370 avg

All tests show pretty consistent CPU, 3D graphics score and 2D graphics score *albiet still low 2D graphics*


No emulatorpin (MSI priority Normal) apparent microstutter 3373 *memory 1467.6 disk 11086*
No emulatorpin (MSI Priority High) no apparent microstutter 3373 *memory  1525 disk 9214*

 

<emulatorpin cpuset='0,4'/>
MSI priority Normal
passmark 3593.1 *memory 1777.4 disk 11730.1* apparent microstutter

 

<emulatorpin cpuset='2-3,18-19'/>
MSI priority Normal
passmark 3449.2 *memory 1760.7 disk 6517.1* apparent microstutter

 

<emulatorpin cpuset='0-4'/>
MSI priority Normal
passmark 3515 *memory 1770 disk 7887* apparent microstutter
3D Mark Timespy 8863 (82nd Percentile) apparent microstutter


<emulatorpin cpuset='0-4'/>
MSI priority High
passmark 3633 *memory 1700 disk 10949* apparent microstutter *Even though I wished it not to be true and MSI priority High to be the fix*
3D Mark Timespy 8887 apparent microstutter

 

<emulatorpin cpuset='0-4'/>
MSI priority High Changed few other interrupts from undefined priority to normal
passmark 3680 *memory 1678 disk 10822* apparent microstutter
3D Mark Timespy could not run caused instability revert

 

<emulatorpin cpuset='0-4'/>
MSI priority High Unloaded Most Windows 10 bloatware
passmark 3572 *memory 1788 disk 9566* apparent microstutter
3D Mark Timespy 8679 apparent microstutter

 

Just to be clear, using the english translation of 

<emulatorpin cpuset='0-4'/>

is

"Pin the QEMU emulator to only use cores zero through 4 of the host" (a total of 5 cores BTW, only 1 or maybe 2 to be safe should be needed)

Most likely you would want to avoid using a hyphen when doing this designation...

 

The english version of:

<emulatorpin cpuset='0,4'/>

is

"Pin the QEMU emulator to only use cores zero or 4 of the host" (a total of 2, or really 1+ its HT, or about 1.25 cores)

That comma vs hyphen makes a big difference...

 

The name of your video card threw me for a bit, because MSI actually does have something to do with this issue, but MSI has 2 different meaning in this context:

 

MSI = Micro-Star International

and

MSI = Message Signaled Interrupts (https://en.wikipedia.org/wiki/Message_Signaled_Interrupts)

 

Message Signaled Interrupts actually was the next thing I was going to have you try, I have attached a utility that may help with that...  This is a small simple utility that shows the windows registry settings for the PCI devices on your system, and displays their MSI status, as well as letting you enable it if needed...  MSI basically lets the video card bypass the CPU to talk to ram, allowing it to run much faster...  It is a bit like DMA access...  Most likely if you passed in your GPU, it is not set to enable MSI interrupts, and this utility will let you enable it, then reboot...  I have to use it every time the drivers are installed or updated for my GTX1070...

MSI_util.exe

Link to comment

I did need context on the emulator pin. Before I got a response from bastl then I put two and two together. I didnt know if it was tied to the XML binding or the host binding but I got it set up now. Ill post what I am seeing. It seems to have made a significant difference. Is it all resolved I need more testing but I am on call and my time is limited at home. *I try to be as thorough as I can for all of you who are so generous with your time and knowledge*

 

As far as MSI. I was very aware about MSI because I was getting IRQ disable crashing my server previously.

 

I'm sure you thought "Damn this guy got a one of those alibaba deals MSI EVGA 2080" lol

 

Ill post my update in a moment

Link to comment

 

 

So, I made the changes to the XML loaded the VM and I from the start it appeared that the stuttering was lessened and the performance appeared better but the results do not exactly express what I was seeing on Passmark. When I ran 3D mark Timespy the Microstuttering was exactly the same.

 

What I noticed on Passmark was the CPU score went down about 2000 points the GPU score went up 700 points the memory score went up 250 points but the disk score went down 5000 points.

 

In 3D mark my score went down 1000 points.

 

XML for RTX 2080 VM

 

<vcpu placement='static'>12</vcpu>
  <iothreads>1</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='2'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <vcpupin vcpu='2' cpuset='3'/>
    <vcpupin vcpu='3' cpuset='19'/>
    <vcpupin vcpu='4' cpuset='4'/>
    <vcpupin vcpu='5' cpuset='20'/>
    <vcpupin vcpu='6' cpuset='5'/>
    <vcpupin vcpu='7' cpuset='21'/>
    <vcpupin vcpu='8' cpuset='6'/>
    <vcpupin vcpu='9' cpuset='22'/>
    <vcpupin vcpu='10' cpuset='7'/>
    <vcpupin vcpu='11' cpuset='23'/>
    <emulatorpin cpuset='0-1,16-17'/>
    <iothreadpin iothread='1' cpuset='0-1,16-17'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>

Link to comment
58 minutes ago, Warrentheo said:

It is also possible to manually edit the XML for the VM to make what ever configuration of NUMA nodes you wish, basically make what ever kind of processor you wish, as long as windows kernel knows what you want...

 

https://libvirt.org/formatdomain.html#elementsNUMATuning

 

I would recommend search for "NUMA" on that page, and seeing all the different settings you can configure...

Ill be on this for months lol.

Edited by member378807
.
Link to comment
Just now, Warrentheo said:

Hazards of having a server grade dual socket board...  You need to become an I.T. Admin to be able to run all the cool things on it 😛

Sigh... I figured as much. It is a lot to learn. I am not even familiar with linux in general. I've only used it for about a month so "ls","cd","mkdir", etc is about all I got. I really just got this for my wife and I to be able to play games from one unit and be able to stream hd content at the same time. I knew it came at the price of headaches, knowledge, and well... price. But, frankly I just want to be able to play Sekiro without micro stutters when it drops on the 22nd.

 

I'm all for going about learning on my own but I hit a brick wall, whether it be my own knowledge of the questions to ask or the specifics of the situation being available online for ease of access.

Link to comment
20 hours ago, member378807 said:

Sigh... I figured as much. It is a lot to learn. I am not even familiar with linux in general. I've only used it for about a month so "ls","cd","mkdir", etc is about all I got. I really just got this for my wife and I to be able to play games from one unit and be able to stream hd content at the same time. I knew it came at the price of headaches, knowledge, and well... price. But, frankly I just want to be able to play Sekiro without micro stutters when it drops on the 22nd.

 

I'm all for going about learning on my own but I hit a brick wall, whether it be my own knowledge of the questions to ask or the specifics of the situation being available online for ease of access.

Here is some consolation...  The new AMD Thread Ripper and Epyc chips are basically multi socket CPU's in a single CPU, and I am sure Intel is not too far off from the same sort of thing...  The rest of us will need learn this stuff soon anyway, you will just be ahead of us 😛

Link to comment
On 3/19/2019 at 5:04 PM, Warrentheo said:

Here is some consolation...  The new AMD Thread Ripper and Epyc chips are basically multi socket CPU's in a single CPU, and I am sure Intel is not too far off from the same sort of thing...  The rest of us will need learn this stuff soon anyway, you will just be ahead of us 😛

I might have got it solved, I'm not going to say 100% yet because I am still in initial testing. There is a possibility of two solves coming from you on a separate thread. Another one you responded to about gaming performance. Then there is another video about overall performance in gaming. But back to your content. You posted something about the basic specs of your XML. One had to do with passing through host bios as well as discard=unmap. I initially tried to pass through the bios on its own and I could not boot. Removed the code and restarted. No dice. Deleted VM and remade the VM, said not booting video yet* I cant remember exactly but i remember the last time I had that it took me awhile to get it back working so I decided to wipe the Vdisk.

 

*Side note you recommended Scsi for vdisk and i could not get it to boot.* small fish atm

 

I started fresh with these in xml

Iothreadpin

emulatorpin

numatune strict nodeset 0

bios host

discard unmap

 

I get everything going. Uninstall all bloatware on Win10.

Run passmark +1300 points overall score

+2000 CPU

+100 2D

3D score fluctuates about the same

+150-200 Memory

+11000 Disk score *this is not a typo

 

The disk score was something to the tune of 2656

Disk Sequential Read : 381
Disk random Seek + RW :  194
Disk Sequential Write : 158

 

These are Cache drives Sata3 RAID 10 so to me no sense.

After I remade the VM its about 1200MBS read and 500-750 write *Nova Bench shows 2000MBSread 500MBSwrite

 

but for whatever reason while I was testing before I remade this new VM the disk performance kept dropping and dropping. I have no answers.

 

Ok, so performance appears to be better in games. Not the best but better. I still get audio popping but that takes me to my next finding. Many sources speak about increasing the bitrate, updating the driver, or disabling allowing full control within windows 10. The guy in the video was saying that the standard windows audio driver did better than the nvidia driver. There are two audio devices Nvidia Virtual Audio Devices and Nvidia High Definition Audio Device. I disabled the Virtual audio device. No change in audio popping clicking or microstuttering. I Installed the Microsoft Audio driver and the audio pops and clicks went down substantially. I played a 1khz test tone while running a benchmark and the only time it every popped was transitioning between tests. Whereas before I could just let it play without a benchmark running and it would pop intermittently ever couple of seconds.

 

So, somewhere is where I am going. But, I have more questions and I have less paths to take. As far as audio went. I made that same attempt on the last VM and I experienced the same result less popping. I even went as far as disabling all audio and removing the card audio controller within unraid. and it only lowered the popping/clicking to the point that the Microsoft Audio Driver did. So, I may make an assumption that if there is no audio device then it cannot have an effect on microstutter/fps drops.

 

Also, why "WAS" my read/write performance getting crushed on my VM? I have 4x250GB cache drives in Raid10. Vdisk is on the cache. I wipe the vdisk make a new one with the same parameters except Bios =Host and Discard=unmap. Is this a permanent solution?

 

Also, I have 13 txt documents of the tests I've done and changes Ive made working from bios/unraid/windows10. I need to look more into unraid xml editing but I honestly don't know what more I can do.

 

Link to comment
5 hours ago, bastl said:

Yes, sir. Well not by the means of *manual* regedit. I used an app call MSI_util_v2. When I use it and restart the VM, I can use terminal lspci -v and see MSI is enabled

Edited by member378807
clarification
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.