Freeze/Stuttering - Threadripper 1920x - Windows 10 VM

Kronos69 · June 20, 2018

UPDATE
While I managed to greatly reduce the symptoms, the problem is still there. See page 2 for the latest updates.

-

ORIGINAL POST:

I've got this single frustrating problem: my VMs (not unraid, only the vms) stutter during file transfers and high disk usage in general, originating inside the VM (synology cloudstation syncing from the main NAS, manual file transfer, crystaldiskmark running, lightroom reading 100+ MB raw images, etc.).

I've got three workstations in my office that are managed by unraid, three W10 VMs with dedicated gpus (two small quadros and a 1070FTW), on a x399-1920X-48GB ECC machine.

When the disk usage spikes, one of the logic CPUs spikes to 100% usage, and all the system stutters for a few seconds (the mouse pointer freezes for a few seconds, etc.).

It happens for transfers on the C volume (1 dedicated ssd for VM, but not passed through), on the array network shares, on other shares mounted on unassigned ssds.

Using crystaldiskmark inside a VM on a share mounted on an m.2 unassigned device, a phisically different disk from C where the VM image is not saved, the problem is still there.

Today I tried isolating the cpus (already assigned in phisical core pairs to the vm from the beginning), and using emulatorpin, pointing it to the first phisical core for all VMs (the same two logical cores that i left for unraid). Nothing changed: the load on the first CPU significantly increased, obviously, sometimes reaching 100%, but meanwhile also the same VM core that was reaching 100% usage before, still does, and the stuttering is still there.

I've read that changing the machine to q35 from i440fx may solve this issue, but I'm scared that all the licenses I've activated on the VMs may be compromised.

I can easily replicate the issue 100% of the time, and I'm attaching the latest diagnostics.
Thanks in advance for your help.

rnzows0-diagnostics-20180620-1630.zip

Edited December 14, 2018 by Kronos69

Kronos69 · June 22, 2018

I’m banging my head against a wall with this :(

I searched on all the forum for days, but I can’t seem to find a solution.

The fact that it regularly happens on all 3 my W10 VMs and not on a Windows bare metal setup, makes me think that it could be a wrong VM configuration made by me inside unraid, or maybe an unraid bug.

Any guessing on what could I do to isolate the source of the problem?

This is a small test configuration to see if unraid may be suited for the rest of the office WSs, but atm I see it’s impossibile to work smoothly while the disk is being accessed with an heavy load.

2.6k · June 22, 2018

maybe I didn't read clearly, but are all the vm's on the same ssd?

what guide did you follow with the vm creation?

Kronos69 · June 22, 2018

5 hours ago, 1812 said:

maybe I didn't read clearly, but are all the vm's on the same ssd?

what guide did you follow with the vm creation?

VM Faq and Gridrunner’s YouTube videos, at least those were my main sources.

Three different disks.

-

VM1 - unassigned M.2 SSD, xfs-raw-scsi-unmap=discard

VM2 - unassigned SATA SSD, xfs-raw-scsi-unmap=discard

VM3 - Cache Sata SSD, btrfs-qcow2-scsi-unmap=discard

Ssd trim plugin installed and enabled, running once a day

-

VM1 and 2 images were initially residing on the cache (I snapshotted VM 2 and 3 from VM1 at the beginning), and were later converted to XFS-RAW and moved to dedicated drives. Later I changed virt io to scsi to see the ssd drives as thin provisioned and to be able to trim them.

I also have a 7200rpm HDD for the array and another m.2 unassigned SSD where I mounted two shares to be used as scratch disks by VM 1 and 2 with DaVinci Resolve and other media editing SW.

-

If with the VM I originate a file movement from the main/central office NAS to C:/ or to any share, one logical cpu immediately jumps to 100% usage and the VM starts to stutter (the screen, pointer included completely and repeatedly freezes for a few secs) until the file transfer is completed and the cpu usage returns to normal.

If I run crystaldiskmark on the scratch disk, it also regularly happens.

Edited June 22, 2018 by Kronos69

2.6k · June 22, 2018

5 hours ago, Kronos69 said:

VM Faq and Gridrunner’s YouTube videos, at least those were my main sources.

Three different disks.

-

VM1 - unassigned M.2 SSD, xfs-raw-scsi-unmap=discard

VM2 - unassigned SATA SSD, xfs-raw-scsi-unmap=discard

VM3 - Cache Sata SSD, btrfs-qcow2-scsi-unmap=discard

Ssd trim plugin installed and enabled, running once a day

-

VM1 and 2 images were initially residing on the cache (I snapshotted VM 2 and 3 from VM1 at the beginning), and were later converted to XFS-RAW and moved to dedicated drives. Later I changed virt io to scsi to see the ssd drives as thin provisioned and to be able to trim them.

I also have a 7200rpm HDD for the array and another m.2 unassigned SSD where I mounted two shares to be used as scratch disks by VM 1 and 2 with DaVinci Resolve and other media editing SW.

-

If with the VM I originate a file movement from the main/central office NAS to C:/ or to any share, one logical cpu immediately jumps to 100% usage and the VM starts to stutter (the screen, pointer included completely and repeatedly freezes for a few secs) until the file transfer is completed and the cpu usage returns to normal.

If I run crystaldiskmark on the scratch disk, it also regularly happens.

Try giving each vm 1 individual emulator pin vcpu (non shared and not reserved for unraid) I've found that sometimes when doing video editing or heavy io work, the emulator pin function can hammer a single core.

what does your pinning/assigmnents look like for the entire cpu?

also, a few errors in the syslog, but I have no idea if they mean anything in regards to your problem:

Jun 20 16:17:22 RnzOWS0 kernel: pcieport 0000:00:01.1: AER: Corrected error received: id=0000
Jun 20 16:17:22 RnzOWS0 kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
Jun 20 16:17:22 RnzOWS0 kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
Jun 20 16:17:22 RnzOWS0 kernel: pcieport 0000:00:01.1:    [12] Replay Timer Timeout  
Jun 20 16:22:00 RnzOWS0 login[9119]: ROOT LOGIN  on '/dev/pts/2'
Jun 20 16:26:22 RnzOWS0 kernel: pcieport 0000:00:01.1: AER: Corrected error received: id=0000
Jun 20 16:26:22 RnzOWS0 kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
Jun 20 16:26:22 RnzOWS0 kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
Jun 20 16:26:22 RnzOWS0 kernel: pcieport 0000:00:01.1:    [ 6] Bad TLP               
Jun 20 16:27:27 RnzOWS0 kernel: pcieport 0000:00:01.1: AER: Corrected error received: id=0000
Jun 20 16:27:27 RnzOWS0 kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
Jun 20 16:27:27 RnzOWS0 kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
Jun 20 16:27:27 RnzOWS0 kernel: pcieport 0000:00:01.1:    [ 6] Bad TLP               
Jun 20 16:27:33 Rn

There are also some thread ripper specific "things" that folks are doing to fix/improve performance. Have you looked into any of those threads?

Kronos69 · June 22, 2018

1 hour ago, 1812 said:

Try giving each vm 1 individual emulator pin vcpu (non shared and not reserved for unraid) I've found that sometimes when doing video editing or heavy io work, the emulator pin function can hammer a single core.

I'll try, thanks.
ATM, when doing intensive file transfer, without an emulator pin assigned, one of the VM isolated vcpu skyrockets. When the emulator pin vcpu is assigned, it - does - jump to an high usage (near to 100% when it's shared between VMs and multiple VMs are heavily accessing the disks), so I think that the emulator pin is taking some of the load, BUT the same VM isolated vcpu that was skyrocketing before, goes to 100% anyway, the same way.

Quote

what does your pinning/assigmnents look like for the entire cpu?

2 vcpu: Unraid & emulatorpin (x3 VMs)

12 vcpu: VM1 (video editing, rendering)

8 vcpu: VM2 (light video editing, raw photo editing)

2 vcpu: VM3 (autocad, ms office)

Nothing else is running, only the three VM at the same time. The ram usage when the only docked application (krusader) is running alongside the VMs (with no ballooning), is between 97 and 98%

Quote

also, a few errors in the syslog, but I have no idea if they mean anything in regards to your problem:


Jun 20 16:17:22 RnzOWS0 kernel: pcieport 0000:00:01.1: AER: Corrected error received: id=0000
Jun 20 16:17:22 RnzOWS0 kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
Jun 20 16:17:22 RnzOWS0 kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00001000/00006000
Jun 20 16:17:22 RnzOWS0 kernel: pcieport 0000:00:01.1:    [12] Replay Timer Timeout  
Jun 20 16:22:00 RnzOWS0 login[9119]: ROOT LOGIN  on '/dev/pts/2'
Jun 20 16:26:22 RnzOWS0 kernel: pcieport 0000:00:01.1: AER: Corrected error received: id=0000
Jun 20 16:26:22 RnzOWS0 kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
Jun 20 16:26:22 RnzOWS0 kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
Jun 20 16:26:22 RnzOWS0 kernel: pcieport 0000:00:01.1:    [ 6] Bad TLP               
Jun 20 16:27:27 RnzOWS0 kernel: pcieport 0000:00:01.1: AER: Corrected error received: id=0000
Jun 20 16:27:27 RnzOWS0 kernel: pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
Jun 20 16:27:27 RnzOWS0 kernel: pcieport 0000:00:01.1:   device [1022:1453] error status/mask=00000040/00006000
Jun 20 16:27:27 RnzOWS0 kernel: pcieport 0000:00:01.1:    [ 6] Bad TLP               
Jun 20 16:27:33 Rn

There are also some thread ripper specific "things" that folks are doing to fix/improve performance. Have you looked into any of those threads?

ATM I've disabled deep c-states in the bios, and enabled zenstates.

The CPU is liquid cooled with a closed loop and a 360mm radiator, and we've got AC in the office, it's not suffering during this summer.

The tcpu turbo (or wathever is called) isn't shown inside unraid, but when enabled or disabled in the bios I notice a replicable difference in benchmarks: even if unraid always shows a max clock of 3.5GHz, i'm guessing that the turbo is somehow working anyway.

Edited June 23, 2018 by Kronos69

2.6k · June 23, 2018

25 minutes ago, Kronos69 said:

The ram usage when the only docked application (krusader) is running alongside the VMs (with no ballooning), is between 97 and 98%

have you changed your dirty cache amount? I'd also try and leave a little more headroom for unraid to deal with caching writes to disk. You clearly don't have any OOM problems, but a little more headroom is always nice I've found.

28 minutes ago, Kronos69 said:

2 vcpu: Unraid & emulatorpin (x3 VMs)

12 vcpu: VM1 (video editing, rendering)

8 vcpu: VM2 (light video editing, raw photo editing)

2 vcpu: VM3 (autocad, ms office)

my assumption is that none of these are stacked on each other's cores, correct?

29 minutes ago, Kronos69 said:

ATM, when doing intensive file transfer, without an emulator pin assigned, one of the VM isolated vcpu skyrockets. When the emulator pin vcpu is assigned, it - does - jump to an high usage (near to 100% when it's shared between VMs and multiple VMs are heavily accessing the disks), so I think that the emulator pin is taking some of the load, BUT the same VM isolated vcpu that was skyrocketing before, goes to 100% anyway, the same way.

this is odd behavior. are you using the virtio driver or the e1000 for network?

Kronos69 · June 23, 2018

18 minutes ago, 1812 said:

have you changed your dirty cache amount? I'd also try and leave a little more headroom for unraid to deal with caching writes to disk. You clearly don't have any OOM problems, but a little more headroom is always nice I've found.

First time I read about the dirty cache, I'll search for it. Thanks. How much should be ok, with 48GB and everything running? 95%?

This is from the diagnostics
total used free shared buff/cache available
Mem: 47G 44G 467M 517M 2.1G 1.5G
Swap: 0B 0B 0B
Total: 47G 44G 467M

Quote

my assumption is that none of these are stacked on each other's cores, correct?

Correct, all different cores, pinning vcpus in phisical pairs. The only core that may be "shared" is the first one: the two vcpus I've not isolated from unraid, and the same two vcpus I'm now pointing the emulatorpin to.

Quote

this is odd behavior. are you using the virtio driver or the e1000 for network?

I'll check tomorror morning when I'll be in the office, but I guess it's the virtio one.

Edited June 23, 2018 by Kronos69

2.6k · June 23, 2018

4 minutes ago, Kronos69 said:

First time I read about the dirty cache, I'll search for it. Thanks. How much should be ok, with 48GB and everything running? 95%?

This is from the diagnostics
total used free shared buff/cache available
Mem: 47G 44G 467M 517M 2.1G 1.5G
Swap: 0B 0B 0B
Total: 47G 44G 467M

There's not correct number, but when i'm running my video editor (64 cores, 48-64gb ram) i tend to keep system ram usage about 90% to allow for file transfer caching, overhead, and misc. It's more room than I need, but I know it won't cause problems. I've run into issues sitting at 96% and up.

david279 · June 23, 2018

Try setting cache='none'/> for your vdisk. It helps with the stuttering as well.

DZMM · June 23, 2018

5 hours ago, david279 said:

Try setting cache='none'/> for your vdisk. It helps with the stuttering as well.

Is this a recommended change? I just tried this on my W10 and pfsense VMs and they felt sluggish. I went back to a @gridrunner video and saw it was mentioned there as well

Edited June 23, 2018 by DZMM

Kronos69 · June 24, 2018

Changed cache=‘writeback’ to cache=‘none’, but it’s not helping. I still have immense stuttering.

Should I set io to thread or native? May it help?

Attaching the vcpu situation inside the vm during a file transfer - sorry about the quality, screenshotted with TeamViewer on the phone.

I’m guessing that the stuttering is due to that high load on the cpu during file transfers: why is that and how can I avoid it, seen that it’s happening on all my VMs?

With a bare metal W10 installation on same machine I don’t see similar spikes in CPU usage during file transfers.

Kronos69 · June 24, 2018

On 6/23/2018 at 2:29 AM, 1812 said:

this is odd behavior. are you using the virtio driver or the e1000 for network?

Checked, it should be the virtio one (see attachment).

In the meantime, this the WM1 XML (I don't know if it's visible inside the diagnostics). Maybe it can be useful.

I'm betting there is some setting I ignored that forces an unnecessary load on the CPU.

<domain type='kvm' id='3'>
  <name>RnzOWS1</name>
  <uuid>censored</uuid>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>20447232</memory>
  <currentMemory unit='KiB'>20447232</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>10</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='1'/>
    <vcpupin vcpu='1' cpuset='13'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='14'/>
    <vcpupin vcpu='4' cpuset='3'/>
    <vcpupin vcpu='5' cpuset='15'/>
    <vcpupin vcpu='6' cpuset='4'/>
    <vcpupin vcpu='7' cpuset='16'/>
    <vcpupin vcpu='8' cpuset='5'/>
    <vcpupin vcpu='9' cpuset='17'/>
    <emulatorpin cpuset='0,12'/>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.11'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/censored_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='10' threads='1'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' discard='unmap'/>
      <source file='/mnt/disks/INTEL_SSDcensored/RnzOWS/vdisk1.img'/>
      <backingStore/>
      <target dev='hdc' bus='scsi'/>
      <boot order='1'/>
      <alias name='scsi0-0-0-2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' discard='unmap'/>
      <source file='/mnt/disks/Samsung_SSD_960_EVO_250GBcensored/RnzOWS/Scratch1.img'/>
      <backingStore/>
      <target dev='hdd' bus='scsi'/>
      <alias name='scsi0-0-0-3'/>
      <address type='drive' controller='0' bus='0' target='0' unit='3'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/isos/Windows.iso'/>
      <backingStore/>
      <target dev='hda' bus='sata'/>
      <readonly/>
      <boot order='2'/>
      <alias name='sata0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/isos/virtio-win-0.1.141-1.iso'/>
      <backingStore/>
      <target dev='hdb' bus='sata'/>
      <readonly/>
      <alias name='sata0-0-1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci' ports='15'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </controller>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <alias name='scsi0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='sata0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:f7:dc:4b'/>
      <source bridge='br0'/>
      <target dev='vnet2'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/2'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/2'>
      <source path='/dev/pts/2'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-3-RnzOWS1/org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x43' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <rom file='/mnt/disk1/drivers/vBIOS/GP104.rom'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x43' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='no'>
      <source>
        <vendor id='0x046d'/>
        <product id='0xc52b'/>
        <address bus='1' device='2'/>
      </source>
      <alias name='hostdev2'/>
      <address type='usb' bus='0' port='1'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='no'>
      <source>
        <vendor id='0x2516'/>
        <product id='0x0048'/>
        <address bus='5' device='3'/>
      </source>
      <alias name='hostdev3'/>
      <address type='usb' bus='0' port='2'/>
    </hostdev>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+0:+100</label>
    <imagelabel>+0:+100</imagelabel>
  </seclabel>
</domain>

2.6k · June 24, 2018

4 minutes ago, Kronos69 said:

Checked, it should be the virtio one (see attachment).

In the meantime, this the WM1 XML (I don't know if it's visible inside the diagnostics). Maybe it can be useful.

I'm betting there is some setting I ignored that forces an unnecessary load on the CPU.


<domain type='kvm' id='3'>
  <name>RnzOWS1</name>
  <uuid>censored</uuid>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>20447232</memory>
  <currentMemory unit='KiB'>20447232</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>10</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='1'/>
    <vcpupin vcpu='1' cpuset='13'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='14'/>
    <vcpupin vcpu='4' cpuset='3'/>
    <vcpupin vcpu='5' cpuset='15'/>
    <vcpupin vcpu='6' cpuset='4'/>
    <vcpupin vcpu='7' cpuset='16'/>
    <vcpupin vcpu='8' cpuset='5'/>
    <vcpupin vcpu='9' cpuset='17'/>
    <emulatorpin cpuset='0,12'/>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.11'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/censored_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='10' threads='1'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' discard='unmap'/>
      <source file='/mnt/disks/INTEL_SSDcensored/RnzOWS/vdisk1.img'/>
      <backingStore/>
      <target dev='hdc' bus='scsi'/>
      <boot order='1'/>
      <alias name='scsi0-0-0-2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' discard='unmap'/>
      <source file='/mnt/disks/Samsung_SSD_960_EVO_250GBcensored/RnzOWS/Scratch1.img'/>
      <backingStore/>
      <target dev='hdd' bus='scsi'/>
      <alias name='scsi0-0-0-3'/>
      <address type='drive' controller='0' bus='0' target='0' unit='3'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/isos/Windows.iso'/>
      <backingStore/>
      <target dev='hda' bus='sata'/>
      <readonly/>
      <boot order='2'/>
      <alias name='sata0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/isos/virtio-win-0.1.141-1.iso'/>
      <backingStore/>
      <target dev='hdb' bus='sata'/>
      <readonly/>
      <alias name='sata0-0-1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci' ports='15'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </controller>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <alias name='scsi0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='sata0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:f7:dc:4b'/>
      <source bridge='br0'/>
      <target dev='vnet2'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/2'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/2'>
      <source path='/dev/pts/2'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-3-RnzOWS1/org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x43' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <rom file='/mnt/disk1/drivers/vBIOS/GP104.rom'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x43' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='no'>
      <source>
        <vendor id='0x046d'/>
        <product id='0xc52b'/>
        <address bus='1' device='2'/>
      </source>
      <alias name='hostdev2'/>
      <address type='usb' bus='0' port='1'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='no'>
      <source>
        <vendor id='0x2516'/>
        <product id='0x0048'/>
        <address bus='5' device='3'/>
      </source>
      <alias name='hostdev3'/>
      <address type='usb' bus='0' port='2'/>
    </hostdev>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+0:+100</label>
    <imagelabel>+0:+100</imagelabel>
  </seclabel>
</domain>

This is probably the wrong direction, but to eliminate a virtio driver issue causing load, change the network to

<model type='e1000-82545em'/>

Did you also try reducing the ram usage on the server? Or alternatively, how does it act when only 1 vm is running? Have you tried (as suggested) to use only 1 isolated emulator pin per vm?

Kronos69 · June 24, 2018

1 minute ago, 1812 said:
This is probably the wrong direction, but to eliminate a virtio driver issue causing load, change the network to
<model type='e1000-82545em'/>
Did you also try reducing the ram usage on the server? Or alternatively, how does it act when only 1 vm is running? Have you tried (as suggested) to use only 1 isolated emulator pin per vm?

I’ll try thanks!

I just tested with vm3 turned off and less than 75% occupied memory, no differences unfortunately.

The isolated emulator pin is the only one test I’m still missing, because I couldn’t reboot the server. In a few hours the pc won’t be needed in the office and I’ll try

Kronos69 · June 24, 2018

54 minutes ago, 1812 said:
This is probably the wrong direction, but to eliminate a virtio driver issue causing load, change the network to
<model type='e1000-82545em'/>
Did you also try reducing the ram usage on the server? Or alternatively, how does it act when only 1 vm is running? Have you tried (as suggested) to use only 1 isolated emulator pin per vm?

Ok I now tried both

Changed the adapter model type
Pinned a cpu for emulator pin that is both isolated from unraid and not assigned to any VM. Test conducted @ 79% total ram usage in unraid.

The situation seems a little bit better during sequential read (maybe placebo), but it still stutters heavily during other activities.
The cpu that boosts to 100% is always cpu0 or cpu3 inside task manager, even if I change the physical cpu pinnings.

Edited June 24, 2018 by Kronos69

2.6k · June 24, 2018

1 hour ago, Kronos69 said:

Ok I now tried both

Changed the adapter model type
Pinned a cpu for emulator pin that is both isolated from unraid and not assigned to any VM. Test conducted @ 79% total ram usage in unraid.

The situation seems a little bit better during sequential read (maybe placebo), but it still stutters heavily during other activities.
The cpu that boosts to 100% is always cpu0 or cpu3 inside task manager, even if I change the physical cpu pinnings.

Is there a windows update available?

Kronos69 · June 24, 2018

7 minutes ago, 1812 said:

Is there a windows update available?

No, I’m running on 1803 with the latest updates.

If only I could isolate what is causing that cpu load, I think we’ll be getting really close

Unfortunately as you can see above, the task manager is not helping me.

It seems like the process that is causing that load isn’t “inside” windows (and I guess this is why windows stutters) but the load is still detected by the task manager.

I’m wondering if there is another way in unraid to see what is causing the load.

Edited June 24, 2018 by Kronos69

2.6k · June 24, 2018

3 hours ago, Kronos69 said:

No, I’m running on 1803 with the latest updates.

If only I could isolate what is causing that cpu load, I think we’ll be getting really close

Unfortunately as you can see above, the task manager is not helping me.

It seems like the process that is causing that load isn’t “inside” windows (and I guess this is why windows stutters) but the load is still detected by the task manager.

I’m wondering if there is another way in unraid to see what is causing the load.

you can install the glances plugin which is like a version of top, it will show you what on the server is pulling high cpu load. another general one is netdata. it looks nicer but in some ways is less nitty-gritty details.

Kronos69 · June 25, 2018

11 hours ago, 1812 said:

you can install the glances plugin which is like a version of top, it will show you what on the server is pulling high cpu load. another general one is netdata. it looks nicer but in some ways is less nitty-gritty details.

I'm attaching three screenshots made during a crystaldiskmark run on VM1 and VM2, the first one on VM1 with single, isolated emulatorpin, targeting the scratch ssd, second one same VM and conditions but targeting C, third one on VM2 with two not isolated emulator pins and targeting C

All of them caused stuttering, but the tests conducted on C always bring more of it.

Edited June 25, 2018 by Kronos69

2.6k · June 25, 2018

These are snapshots of each other from bare metal installs, right? If so, can you try and create a new windows 10 install completely on the server following grid runners video? Doesn't have to be a huge disk image. And then on the fresh install try a network transfer and see if the problem persists. And then fully update windows 10 and try again.

I know this is a pain, but I'm trying to isolate whether it is the VM OS or the server (because I'm running out of ideas and am not a Threadripper owner.)

Also, is the board bios/firmware up to date?

Edited June 25, 2018 by 1812

Kronos69 · June 25, 2018

55 minutes ago, 1812 said:

These are snapshots of each other from bare metal installs, right? If so, can you try and create a new windows 10 install completely on the server following grid runners video? Doesn't have to be a huge disk image. And then on the fresh install try a network transfer and see if the problem persists. And then fully update windows 10 and try again.

I know this is a pain, but I'm trying to isolate whether it is the VM OS or the server (because I'm running out of ideas and am not a Threadripper owner.)

Also, is the board bios/firmware up to date?

I installed VM1 following gridrunner’s video but on btrfs qcow2, to test VM snapshot (all 3 were running on the cache initially), then converted VM1 and 2 vdisks to run with xfs raw and moved them on their own dedicated drive.

I’ll try.

I’m using a gigabyte aorus 7 with the second-last bios because I tried the latest one but it’s bugged like hell. Not happy with this manufacturer.

Just one thing: the problem doesn’t happen only during network transfers, but also when generating disk load onto the OS partition with crystaldiskmark like in screenshot 2 and 3. Actually, this is the worst case for stuttering.

Kronos69 · June 25, 2018

11 hours ago, 1812 said:

can you try and create a new windows 10 install completely

Ok, I've done that, I freshly installed VM4. XML follows below.

With or without emulatorpin, with or without chache='none', or changing i440fx to q35 (see link below) after the install, it doesn't change anything about the stuttering or the CPU usage. ? The only difference was that with cache set to 'none', crystaldiskmark speeds were significantly lower.

Also, I've got hyper-v enabled this time.

At the beginning, it - seemed - that the problem was solved with the fresh install, but I left the pc a few hours updating, I rebooted, and the stuttering is back in all it's glory. VM 1-2-3 have the latest windows update, vm4 is still 17xx version, so I doubt it's a recent update that messed the things up (and I've had this problem for months, since the beginning, through varios windows and unraid updated versions).

P.s.: unrelated, with hyper-v enabled even if the stuttering doesn't go away, the disk performance seems better, but when I try to enable it on the old VMs that have it disabled (due to previous compatibility issues with the nvidias passthrough) windows repeatedly fails to load (error INTERRUPTION_EXCEPTION_NOT_HANDLED). This could be a question for another topic.

<domain type='kvm' id='21'>
  <name>RnzOWS4</name>
  <uuid>nope</uuid>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='10'/>
    <vcpupin vcpu='1' cpuset='22'/>
    <vcpupin vcpu='2' cpuset='11'/>
    <vcpupin vcpu='3' cpuset='23'/>
    <emulatorpin cpuset='13'/>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.11'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/uuid_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vendor_id state='on' value='none'/>
    </hyperv>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='4' threads='1'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='hypervclock' present='yes'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' discard='unmap'/>
      <source file='/mnt/cache/domains/RnzOWS/vdisk4.img'/>
      <backingStore/>
      <target dev='hdc' bus='scsi'/>
      <boot order='1'/>
      <alias name='scsi0-0-0-2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/isos/Windows.iso'/>
      <backingStore/>
      <target dev='hda' bus='ide'/>
      <readonly/>
      <boot order='2'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/isos/virtio-win-0.1.141-1.iso'/>
      <backingStore/>
      <target dev='hdb' bus='ide'/>
      <readonly/>
      <alias name='ide0-0-1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci' ports='15'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </controller>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <alias name='scsi0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='ide' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:30:63:1f'/>
      <source bridge='br0'/>
      <target dev='vnet1'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/1'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/1'>
      <source path='/dev/pts/1'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-21-RnzOWS4/org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x08' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='no'>
      <source>
        <vendor id='0x062a'/>
        <product id='0x4101'/>
        <address bus='5' device='2'/>
      </source>
      <alias name='hostdev2'/>
      <address type='usb' bus='0' port='1'/>
    </hostdev>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+0:+100</label>
    <imagelabel>+0:+100</imagelabel>
  </seclabel>
</domain>

Kronos69 · June 27, 2018

Guys, @1812, I'm really, really stuck with this.

I also tried the two combinations of io (native and threads), without noticing differencies.

The things that seem to somehow mitigate the issue are pinning the emulator to an isolated cpu, and having hyper-v enabled (did that on a new vm, because i'm not able to enable it for old vms), but maybe it's placebo.

Cache none also seems to help, but it really impacts performance (I've got a huge UPS so power losses don't scare me that much).

I still suffer of bad stuttering and 100% cpu load during disk i/o, especially with random i/o.

The problem doesn't seem to reside inside the guests, but inside the unraid host (maybe some misconfiguration made by me).

What other steps can I do to help you/anyone to isolate what may be causing this issue on every vm I create?

May it be that my cache is btrfs, and that all my vdisks were created first in a btrfs filesystem and later transferred onto an xfs formatted drive? I'm really running out of ideas.

I'd really like to use unraid to manage more WSs inside my office, but I've got to get rid of this problem first.

Attaching again the updated diagnostics

rnzows0-diagnostics-20180628-0007.zip

Edited June 27, 2018 by Kronos69

2.6k · June 27, 2018

3 hours ago, Kronos69 said:

Guys, @1812, I'm really, really stuck with this.

I also tried the two combinations of io (native and threads), without noticing differencies.

The things that seem to somehow mitigate the issue are pinning the emulator to an isolated cpu, and having hyper-v enabled (did that on a new vm, because i'm not able to enable it for old vms), but maybe it's placebo.

Cache none also seems to help, but it really impacts performance (I've got a huge UPS so power losses don't scare me that much).

I still suffer of bad stuttering and 100% cpu load during disk i/o, especially with random i/o.

The problem doesn't seem to reside inside the guests, but inside the unraid host (maybe some misconfiguration made by me).

What other steps can I do to help you/anyone to isolate what may be causing this issue on every vm I create?

May it be that my cache is btrfs, and that all my vdisks were created first in a btrfs filesystem and later transferred onto an xfs formatted drive? I'm really running out of ideas.

I'd really like to use unraid to manage more WSs inside my office, but I've got to get rid of this problem first.

Attaching again the updated diagnostics

rnzows0-diagnostics-20180628-0007.zip

I'm running out of ideas, more so since I don't own any thread ripper servers.

Have you seen this: https://lime-technology.com/forums/topic/59744-ryzen-build-progress-with-gpu-nvme-pass-through/?do=findComment&comment=588245

It's older, but they also discuss stuttering in the thread.

Edited June 28, 2018 by 1812

Freeze/Stuttering - Threadripper 1920x - Windows 10 VM

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

david279

Kronos69

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation