Freeze/Stuttering - Threadripper 1920x - Windows 10 VM


Recommended Posts

57 minutes ago, 1812 said:

 

I'm running out of ideas, more so since I don't own any and servers.

 

Have you seen this: https://lime-technology.com/forums/topic/59744-ryzen-build-progress-with-gpu-nvme-pass-through/?do=findComment&comment=588245

 

 

It's older, but they also discuss stuttering in the thread.

 

I’ll try what’s inside there, but also you just gave me another idea:

 

my “old” home PC is very similar talking about storage options, but Intel based, I’ve got a 270 7700k with an 1070 ftw identical to the one I’m passing through VM1, a m.2 960 evo as the system disk, a m.2 970 evo as a scratch disk and a 7200rpm HDD as “intermediate” storage between the pc and the very same central nas (thanks to an ubnt radio link I share my network - and fiber connection - with the office).

 

I’ll try with a test unraid installation there with identical settings (minus the amd specific ones), using the 970 evo to host the vm vdisk, to see if changing platforms changes anything at all.

 

If it doesn’t change anything, I’m guessing it must be a misplaced server setting or an unraid/kvm bug.

Link to comment

@1812 thanks to ssds it was quick to set it up - unfortunately the cpu spikes and the stuttering do seem to be there with the intel platform, too.

 

1 vcpu for unraid, one for the emulator pin, two for the vm (the 7700 isn’t generous with cores). 8GB ram to the VM, with a total 16GB ram.

Vdisk mounted via scsi, discard=unmap, cache=none or writeback, io=native or thread

 

I’ll do other tests tomorrow morning to be 100% sure, but I think it’s not an amd related issue. With intensive random I/O on ssds with crystaldiskmark, it seems to always happen. The cpu spike starts and stops with the disk access, and it doesn’t happen on the bare metal windows install on the same machine.

Edited by Kronos69
Link to comment
12 hours ago, Kronos69 said:

May it be that my cache is btrfs, and that all my vdisks were created first in a btrfs filesystem and later transferred onto an xfs formatted drive? I'm really running out of ideas.

 

I doubt this would be an issue but you never know. I could run a few tests on my setup this afternoon to see if I have similar problems. I'm in the middle of a parity check right now.

 

My Win 10 VM's were created new a few months ago all on Q35 and UEFI each with their own GPU. They all use a raw image for the vdisk which resides on an SSD cache drive formatted as XFS. My hardware is in my signature. Maybe this might provide useful info. I don't know. But I'll try it in a few hours.

 

My VM's are mostly used to run Emby Theater to other rooms in the house and they have no problems streaming Bluray quality movies at the same time.

Link to comment
15 hours ago, GHunter said:

As a follow up, I transferred 10 gigs of large files and then 10 gigs of small files back and forth between my VM and unRaid. My CPU usage showed between 34 and 44 percent. No stuttering in my VM during file transfers.

 

Any visible spikes of cpu activity during those transfers?
May I ask you for your VM XML and unraid syslinux?
May I ask you to run one test with crystaldiskmark and see if you notice stuttering of the pointer during the random read test (the second one after launching "all" tests)?

Edited by Kronos69
Link to comment

Another thing that I could try: following one of gridrunner's tutorials for ryzen systems, I have got rcu_nocbs=0-23 inside my syslinux as an added measure to avoid system hangs, but I'm not fully aware of what that command does. I'll try to disable that and see if it changes something.

Link to comment

No spikes during file transfers. CPU usage graph in resource monitor just bounces around 34 to 44 percent. I tried to run Crystaldiskmark the other day and it could not detect a hard drive. Probably because I use a vdisk instead of passing through an SSD like you are doing. If you have something else you'd like me to try, let me know and I'll give it a shot.

 

My VM XML only has a manual edit for emulator pinning. Syslinux has pins isolated for VM use only.

 

LT is working on v6.6 which will be on the latest linux kernel and have updated QEMU and libvirt.. This might give you better performance and stability on the latest AMD CPU's. I'd be anxiously waiting for the RC to see if it helps you.

 

VM XML

<domain type='kvm'>
  <name>Emby Theater - Living Room</name>
  <uuid>7ea7321d-aaea-4d52-a39f-4c86eb882ba3</uuid>
  <description>Living room VM running Windows 10 and Emby Theater.</description>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>2</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='2'/>
    <vcpupin vcpu='1' cpuset='6'/>
    <emulatorpin cpuset='0,4'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-q35-2.11'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/7ea7321d-aaea-4d52-a39f-4c86eb882ba3_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vendor_id state='on' value='none'/>
    </hyperv>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='1' threads='2'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='hypervclock' present='yes'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/mnt/user/VMs/Emby Theater - Living Room/vdisk1.img'/>
      <target dev='hdc' bus='virtio'/>
      <boot order='1'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/Programs/DreamSpark/Windows 10/Windows10All.iso'/>
      <target dev='hda' bus='sata'/>
      <readonly/>
      <boot order='2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/Programs/Virtualization ISOs/Stable/virtio-win-0.1.141.iso'/>
      <target dev='hdb' bus='sata'/>
      <readonly/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0xa'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0xb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0xc'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0xd'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0xe'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x6'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:c0:96:af'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x02' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='no'>
      <source>
        <vendor id='0x147a'/>
        <product id='0xe042'/>
      </source>
      <address type='usb' bus='0' port='1'/>
    </hostdev>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </memballoon>
  </devices>
</domain>

 

syslinux.cfg

default menu.c32
menu title Lime Technology, Inc.
prompt 0
timeout 2
label unRAID OS
  menu default
  kernel /bzimage
  append isolcpus=1,2,3,5,6,7 pcie_acs_override=downstream initrd=/bzroot
label unRAID OS GUI Mode
  kernel /bzimage
  append isolcpus=1,2,3,5,6,7 pcie_acs_override=downstream initrd=/bzroot,/bzroot-gui
label unRAID OS Safe Mode (no plugins, no GUI)
  kernel /bzimage
  append pcie_acs_override=downstream initrd=/bzroot unraidsafemode
label unRAID OS GUI Safe Mode (no plugins)
  kernel /bzimage
  append pcie_acs_override=downstream initrd=/bzroot,/bzroot-gui unraidsafemode
label Memtest86+
  kernel /memtest

 

Link to comment
  • 3 weeks later...
On 6/30/2018 at 8:13 AM, Kronos69 said:

Another thing that I could try: following one of gridrunner's tutorials for ryzen systems, I have got rcu_nocbs=0-23 inside my syslinux as an added measure to avoid system hangs, but I'm not fully aware of what that command does. I'll try to disable that and see if it changes something.

 

Ok i found i had this issue with my vm as well, all sorts of cpus would ramp up with doing a disk speedtest. Im using a r7 2700x on the gigabyte gaming 5 board and everything else works perfect. I picked up one of those samsung nvme drives on sale at best buy to test and with the drive passed through to the vm and cloned the vm to it....i no longer have the issue. I passed the entire controller through as well, no block device passthough. I see no cpus going crazy or anything.

 

beware 4k screenshot

samsung  nvme samsung driver.jpg

Edited by david279
details
  • Like 1
Link to comment
  • 2 weeks later...
On 7/19/2018 at 10:03 PM, david279 said:

 

Ok i found i had this issue with my vm as well, all sorts of cpus would ramp up with doing a disk speedtest. Im using a r7 2700x on the gigabyte gaming 5 board and everything else works perfect. I picked up one of those samsung nvme drives on sale at best buy to test and with the drive passed through to the vm and cloned the vm to it....i no longer have the issue. I passed the entire controller through as well, no block device passthough. I see no cpus going crazy or anything.


I finally had the time to try this, and a few blue screens of despair later, it WORKS! Even without applying any previous optimization I tried before in this topic, the overhead that was causing stuttering is gone. Thanks!

--

I'll try to describe the process I followed here to help someone else.

I tried to pass through with this method first: SSD Passthrough (had to change the ata- prefix with nvme-, everything else the same), but I noticed no real differencies because as you suggested the entire controller needs to be passed through, so i followed this method NVME controller passthrough, included not stubbing the controller but using the hostdev xml provided in the video description, with a few differencies:


1- I used minitool partition wizard to migrate the os, selecting "copy partitions without resize" to avoid the recovery partition to be unnecessarily stretched, and immediately after stretched the C partition, leaving a 10% overprovisioning.


2- With the most recent unraid version it seems that the modified clover isn't necessary, you simply stub the controller in the syslinux configuration or add the hostdev to the vm xml and click on update, then you specify the boot order adding <boot order='1'/> after the source, so that it looks something like this:

     <hostdev mode='subsystem' type='pci' managed='yes'>
           <driver name='vfio'/>
           <source>
             <address domain='0x0000' bus='0x41' slot='0x00' function='0x0'/>
           </source>
           <boot order='1'/>
           <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
         </hostdev>


then, the device should be visible and selectable inside the GUI editor. You then have to simply select "none" as the primary vdisk location, update again and check that the boot order is still there inside the xml, and then boot the vm up.

I had to reboot a few times, inside the windows recovery options that followed the first blue screen telling me "no boot device" or something like that, select the "boot recovery" option (dunno if it's the correct name bc my interface isn't in english), reboot two times again, and it worked. I simply had to reinstall my nvidia drivers again, don't ask me why :)

 


--

With my configuration, seen that I wanted to pass through the same SSD that was occupied by the vdisk, I had to move the vdisk on another disk with krusader and then select the new location inside the gui editor. Don't do like I did and make TWO copies on the other drive, one as backup, because something might simply go wrong and corrupt your vdisk.

--

It works with the nvme drives, and now I want to try this method with the sata SSDs, too. The problem is that isolating the sata controller in it's own IOMMU group isn't that easy.
With the second-last stable bios of my x399 aorus, f3g (f3j was bugged as hell), it simply isn't possible, even with the acs override patch enabled. The sata controllers are always grouped with something else. Updating to the latest f10 bios with the new agesa, it seems to be feasible.

The obstacle I'm trying to overcome now, is to understand what sata controller I need to pass through without messing everything up.

I installed a plugin to run scripts inside the gui with community apps, then ran this script: iommu script that i found on reddit, to try to understand what sata controller I need to pass through.
It seems that now every sata drive is under the same sata controller, but later I'll try to change connectors.

I'll keep this topic updated!

  • Upvote 1
Link to comment
  • Kronos69 changed the title to [SOLVED] Windows 10 VMs stuttering during file transfers
  • 1 month later...

Never managed to passthrough the sata ssd controller.


Now the stuttering is gone on the ssd as written above, but if I try to write from inside the VM towards an unraid share, the system hangs the same way (if not worse).
High single thread cpu usage is seen for all intensive network usage, and again it's not the emulator pin, but one of the VM cores.

I'll open another topic about this problem, but I feel that the bug is somehow related.

Link to comment
10 minutes ago, Kronos69 said:

Never managed to passthrough the sata ssd controller.


Now the stuttering is gone on the ssd as written above, but if I try to write from inside the VM towards an unraid share, the system hangs the same way (if not worse).
High single thread cpu usage is seen for all intensive network usage, and again it's not the emulator pin, but one of the VM cores.

I'll open another topic about this problem, but I feel that the bug is somehow related.

 

Have you tried the 6.6 rc yet? It has qemu 3.0 and updated kernel and a bunch of new updates to see if its still happening.

Link to comment
3 hours ago, david279 said:

 

Have you tried the 6.6 rc yet? It has qemu 3.0 and updated kernel and a bunch of new updates to see if its still happening.

Ok now I tried - updated to 6.6.0 rc2 and installed the new vfio drivers -  probably better with passed through ssds with non passed through controllers (passed through ssds with passed through controllers were already ok), but still the same with network shares.

 

I then tried changing the Ethernet adapter from vfio to e1000, and flagging/unflagging it inside of MSI interrupts - nothing, maybe worse

 

To easily trigger the stuttering/hangs, I just need to perform a sequential read test with crystaldiskmark on ANY network folder, even folders that aren’t unraid related (for example, a smb share on the central nas)

 

Bypassing emulation passing through the ssd controller solved the issue with the ssds, but the problem still remains because I cannot pass through the Ethernet adapter having to share it with 3+ vms and unraid.

Because of this, anytime I move files between the VM and the network, I still experience heavy stuttering
_
Edited the title instead of creating a new topic, because I think that the two issues are related.

Edited by Kronos69
Link to comment
  • Kronos69 changed the title to [Partially SOLVED] Windows 10 VMs stuttering during file transfers
  • 3 weeks later...

Update, hoping to be helpful to other ryzen/threadripper users.

 

Now I’m pretty sure that the problem with network file transfers and the hiccups is due to system interrupts.

 

The single core almost 100% utilization spikes are correlated to a spike in “system interrupts” resources usage inside the windows resource manager (it’s less than a 10% usage, but with 10 cores, it’s enough to saturate one core).

 

After upgrading to 6.6.1 and creating again a new VM from scratch using q35 quemu 3.0 as the machine, enabling hyper-v and using the msi interrupts tool that I downloaded from one of gridrunner’s videos, to select msi interrupts for the video card, the sound card (!) and the vfio eth adapter, the problem SEEMS to be vanished

 

On a W10 VM

With only two cores pinned

Even without an emulator core pinned

Even using a standard vdisk image on the cache drive, and not passing through anything beyond the gpu

 

I’ll test it further, no news = good news.

 

I’m now trying to convert at least one of the old VMs to see if I can manage to not reainstall every VM from scratch.

 

I managed to change the machine type to q35 without deactivating windows - I created a new vm in the GUI with q35, assigned the passthrough nvme where the VM was installed, and copied the old uuid inside the new xml (but left the new uuid in the file address just below that).

 

Windows automatically updated the missing drivers after a while, and after a few reboots it’s now working.

 

But the stuttering isn’t gone.

 

So I guess that the culprit is the hyper-v that needs to be activated (I had it turned off because older Unraid versions had problem with hyper-v enabled and non-quadro GPUs passthrough).

 

I’m not able to enable hyper v without running across an endless boot loop of blue screens of defeat, does anyone know how to do that, or I have to reinstall all VMs from scratch?

Edited by Kronos69
Link to comment
  • Kronos69 changed the title to Freeze/Stuttering - Threadripper 1920x - Windows 10 VM

I was about to throw in the towel last week, accept the failure and replace the TR server with 3 separate physical workstations, but I don't want to "go gentle into that good night", so let's see if putting together all the info I collected until now can give someone expert an idea on how to solve this.

 

Quick recap:
 

  • I'm experimenting using Unraid to service 3 nearby workstations at work, repurposing "old" gaming and office components, but the stuttering has been plaguing us for months, we must locally pause the file sync with the main server while working.
     
  • When moving/accessing files inside a W10 VM on this threadripper 1920x build, a strong stuttering occurs. It's bursts of continuous micro/macro freezes, ranging from a fraction of a second to 4-5 seconds, and it's really noticeable because even the pointer gets stuck.
     
  • Having a storage device passthrough by id via scsi helps a lot, and an nvme controller "bare metal" passthrough almost solves the problem regarding local files (latency is still high but no perceivable stuttering). I still can't for the life of me solve the issue with network (LAN via ethernet AND local Unraid shares mounted via SMB) file transfers.
     
  • The issue is apparently NOT related to writes, only to reads. Random reads give usually more issues than sequential reads.
     
  • It's apparently NOT an environmental issue (broadcast storms, or something like that) because I can replicate the issue with the ethernet cable disconnected, simply running crystaldiskmark ponting to an unraid SMB share on the array.
     
  • Reading from a SMB share with crystaldiskmark, one logical core almost always hovers around 50%, and anoter logical core hovers around 90% (see attached screenshots) with a peak activity of "system interrupts" inside the task manager. Otherwise, they idle normally.
     
  • The faster the share, the worse the stuttering, apparently.
     
  • Q 35 3.0 seems to help.

 

Test hardware, updated:
 

  • Mobo - Gigabyte X399 Aorus Gaming 7 rev 1.0
  • Processor - Threadripper 1920X
  • RAM - 48GB (6x8GB) Kingston KVR24E17S8/8MA 8 GB, DDR4 2400 MHz, CL17, ECC Unbuffered - To be expanded to 64GB (8x8GB) in a few days.
  • 1 x EVGA 1070 FTW3 - VM1
  • 1 x Quadro P400 - VM2
  • 1 x Quadro K620 - VM3
  • 3 x 120GB SATA SSDs (2x Crucial BX300, 1x KingDian cheapo) - VM OS disks
  • 1 x 2.5" 7200rpm 1TB HDD - HGST - only and lonely local array disk
  • 1 x 120GB NVME M.2 PCIe - Intel 600p - VM2 scratch disk
  • 1 x 250GB NVME M.2 PCIe - Samsung 960 evo - VM1 scratch disk
  • 1 x StarTech PEXUSB3S42 4-Port PCI Express USB 3.0 Card - 1 USB controller, 3 ports (+1 internal) passed through to VM1 or VM2
  • 1200W PSU
     

Environment:
 

  • 1Gbe LAN, nearest router an old Netgear R7000 with dd-wrt custom firmware, 2 unmanaged switches between the Unraid tower and the main server.
  • Main office server, a 918+ with 4xWD REDs in RAID 10 + RAID 1 NVME R/W cache
     

Test software, updated:
 

  • MB BIOS F11e - AGESA 1.1.0.1a - latest
  • Windows 10 Pro 1809 - latest
  • Unraid 6.6.6 - latest
  • CrystalDiskMark 6.0.2 - latest
  • Office server OS: DSM 6.2.1-23824 Update 2 - latest
  • Office server sync software: Syn Drive 1.1..1-10562 - latest
  • latest NVIDIA drivers for GTX and Quadros
  • virtio-win-0.1.160-1 - latest
     

Configuration (see attached pictures and diagnostics) :
 

  • BIOS memory interleaving setting: channel
  • ZenStates ON
  • CPU scaling governor: performance (no clear difference from previous "on demand"), turbo boost enabled
  • PCIe ACS: downstream
  • NIC flow control and offload: disabled
  • isolcpus - every cpu, except 0,1 pair
  • rcu-nocbs removed from the last configuration because it should be implemented already in the newer Unraid versions
  • Q35 3.0, OVMF, Hyper-v enabled, USB controller 3.0 quemu XHCI
  • VM1 (4K video editing): 19456 ram, GTX 1070 (with ROM) + 1 SATA SSD by-id SCSI cache=none discard=unmap + 1 NVME SSD bare metal + 1 PCIe-USB adapter, vcpupin from 12 to 23, numatune memory mode='preferred' nodeset='1'
  • VM2 (RAW picture editing): 15360 ram, Quadro P400 + 1 SATA SSD by-id SCSI cache=none discard=unmap + 1 NVME SSD, vcpupin from 6 to 11, numatune memory mode='preferred' nodeset='0'
  • VM3 (light CAD): 8192 ram, Quadro K620 + 1 SATA SSD by-id SCSI cache=none discard=unmap, vcpupin from 2 to 5, no numatune atm (waiting to get more ram)
  • the NMVEs are not isolated via vfio-pci.ids, only via XML hostdev add
  • using the MSI interrupts software I found in one of gridrunner's tutorials, I enabled msi interrupts on everything listed there
     

Result:

DAMNED STUTTERING :(

I've seen in a recent video tutorial that @SpaceInvaderOne has experimented with both 1950X and 2990WX, are you experiencing something like that, or have you an idea on how to circumvent this? To replicate the issue 100% of the time, with the above setup, is sufficient to launch crystaldiskmark on an unraid SMB share from a W10 VM.

 

-


P.s.: in the green/red scheme below (made by Gigabyte support) that i found somwhere some time ago (maybe here, maybe on the gigabyte forum), DIE 0 and 1 denominations are inverted compared to lstopo

 

rnzows0-diagnostics-20181214-0553.zip

 

IMG-7655.JPG

Latencymon after load.JPG

Latencymon cpus.JPG

Latencymon drivers.JPG

Load taskmanager.JPG

Lstopo topology.jpg

Random read.jpg

Random write.jpg

Sequential read.jpg

x399 Aorus 7 block diagram.jpg

X399 Aorus 7 layout.JPG

X399 AORUS Gaming 7 - PCIe Lane Allocation by Gigabyte support.jpg

X399 Topology by AMD infographics.jpg1325001581_Latencymoncpus.thumb.JPG.739613499e96be02ca8ce9db66346bcf.JPG

Edited by Kronos69
Link to comment

@Kronos69 How many cores does each machine have? I assume all 3 are running at the same time? Do you let cores 0/12 pair be free and not assigned to anything?

 

Can you try the below instead of passthrough on at least one machine if not all 3 making sure to set the core and threads properly?

 <cpu mode='custom' match='exact' check='partial'>
    <model fallback='allow'>EPYC-IBPB</model>
    <topology sockets='1' cores='8' threads='2'/>
    <feature policy='require' name='topoext'/>
  </cpu>

 

Did you allocate your ram in such a way that the cores assigned to the CPU only receive ram from that die/memory controller?

Edited by Jerky_san
Link to comment
2 hours ago, Jerky_san said:

@Kronos69 How many cores does each machine have? I assume all 3 are running at the same time? Do you let cores 0/12 pair be free and not assigned to anything?

At the moment

 

Core 0,1 free for Unraid to use

Core 2-23 isolated from Unraid

 

Core 2-5 VM3

Core 6-11 VM2

Core 12-23 VM1

 

“-“ meaning “to”

 

Seen the topology the core/multithreaded pairing should be 0 and 1, 2 and 3, etc, and it’s correctly picked up by recent unraid versions.

 

Regarding the stuttering, it doesn’t differ if used together or alone.

 

2 hours ago, Jerky_san said:

 

Can you try the below instead of passthrough on at least one machine if not all 3 making sure to set the core and threads properly?


 <cpu mode='custom' match='exact' check='partial'>
    <model fallback='allow'>EPYC-IBPB</model>
    <topology sockets='1' cores='8' threads='2'/>
    <feature policy='require' name='topoext'/>
  </cpu>

I’ll try tomorrow and report back

 

2 hours ago, Jerky_san said:

Did you allocate your ram in such a way that the cores assigned to the CPU only receive ram from that die/memory controller?

 

As reported above, the RAM BIOS interleaving setting is set to “channel”, and every VM except VM3 had numatune set as preferred on their die.

 

Today I installed the last dimm modules, totaling 64G, so I gave VM3 the “preferred” setting, too.

 

Before, “strict” wasn’t really assigning everything to the correct die, a little part of the other memory banks were still used, always. I’ll try again with strict with the new configuration to see if something changed.

Link to comment
20 minutes ago, Kronos69 said:

At the moment

 

Core 0,1 free for Unraid to use

Core 2-23 isolated from Unraid

 

Core 2-5 VM3

Core 6-11 VM2

Core 12-23 VM1

 

“-“ meaning “to”

 

Seen the topology the core/multithreaded pairing should be 0 and 1, 2 and 3, etc, and it’s correctly picked up by recent unraid versions.

 

Regarding the stuttering, it doesn’t differ if used together or alone.

 

I’ll try tomorrow and report back

 

 

As reported above, the RAM BIOS interleaving setting is set to “channel”, and every VM except VM3 had numatune set as preferred on their die.

 

Today I installed the last dimm modules, totaling 64G, so I gave VM3 the “preferred” setting, too.

 

Before, “strict” wasn’t really assigning everything to the correct die, a little part of the other memory banks were still used, always. I’ll try again with strict with the new configuration to see if something changed.

Ok. The CPU thing is because the cache on the processor isn't passed properly By doing the CPU thing it will properly pass the cache and makes the system much more responsive. Memory latency drops drastically as does l1-l3 cache latencies. If you'd like to see what it does see the below two post by me.

 

 

 

Link to comment

Tested on VM1: It does not competely solve the stuttering, but it works wonders, perceivably reducing the latency. Thanks!
Let's hope it'll be included in the next code.

I used this variation:

<cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>EPYC-IBPB</model> <topology sockets='1' cores='6' threads='2'/> <feature policy='require' name='topoext'/> </cpu>

because I have 12 cores assigned.

To find the root cause, I would ask you if you wish to test your system for this very specific problem, running crystaldiskmark on a network share capable of saturating a gigabit connection or pointing towards an unraid SMB network share. Do you notice any of the above problems (increased latency, cpu spikes, stuttering)?

P.s.: I've still not changed the cpu mode on the other VMs, I'll update as soon as I have news.

EPYC CPU-Z.JPG

EPYC under load.JPG

Link to comment

I had a lot better luck removing the isolation on CPUs and changing my RAM settings back to auto in the bios.  I have 4 VMs that are gaming machines, and a virtual game server and am not running into any stuttering while playing or doing smb activity while playing.  Given, I have an asrock x399 instead of an MSI, but seems similar enough.  I also have 64Gb of RAM so maybe that is helping with bandwidth- but when I isolated cores in addition to pinning them, I got similar results.  It was only after throwing out the idea of pinning and isolating RAM that I was getting good results.  I did do the EYPC change also- and the MSI fix, but that was all.  

Link to comment
8 hours ago, Kronos69 said:

Tested on VM1: It does not competely solve the stuttering, but it works wonders, perceivably reducing the latency. Thanks!
Let's hope it'll be included in the next code.

I used this variation:


<cpu mode='custom' match='exact' check='partial'> <model fallback='allow'>EPYC-IBPB</model> <topology sockets='1' cores='6' threads='2'/> <feature policy='require' name='topoext'/> </cpu>

because I have 12 cores assigned.

To find the root cause, I would ask you if you wish to test your system for this very specific problem, running crystaldiskmark on a network share capable of saturating a gigabit connection or pointing towards an unraid SMB network share. Do you notice any of the above problems (increased latency, cpu spikes, stuttering)?

P.s.: I've still not changed the cpu mode on the other VMs, I'll update as soon as I have news.

EPYC CPU-Z.JPG

EPYC under load.JPG

I'll try later tonight but I can tell you already that I saturate my network all the time. I download via jdownloader at near gigabit speeds. Once all the files land it begins to unrar it. I've seen the cache SSD hit over 200 megabytes a second doing that. I have my whole drive passed and the machine completely lives on it. The cache holds the virt image and Dockers. I used to have a problem when any part was using that cache drive but since I passed my entire drive that part is all gone. I run a factorio server to on another vm and had to pass a USB drive through to it so it wouldn't lag on saves when I was doing jdownloader.

Link to comment
  • 4 months later...

@Kronos69Many thanks for all your hard work on this - not completely solved, but it get's me a lot further and confirms my current thinking - others on here just assume it is a misconfiguration.  I was just moving into running the VM on an NVME drive as had strong suspicions this is linked to storage and or network issues.  

 

Using your thread I have installed netdata which is working wonders in that it is sending me alerts on poor performance areas - I don't think you installed it based on the recommendation online - however I'd say it's better than glances for what we are doing.  One particular concerning area for me is that during copy to the VM the network bridge is receiving packet loss of up to 10% - though this does not appear to be happening on the raw interface - only internal.  More testing to be done but this might help.

 

I'm running a 1950X, Asus X399-A and a marvellous 128GB RAM, along with a stack of internal SATA HDD and SSD, plus an Intel 128GB NVME.  I have an NVIDIA 1070TI also.  Because of suspicious of network I've put in a second network card - though this has not been easy - Unraid doesn't seem to like me doing this when it comes to setting default cards - so there may be something in that - though it's still a point of contention between me and someone else on here - it's on me to validate it with screenshots really - which could prove me wrong!

 

So where does it stand for you now?  Not solved and gave up on Unraid I'd guess due to the time since your last post?

 

Thanks.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.