VM/network issues.


Hanfufu

Recommended Posts

So i have spent the past 2 months, ripping out my remaining hair, and pondering to toss my server out the window...

 

I made the unfortunate decision to switch from WS2016 to UNRAID. Very bad decision apparently.

 

I have problems with VMs being completely unusable when downloading/copying files from a VM on UNRAID, to a share on the UNRAID server (disk/user shares both tried, no difference).

 

Quick view of my server:

 

Ryzen 9 3950x, Asrock x570 Steel Legends

48Gigs RAM

2x1TB M2 as cachepool for VMs

2x4TB Sata SSD cachepool for misc files

1x240GB SSD for misc projects.

Intel I211 LAN card

 

I attached some screenshots, showing the problems as best as i can.

 

As soon as there is network traffic to any share/cache pools from my VM (Currently running WS2016, but have tried with WS2012, Windows 10 and even a Linux Mint VM), everything starts to lockup and freeze totally, for sometimes several minutes.

If i try to navigate anything while downloading, nothing will respond, and clicking a button might take several minutes to actually open.

Download speeds occasionally goes up to 60MB/sec (500Mbit connection), but only for a second, then it drops to 0 and the server freezes once again. It is completely unusable and i have tried EVERYTHING i can think of/find online. Nothing works, if anything it just gets worse.

If i manage to get it to pause the download, everything immediately runs fine with no hiccups. As soon as i start the download again, the speed goes up to like 30-40MB/sec, then after a few secs, the freezes start once again, and DL speeds drop to 0KB/secs.

 

Most of the time, the machine is unresponsive when the DL speed goes to 0, but at rare occasions it still responds fine, but the DL speeds are just 0, and spikes up to 20MB/secs for a few seconds then down again.

 

Sometimes it locks up completely for a long time (10+ mins), and when i check again, all running downloads has stopped, with a network error: Couldnt write to file, no space on disk or just network error.

 

Copying a file from the VM to a share, results in some short lockups, and the speed goes up and down sometimes stops completely for a few secs - tops at around 50MB/secs and goes down to sub 10MB/sec and then up again, over and over.

Copying the SAME file from my gaming PC to the same share, gives a rock solid 113-114MB/secs, all through the transfer, exactly as it is supposed to be.

Using WinRAR to unpack files to a share, also makes the VM lockup/freeze constantly.

 

Also it is ONLY the VMs that becomes unresponsive, everything else on the server runs fine, dockers etc while it freezes the VM.

 

Checking CPU usage when it happens, shows its doing pretty much nothing and there is sub 5% load on the cores for the VM.

Yes i am using CPU PINNING, so the cores for the VM is only for that.

 

I am also running Plex in Docker, and it runs VERY well with 10+ transcodes with no issues at all, even when using Handbrake to convert x264 -> x265 on 12 cores/24 threads.

 

Running qBittorrent in Docker as a test - rock solid 60MB/secs speeds with no issues at all.

 

I tried running the VMs from all my 3 drive pools - same result, freezes and becomes unusable.

Tried changing VM network to virtio or virtio-net - Same results

Tried changing everything under Tips and tricks pluging, with no success whatsoever:

Disable NIC Flow Ctr - yes/no

Disable NIC Offload - yes/no

Ethernet NIC Rx/Tx buffer: 256,512.......4096

 

Also tried to change DirectIO to yes under Global Share settings - does absolutely nothing.

 

So i am 100% certain, that this is an issue between the VM networking and connections to UNRAID shares, as using qBittorrent in docker works perfectly fine, and copying files to/from the unraid shares from physical machines on my network runs totally fine with no issues at all.

Another important note here, is that if i download to the C drive on the VM (aka the vdisk file), there are absolutely NO PROBLEMS at all, 55MB/sec solid download, and NO slowdowns. Again, its ONLY when transferring/downloading to/from UNRAID network shares from within the VMs.

 

I am totally stumped, and this issue is making my server completely useless, and i even paid money for that.....

 

I sincerely hope someone has an idea, as i am ready to throw in the towel....

 

 

VMdownload-cdrive.png

VMdownload-share.png

tower-diagnostics-20220219-1115.zip

Link to comment
3 hours ago, Hanfufu said:

Running qBittorrent in Docker as a test - rock solid 60MB/secs speeds with no issues at all.

 

3 hours ago, Hanfufu said:

Another important note here, is that if i download to the C drive on the VM (aka the vdisk file), there are absolutely NO PROBLEMS at all

 

Does all destination not array ?

 

BTW, I think most problem will solve if array out of equation.

Link to comment
33 minutes ago, Vr2Io said:

 

 

Does all destination not array ?

 

BTW, I think most problem will solve if array out of equation.

 

Does all destination not array ?

 

Nothing goes to the array, i have 2x1TB SSDs as a cache pool, and I can see the files are being created and downloaded onto the array, so this is not the issue. 

And I was downloading the exact same file in the VM and qbittorrent Ib docker, and to the same path. Docker runs perfect, VMS run terrible. Docker: /user/media/Temp, VM \\tower\media\Temp to make sure that the only difference was VM or Docker.

 

 

BTW, I think most problem will solve if array out of equation.

 

As stated, it is out of the equation, as the cache pool works fine and as it should. 

Even downloading (from VM) to the cache-ssd disk share, exchibits the same strange freezing behaviour, and I have read that doing it through disk shares, could be faster than via user share. No difference at all.

Edited by Hanfufu
Link to comment

Update:

 

Tried a few other things today, from various other forum posts, none had any effect at all.

It seems that a lot of people experiencing this behaviour, have problems with 1 of the CPU cores hitting 100% - This is NOT happening to me. None of
the cores on my WS2016 server are doing anything when it stutters, hovers around a few % usage, both in Task Manager in the VM, and on Dashboard on the UNRAID WEBUI.


This topic and the fixes in it, did nothing: 

 

Also i tried setting up a brand new VM, using both Q35-5.1/5.2 + i440fx-5.1/4.x
The new VM still jumps up and down in speed when downloading, HOWEVER it DOES NOT stutter and become unresponsive all the time.

Tried setting CPU Scaling Governor to On Demand instead of Power Save, no effect from that either.

Tried changing USB to USB 3 (both options), no effect at all.

 

Tried everything i could find here, also did nothing..

 


Nothing here either:

 

 

Edited by Hanfufu
Link to comment

Update once again:

 

When connecting to the server through UNRAID gui VNC, the stutter does not happen, but the download speeds still jumps up and down, but stays mostly down in the sub 10KB/secs. But as mentioned the server responds fine also when downloads drop to 0.

 

Not sure what this means, but connectiong via RDP to the server, reintroduces the stuttering and freezing.

 

Im seriously getting more and more perplexed by this problem, and am baffled that it can even exist in software that is paid and costs 100$+...

Link to comment
2 hours ago, Squid said:

Can you retry all this stuff without running what appears to be folding at home?  In your diagnostics it's taking at the time the ps was run 850%, and at the time top was run 1076%

Not sure what you mean, i was not running folding@home or anything, i may have been running some handbrake encoding, but that does not affect anything, as my VMS run fine until there are IO between it and network/user shares on the UNRAID server, no matter how many plex transcodes are running.

Shutting down all other VMs/dockers, doesnt affect this behaviour, it still happens.

Link to comment
16 hours ago, jonp said:

Also, please share with us your VM settings and CPU pinning.  You can take a screenshot of the Tools > CPU Pinning tab to give us an easy way to see what is assigned to what.

Thanks a lot for the response! :)

 

I posted a few more screenshots of CPU usage while the VM is frozen, connected via RDP.

And as i mentioned earlier, i discovered that connection via UNRAID WEBui VNc (VNC Remote), the stuttering is non existing, but the download speeds are still displaying the same behaviour with the download speed hitting < 1Mbits 98% of the time.

 

 

2_cpu-pinning.png

2_downloading-cpu-usage-unraid-dashboard.png

2_downloading-cpu-usage-vm-ws2016.png

tower-diagnostics-20220224-0953.zip

Link to comment
9 minutes ago, Squid said:

I'm wondering if something like NetData (via Apps) would show anything weird happening at the time.

Holy sh*t that gives a lot of information, didnt know that app, thanks!

 

I dont know what to look for, but i found a lot of packet drops, check the screendump, no idea if its relevant.

 

I will check if i can see anything, if the network error in qbittorrent shows up again soon.

3_netdata-packets-br0.png

Link to comment

I also saw this, under IPv6 there are lots of discarded packets.

I also disabled IPv6 on the VM a while back, to make sure that it didnt cause problems.

Also i posted a screenshot of Network Settings, showing something odd with IPv6 route, but im not a network expert so dont know if its relevant.

 

3_ipv6_packet_loss.png

4_ipv6-route.png

Edited by Hanfufu
Link to comment
5 hours ago, Hanfufu said:

2_cpu-pinning.png

 

The amount of VMs that you have overlapping core assignments with is problematic.  Unraid (by default) doesn't work like a traditional hypervisor.  In a traditional setting, you just assign a quantity of CPUs to the VM and the hypervisor takes care of deciding where actual "work" needs to go on the physical CPUs.  So if you have a VM in that setting with say 4 CPUs assigned (vCPU1, vCPU2, vCPU3, and vCPU4), those 4 vCPUs can technically "roam" across any of the physical CPUs in the system (pCPU0, pCPU1, pCPU2, pCPU3, pCPU4, pCPU5, pCPU...).  In Unraid, the way you have this configured, that won't happen.  Looking at your Linux Mint VM, vCPU0-6 are hard bound to pCPU 1, 17, 2, 18, 3, and 19.  Those vCPUs will never shift.  And if any of your numerous other VMs have "work" that needs to happen on the same pinned cores, you're going to create massive context switching issues that will bring performance to a crawl.  Then on top of that you are not even fully isolating all the CPUs that VMs are using and you are overlapping your CPU pinning not only with multiple VMs, but with your docker containers as well.

 

Shut down all your containers and VMs except for 1 VM.  Now try your copy test.  Does the issue persist?

Link to comment

Thanks for your clarification on the CPU pinning system!

I was however considering some misconfig of these settings, but everywhere i found topics related to my problem, there was always 1 core that was maxing out and bottlenecking the VM, but I am not experiencing anything of the sorts.

No load of any kind on any of the cores, CPU usage is only a few % with no cores showing full or even high usage.

 

Only 2 VMs are running, the Windows Server 2016 old is running an iis/SQL Server and cannot be shut down atm.

 

And the Windows Server 2016 SATA, is the one i am currently testing on.

Linux Mint was only to test if the problem would also occur in a Linux VM, just to cross that off the list. 

 

I just tried shutting down all dockers and still only running the 2 VMS i mentioned above. 

Same result, download spends 95% of the time at around 0kb/sec and VM non responsive via RDP :(

 

It really baffles me if it is not related to the virtualization network stuff, since downloading to the vdisk file aka c drive inside the VM, runs perfectly smooth @ full speed. It is only, and only if i download/copy to the array/cache from inside the VM, the problems occur.

 

Screenshot_20220224-160905_Chrome.jpg

Screenshot_20220224-160914_Chrome.jpg

Edited by Hanfufu
Link to comment
On 2/24/2022 at 8:19 PM, JonathanM said:

so what happens if you shut down the one you are testing with and run the tests in the 2016 old VM?

Sorry for the delay in posting, have been pretty busy the past few days.

 

Sadly its still the same behaviour when testing on that VM, with the other shut down :(

 

I have however, discovered that the stuttering was misinterpreted by me.

 

it is only RDP/VNC over ethernet that chokes it. VNC via VM Manager in unraid gui does not stutter, it is only the download speeds that are affected. This also makes sence, since there are no CPU usage what so ever, so I am confident that it is NOT a CPU issue.

 

Also i found out, that limiting the amount of open connections pr torrent to 6 Peers only, and only downloading 1 torrent at a time, runs ALOT better. The download still goes way down to sub 5 MBps, but generelly stays in the 50MBps range, with an average download speed of around 35-45MBps.

 

It is NOT a permanent fix, but makes it possible to download at somewhat respectable speeds.

 

Starting 2 downloads at once, makes it choke the LAN completely once again - but as stated without any stuttering per say, it is just network related, and the more open connections there are, the more it chokes.

Link to comment

Ok, I can definitely tell you that the issues you're experiencing are hardware-specific in nature.  These are not widespread nor being reported by anyone else.  I've thoroughly reviewed your diagnostics and haven't seen anything in the OS configuration itself that would be causing this problem, but you are right to be suspicious of those dropped packets.  Another key indicator is that when you connect to VNC using the Unraid GUI itself, you don't have these performance issues.  That is VERY interesting because even though it's local to the system, it is still using a virtual network interface to connect from the browser to the VNC session.  This means that when your physical network/cables are out of the equation, things work as they should.

 

My belief now is that you have either a bad network cable or improperly configured network environment.  I would try replacing the cables connecting your server to the network.  In addition, what can you tell us about your network environment?  What kind of router(s)/switch(es) do you have in place?  Any advanced routing?

 

Another question is that your server appears to have two Ethernet adapters: an Intel and an ASRock branded controller.  Which of these are currently being used to connect your server to the network?  Maybe try switching to the other one?

Link to comment

Hey Jonp, thanks again for your time, its invaluable! :)

 

My main router is an ASUS RT-AX86U. I have a cable running from that, to a no-name 5 port gigabit switch, which then connects to my server and a wifi extender setup.

I just tried to remove the switch as a possible issue, so the cable was running straight from my router to the server. The results are extremely confusing...

 

When removing that switch, the problems still persists, but the download speed was like capped at 30MB/sec - makes absolutely no sence to me.

 

Also tried:

using just ethernet 1 (onboard Intel I211 something)

using just ethernet 2 (PCIe Intel Gigabit generic adapter)

using just ethernet 1 without the switch

using just ethernet 2 without the switch

using both ethernet simultaneous.

used another brand new cable.

 

Still the issue persits.

 

About it being a bad cable/setup, we have to remember that:

 

Downloading to c-drive (vdisk file) in VM runs PERFECTLY @ 60MB/sec - it is only when i save it the the \\tower\media network share from within the VM that the problems show up.

Downloading to array through qBittorrent running in Docker, also runs perfectly @ stable 60MB/sec.

Nothing has been changed with my setup, going from WS2016 to UNRAID, same cables, setup etc and it always ran perfect.

So if this was a bad cable etc, im sure it would have come up as a problem, in the past 2.5 years where my setup (LAN wise) has been exactly the same.

But anyway i have now tried different cable etc with no luck, but it was indeed worth a shot.

 

Also as mentioned before, the problem appeared on my old Intel Xeon X99 board, which only had a 3m cable directly into my router, so it is very strange indeed :(

 

 

Link to comment

I have to say, I'm at a loss here.  You have quite an exhaustive setup, and for us to figure this out, we'd probably need to work with you 1:1.  We do offer paid support services for such an occasion which you can purchase from here:  https://unraid.net/services.  Unfortunately there are no guarantees that this will fix the issue, but its probably the best chance you have at this point to diagnose.

 

I would again try shutting down all containers and leave only 1 VM running.  Attempt your copy jobs and see if things work ok.  If so, then your issue is related to all those services running in concert.  But as far as this being related to the virtual networking stack, I have to say, I don't think that's the case.  There are too many users leveraging that exact same virtual software stack without issue.  If it was software-related, I would expect to see a much larger amount of users complaining about this very issue as this is a fairly common scenario.

 

You also could try manually adjusting the XML for the VM to adjust the virtual network device in use.  Take a look at the libvirt docs for more information:  https://libvirt.org/formatdomain.html#network-interfaces

Link to comment
  • 11 months later...

I just came about this thread with the exact same issue.

After reading this thread and a bit of trial an error I found a solution that works for me:

 

Changing the network model in my VM configuration from virtio-net to vmxnet3.

 

I didn't experience any stuttering or other issues ever since.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.