Trouble-shooting Nvidia GPU Pass-through


Recommended Posts

Hi All,

 

I've been toying with passing through an Nvidia GTX 760 GPU to a Windows 10 VM.  Originally I used a super-micro X99 motherboard which seemed to have ok results, but it had a bug with USB controllers so I changed motherboard to another one.  While the USB controller ordeal is over, I'm now dealing with an ordeal with the Nvidia GPU.

 

Hardware:

(1) MSI X99A-SLI motherboard (unraid flash on it's own USB controller, while 2 other USB controllers are passed-through)

(2) Intel Xeon E5-2658 V3

(3) 480GB and 60GB SSDs (cache and data, respectively.  No parity as this box is purely for Virtualization)

(4) AMD HD7770 (Gigabyte)

(5) AMD HD7850 (Sapphire)

(6) Nvidia GTX 760 (Asus)

(7) EVGA 80+ Platinum 650W (can safely rule out PSU issue :))

 

Software:

(1) UnRaid 6.2 Beta 18 (trial)

(2) Windows 10 Pro

 

Using either of the AMD GPU's, Windows 10 VM's are very happy - there are occasional DPC issues (detected by LatencyMon) yet the VM feels like a fast machine.  USB controller pass-through is behaving well.

 

As soon as I pass-through the Nvidia GPU, things get to almost a halt.  There is no picture output from either the HDMI or DVI output of course, so I have been using remote desktop to connect to the machine.  On the remote desktop, mouse response is super-slow.  I cannot even change the resolution which is stuck at 640x480, and Code 43 is present to the Nvidia GPU under Device Manager.

 

I've gone through many combinations of vcpu assignment, ram assignment, Hyper-V on/off, i440fx/Q35 machine types, swapping PCIe slots.  I cannot use OVMF as I believe none of my GPU's has UEFI support.  With Hyper-V off, I have picture on the monitor, but the machine is still very crippled - no reasonable work can be done on the VM.

 

I would therefore like to consult this forum for help, hopefully a collaboration will benefit the community.  Free feel to ask question, request log/xml etc.

 

Have a good day!

 

Link to comment

I did try swapping the GPU slots, but I have always used the first and the 2nd last slots.  1 thing I didn't try is to use the slot beside the 1st slot.  Since all of my GPU's are dual-width, I always put them the furthest apart.

 

The motherboard BIOS doesn't have the option to choose the default GPU.  Yet I notice that it always defaults to the slot closest to CPU (1st slot i believe).

 

What PCIe slot is the Nvidia GTX 760 (Asus) in?

If the 1st 16X slot, try moving it to another one.

Also, what GPU is the BIOS set to be the primary GPU?

Link to comment

There is good news.  Installing the GTX760 GPU on the middle x16 PCIe slot allows a SeaBios+i440fx Windows 10 VM to boot and run without the issues I've had.  The machine seems to be quite responsive (I just gamed a bit, and did a CAD render, shut-down a few times).  LatencyMon reports some red ISR and DPC times. 

 

The other Win10 VM is also running fine (as always) with the AMD HD7770 GPU which now sits on the x16 PCIe slot right beside the CPU.  It is responsive as well.

 

My hope for the UnRaid virtualization endeavor is now revived.  I plan to do more tests before migrating my working unraid key (and the HDD's) to this system.  One incident I had before was that I left a Windows 10 VM on (with auto-sleep turned off) and the next morning nothing would respond (mouse/keyboard/monitor, webUI).  I had to power-cycle the box and of course that resulted in 10+ hours of parity re-sync.  I plan to do more testing with this dual-VM rig now without the HDD's, and once the behavior is satisfactory, I'll migrate the production unraid to it.

 

Any suggested tests to run on the VM's?

 

I did try swapping the GPU slots, but I have always used the first and the 2nd last slots.  1 thing I didn't try is to use the slot beside the 1st slot.  Since all of my GPU's are dual-width, I always put them the furthest apart.

 

The motherboard BIOS doesn't have the option to choose the default GPU.  Yet I notice that it always defaults to the slot closest to CPU (1st slot i believe).

 

What PCIe slot is the Nvidia GTX 760 (Asus) in?

If the 1st 16X slot, try moving it to another one.

Also, what GPU is the BIOS set to be the primary GPU?

latmon.PNG.6091be1fda0d48559f9c30c487655408.PNG

Link to comment

I have been letting the 2 Windows VM's IDLE for many hours now.  So far so good.

 

One thing I do find weird is that on the VM with the GTX 760, when I remote desktop to it, it consumes un-realistically high CPU usage.

 

This is the cpu usage at idle (from unraid telnet console) when both VM's are idle:

 

top - 12:40:28 up  7:55,  1 user,  load average: 2.76, 2.39, 3.03
Tasks: 348 total,   2 running, 346 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.4 us,  8.0 sy,  0.0 ni, 90.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32909372 total,   207412 free, 22091376 used, 10610584 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 10344156 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  781 root      20   0 16.752g 0.016t  22596 S 115.0 52.6 438:34.73 qemu-system-x86
30747 root      20   0 4912788 4.351g  22412 S  94.0 13.9 414:43.00 qemu-system-x86
16814 root      20   0  163864  11916  10232 R   0.7  0.0   0:00.02 php
   52 root      rt   0       0      0      0 S   0.3  0.0   0:00.12 migration/11

 

When I remote-desktop to both VM's, the one with the GTX760 suddenly sees a huge bump in CPU usage under unraid (which is also evident by the bump in power consumption reported by the UPS):

 

top - 12:43:14 up  7:58,  1 user,  load average: 4.60, 3.20, 3.23
Tasks: 349 total,   2 running, 347 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.4 us, 60.0 sy,  0.0 ni, 33.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32909372 total,   217976 free, 22080272 used, 10611124 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 10355544 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  781 root      20   0 16.752g 0.016t  22596 R  1436 52.6 443:09.63 qemu-system-x86
30747 root      20   0 4912788 4.351g  22412 S 104.7 13.9 417:40.27 qemu-system-x86
    7 root      20   0       0      0      0 S   0.3  0.0   0:35.69 rcu_preempt
  787 root      20   0       0      0      0 S   0.3  0.0   0:03.84 vhost-781

 

What's troubling me is that the VM with the AMD GPU doesn't see a much jump in CPU usage.

 

Any idea how to debug this?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.