VM Freeze with GPU passthrough

Mr_Jay84 · February 3, 2021

Recently whenever I try booting one of my Win 10 VMs with GPUs passed through the VM OS tab freezes and won't load anymore. Nothing has changed hardware wise or software wise, its just randomly started. They all functioned perfectly up until a few weeks, randomly just started freeing. The only way to clear this is to reset the machine.

unraid 6.8.3

Any ideas?

Feb 3 20:40:51 Ultron kernel: ------------[ cut here ]------------
Feb 3 20:40:51 Ultron kernel: WARNING: CPU: 46 PID: 17363 at /tmp/SBo/NVIDIA-Linux-x86_64-440.59/kernel/nvidia/nv-pci.c:577 nv_pci_remove+0xe9/0x2fc [nvidia]
Feb 3 20:40:51 Ultron kernel: Modules linked in: arc4 ecb md4 nvidia_uvm(O) sha512_ssse3 sha512_generic cmac cifs ccm xt_nat veth xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap macvlan ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod ipmi_devintf nct6775 hwmon_vid bonding igb(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) crc32_pclmul intel_rapl_perf intel_uncore pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd kvm_intel kvm drm_kms_helper intel_cstate coretemp drm crct10dif_pclmul intel_powerclamp mxm_wmi crc32c_intel syscopyarea sysfillrect sb_edac sysimgblt fb_sys_fops x86_pkg_temp_thermal ipmi_si i2c_i801 agpgart ipmi_ssif i2c_core ahci libahci button wmi pcc_cpufreq
Feb 3 20:40:51 Ultron kernel: [last unloaded: igb]
Feb 3 20:40:51 Ultron kernel: CPU: 46 PID: 17363 Comm: libvirtd Tainted: P O 4.19.107-Unraid #1
Feb 3 20:40:51 Ultron kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EP2C612 WS, BIOS P2.50 03/22/2018
Feb 3 20:40:51 Ultron kernel: RIP: 0010:nv_pci_remove+0xe9/0x2fc [nvidia]
Feb 3 20:40:51 Ultron kernel: Code: aa 01 00 00 00 75 2c 8b 95 70 04 00 00 48 c7 c6 7b 15 7d a1 bf 04 00 00 00 e8 bd 7d 00 00 48 c7 c7 c2 15 7d a1 e8 31 c6 9e e0 <0f> 0b e8 c2 82 00 00 eb f9 4c 8d b5 50 04 00 00 4c 89 f7 e8 f7 62
Feb 3 20:40:51 Ultron kernel: RSP: 0018:ffffc900073c7d50 EFLAGS: 00010246
Feb 3 20:40:51 Ultron kernel: RAX: 0000000000000024 RBX: ffff88905a3250a8 RCX: 0000000000000007
Feb 3 20:40:51 Ultron kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88a05fb964f0
Feb 3 20:40:51 Ultron kernel: RBP: ffff88a0597cc800 R08: 0000000000000003 R09: 000000000006d000
Feb 3 20:40:51 Ultron kernel: R10: 0000000000000000 R11: 0000000000000044 R12: ffff8890f018b008
Feb 3 20:40:51 Ultron kernel: R13: ffff88905a325000 R14: 0000000000000060 R15: ffff889df6dd3fc0
Feb 3 20:40:51 Ultron kernel: FS: 000014594921b700(0000) GS:ffff88a05fb80000(0000) knlGS:0000000000000000
Feb 3 20:40:51 Ultron kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 3 20:40:51 Ultron kernel: CR2: 0000145949217378 CR3: 00000010580fe003 CR4: 00000000001626e0
Feb 3 20:40:51 Ultron kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 3 20:40:51 Ultron kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 3 20:40:51 Ultron kernel: Call Trace:
Feb 3 20:40:51 Ultron kernel: pci_device_remove+0x36/0x8e
Feb 3 20:40:51 Ultron kernel: device_release_driver_internal+0x144/0x225
Feb 3 20:40:51 Ultron kernel: unbind_store+0x6b/0xae
Feb 3 20:40:51 Ultron kernel: kernfs_fop_write+0xf3/0x135
Feb 3 20:40:51 Ultron kernel: __vfs_write+0x32/0x13a
Feb 3 20:40:51 Ultron kernel: vfs_write+0xc7/0x166
Feb 3 20:40:51 Ultron kernel: ksys_write+0x60/0xb2
Feb 3 20:40:51 Ultron kernel: do_syscall_64+0x57/0xf2
Feb 3 20:40:51 Ultron kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 3 20:40:51 Ultron kernel: RIP: 0033:0x14594b0bc48f
Feb 3 20:40:51 Ultron kernel: Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 49 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 7c fd ff ff 48
Feb 3 20:40:51 Ultron kernel: RSP: 002b:000014594921a530 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Feb 3 20:40:51 Ultron kernel: RAX: ffffffffffffffda RBX: 000000000000000c RCX: 000014594b0bc48f
Feb 3 20:40:51 Ultron kernel: RDX: 000000000000000c RSI: 0000145934033910 RDI: 000000000000001e
Feb 3 20:40:51 Ultron kernel: RBP: 0000145934033910 R08: 0000000000000000 R09: 0000000000000000
Feb 3 20:40:51 Ultron kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001e
Feb 3 20:40:51 Ultron kernel: R13: 000000000000001e R14: 0000000000000000 R15: 000014593402faf0
Feb 3 20:40:51 Ultron kernel: ---[ end trace 92e1e3438dbde051 ]---
Feb 3 20:42:51 Ultron nginx: 2021/02/03 20:42:51 [error] 24286#24286: *2569470 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 200.200.1.101, server: , request: "GET /plugins/gpustat/gpustatus.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "200.200.1.111", referrer: "http://200.200.1.111/Dashboard"

Edited February 3, 2021 by Mr_Jay84

zeus83 · February 5, 2021

Hi, do you have nvidia-persistenced running in your host machine ?

Mr_Jay84 · February 5, 2021

Not even sure what that is lol. How do would I check?

zeus83 · February 5, 2021

Try to execute something like this:

ps aux | grep nvidia-persistenced

It might also be that there are processes on your host that utilizing the card. Try to run this and see if there are any processes bounded to the gpu:

nvidia-smi

Otherwise I recommend you to remove host nvidia drivers and check if gpu passthrough works in that case.

Mr_Jay84 · February 10, 2021

The only card in use is the 1060 which is dedicated to Emby transcoding.

root@Ultron:~# ps aux | grep nvidia-persistenced
root       461  0.0  0.0   3916  2188 pts/3    S+   17:51   0:00 grep nvidia-persistenced
root@Ultron:~# nvidia-smi
Wed Feb 10 17:51:43 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 440.59       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  On   | 00000000:02:00.0 Off |                  N/A |
| 40%   28C    P2    30W / 120W |     82MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 750 Ti  On   | 00000000:81:00.0 Off |                  N/A |
| 29%   16C    P8     1W /  38W |      0MiB /  2002MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 105...  On   | 00000000:82:00.0 Off |                  N/A |
| 29%    9C    P8    N/A /  75W |      0MiB /  4040MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      4669      C   /bin/ffmpeg                                   72MiB |
+-----------------------------------------------------------------------------+
root@Ultron:~#

I'm not sure I can reinstall the drivers as i'm using the [PLUGIN] LINUXSERVER.IO - UNRAID NVIDIA which is now depreciated.

Edited February 10, 2021 by Mr_Jay84

Mr_Jay84 · March 16, 2021

Okay partial success.

I'm on 6.9.1 and using the Nvidia plugin which enables the dockers to use all my three GPUs, one one is used for Emby. Now the issue is when I assign one to a VM I get his error...

Mar 15 23:37:01 Ultron kernel: NVRM: Attempting to remove minor device 1 with non-zero usage count!

So from what I can gather the Nvidia plugin captures the GPU's and wont allow them to be used by VMs. Using the VFIO option reserves the spare two GPUs enabling their use by VMs, okay great but this means the VMs must be running other wise the cards are in full P0 mode (full power).

Is there a away to boot without VFIO using the Nvidia plugin, use the spare cards with the VMs and then return them to the Nvidia plugin pool when the VM shuts down for power management

EDIT: The drivers do indeed capture the cards but only if persistence mode is activated. This will enable the cards to toggle their respective power states HOWEVER it means if you try and use them with VMs you'll get a hardware lock. According the Ich777 (the maker of the plugin) it is possible to enable persistence mode per card in the CLI

nvidia-smi -i <target gpu> -pm ENABLED
    Enabled persistence mode for GPU <target gpu>.
    All done.

The <target gpu> should be the HW ID I think (something like 0000:01:00.0).

I discovered by cards 'stubbed' by VFIO would indeed power throttle correctly even if the VM was in use or not however according to Ich777 this is very much a card specific behaviour and not all will behave the same as it depends on the cards BIOS and how the manufacture set the thing up. See post here for more info

In conclusion i'll keep my transcoding card assigned to the Nvidia plugin and the VM cards stubbed in VFIO this way everything power throttles correctly. It does mean my docker options are limited to one card without unbinding from VFIO and restarting but I only use one for emby transcoding anyway.

Edited March 16, 2021 by Mr_Jay84
solved

VM Freeze with GPU passthrough

Recommended Posts

Mr_Jay84

Link to comment

zeus83

Link to comment

Mr_Jay84

Link to comment

zeus83

Link to comment

Mr_Jay84

Link to comment

Mr_Jay84

Link to comment

Join the conversation