Mr_Jay84 Posted February 3, 2021 Share Posted February 3, 2021 (edited) Recently whenever I try booting one of my Win 10 VMs with GPUs passed through the VM OS tab freezes and won't load anymore. Nothing has changed hardware wise or software wise, its just randomly started. They all functioned perfectly up until a few weeks, randomly just started freeing. The only way to clear this is to reset the machine. unraid 6.8.3 Any ideas? Feb 3 20:40:51 Ultron kernel: ------------[ cut here ]------------ Feb 3 20:40:51 Ultron kernel: WARNING: CPU: 46 PID: 17363 at /tmp/SBo/NVIDIA-Linux-x86_64-440.59/kernel/nvidia/nv-pci.c:577 nv_pci_remove+0xe9/0x2fc [nvidia] Feb 3 20:40:51 Ultron kernel: Modules linked in: arc4 ecb md4 nvidia_uvm(O) sha512_ssse3 sha512_generic cmac cifs ccm xt_nat veth xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap macvlan ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod ipmi_devintf nct6775 hwmon_vid bonding igb(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) crc32_pclmul intel_rapl_perf intel_uncore pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd kvm_intel kvm drm_kms_helper intel_cstate coretemp drm crct10dif_pclmul intel_powerclamp mxm_wmi crc32c_intel syscopyarea sysfillrect sb_edac sysimgblt fb_sys_fops x86_pkg_temp_thermal ipmi_si i2c_i801 agpgart ipmi_ssif i2c_core ahci libahci button wmi pcc_cpufreq Feb 3 20:40:51 Ultron kernel: [last unloaded: igb] Feb 3 20:40:51 Ultron kernel: CPU: 46 PID: 17363 Comm: libvirtd Tainted: P O 4.19.107-Unraid #1 Feb 3 20:40:51 Ultron kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EP2C612 WS, BIOS P2.50 03/22/2018 Feb 3 20:40:51 Ultron kernel: RIP: 0010:nv_pci_remove+0xe9/0x2fc [nvidia] Feb 3 20:40:51 Ultron kernel: Code: aa 01 00 00 00 75 2c 8b 95 70 04 00 00 48 c7 c6 7b 15 7d a1 bf 04 00 00 00 e8 bd 7d 00 00 48 c7 c7 c2 15 7d a1 e8 31 c6 9e e0 <0f> 0b e8 c2 82 00 00 eb f9 4c 8d b5 50 04 00 00 4c 89 f7 e8 f7 62 Feb 3 20:40:51 Ultron kernel: RSP: 0018:ffffc900073c7d50 EFLAGS: 00010246 Feb 3 20:40:51 Ultron kernel: RAX: 0000000000000024 RBX: ffff88905a3250a8 RCX: 0000000000000007 Feb 3 20:40:51 Ultron kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff88a05fb964f0 Feb 3 20:40:51 Ultron kernel: RBP: ffff88a0597cc800 R08: 0000000000000003 R09: 000000000006d000 Feb 3 20:40:51 Ultron kernel: R10: 0000000000000000 R11: 0000000000000044 R12: ffff8890f018b008 Feb 3 20:40:51 Ultron kernel: R13: ffff88905a325000 R14: 0000000000000060 R15: ffff889df6dd3fc0 Feb 3 20:40:51 Ultron kernel: FS: 000014594921b700(0000) GS:ffff88a05fb80000(0000) knlGS:0000000000000000 Feb 3 20:40:51 Ultron kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 3 20:40:51 Ultron kernel: CR2: 0000145949217378 CR3: 00000010580fe003 CR4: 00000000001626e0 Feb 3 20:40:51 Ultron kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Feb 3 20:40:51 Ultron kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Feb 3 20:40:51 Ultron kernel: Call Trace: Feb 3 20:40:51 Ultron kernel: pci_device_remove+0x36/0x8e Feb 3 20:40:51 Ultron kernel: device_release_driver_internal+0x144/0x225 Feb 3 20:40:51 Ultron kernel: unbind_store+0x6b/0xae Feb 3 20:40:51 Ultron kernel: kernfs_fop_write+0xf3/0x135 Feb 3 20:40:51 Ultron kernel: __vfs_write+0x32/0x13a Feb 3 20:40:51 Ultron kernel: vfs_write+0xc7/0x166 Feb 3 20:40:51 Ultron kernel: ksys_write+0x60/0xb2 Feb 3 20:40:51 Ultron kernel: do_syscall_64+0x57/0xf2 Feb 3 20:40:51 Ultron kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Feb 3 20:40:51 Ultron kernel: RIP: 0033:0x14594b0bc48f Feb 3 20:40:51 Ultron kernel: Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 49 fd ff ff 48 8b 54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 7c fd ff ff 48 Feb 3 20:40:51 Ultron kernel: RSP: 002b:000014594921a530 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 Feb 3 20:40:51 Ultron kernel: RAX: ffffffffffffffda RBX: 000000000000000c RCX: 000014594b0bc48f Feb 3 20:40:51 Ultron kernel: RDX: 000000000000000c RSI: 0000145934033910 RDI: 000000000000001e Feb 3 20:40:51 Ultron kernel: RBP: 0000145934033910 R08: 0000000000000000 R09: 0000000000000000 Feb 3 20:40:51 Ultron kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001e Feb 3 20:40:51 Ultron kernel: R13: 000000000000001e R14: 0000000000000000 R15: 000014593402faf0 Feb 3 20:40:51 Ultron kernel: ---[ end trace 92e1e3438dbde051 ]--- Feb 3 20:42:51 Ultron nginx: 2021/02/03 20:42:51 [error] 24286#24286: *2569470 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 200.200.1.101, server: , request: "GET /plugins/gpustat/gpustatus.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "200.200.1.111", referrer: "http://200.200.1.111/Dashboard" Edited February 3, 2021 by Mr_Jay84 Quote Link to comment
zeus83 Posted February 5, 2021 Share Posted February 5, 2021 Hi, do you have nvidia-persistenced running in your host machine ? Quote Link to comment
Mr_Jay84 Posted February 5, 2021 Author Share Posted February 5, 2021 Not even sure what that is lol. How do would I check? Quote Link to comment
zeus83 Posted February 5, 2021 Share Posted February 5, 2021 Try to execute something like this: ps aux | grep nvidia-persistenced It might also be that there are processes on your host that utilizing the card. Try to run this and see if there are any processes bounded to the gpu: nvidia-smi Otherwise I recommend you to remove host nvidia drivers and check if gpu passthrough works in that case. Quote Link to comment
Mr_Jay84 Posted February 10, 2021 Author Share Posted February 10, 2021 (edited) The only card in use is the 1060 which is dedicated to Emby transcoding. root@Ultron:~# ps aux | grep nvidia-persistenced root 461 0.0 0.0 3916 2188 pts/3 S+ 17:51 0:00 grep nvidia-persistenced root@Ultron:~# nvidia-smi Wed Feb 10 17:51:43 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.59 Driver Version: 440.59 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 106... On | 00000000:02:00.0 Off | N/A | | 40% 28C P2 30W / 120W | 82MiB / 6078MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 750 Ti On | 00000000:81:00.0 Off | N/A | | 29% 16C P8 1W / 38W | 0MiB / 2002MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 105... On | 00000000:82:00.0 Off | N/A | | 29% 9C P8 N/A / 75W | 0MiB / 4040MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 4669 C /bin/ffmpeg 72MiB | +-----------------------------------------------------------------------------+ root@Ultron:~# I'm not sure I can reinstall the drivers as i'm using the [PLUGIN] LINUXSERVER.IO - UNRAID NVIDIA which is now depreciated. Edited February 10, 2021 by Mr_Jay84 Quote Link to comment
Mr_Jay84 Posted March 16, 2021 Author Share Posted March 16, 2021 (edited) Okay partial success. I'm on 6.9.1 and using the Nvidia plugin which enables the dockers to use all my three GPUs, one one is used for Emby. Now the issue is when I assign one to a VM I get his error... Mar 15 23:37:01 Ultron kernel: NVRM: Attempting to remove minor device 1 with non-zero usage count! So from what I can gather the Nvidia plugin captures the GPU's and wont allow them to be used by VMs. Using the VFIO option reserves the spare two GPUs enabling their use by VMs, okay great but this means the VMs must be running other wise the cards are in full P0 mode (full power). Is there a away to boot without VFIO using the Nvidia plugin, use the spare cards with the VMs and then return them to the Nvidia plugin pool when the VM shuts down for power management EDIT: The drivers do indeed capture the cards but only if persistence mode is activated. This will enable the cards to toggle their respective power states HOWEVER it means if you try and use them with VMs you'll get a hardware lock. According the Ich777 (the maker of the plugin) it is possible to enable persistence mode per card in the CLI nvidia-smi -i <target gpu> -pm ENABLED Enabled persistence mode for GPU <target gpu>. All done. The <target gpu> should be the HW ID I think (something like 0000:01:00.0). I discovered by cards 'stubbed' by VFIO would indeed power throttle correctly even if the VM was in use or not however according to Ich777 this is very much a card specific behaviour and not all will behave the same as it depends on the cards BIOS and how the manufacture set the thing up. See post here for more info In conclusion i'll keep my transcoding card assigned to the Nvidia plugin and the VM cards stubbed in VFIO this way everything power throttles correctly. It does mean my docker options are limited to one card without unbinding from VFIO and restarting but I only use one for emby transcoding anyway. Edited March 16, 2021 by Mr_Jay84 solved 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.