Jump to content

Another GPU Passthrough Problem Thread


Recommended Posts

So, I've gone down so many different GPU passthrough problem threads, and none seem to fix my issue. Through SpaceInvaderOne, I managed to find a workaround, though it's not that great and presents issues of its own. Here's my particular situation, which I can't seem to find any info on.

 

I have a Ryzen Threadripper 1950X on an AsRock X399 Taichi on firmware version P3.90, with multiple GPUs that I'm trying to set up for passthrough. I honestly don't think the GPU models matter, as the issue affects all of them equally, but in case it might, one is an ASUS ROG STRIX Radeon Vega 64 8GB, one is a MacVidCards flashed EVGA Nvidia GTX 980 4GB, and one is a Zotac Nvidia GT 710 1GB PCIe x1 card.

 

I have my virtualization all setup proper in the BIOS/UEFI settings. I have my IOMMU groups mapped out using the ACS Override Both option (needed to separate out the PCIe x1 slot and M.2 wifi slot). I'm running on Unraid 6.9.2, clean install, no upgrades from previous versions. And I have the GPUs and wifi card stubbed/bound to VFIO at boot.

 

So, all is working great, except for passthrough. I have passed through an M.2 NVMe M.2 drive without issues to a VM. Booted from it even. SpaceInvaderOne to credit for that. But right now, I'm not doing that. Just trying to passthrough only a GPU.

 

I dumped the vBIOS using SpaceInvaderOne's tutorial. In order to do so, I had to modify the script to force reset, which puts the server to sleep in the middle of the script in order to reset the card. Any other way didn't work for any of the GPUs in any of the slots. Each and every one required the forced reset, which according to his documentation should only be needed if trying to do the primary GPU (in my case, the Vega 64).

 

If I pass through a GPU, I have to first reset the card using the sleep method from SpaceInvaderOne. I even have a special script for it, literally reverse engineered from SpaceInvaderOne's vBIOS dump script.

#!/bin/bash
gpuid="45:00.0"
gpuid=$(echo "$gpuid" | sed 's/ *$//')
gpuid=$(echo "$gpuid" | sed 's/^ *//g')
dumpid="0000:$gpuid"
mygpu=$(lspci -s $gpuid)
disconnectid=$(echo "$dumpid" | sed 's?:?\\:?g')
disconnectid2=$(echo "$disconnectid" | sed 's/\(.*\)0/\11/')
vganame=$( lspci | grep -i "$gpuid" )
echo "Disconnecting the graphics card"
echo "1" | tee -a /sys/bus/pci/devices/$disconnectid/remove
echo "Entered suspended (sleep) state ......"
echo
echo " PRESS POWER BUTTON ON SERVER TO CONTINUE"
echo
echo -n mem > /sys/power/state
echo "Rescanning pci bus"
echo "1" | tee -a /sys/bus/pci/rescan
echo "Graphics card has now sucessfully been disconnected and reconnected"
echo

 

However, every time the VM is done with the GPU, I have to run the reset, which requires sleeping the server, if I want to use that GPU again. For any and all of the GPUs. Safe shutdowns of the VMs don't seem to release the cards properly. And if I reboot the whole server, I also have to run the reset before I can use any of the GPUs in passthrough for the first time.

 

One issue that has been really hard to get around is that when I install the drivers for the GPU inside the VM, when it resets the GPU from inside the VM, it makes the GPU stop sending signal, and I have to force stop the VM, run the reset script that sleeps the server, and then can go back in and try again. But again, the driver install process resets the card (I think) and it causes the card to need a reset again.

 

I have started messing with trying to use the reset command on different higher level PCIe devices, hoping to maybe reset the slot. That's not worked at all either. And a remove/rescan only seems to work if I sleep the server in between the remove and reset. Without the sleep, it's like the GPU isn't powering down to be ready for use again, so the rescan just adds it back, but the GPU is still locked in use by the last use of it, be a VM or the boot process.

 

My problem is this: how can I reset the GPUs without having to sleep the server? And how can I make it so that the cards properly release, or whatever they are supposed to do, to be ready for use again, without having to reset them?

Edited by tyrelius
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...