tyrelius

Members
  • Posts

    6
  • Joined

  • Last visited

tyrelius's Achievements

Noob

Noob (1/14)

1

Reputation

  1. So, I've gone down so many different GPU passthrough problem threads, and none seem to fix my issue. Through SpaceInvaderOne, I managed to find a workaround, though it's not that great and presents issues of its own. Here's my particular situation, which I can't seem to find any info on. I have a Ryzen Threadripper 1950X on an AsRock X399 Taichi on firmware version P3.90, with multiple GPUs that I'm trying to set up for passthrough. I honestly don't think the GPU models matter, as the issue affects all of them equally, but in case it might, one is an ASUS ROG STRIX Radeon Vega 64 8GB, one is a MacVidCards flashed EVGA Nvidia GTX 980 4GB, and one is a Zotac Nvidia GT 710 1GB PCIe x1 card. I have my virtualization all setup proper in the BIOS/UEFI settings. I have my IOMMU groups mapped out using the ACS Override Both option (needed to separate out the PCIe x1 slot and M.2 wifi slot). I'm running on Unraid 6.9.2, clean install, no upgrades from previous versions. And I have the GPUs and wifi card stubbed/bound to VFIO at boot. So, all is working great, except for passthrough. I have passed through an M.2 NVMe M.2 drive without issues to a VM. Booted from it even. SpaceInvaderOne to credit for that. But right now, I'm not doing that. Just trying to passthrough only a GPU. I dumped the vBIOS using SpaceInvaderOne's tutorial. In order to do so, I had to modify the script to force reset, which puts the server to sleep in the middle of the script in order to reset the card. Any other way didn't work for any of the GPUs in any of the slots. Each and every one required the forced reset, which according to his documentation should only be needed if trying to do the primary GPU (in my case, the Vega 64). If I pass through a GPU, I have to first reset the card using the sleep method from SpaceInvaderOne. I even have a special script for it, literally reverse engineered from SpaceInvaderOne's vBIOS dump script. #!/bin/bash gpuid="45:00.0" gpuid=$(echo "$gpuid" | sed 's/ *$//') gpuid=$(echo "$gpuid" | sed 's/^ *//g') dumpid="0000:$gpuid" mygpu=$(lspci -s $gpuid) disconnectid=$(echo "$dumpid" | sed 's?:?\\:?g') disconnectid2=$(echo "$disconnectid" | sed 's/\(.*\)0/\11/') vganame=$( lspci | grep -i "$gpuid" ) echo "Disconnecting the graphics card" echo "1" | tee -a /sys/bus/pci/devices/$disconnectid/remove echo "Entered suspended (sleep) state ......" echo echo " PRESS POWER BUTTON ON SERVER TO CONTINUE" echo echo -n mem > /sys/power/state echo "Rescanning pci bus" echo "1" | tee -a /sys/bus/pci/rescan echo "Graphics card has now sucessfully been disconnected and reconnected" echo However, every time the VM is done with the GPU, I have to run the reset, which requires sleeping the server, if I want to use that GPU again. For any and all of the GPUs. Safe shutdowns of the VMs don't seem to release the cards properly. And if I reboot the whole server, I also have to run the reset before I can use any of the GPUs in passthrough for the first time. One issue that has been really hard to get around is that when I install the drivers for the GPU inside the VM, when it resets the GPU from inside the VM, it makes the GPU stop sending signal, and I have to force stop the VM, run the reset script that sleeps the server, and then can go back in and try again. But again, the driver install process resets the card (I think) and it causes the card to need a reset again. I have started messing with trying to use the reset command on different higher level PCIe devices, hoping to maybe reset the slot. That's not worked at all either. And a remove/rescan only seems to work if I sleep the server in between the remove and reset. Without the sleep, it's like the GPU isn't powering down to be ready for use again, so the rescan just adds it back, but the GPU is still locked in use by the last use of it, be a VM or the boot process. My problem is this: how can I reset the GPUs without having to sleep the server? And how can I make it so that the cards properly release, or whatever they are supposed to do, to be ready for use again, without having to reset them?
  2. Mkay, so long delay here for the update. After lots more troubleshooting, I finally decided to plug a monitor in and watch the post. Well, that pointed to an even bigger problem. When the GTX 1650 is plugged into the server, I see an error during boot. Which I think is the cause of why I can't get IOMMU to work with it. Plug & Play Configuration Error: Memory Allocation Embedded I/O Bridge Device 71 Bus#40/Dev#14/Func#0: Embedded I/O Bridge Device 71 And when I change PCIe ports to another that is x16 long, I get this: Plug & Play Configuration Error: Memory Allocation Embedded I/O Bridge Device 44 Bus#40/Dev#03/Func#0: Embedded I/O Bridge Device 44 So... for some reason, my BIOS just can't initialize this GPU. Which I don't fully understand why. When I looked up PCIe memory allocation, from what I can tell, all GPUs should request the same amount of memory for use on the PCIe bus. Which makes me wonder why the old GT 120 works, and the GTX 1650 doesn't. So maybe not all GPUs require the same amount of memory allocation on the PCIe bus? I also found that Dell says the PowerEdge R910 only supports 25W of power to PCIe devices, which I know is incorrect, since my GT 120 pulls 50W. (PCIe should be capable of pushing 75W of power to a PCIe device just off the slot, with anything above 75W being provided by 6-pin or 8-pin auxiliary power, like you see on more powerful graphics cards.) But I don't think that power consumption is the reason behind the memory allocation error anyway, because it's a memory allocation error and not a power error, or a flat out crash like you normally see when not enough power is given to a PCIe device. So now I'm down to trying to figure out more information about PCIe memory allocation, where is it allocated from (memory, CPU cache, etc.), and how PCIe devices are initialized. And then whether or not Dell even implemented PCIe properly in the R910, since they claim that they can only push 25W of power to a PCIe device (which I've already disproved).
  3. Mkay. I'll try this and report back.
  4. I can't even add a second GPU or change the GPU to the real one on the VM. Without IOMMU working, it won't allow me to pass the card in, so the virtual machine manager doesn't even give me the option. As such, I can't even select the vbios file for it. I've seen this solution already. The problem is, I don't want Fedora installed, I want Unraid installed. And Unraid is already set up in 64-bit mode. Or should be. I don't think it even runs in 32-bit mode. My BIOS settings are already set to enable all virtualization features. That's the problem. Virtualization is already enabled, but IOMMU isn't working. But only when I have the GTX 1650 installed. I would love to figure out how to change the boot configuration on Unraid. I'm assuming it uses grub? There doesn't appear to be any system setting from the GUI for it, so I'm hesitant to change it manually through the terminal. Although, if that's the correct way, I'm willing to try it. Will enabling this affect my HBA card? If so, is there risk of it affecting my drives with my data on them? (transferring the data to another server for backup just to change this setting will take 10 days to complete) Also, will it even do anything at all since IOMMU isn't being enabled in the kernel in the first place?
  5. Okay, more info. I have dug deeper into this, and what seems to be going on is that the new GPU isn't getting an IRQ. I don't know if this is normal or not. Below are the hardware profile logs for each GPU, plugged into the same slot, which is labeled as slot 7 on the motherboard. You can see they both end up on the PCI 44:00.x bus, so that's how I can tell that they're actually mapping on the same bus as well. (This thing takes forever to reboot in between swapping out the GPU each time, so it's not been a fun troubleshooting process.) GeForce 9500 GT (Apple GeForce GT 120): <node id="display" claimed="true" class="display" handle="PCI:0000:44:00.0"> <description>VGA compatible controller</description> <product>G96C [GeForce 9500 GT]</product> <vendor>NVIDIA Corporation</vendor> <physid>0</physid> <businfo>pci@0000:44:00.0</businfo> <version>a1</version> <width units="bits">64</width> <clock units="Hz">33000000</clock> <configuration> <setting id="driver" value="vfio-pci" /> <setting id="latency" value="0" /> </configuration> <capabilities> <capability id="pm" >Power Management</capability> <capability id="msi" >Message Signalled Interrupts</capability> <capability id="pciexpress" >PCI Express</capability> <capability id="vga_controller" /> <capability id="cap_list" >PCI capabilities listing</capability> <capability id="rom" >extension ROM</capability> </capabilities> <resources> <resource type="irq" value="15" /> <resource type="memory" value="f7000000-f7ffffff" /> <resource type="memory" value="e8000000-efffffff" /> <resource type="memory" value="f8000000-f9ffffff" /> <resource type="ioport" value="dc80(size=128)" /> <resource type="memory" value="f6000000-f607ffff" /> </resources> </node> GTX 1650: <node id="display" class="display" handle="PCI:0000:44:00.0"> <description>VGA compatible controller</description> <product>TU117 [GeForce GTX 1650]</product> <vendor>NVIDIA Corporation</vendor> <physid>0</physid> <businfo>pci@0000:44:00.0</businfo> <version>a1</version> <width units="bits">64</width> <clock units="Hz">33000000</clock> <configuration> <setting id="latency" value="0" /> </configuration> <capabilities> <capability id="pm" >Power Management</capability> <capability id="msi" >Message Signalled Interrupts</capability> <capability id="pciexpress" >PCI Express</capability> <capability id="vga_controller" /> <capability id="cap_list" >PCI capabilities listing</capability> </capabilities> <resources> <resource type="memory" value="fa000000-faffffff" /> <resource type="memory" value="e0000000-efffffff" /> <resource type="memory" value="de000000-dfffffff" /> <resource type="ioport" value="dc80(size=128)" /> <resource type="memory" value="f9000000-f907ffff" /> </resources> </node> <node id="multimedia" class="multimedia" handle="PCI:0000:44:00.1"> <description>Audio device</description> <product>NVIDIA Corporation</product> <vendor>NVIDIA Corporation</vendor> <physid>0.1</physid> <businfo>pci@0000:44:00.1</businfo> <version>a1</version> <width units="bits">32</width> <clock units="Hz">33000000</clock> <configuration> <setting id="latency" value="0" /> </configuration> <capabilities> <capability id="pm" >Power Management</capability> <capability id="msi" >Message Signalled Interrupts</capability> <capability id="pciexpress" >PCI Express</capability> <capability id="bus_master" >bus mastering</capability> <capability id="cap_list" >PCI capabilities listing</capability> </capabilities> <resources> <resource type="memory" value="f9ffc000-f9ffffff" /> </resources> </node> As you can see, the old GPU ends up on IRQ 15. But the new GPU doesn't seem to get an IRQ. So I dug deeper into the syslog. And found this when the GTX 1650 is plugged in, but not when the GT 120 is plugged in: kernel: DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0! kernel: BIOS vendor: Dell Inc.; Ver: 2.12.0; Product Version: So, I've done a lot of research on this error, and from the looks of it, Dell is the only one who can fix it, unless I can patch the BIOS myself. Can I? Anyway, so the next question I began asking myself is this: why does one GPU work and the other one causes this error to happen? Like, if DMAR is screwed up in the BIOS, wouldn't it cause IOMMU to fail regardless of what card is plugged in?
  6. I've researched this into the ground. I don't know if I just missed the right thread, or if it has yet to exist, but I'm at whit's end. I have a Dell PowerEdge R910 with quad Intel Xeon E7-4870 on BIOS 2.12.0 with virtualization enabled. As expected, IOMMU groups work exactly as they should. As long as I have the old Apple Nvidia GT 120 plugged in. I can pass the GPU to any VM I wish to. IOMMU groups show up properly. Everything works as expected. However, as soon as I plug in this new GTX 1650 running a TU117 core, it all fails. IOMMU shows that it's no longer working. Even though I check the BIOS and it shows that virtualization is still enabled. It's like this card is somehow breaking my system. And I can't fathom why or how. As soon as I plug the old GT 120 back in, everything starts working properly again. However, I want the GTX 1650 plugged in instead, because it's what I want for the virtual machine I plan on passing it into. Please, what can I do to troubleshoot this? What can I do to make this work? I have changed PCIe slots. I have tried the card in another machine (Windows machine that loads the drivers and plays games on it just fine). I have checked the BIOS settings for anything that may be different or changed between the cards. Nothing I have done seems to affect that Unraid refuses to utilize the GTX 1650 in a way that allows me to pass it to virtual machines, and instead completely breaks IOMMU functionality.