IOMMU Works But Doesn't


tyrelius

Recommended Posts

I've researched this into the ground. I don't know if I just missed the right thread, or if it has yet to exist, but I'm at whit's end. I have a Dell PowerEdge R910 with quad Intel Xeon E7-4870 on BIOS 2.12.0 with virtualization enabled. As expected, IOMMU groups work exactly as they should. As long as I have the old Apple Nvidia GT 120 plugged in. I can pass the GPU to any VM I wish to. IOMMU groups show up properly. Everything works as expected.

 

However, as soon as I plug in this new GTX 1650 running a TU117 core, it all fails. IOMMU shows that it's no longer working. Even though I check the BIOS and it shows that virtualization is still enabled. It's like this card is somehow breaking my system. And I can't fathom why or how. As soon as I plug the old GT 120 back in, everything starts working properly again. However, I want the GTX 1650 plugged in instead, because it's what I want for the virtual machine I plan on passing it into.

 

Please, what can I do to troubleshoot this? What can I do to make this work? I have changed PCIe slots. I have tried the card in another machine (Windows machine that loads the drivers and plays games on it just fine). I have checked the BIOS settings for anything that may be different or changed between the cards. Nothing I have done seems to affect that Unraid refuses to utilize the GTX 1650 in a way that allows me to pass it to virtual machines, and instead completely breaks IOMMU functionality.

Link to comment

Okay, more info. I have dug deeper into this, and what seems to be going on is that the new GPU isn't getting an IRQ. I don't know if this is normal or not. Below are the hardware profile logs for each GPU, plugged into the same slot, which is labeled as slot 7 on the motherboard. You can see they both end up on the PCI 44:00.x bus, so that's how I can tell that they're actually mapping on the same bus as well. (This thing takes forever to reboot in between swapping out the GPU each time, so it's not been a fun troubleshooting process.)

 

GeForce 9500 GT (Apple GeForce GT 120):

      <node id="display" claimed="true" class="display" handle="PCI:0000:44:00.0">
       <description>VGA compatible controller</description>
       <product>G96C [GeForce 9500 GT]</product>
       <vendor>NVIDIA Corporation</vendor>
       <physid>0</physid>
       <businfo>pci@0000:44:00.0</businfo>
       <version>a1</version>
       <width units="bits">64</width>
       <clock units="Hz">33000000</clock>
       <configuration>
        <setting id="driver" value="vfio-pci" />
        <setting id="latency" value="0" />
       </configuration>
       <capabilities>
        <capability id="pm" >Power Management</capability>
        <capability id="msi" >Message Signalled Interrupts</capability>
        <capability id="pciexpress" >PCI Express</capability>
        <capability id="vga_controller" />
        <capability id="cap_list" >PCI capabilities listing</capability>
        <capability id="rom" >extension ROM</capability>
       </capabilities>
       <resources>
        <resource type="irq" value="15" />
        <resource type="memory" value="f7000000-f7ffffff" />
        <resource type="memory" value="e8000000-efffffff" />
        <resource type="memory" value="f8000000-f9ffffff" />
        <resource type="ioport" value="dc80(size=128)" />
        <resource type="memory" value="f6000000-f607ffff" />
       </resources>
      </node>

 

GTX 1650:

      <node id="display" class="display" handle="PCI:0000:44:00.0">
       <description>VGA compatible controller</description>
       <product>TU117 [GeForce GTX 1650]</product>
       <vendor>NVIDIA Corporation</vendor>
       <physid>0</physid>
       <businfo>pci@0000:44:00.0</businfo>
       <version>a1</version>
       <width units="bits">64</width>
       <clock units="Hz">33000000</clock>
       <configuration>
        <setting id="latency" value="0" />
       </configuration>
       <capabilities>
        <capability id="pm" >Power Management</capability>
        <capability id="msi" >Message Signalled Interrupts</capability>
        <capability id="pciexpress" >PCI Express</capability>
        <capability id="vga_controller" />
        <capability id="cap_list" >PCI capabilities listing</capability>
       </capabilities>
       <resources>
        <resource type="memory" value="fa000000-faffffff" />
        <resource type="memory" value="e0000000-efffffff" />
        <resource type="memory" value="de000000-dfffffff" />
        <resource type="ioport" value="dc80(size=128)" />
        <resource type="memory" value="f9000000-f907ffff" />
       </resources>
      </node>
      <node id="multimedia" class="multimedia" handle="PCI:0000:44:00.1">
       <description>Audio device</description>
       <product>NVIDIA Corporation</product>
       <vendor>NVIDIA Corporation</vendor>
       <physid>0.1</physid>
       <businfo>pci@0000:44:00.1</businfo>
       <version>a1</version>
       <width units="bits">32</width>
       <clock units="Hz">33000000</clock>
       <configuration>
        <setting id="latency" value="0" />
       </configuration>
       <capabilities>
        <capability id="pm" >Power Management</capability>
        <capability id="msi" >Message Signalled Interrupts</capability>
        <capability id="pciexpress" >PCI Express</capability>
        <capability id="bus_master" >bus mastering</capability>
        <capability id="cap_list" >PCI capabilities listing</capability>
       </capabilities>
       <resources>
        <resource type="memory" value="f9ffc000-f9ffffff" />
       </resources>
      </node>

 

As you can see, the old GPU ends up on IRQ 15. But the new GPU doesn't seem to get an IRQ. So I dug deeper into the syslog. And found this when the GTX 1650 is plugged in, but not when the GT 120 is plugged in:

kernel: DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!
kernel: BIOS vendor: Dell Inc.; Ver: 2.12.0; Product Version: 

 

So, I've done a lot of research on this error, and from the looks of it, Dell is the only one who can fix it, unless I can patch the BIOS myself. Can I? Anyway, so the next question I began asking myself is this: why does one GPU work and the other one causes this error to happen? Like, if DMAR is screwed up in the BIOS, wouldn't it cause IOMMU to fail regardless of what card is plugged in?

  • Like 1
Link to comment

Hi have you tried this. Also have you tried adding a vbios to the vm setup

1 hour ago, tyrelius said:

kernel: DMAR: [Firmware Bug]: Your BIOS is broken; DMAR reported at address 0!

 

:

Your BIOS is broken; DMAR reported at address zero!

Please note that if you are using a system with such a broken BIOS, the kernel message will always appear, even if the kernel in fact handles your case correctly, or you have successfully worked around the issue. So don't worry that you still see the message once you have worked around the problem.

There are several ways to work around this issue. In most cases (see above), installing the 64-bit edition of Fedora 12 would be enough. If your BIOS has an option for it, enabling virtualization features in the BIOS should also work around this problem. Finally, you can work around this issue by appending the kernel parameter iommu=soft to your boot configuration.

 

Or try unsafe interupts in vm settings

Edited by SimonF
Link to comment
19 minutes ago, SimonF said:

Hi have you tried this. Also have you tried adding a vbios to the vm setup

I can't even add a second GPU or change the GPU to the real one on the VM. Without IOMMU working, it won't allow me to pass the card in, so the virtual machine manager doesn't even give me the option. As such, I can't even select the vbios file for it.

 

20 minutes ago, SimonF said:

There are several ways to work around this issue. In most cases (see above), installing the 64-bit edition of Fedora 12 would be enough. If your BIOS has an option for it, enabling virtualization features in the BIOS should also work around this problem. Finally, you can work around this issue by appending the kernel parameter iommu=soft to your boot configuration.

I've seen this solution already. The problem is, I don't want Fedora installed, I want Unraid installed. And Unraid is already set up in 64-bit mode. Or should be. I don't think it even runs in 32-bit mode. My BIOS settings are already set to enable all virtualization features. That's the problem. Virtualization is already enabled, but IOMMU isn't working. But only when I have the GTX 1650 installed.

I would love to figure out how to change the boot configuration on Unraid. I'm assuming it uses grub? There doesn't appear to be any system setting from the GUI for it, so I'm hesitant to change it manually through the terminal. Although, if that's the correct way, I'm willing to try it.

 

22 minutes ago, SimonF said:

Or try unsafe interupts in vm settings

Will enabling this affect my HBA card? If so, is there risk of it affecting my drives with my data on them? (transferring the data to another server for backup just to change this setting will take 10 days to complete) Also, will it even do anything at all since IOMMU isn't being enabled in the kernel in the first place?

Link to comment

Mkay, so long delay here for the update. After lots more troubleshooting, I finally decided to plug a monitor in and watch the post. Well, that pointed to an even bigger problem. When the GTX 1650 is plugged into the server, I see an error during boot. Which I think is the cause of why I can't get IOMMU to work with it.

Plug & Play Configuration Error:
Memory Allocation
Embedded I/O Bridge Device 71
 Bus#40/Dev#14/Func#0: Embedded I/O Bridge Device 71

And when I change PCIe ports to another that is x16 long, I get this:

Plug & Play Configuration Error:
Memory Allocation
Embedded I/O Bridge Device 44
 Bus#40/Dev#03/Func#0: Embedded I/O Bridge Device 44

 

So... for some reason, my BIOS just can't initialize this GPU. Which I don't fully understand why. When I looked up PCIe memory allocation, from what I can tell, all GPUs should request the same amount of memory for use on the PCIe bus. Which makes me wonder why the old GT 120 works, and the GTX 1650 doesn't. So maybe not all GPUs require the same amount of memory allocation on the PCIe bus?

 

I also found that Dell says the PowerEdge R910 only supports 25W of power to PCIe devices, which I know is incorrect, since my GT 120 pulls 50W. (PCIe should be capable of pushing 75W of power to a PCIe device just off the slot, with anything above 75W being provided by 6-pin or 8-pin auxiliary power, like you see on more powerful graphics cards.) But I don't think that power consumption is the reason behind the memory allocation error anyway, because it's a memory allocation error and not a power error, or a flat out crash like you normally see when not enough power is given to a PCIe device.

 

So now I'm down to trying to figure out more information about PCIe memory allocation, where is it allocated from (memory, CPU cache, etc.), and how PCIe devices are initialized. And then whether or not Dell even implemented PCIe properly in the R910, since they claim that they can only push 25W of power to a PCIe device (which I've already disproved). 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.