• [6.9.1] Issues with GPU passthrough


    Voydz
    • Minor

    Hi there!

     

    There appears to be an issue with GPU passthrough. I had GPU passthorugh working on 6.9.0-rc2 fine, using the latest approach to vfio binding I could find (using /boot/config/vfio-pci.cfg). For some information about vfio setup, please have a look on the screenshots. Unfortunately, I can not provide much detail on how I setup the VM itself, because it was running for quite some time now and I never touched it.

     

    After upgrading to 6.9.1 I am able to launch the VM but I can not access it. My unRAID server logs are spammed with the following entries:

    ...
    Mar 15 09:15:56 Prime kernel: vfio-pci 0000:26:00.0: BAR 1: can't reserve [mem 0xc0000000-0xcfffffff 64bit pref]
    Mar 15 09:15:56 Prime kernel: vfio-pci 0000:26:00.0: BAR 1: can't reserve [mem 0xc0000000-0xcfffffff 64bit pref]
    Mar 15 09:15:56 Prime kernel: vfio-pci 0000:26:00.0: BAR 1: can't reserve [mem 0xc0000000-0xcfffffff 64bit pref]
    Mar 15 09:15:56 Prime kernel: vfio-pci 0000:26:00.0: BAR 1: can't reserve [mem 0xc0000000-0xcfffffff 64bit pref]
    Mar 15 09:15:56 Prime kernel: vfio-pci 0000:26:00.0: BAR 1: can't reserve [mem 0xc0000000-0xcfffffff 64bit pref]
    ...

     

    My VM outputs:

    ...
    2021-03-15T08:15:56.564833Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2c0, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564842Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2c1, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564850Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2c2, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564859Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2c3, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564868Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2c4, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564876Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2c5, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564885Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2c6, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564894Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2c7, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564902Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2c8, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564911Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2c9, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564920Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2ca, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564936Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2cb, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564948Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2cc, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564958Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2cd, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564968Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2ce, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564978Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2cf, 0x0,1) failed: Device or resource busy
    2021-03-15T08:15:56.564987Z qemu-system-x86_64: vfio_region_write(0000:26:00.0:region1+0xbd2d0, 0x0,1) failed: Device or resource busy
    ...

     

    Resource busy seems quite odd to me. I tried to search this issue first, but I can not find any hint, which made me believe that that issue has to be related to the unRAID update. 

     

    Thanks for any help and your hard work on this!

     

    best regards

    Bildschirmfoto 2021-03-15 um 09.31.47.png

    Bildschirmfoto 2021-03-15 um 09.32.14.png

    prime-diagnostics-20210315-0921.zip




    User Feedback

    Recommended Comments



    UPDATE: I noticed those errors in the vfio log, so I tried using the unRAID "Tools -> System Devices" page to reconfigure my vfio bindings. After that all warnings are gone but the problem persists. For good measure the diagnostics is in the attachments.

    prime-diagnostics-20210315-1013.zip

    Link to comment

    UPDATE: I tried to diagnose further by connecting a display to the (otherwise headless) server. I naturally use a UEFI/GOP compatible GPU as a primary card for the unRAID OS to use and a secondary one to pass through. After rebooting with the display connected to the "primary GPU" it seems to do the trick now.

     

    Now I am even more confused as it feels like bad support on Motherboard/BIOS side of things. But this makes no sense to me as I can not remind that happening before the upgrade. The server was "reboot save".

     

    (I hope its okay just commenting as I further progress the issue, as I feel its the best way to keep it tidy.)

    Link to comment

    I have a similar issue. My setup:

    - GTX 1080Ti that I want to passthrough to a Windows VM, in the top (first slot), viewed as 0B:00... and marked as "bound to vfio at boot"

    - GTX 1650 that I want to use as an Unraid display and in Docker containers, in the bottom slot, viewed as 05:00

    Both have displays attached

     

    At boot, both primary displays (from both cards) display the startup menu. They are mirrored until "Loading /bzroot ...ok" shows up.

    At that point, the display connected to the 1650 freezes and the boot sequence continues on the primary display of the 1080Ti, until the vfio driver is loaded. Then, the display on the 1080Ti also freezes, which kinda makes sense.

     

    However, when I start the Windows VM, it starts up and the screen goes black. And the VM's logs is filled with "2021-04-03T14:48:00.897611Z qemu-system-x86_64: vfio_region_write(0000:0b:00.0:region1+0x13550a, 0x0,1) failed: Device or resource busy"

    And very soon after, I get a warning about Unraid logs being full.

     

    Note: If I flip the GPUs, it works fine, but the 1080Ti gets poor airflow and blows directly on my NVMe drive and heats it up to 67°C, which makes me very uncomfortable...

     

    Link to comment

    @DaNoob Which video card do you have set in the BIOS to be your primary? You probably need to make sure the 1650 is set to be the primary video card if not already set.

    Link to comment
    15 minutes ago, Taddeusz said:

    @DaNoob Which video card do you have set in the BIOS to be your primary? You probably need to make sure the 1650 is set to be the primary video card if not already set.

     

    I was unable to find such an option in my BIOS. I'm running an Asus TUF GAMING X570-PLUS.

    I found something about enabling CSM that switches the primary GPU (no idea why).

    I'll give that a try and keep you posted.

     

    Link to comment
    4 hours ago, DaNoob said:

     

    I was unable to find such an option in my BIOS. I'm running an Asus TUF GAMING X570-PLUS.

    I found something about enabling CSM that switches the primary GPU (no idea why).

    I'll give that a try and keep you posted.

     

    It might be located under the PCI Express settings.

    Link to comment

    Sadly no, PCI express settings only offer me the option to choose the PCIe gen (1-4) for each port.

    They are all on auto by default and seem to handle my graphics cards (gen3) and sata controller (gen2) fine.

     

    I can confirm the trick of enabling CSM does not work on X570. I think it was more of a side effect on 370-350 that was fixed with the later chipsets...

     

    Maybe it can be worked around by passing options to the kernel through grub? Telling the kernel to only load its framebuffer on the cards that are not marked for vfio use?

    Link to comment

    Hey there,

     

    I can confirm that this is a bit of a pain with MSI motherboards. On my old ASUS I could directly select the primary GPU. (This becomes especially interesting if you want to use a primary GPU on the secondary PCIe slot. As the primary slot is most likely the only one supporting 16x speed. I some rare cases a GPU could be bottlenecked a bit on slower ports as far as I did understand.) 

     

    Currently on my MSI I am running with CSM DISABLED (full UEFI mode only, I don't know how its called exactly right now) which will require a GOP compatible card in the secondary PCIe slot as primary GPU. As far as I understood the motherboard firmware will check for a card in the highest order slots first and picks the first one it finds. This is nice as it leaves the one in the primary slot for passthrough.

     

    For now my system works flawlessly like this (besides the issue mentioned in my third update).

     

    best regards

    Link to comment

    I've check in my old Asrock's BIOS and it does indeed have an option to pick the 'primary' GPU. This seems to be an X570 issue as outlined by the Reddit thread I linked to in my previous post.

     

    So, I have decided to sidestep the issue by ordering a vertical GPU mounting riser from Fractal Design, so at least the GPU won't blow directly on my NVMe drive: it reaches 67° at full load (thanks Dyson Sphere Program 😉), which is way too hot for comfort (0-70°C according to Samsung)...

    According to Unraid, that second NVMe is 10 to 15°C hotter than the other one, with the GPU under load, and 6-7° hotter when idling.

    Yes the bottom slot is "only" 8X, but it is PCIe gen4, and the GPU is gen3. I'm not sure if the conversion is done, giving me 16X gen 3 or not. But I have honestly never noticed a difference in performance. Maybe in synthetic benchmarks or other applications, but for gaming and everyday work, 8X seems to be fine...

    Link to comment
    On 4/4/2021 at 2:53 PM, DaNoob said:

    Yes the bottom slot is "only" 8X, but it is PCIe gen4, and the GPU is gen3. I'm not sure if the conversion is done, giving me 16X gen 3 or not.

     

    No, it doesn't work like that. It works by lowest common denominator. The GPU can't work at gen 4 speeds, and the slot can only provide 8 lanes so you end up with gen 3 x8, which enough for most purposes.

    Link to comment

    You are are right. I read the Asus TUF Gaming X570-Plus Manual (page 21) and it appears that the top 16X connector is controlled by the CPU, while the bottom 16X, and the 3 1X, are all controlled by the chipset. That means, with Ryzen 3rd gen, the top slot has 16 lanes (whatever gen), and the bottom slot has only 4, at best. So there would indeed be a benefit if I could plug my 'main' GPU in the top slot.

     

    I'm still going with the vertical mounting of the 1080Ti, because whichever slot I use, I'm blowing hot air right at one of my M.2s, the bracket and extension should arrive by the end of the week and greatly improve my airflow... And allow for easier switches/troublshooting.

     

    So we are back to the issue of not being unable to select a primary GPU in the X570 BIOS (damn you Asus). And the VM refusing to start with the "Device or resource busy" error message in the logs.

     

    Maybe I don't understand how the 'System devices' menu works in 6.9.1? I assumed having the device marked for vfio-pci meant that's all I had to do. Apparently, it is not. Does it work differently in 6.9? Or do I still have to dump/reload the vbios like in old documentations/videos (thanks @SpaceInvaderOne btw, you taught me a lot! But I still have a lot to learn apparently...)?

     

    Also, a way to tell the kernel to use the 1650 for its framebuffer/GUI would be a big help since it would allow me to see the console and manage the server locally, while having the nvidia driver loaded, hence reducing the power consumption...

    Link to comment

    Alright, I keep trying to get my 1080ti to work in the first slot, to no avail. I've be able to extract a vbios with @SpaceInvaderOne's script. I've also removed the header from one I found on techpowerup. I've tried both and get the same result when I boot the VM:

    • Screen goes black for a couple of seconds, then displays the kernel output that was on the screen previously but kinda "zoomed in"
    • The VM logs fill with "Device or resource busy" errors, and quickly fills the unraid server logs too.

     

    This only happens when the GPU is plugged in the first slot (marked for vfio or not). It works fine in the second one.

    Any ideas on what else I could try?

    Link to comment

    Following as I am running into a similar if not the same exact issue.  I have the same mobo as you.  Using a 3900x cpu and MSI 2070 Super Ventus OC.  If I manage to get this to work in the primary slot, I'll let you know.

    Link to comment

    Yes, I still have the issue. What did you change? As a last resort, I was starting to look at the new Ryzen 5700G with integrated graphics. It should be compatible with my mobo and fix the issue. But that is a decent chunk of change for a relatively small upgrade. It would be a different story if they added a 12 cores to the skew but it seems unlikely...

    Link to comment

    I just registered to say that I'm having the same issue on a WS X570 ACE mobo. For me it seems that Unraid is grabbing my gpu in the first x16 slot with a dummy hdmi adapter even thought I bind it to vfio through:

     

    Tools -> System Devices -> PCI Devices and IOMMU Groups

     

    I experimented with adding video=efifb:off to syslinux/syslinux.cfg and this seems to workaround the issue but now I'm not able to have terminal console as a fallback for unraid.

    Edited by tantris
    Link to comment
    2 hours ago, tantris said:

    ... I bind it to vfio ... but now I'm not able to have terminal console as a fallback for unraid.

     

    Once a device has been bound to vfio it is completely hidden from Unraid. Therefore there can be no terminal console to that device.

    Link to comment
    On 6/4/2021 at 3:57 PM, DaNoob said:

    Yes, I still have the issue. What did you change? As a last resort, I was starting to look at the new Ryzen 5700G with integrated graphics. It should be compatible with my mobo and fix the issue. But that is a decent chunk of change for a relatively small upgrade. It would be a different story if they added a 12 cores to the skew but it seems unlikely...

    I didn't have to edit syslinux.cfg at all. I had my gpu in slot one. I made sure it was isolated into its own iommu group.. I didn't bind to vfio. I followed the spaceinvader one video on dumping the vbios from within unraid using a script he had written. I set the card and vbios in the vm template and Bob's your uncle.  I've been gaming on it for weeks now with no issue. We can compare settings if you want.

     

    Beforehand I did encounter the same issue of the logs filling up, but since I did the above, this has not happened.

    Edited by Mobius71
    Additional info
    • Like 1
    Link to comment
    19 hours ago, tantris said:

    Thats strange because I will still get a terminal output on a gpu bound to vfio via the gui.

     

    You said something about adding video=efifb:off to syslinux to make vfio binding work? I don't know what that does, where did you find it?

     

    What I am saying is that if vfio binding is working properly, Unraid should not be able to access the video card once the vfio code executes. So during boot you may see some text on the screen, but once the vfio code executes I would expect it to stop. Just trying to help set your expectations about what it means to bind a device to vfio-pci

    Link to comment

    @ljm42 What video card is set to be primary in your BIOS? If the boot screen shows on the video card you intend to pass through it’s not going to work no matter how vfio is set. On my motherboard I must have a monitor connected to the integrated video for it to work as primary. Otherwise it automatically boots to my vm’s video card and fails to pass through even though the integrated is set as primary.

    Edited by Taddeusz
    Link to comment

    @Taddeusz No disrespect but according to your logic single gpu passthrough wouldnt be possible. As far as I know there is only one mainboard brand which allows to select the primary gpu in a system without a iGPU/APU and that's Gigabyte. I'm running a ASRock X570M Pro4 and I believe I've seen other unraid users with this board.

     

    @ljm42 I found the video=efifb:off flag in several places. Like here and here.

     

     

     

    Edited by tantris
    Link to comment

    Thanks. Will try that at the next shutdown opportunity. The longer I use unraid, the more it does... I'm at the point where it is the only computer in the house (excl. router&phone&laptop): media server, file server with remote replication, personal cloud, work VM, gaming VM, dev environment... I had so much fun playing with unraid, I created the most massive SPoF of my career, in my own house...

    • Haha 1
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.