Unraid has kernel panic moments after passthrough GPU is utilized by VM


ogi

Recommended Posts

Hello Unraid community, I'm hoping that I can get some help because I am stumped! 

 

I'm trying to pass-through a GTX 670 GPU into a Windows VM for the purposes of running a nvidia gamestream host.  Unfortunately, when I try and passthrough the GPU into the VM, Unraid has a kernel panic, with following message (I should note that when using Splashtop Personal (or having the GPU connected to a monitor), the graphics card is clearly recognized, and the resolution starts adjusting accordingly). 

Kernel Panic - not syncing; Timeout: Not all CPUs entered broadcast exception handler.

1920648466_ScreenShot2020-03-10at7_59_02PM.thumb.png.d582964f261b9dfff45dc683e34bb05d.png

Using VNC without the GPU, everything works as expected.

 

Hardware in question:

 

Motherboard X9DRi-LN4F+

GPU: Gigabyte GTX 670 - Flashed with UEFI capable VBIOS

BIOS and IPMI/BMC are updated to latest versions posted on Supermicro's website

 

I've attached the XML config for the VM to this post.  

 

I have a vfio-pci.cfg file with the following content (those are the PCI addresses for the GPU).

 

BIND=83:00.0 83:00.1

 

The IOMMU group consists of strictly those devices

 

image.thumb.png.07fb72d20f42ca720a5d84c24ad67a49.png

 

I've tried various options in the syslinux.cfg, haven't found anything that worked, but for this last run that I collected data, this is what that file looked like:

 

default menu.c32
menu title Lime Technology, Inc.
prompt 0
timeout 50
label Unraid OS
  menu default
  kernel /bzimage
  append isolcpus=16-19,36-39 pcie_aspm=off video=vesafb:off,efifb:off initrd=/bzroot
label Unraid OS GUI Mode
  kernel /bzimage
  append isolcpus=16-19,36-39 initrd=/bzroot,/bzroot-gui
label Unraid OS Safe Mode (no plugins, no GUI)
  kernel /bzimage
  append initrd=/bzroot unraidsafemode
label Unraid OS GUI Safe Mode (no plugins)
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui unraidsafemode
label Memtest86+
  kernel /memtest

Before I started the VM, I made sure I had syslog mirroring on the USB running, so I captured the syslog, and I've attached it to this post as well.  Lastly, the diagnostics are attached too....

 

I was having a lot of Error code 43 issues, but once I flashed the GPU with the UEFI capable vbios, it seemed to work for a bit (resolution adjusted to monitor native resolution...then seconds later, kernel panic).

 

This effort has been somewhat difficult to be iterative about, as each time I attempt to start the VM, it requires I perform a parity check after.

 

If anyone here has any suggestions, or something sticks out to them in the diagnostics or syslog, please let me know; I'm all out of ideas.  I should also add, I'm not above going out and getting a different GPU if there is evidence to suggest that's the culprit.

 

tower-diagnostics-20200311-1839.zip vfio-pci.cfg windows-vm.xml syslog

 

Edited by ogi
removed un-needed screenshot
Link to comment

Digging through the syslog myself, these appear to be the last two entries before the kernel panic

 

Mar 11 17:55:20 Tower kernel: vfio_bar_restore: 0000:83:00.0 reset recovery - restoring bars
Mar 11 17:55:24 Tower kernel: vfio_bar_restore: 0000:83:00.0 reset recovery - restoring bars

 

The device 83:00.0 is the GPU I'm trying to pass through.

Link to comment

@ogi Did you tried your GPU in another slot on the board? If not, try this. You might have to adjust the syslinux config. Some boards have issues passing through a GPU plugged into the first slot. Also adding a cheap GPU in the first slot and using a GPU in the second or third slot is an option. On my x399 board I have 5 pcie slots and having only 1 card in slot 1 won't work for my. By adding a second card I'am able to pass them through both, no matter which slot I use.

 

Also keep in mind if you using a vbios in your xml for your card you use the right one for your card. You mostly find different revisions on TechPowerup. Maybe try a different vbios. If it's a Nvidia card you wanna passthrough you have to manual hex edit the vbios and remove some of the headers like SpacInvaderOne described in one of his videos.

  • Thanks 1
Link to comment
9 hours ago, bastl said:

@ogi Did you tried your GPU in another slot on the board? If not, try this. You might have to adjust the syslinux config. Some boards have issues passing through a GPU plugged into the first slot. Also adding a cheap GPU in the first slot and using a GPU in the second or third slot is an option. On my x399 board I have 5 pcie slots and having only 1 card in slot 1 won't work for my. By adding a second card I'am able to pass them through both, no matter which slot I use.

 

Also keep in mind if you using a vbios in your xml for your card you use the right one for your card. You mostly find different revisions on TechPowerup. Maybe try a different vbios. If it's a Nvidia card you wanna passthrough you have to manual hex edit the vbios and remove some of the headers like SpacInvaderOne described in one of his videos.

Thanks for the reply @bastl

 

As this card has a 2-slot width, there is only one other slot I can try to plug it into.  That second on my list of things to try besides trying to boot unraid via UEFI, which was a suggestion in some other forums with this sort of error.

 

Regarding the vbios, I did grab a vbios from techpowerup, but.... I flashed that vbios to my card, then dumped the vbios shortly after (because why not) and modified the header per spaceinvaderone's video.  Flashing the techpowerup bios is what allowed the GPU to "work" (for a few seconds anyway) because the BIOS on there to begin with did not support UEFI booting, which as I saw on the wiki, is a requirement for OVMF.  The reason I can tell it was working momentarily was that the monitor I had the GPU connected to showed the login screen at the correct resolution, and the splashtop streaming client also adjusted to an "appropriate" resolution.

 

I'll be following up a bit more.  With the covid-19 pandemic, my plex server has been getting used a little more heavily than usual, so it's tough to come up with some downtime that won't impact folks.

Link to comment

Hmm...booted in UEFI mode; getting different behavior.

 

Array starts fine, I start the VM, it registers as running, but the VM never becomes available via splashtop.  Eventually I connect a monitor, and monitor says

 

"The current input timing is not supported by the monitor display.  Please change the input timing to <resolution> @ <refresh_rate> ..."

 

Guess the next thing to try is removing the GPU from the passthrough device and see what happens. 

 

EDIT: I'm unable to stop the VM normally, had to Force Stop it.

 

EDIT2: Forced stopped the VM, removed the GPU, booted with VNC fine, stopped the VM normally, re-added the GPU, same kernel panic (IOW UEFI boot made no difference).  Trying the other available slot now (yay for the baby sleeping for long period of time letting me try out these different configs!)

 

EDIT3:  No luck with moving the GPU.  Moved the GPU to its adjacent slot, used the VFIO plugin to bind the devices in that IOMMU group, started the VM with the non-header-removed vbios, got a error code 43, then I referenced the vbios with the header removed, and got the same kernel panic as earlier.  I came across this site which has some other suggestions I may try: https://github.com/intel/nemu/wiki/Testing-VFIO-with-GPU

Edited by ogi
added extra info
Link to comment
2 hours ago, ogi said:

I flashed that vbios to my card

Where did you get the information to flash the BIOS directly to your card? NO ONE in the forums will ever give you that advice!!! Selecting the downloaded BIOS in the VM settings template is all you have to do. I never heared someone flashed the BIOS directly to the card for a GPU passthrough to work. Let's hope you not bricked your card. 

Link to comment
4 minutes ago, bastl said:

Where did you get the information to flash the BIOS directly to your card? NO ONE in the forums will ever give you that advice!!! Selecting the downloaded BIOS in the VM settings template is all you have to do. I never heared someone flashed the BIOS directly to the card for a GPU passthrough to work. Let's hope you not bricked your card. 

Card is not bricked, it works fine on the desktop (that's where I did the flashing), and where I verified the card had UEFI capability.  The vbios I flashed on the card is meant for my model card, the difference it had vs. the vbios on my card is the UEFI capability, which is needed.  Here is the post with another user that discovered the same thing:

 

 

Link to comment

From my googling, it sounds like 600 series cards, could be compatible, but generally weren't initially.  Manufacturers were distributing UEFI capable vBIOSs on request through the forums from the looks of things.  I confirmed in my desktop via GPU-Z, that my Gigabyte GTX 670 did not have UEFI capability.

 

I should note that before I flashed the BIOS, no matter what configuration I had in the VM setup, i would always get Error Code 43; it wasn't until i read the bit on the wiki that OVMF devices must support UEFI booting.  I tried a SeaBios config, but could never get the VM to start.

Link to comment

Another oddity I'm discovering is that sometimes the GTX 670 is not even listed in the Tools -> System Devices, or visible in `lspci`

root@Tower:/boot/config# lspci | grep 670
root@Tower:/boot/config#

On reboot the GPU is usually shown, but I've had to reboot on a few occasions now to ensure the device is visible in system devices..... i suppose i should try starting the VM again like this without the device being visible in unraid....server is being utilized somewhat heavily right now so this will have to wait :(

Link to comment

Well, when the GPU goes invisible the way I described earlier, all I need to do is plug it into my desktop, power it up, then put it back into the server, and wha-la it's visible again.

 

Anyway, I trie adding the following options when booting, but still with the same result.

append isolcpus=16-19,36-39 pcie_acs_override=downstream,multifunction intel_iommu=on rd.driver.pre=vfio-pci video=vesafb:off,efifb:off initrd=/bzroot

This time I had dmesg -wH running, I don't think it gave me any more meaningful information, but I'll post the screenshot regardless

598519868_ScreenShot2020-03-14at8_56_37PM.thumb.png.67b307010a31cfce3eb33da5d09792e9.png

At this point, I'm just starting to suspect that this GPU just plain won't work with passthrough.  I'd certainly welcome other things to try if anyone has other suggestions.

 

Link to comment

Hi,

 

I do indeed have 2 GPUs, the monitor output I described above was through the onboard VGA connector.  Before I describe the PCIe layout, probably best you look at the photo of the motherboard here: 

 

https://www.supermicro.com/products/motherboard/Xeon/C600/X9DRi-LN4F_.cfm

 

The first slot, closest to the CPUs is a 4x slot in a 8x connector (occupied with a NVMe adapter for my cache drive).  The slot furthest from the CPUs holds a Quadro P2000 GPU, which I use primarily for plex transcoding (running in Docker).  That slot furthest from the CPU is up against the chassis, so there is no way a 2x width card can fit.  Adjacent to that slot, is an actual 8X connector which I have my HBA attached to.  That leaves the 3 PCIe 16x slots, 2nd from the CPU to the 4th from the CPU.  As the GTX 670 is a dual slot width card, I can only use two of those slots, and I've tried both at this point.

 

Thanks for chiming in, I really do appreciate a second set of eyes on this issue!

Link to comment

Hi there,

 

Saw your email into support but after coming here and reading the thread, it looks like all the things we would have told you to try our community has already stepped in and suggested.  The tricky part with GPU pass through is that it can be very hardware specific.  Sometimes the same model GPU from different OEMs like EVGA vs. ASUS can even have different results.  Unfortunately when things just plain don't work and you've tried all these various options, there really isn't anything else we can do except to suggest changing to an EVGA branded device that is more current.  As an FYI, I had a GTX 650 Ti from EVGA that originally didn't have support for UEFI, but upon contacting them, they provided me an updated BIOS to flash on the card that added proper UEFI support.  That's why we have such a love for EVGA devices, because their support is really great.

  • Thanks 1
Link to comment

Thanks for chiming in @jonpI didn't for you to answer support tickets on a weekend!  I bought this GPU ages ago, to say I got my money's worth out of it would be an understatement, I'm okay with retiring it.

 

I suppose before I buy a new GPU, I should re-purpose the P2000 GPU into one of those slots to make sure passthrough works as intended and that it is in-fact this GPU that is causing the issue, I'll update this thread if I turn up anything else of interest.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.