Windows VM GPU Losing Signal, Requires Server Reboot


Recommended Posts

I have a Windows 10 HTPC/gaming VM set up on my Unraid server. It has a dedicated Nvidia RTX 2070 Super. It worked fine for months, but lately it has been having issues where it suddenly stops outputting a signal to the TV. It seems to mostly happen when the GPU is under load or when an application is starting up, though I have had it happen as soon as Windows starts.

 

Sometimes after the VM has crashed the GPU fans ramp to 100% and stay there until the server is rebooted. Also, after a VM crash, the Unraid GUI reports that all CPU threads allocated to the VM are at 100% for a couple of seconds, then just the first thread sits at 100% until the VM is stopped. The VM will not stop cleanly, it has to be force stopped. After this it cannot be started again until the entire server is rebooted. Trying to start the VM before rebooting gives this error (the device ID is the GPU):

internal error: Unknown PCI header type '127' for device '0000:01:00.0'

 

The VM issue does not appear to effect any other component running on the server. All docker applications and my other VM's continue running as normal after the Windows VM has crashed.

 

Some searching indicated this might be a VBIOS issue. I was originally using a VBIOS from techpowerup that was modified as explained in SpaceInvaderOne's 2017 GPU passthrough video. I tried the userscript technique from SpaceInvaderOne's newer video to dump my own VBIOS. Using this VBIOS file did not fix the issue.

 

Finally, I tried swapping out the 2070 for the 3070 I have in my desktop machine. I used SpaceInvaderOne's script to dump the VBIOS and did a clean GPU driver reinstall on the VM. At first it seemed like the issue had been resolved, but after a few minutes running heaven benchmark the VM crashed exactly as it had with the 2070.

 

I am now out of ideas. Any advice would be much appreciated. Thank you.

Link to comment

Keep in mind that I am by no means an expert, but I think that Windows is crashing and then the gpu has not reset properly due to the crash. I would be trying to troubleshoot the cause of the crash, gpu temp, cpu temp, windows logs.

 

There are a couple forum threads about switching slots and downgrading mobo bios for similar problems. I googled "internal error: Unknown PCI header type '127' for device" and then sorted to show only results from unraid.net

Link to comment
21 hours ago, joecool169 said:

Are you monitoring temps when the problem occurs?

 

I don't have a way to monitor the passed-through GPU's temps from within Unraid (I didn't think that was possible, please correct me if I'm wrong). Within Windows, I haven't noticed any abnormal temperatures leading up to a crash. As I mentioned, it seems to happen most reliably when the GPU is under load, but it has happened other times as well.

 

 

21 hours ago, joecool169 said:

Keep in mind that I am by no means an expert, but I think that Windows is crashing and then the gpu has not reset properly due to the crash. I would be trying to troubleshoot the cause of the crash, gpu temp, cpu temp, windows logs.

 

There are a couple forum threads about switching slots and downgrading mobo bios for similar problems. I googled "internal error: Unknown PCI header type '127' for device" and then sorted to show only results from unraid.net

 

I really appreciate your help. I agree with your diagnostics. The root problem is whatever is causing the crash. From what I can tell the "internal error: ..." seems like a reasonable error to get after an unclean force-stop of the VM. I included that info because I have been having a hard time finding any other error messages or other indicators in the logs of either Unraid or the VM to help diagnose the issue. I will try to recreate the crash tonight and check the logs again to see if I missed anything.

 

Assuming no other suggestions, I'm going to try fully reinstalling Windows on the VM. I thought for sure it was a hardware issue with the 2070S until I got the same error on the 3070, so now I am guessing it is a software issue. I suppose it could still be a hardware error at the motherboard level. For the record I am running an Asus X99 WS/IPMI LGA2011 with an Intel Xeon E5-2680 v3. The GPU is in the second slot, which appears to be the primary slot according to the MB manual. I am on the latest BIOS AFAIK. If a Windows reinstall doesn't change anything I'll try an older BIOS, though I would be surprised if that was the issue since the problem only started happening recently.

Link to comment

Update:

I found the relevant messages in the logs when a crash happens:

 

Apr 27 21:59:49 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
...
Apr 27 21:59:52 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 21:59:53 Tower kernel: vfio-pci 0000:02:00.0: timed out waiting for pending transaction; performing function level reset anyway
Apr 27 21:59:54 Tower kernel: vfio-pci 0000:02:00.0: not ready 1023ms after FLR; waiting
Apr 27 21:59:55 Tower kernel: vfio-pci 0000:02:00.0: not ready 2047ms after FLR; waiting
Apr 27 21:59:57 Tower kernel: vfio-pci 0000:02:00.0: not ready 4095ms after FLR; waiting
Apr 27 22:00:01 Tower kernel: vfio-pci 0000:02:00.0: not ready 8191ms after FLR; waiting
Apr 27 22:00:10 Tower kernel: vfio-pci 0000:02:00.0: not ready 16383ms after FLR; waiting
Apr 27 22:00:27 Tower kernel: vfio-pci 0000:02:00.0: not ready 32767ms after FLR; waiting
Apr 27 22:01:01 Tower kernel: vfio-pci 0000:02:00.0: not ready 65535ms after FLR; giving up
Apr 27 22:01:01 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:01:01 Tower kernel: vfio-pci 0000:02:00.0: can't change power state from D0 to D3hot (config space inaccessible)
Apr 27 22:01:01 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:01:09 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:01:09 Tower kernel: vfio-pci 0000:02:00.0: can't change power state from D0 to D3hot (config space inaccessible)
Apr 27 22:01:09 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:01:22 Tower kernel: vfio-pci 0000:02:00.3: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:01:22 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:01:22 Tower kernel: vfio-pci 0000:02:00.1: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:01:22 Tower kernel: vfio-pci 0000:02:00.2: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:01:24 Tower kernel: vfio-pci 0000:02:00.3: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:01:24 Tower kernel: vfio-pci 0000:02:00.2: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:01:24 Tower kernel: vfio-pci 0000:02:00.1: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:01:24 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:01:25 Tower kernel: vfio-pci 0000:02:00.0: timed out waiting for pending transaction; performing function level reset anyway
Apr 27 22:01:26 Tower kernel: vfio-pci 0000:02:00.0: not ready 1023ms after FLR; waiting
Apr 27 22:01:27 Tower kernel: vfio-pci 0000:02:00.0: not ready 2047ms after FLR; waiting
Apr 27 22:01:29 Tower kernel: vfio-pci 0000:02:00.0: not ready 4095ms after FLR; waiting
Apr 27 22:01:34 Tower kernel: vfio-pci 0000:02:00.0: not ready 8191ms after FLR; waiting
Apr 27 22:01:42 Tower kernel: vfio-pci 0000:02:00.0: not ready 16383ms after FLR; waiting
Apr 27 22:01:59 Tower kernel: vfio-pci 0000:02:00.0: not ready 32767ms after FLR; waiting
Apr 27 22:02:35 Tower kernel: vfio-pci 0000:02:00.0: not ready 65535ms after FLR; giving up
Apr 27 22:02:40 Tower kernel: vfio-pci 0000:02:00.2: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:02:40 Tower kernel: vfio-pci 0000:02:00.3: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:02:40 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:02:40 Tower kernel: vfio-pci 0000:02:00.1: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:02:40 Tower kernel: vfio-pci 0000:02:00.2: vfio_bar_restore: reset recovery - restoring BARs
Apr 27 22:02:40 Tower kernel: vfio-pci 0000:02:00.3: vfio_bar_restore: reset recovery - restoring BARs
...
Apr 27 22:02:41 Tower kernel: vfio-pci 0000:02:00.0: vfio_bar_restore: reset recovery - restoring BARs

 

The most useful message seems to be the "vfio_bar_restore: reset recovery - restoring BARs" (note: this message is repeated many times, I clipped most for readability). Googling this returns a decent number of results, though it seems to mostly be with AMD cards from what I can tell. Most of the solutions appear to involve the motherboard.

 

I tried changing which slot the 2070S was in. Originally I had it in the primary (#2) slot with my M2000 (for Plex transcoding) in the #4 slot. I also have an LSI card in the #5 slot. I tried having the M2000 in the primary slot and the 2070S in the #3 slot (this is the way the MB manual recommends having three pcie devices, I originally had the M2000 a slot lower for better airflow to the 2070S). Unfortunately this did not fix the problem. I thought it had fixed it because the system was stable for a few hours but it eventually did crash. (The error log above is from that crash, which is why the GPU is now device 02.00.0).

 

I have not yet tried downgrading to an older MB BIOS. That will be my next step, unless there are any other suggestions.

 

If anyone has had experience with any of the above errors with Nvidia cards, please let me know.

 

Thank you

 

Edited by Team_Dango
Link to comment
On 4/28/2021 at 7:00 PM, joecool169 said:

Have you tried "video=efifb:off" in the syslinux config?

 

Thank you for the suggestion. I gave that a shot and it seemed to help. The VM did not crash for several hours. I even for a moment thought it may have been fixed. But eventually it crashed again same as before, much to my disappointment. After that initial success I was not able to achieve the same level of stability on subsequent reboots.

 

I also tried adding both "video=vesafb:off" and "video=efifb:off" to the syslinux config, which is something I saw suggested a few places. This did not help at all. If anything it was less stable.

 

I should perhaps mention that I already have one extra parameter in my syslinux config: "pcie_aspm=off" which solved an error I started getting after adding the LSI card. I do not know if it could somehow have anything to do with the other issue.

 

I have by now also tried downgrading my motherboard BIOS. The latest is 4001 (what I was using) so I tried the two previous releases, 3901 and 3803. Neither caused any change in behavior that I could see. After those tests I reset back to 4001.

 

I tried a fresh install of a new Windows 10 VM with the same GPU. This proved difficult as I was getting more errors.

Apr 30 15:02:53 Tower kernel: vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Apr 30 15:02:53 Tower kernel: vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Apr 30 15:02:53 Tower kernel: vfio-pci 0000:01:00.0: No more image in the PCI ROM
Apr 30 15:02:53 Tower kernel: vfio-pci 0000:04:00.0: vfio_ecap_init: hiding ecap 0x19@0x168
Apr 30 15:02:53 Tower kernel: vfio-pci 0000:04:00.0: vfio_ecap_init: hiding ecap 0x1e@0x190

Device 01:00.0 is the GPU and 04:00.0 is the USB controller passed through to the VM to which I connect the keyboard and mouse.

 

The errors made it so the keyboard and mouse were not recognized inside the VM, which made installing Windows impossible. I was able to work around this by passing though the keyboard and mouse directly as USB devices. This allowed me to get through the Windows installation, and after a few reboots I was able to pass through the USB controller without error. However after installing the graphics drivers and heaven benchmark the VM again crashed as soon as the benchmark started.

 

I am again very much out of ideas. As always, any help would be very much appreciated.

 

Thank you.

Link to comment

I have one last suggestion for you, and I have no idea if this is relevant. My mobo bios allows me to enable or disable resizable bar, which I think is a fairly new technology. I'm on AMD. Mine is disabled. Any chance your mobo has that setting and have you checked it?

 

It might be completely irrelevant or may not even be there, just thought I'd ask because sometimes someone gives me an off suggestion but it leads me to the right place.

Link to comment
  • 2 months later...
On 4/30/2021 at 9:54 PM, Team_Dango said:

 

Thank you for the suggestion. I gave that a shot and it seemed to help. The VM did not crash for several hours. I even for a moment thought it may have been fixed. But eventually it crashed again same as before, much to my disappointment. After that initial success I was not able to achieve the same level of stability on subsequent reboots.

 

I also tried adding both "video=vesafb:off" and "video=efifb:off" to the syslinux config, which is something I saw suggested a few places. This did not help at all. If anything it was less stable.

 

I should perhaps mention that I already have one extra parameter in my syslinux config: "pcie_aspm=off" which solved an error I started getting after adding the LSI card. I do not know if it could somehow have anything to do with the other issue.

 

I have by now also tried downgrading my motherboard BIOS. The latest is 4001 (what I was using) so I tried the two previous releases, 3901 and 3803. Neither caused any change in behavior that I could see. After those tests I reset back to 4001.

 

I tried a fresh install of a new Windows 10 VM with the same GPU. This proved difficult as I was getting more errors.


Apr 30 15:02:53 Tower kernel: vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Apr 30 15:02:53 Tower kernel: vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Apr 30 15:02:53 Tower kernel: vfio-pci 0000:01:00.0: No more image in the PCI ROM
Apr 30 15:02:53 Tower kernel: vfio-pci 0000:04:00.0: vfio_ecap_init: hiding ecap 0x19@0x168
Apr 30 15:02:53 Tower kernel: vfio-pci 0000:04:00.0: vfio_ecap_init: hiding ecap 0x1e@0x190

Device 01:00.0 is the GPU and 04:00.0 is the USB controller passed through to the VM to which I connect the keyboard and mouse.

 

The errors made it so the keyboard and mouse were not recognized inside the VM, which made installing Windows impossible. I was able to work around this by passing though the keyboard and mouse directly as USB devices. This allowed me to get through the Windows installation, and after a few reboots I was able to pass through the USB controller without error. However after installing the graphics drivers and heaven benchmark the VM again crashed as soon as the benchmark started.

 

I am again very much out of ideas. As always, any help would be very much appreciated.

 

Thank you.

Just started to get the same error as you today, the only thing I changed was swapping my GT 710 for a GT 730 oh and also updated nvidia drivers to latest version. Whats really annoying is that it can happen at any time load or no load, very odd.

Link to comment
On 6/30/2021 at 10:11 PM, mikeyosm said:

Just started to get the same error as you today, the only thing I changed was swapping my GT 710 for a GT 730 oh and also updated nvidia drivers to latest version. Whats really annoying is that it can happen at any time load or no load, very odd.

So it seems enabling Downstream and unsafe interrupts under VM settings fixed it for me.

Link to comment
  • 1 year later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.