Going crazy trying to figure out GPU Passthrough stability issues

dcoulson · March 1, 2023

I've had this UnRAID system forever, but have done a lot of hardware swapping around over the years. Last hardware changes were over the Holidays, and things have been running stable for 4-6 weeks before it broke in the last 10-14 days.

Current HW:

Gigabyte TRX40 board w/ 3970X CPU
3070 GPU for OS
4090 GPU for VFIO
NVMe boot drive for VM passed via VFIO

Currently, if I boot my Windows 11 VM with the 4090 attached to it, it will run fine until I put any load on the GPU then it will crash in 3-5 minutes. Load can be just GPU acceleration for remote access from Parsec, running Kombustor or running Valley benchmark. Windows will reboot and UnRAID will log " vfio-pci 0000:4a:00.0: vfio_bar_restore: reset recovery - restoring BARs". Sometimes the UnRAID system will lock up shortly afterwards, other times it will keep on running.

So far I have tried the following changes to UnRAID/HW with no difference:

Enabled/Disabled CSM
Enabled/Disabled 4G encoding and resizable Bar
Removed memory overclock/XMP
Deleted VM and re-created it from scratch - Both Q35 and 440fx
Fresh windows 11 install w/ current nvidia drivers
Reseated CPU/ram/GPUs/etc.

Booting the Windows 11 NVMe drive works fine and the GPU can run Kombustor for hours without any issues. I've also pulled the GPU and tried it in a different box and it runs stable.

I have a 'new to me' TRX40 board coming in a couple of days, so hoping it's a weird HW issue, but I can't figure out what the heck is going on.

Any ideas, or do I just need to continue to replace hardware until the problem goes away?

Enver · March 6, 2023

I am having the same problem. Asus WRX80 with two RTX4090's. The primary GPU is responsible for host duties, transcoding, analytics and some AI, including console video access. The second GPU in the 5th PCIE slot is passed through to the VM.

What I am seeing is the following in the logs: Tower kernel: vfio-pci 0000:61:00.0: vfio_bar_restore: reset recovery - restoring BARs

The VM loses video (black screen) and the entire VM crashes and then restarts Windows 11 and the process happens all over again.

No issues with IMMOU groups; all devices are in their own groups and are passing through the VM.

I have enabled IMMOU in the BIOS NOT in unRAID. i.e. PCIe ACS override is disabled. Having said that I have tried combinations of Downstream, Multifunction and Both and this doesn't seem to fix the issue.

CSM disabled, and Enabled resizable BAR is enabled in the BIOS and Above 4G Decoding.

Here is my syslinux.cfg: append amd_iommu=on amd_iommu=pt pci=noaer pci=acpi acpi_irq_balance apic apm=off rcu_nocbs=24-63 isolcpus=24-31,56-63 vga=extended pcie_acs_override=downstream,multifunction vfio_iommu_type1.allow_unsafe_interrupts=1 initrd=/bzroot

This has got me stumped.

Edited March 6, 2023 by Enver

turnipisum · March 7, 2023

Oh hello I'm not alone. I've got Asrock TRX40 creator, 3970x, 2 x nvida 2070 super's and have been chasing this for weeks or months (lost track) on win10/11 vm.

Unraid works fine when the vm has crashed. But if i try and start it again i get

Execution error internal error: Unknown PCI header type '127' for device '0000:4e:00.0'

Only way to get it running again is reboot unraid. As @dcoulson says always happens when under higher loads playing games.

I'm thinking it's nvida driver changes (on latest currently 531.18) as vm has been stable for month's before this issue.

I've recreate VM, tried unsafe interrupts, update nvida drivers, windows is up to date...

I was thinking of maybe trying asrock bios beta 1.83 2022/5/30 AMD CastlePeakPI-SP3r3-1.0.0.7 currently on last stable one 1.70 2020/6/30 AMD CastlePeakPI-SP3r3-1.0.0.4 but not sure i want open that can of worms at the moment.🙃

I can't remember but thinking it started maybe around nvida Driver Version: 527.37 not tried using earlier version yet not had time!

Mar  6 16:14:29 kernel: vfio-pci 0000:4e:00.0: vfio_bar_restore: reset recovery - restoring BARs
Mar  6 16:14:32 kernel: vfio-pci 0000:4e:00.0: vfio_bar_restore: reset recovery - restoring BARs
Mar  6 16:14:32 kernel: vfio-pci 0000:4e:00.0: vfio_bar_restore: reset recovery - restoring BARs
Mar  6 16:14:32 kernel: vfio-pci 0000:4e:00.0: Unable to change power state from D0 to D3hot, device inaccessible
Mar  6 16:14:32 kernel: vfio-pci 0000:4e:00.0: vfio_bar_restore: reset recovery - restoring BARs

Enver · March 7, 2023

8 hours ago, turnipisum said:
Oh hello I'm not alone. I've got Asrock TRX40 creator, 3970x, 2 x nvida 2070 super's and have been chasing this for weeks or months (lost track) on win10/11 vm.

Unraid works fine when the vm has crashed. But if i try and start it again i get

Execution error internal error: Unknown PCI header type '127' for device '0000:4e:00.0'

Only way to get it running again is reboot unraid. As @dcoulson says always happens when under higher loads playing games.

I'm thinking it's nvida driver changes (on latest currently 531.18) as vm has been stable for month's before this issue.

I've recreate VM, tried unsafe interrupts, update nvida drivers, windows is up to date...

I was thinking of maybe trying asrock bios beta 1.83 2022/5/30 AMD CastlePeakPI-SP3r3-1.0.0.7 currently on last stable one 1.70 2020/6/30 AMD CastlePeakPI-SP3r3-1.0.0.4 but not sure i want open that can of worms at the moment.🙃

I can't remember but thinking it started maybe around nvida Driver Version: 527.37 not tried using earlier version yet not had time!
Mar  6 16:14:29 kernel: vfio-pci 0000:4e:00.0: vfio_bar_restore: reset recovery - restoring BARs
Mar  6 16:14:32 kernel: vfio-pci 0000:4e:00.0: vfio_bar_restore: reset recovery - restoring BARs
Mar  6 16:14:32 kernel: vfio-pci 0000:4e:00.0: vfio_bar_restore: reset recovery - restoring BARs
Mar  6 16:14:32 kernel: vfio-pci 0000:4e:00.0: Unable to change power state from D0 to D3hot, device inaccessible
Mar  6 16:14:32 kernel: vfio-pci 0000:4e:00.0: vfio_bar_restore: reset recovery - restoring BARs

My situation; the host is stable and I can restart the VM as many times as I like which is good for troubleshooting, however when I login to Windows 11 its stable anywhere from 30 seconds to 5 minutes until the screen goes black and then the Windows 11 machine reboots. So basically unusable. No load needs to be applied to the VM; it can be just idle at the desktop for it to crash.

I am also now seeing the following in the unRAID logs:

Mar 7 18:17:34 Tower kernel: SVM: kvm [64308]: vcpu0, guest rIP: 0xfffff841bcc7bf99 unimplemented wrmsr: 0xc0010115 data 0x0
Mar 7 18:17:35 Tower kernel: SVM: kvm [64308]: vcpu1, guest rIP: 0xfffff841bcc7bf99 unimplemented wrmsr: 0xc0010115 data 0x0
Mar 7 18:17:35 Tower kernel: SVM: kvm [64308]: vcpu2, guest rIP: 0xfffff841bcc7bf99 unimplemented wrmsr: 0xc0010115 data 0x0
Mar 7 18:17:35 Tower kernel: SVM: kvm [64308]: vcpu3, guest rIP: 0xfffff841bcc7bf99 unimplemented wrmsr: 0xc0010115 data 0x0
Mar 7 18:17:35 Tower kernel: SVM: kvm [64308]: vcpu4, guest rIP: 0xfffff841bcc7bf99 unimplemented wrmsr: 0xc0010115 data 0x0
Mar 7 18:17:35 Tower kernel: SVM: kvm [64308]: vcpu5, guest rIP: 0xfffff841bcc7bf99 unimplemented wrmsr: 0xc0010115 data 0x0
Mar 7 18:17:35 Tower kernel: SVM: kvm [64308]: vcpu6, guest rIP: 0xfffff841bcc7bf99 unimplemented wrmsr: 0xc0010115 data 0x0
Mar 7 18:17:36 Tower kernel: SVM: kvm [64308]: vcpu7, guest rIP: 0xfffff841bcc7bf99 unimplemented wrmsr: 0xc0010115 data 0x0
Mar 7 18:17:36 Tower kernel: SVM: kvm [64308]: vcpu8, guest rIP: 0xfffff841bcc7bf99 unimplemented wrmsr: 0xc0010115 data 0x0
Mar 7 18:17:36 Tower kernel: SVM: kvm [64308]: vcpu9, guest rIP: 0xfffff841bcc7bf99 unimplemented wrmsr: 0xc0010115 data 0x0

This seems to indicate some sort of Hyper-V / QEMU emulation error. I am sure this is related to the Windows 11 VM / Nvidia driver crashing but I haven't seen any guidance on how to fix this I do have "kvm-amd.avic=1" in my syslinux.cfg file which apparently is supposed to ignore this error?

When you say trying an earlier Nvidia driver do you mean inside the Windows VM or on the host via Nvidia plugin?

FYI my host is running Nvidia driver 530.30.02. I Windows 11 VM is running Nvidia driver 531.18.

Things I am going to try tonight:

Revert the Windows 11 VM Nvidia driver to an earlier version.

Revert the unRAID host to an earlier version of the Nvidia driver.

Try an Ubuntu VM tonight and pass through the same GPU just to see what happens.

I did see some posts in Proxmox forums about disabling Above 4G Decoding which apparently fixed this for two users BUT that is a no go for me; the host won't even POST to BIOS screen i.e. initialise the GPU if Above 4G Decoding is turned off. -> CMOS CLEAR and RESET.

Edited March 7, 2023 by Enver

Enver · March 9, 2023

Hello @turnipisum and @dcoulson

Success! For me this has been resolved. The root cause for me was the CoreFreq plugin was set to enable and autostart.

Once I disable autostart and set the plugin to disable I was able to boot the VM with the latest Nvidia driver and its been stable for the last few hours with round after round of 3DMark stress tests to prove it!

Happy to share my BIOS and syslinux.cfg, along the way I enabled many HPC optimisations as recommended by AMD for the TRX/WRX platform.

In my case the SVM:kvm entries are harmless and will most probably be rectified in a new Linux kernel.

@ich777 This is just an FYI. No idea why the CoreFreq plugin would cause GPU resets; happy to share logs if this helps. Also do accept the plugin is experimental in nature.

ich777 · March 9, 2023

5 hours ago, Enver said:

@ich777 This is just an FYI. No idea why the CoreFreq plugin would cause GPU resets; happy to share logs if this helps. Also do accept the plugin is experimental in nature.

@CyrIng do you have a idea why CoreFreq is messing up the GPU resets on this system?

@Enver please post your Diagnostics.

Enver · March 10, 2023

https://forums.unraid.net/topic/111449-solved-vm-crash-vgaarb-changed-vga-decodes-olddecodesiomemdecodesnoneownsiomem/

Here is another example of the CoreFreq Plugin causing resets.

Will post my diagnostics soon, I have rebooted the server several times since then BUT may have a historical log you can look at.

Edited March 10, 2023 by Enver

KoPVM · March 29, 2023

Hi Everyone,

I was having the exact same problem as the original post. Removing the Core Freq plugin solved the problem.

Mr.Will · June 4, 2023

@dcoulson Did you find out your issue? I have the same problem since probably end of last year (2022) and have been trying to fix it for several months.

Everything was working perfectly up to a certain point where the VM started crashing when I try to play any video game. It runs OK for a few minutes, or it sometimes doesn't even crash, but usually it does. When it crashes I see the following error in Unraid:

vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs

I had Windows 10 when this started happening, and tried migrating to Windows 11, which did not solve it. These are the things I tried:

- Change XMP RAM profile in the BIOS

- Lower game graphis config

- Update nvidia driver

- Downgrade nvidia driver to 522.25 and 526.86

- Run games as administrator

- Run games as DW11 only

- Check that CPU is in performance profile in Windows and Unraid

- Remove any software that has overlays (Razer and so on)

- Update Windows

- Enable MSI interrupts under Windows

- Use a vBIOS in the VM config

- Leave the case open (in case it was a tempearature issue). Also monitored the temps during play

None of these works. My suspicion is this has to be an iossue with the nvidia drivers. I had not change anything in the VM as far as I remember when this started happening. The only thing that likely changed was the driver but a downgrade to previous driver didn't work. So I'm fully stuck.

Any ideas?

howiser1 · December 18, 2023

**Possible Solution**

I had this exact same issue and it was driving me nuts too. I see this thread is a little old and not much on here. So I figured I share my experience and solution, but your mileage may differ.

I was building a gaming VM, passing through a GTX 1080ti GPU and could play for around 10-20mins before the VM would just “crash” and unraid will fill the logs with this error:

vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs
...

Literally 1000’s of those until I would force stop the VM. After which I could no longer restart the VM – I would get a 127 error, not able to access the GPU device (see below as to my assumption). And my GPU was on 0000:01:00.0 – as well. Others might be in the #2 slot and so forth, which your addressing should follow.

I didn’t have the CoreFreq Plugin as suggested above, so that’s ruled out.

So to cut right to it – after hours and yes, DAYS of ruling out configuration settings, trying vbios, MB bios settings, VM graphics drivers, etc, etc. IT WAS THE DAM PSU (power supply). 🤬 I only came across this once on other thread… somewhat similar but not 100% the same setup and experience.

So in my particular case – I added the GPU above (used from Ebay) and since my server case is more server grade case, there are no 6 or 8 pinouts for GPUs. Yes, I could “adapt” to those, but I only have two main power lines: one powering my disk array and the other powering the SATA drive enclosure. So I didn’t want to steal power from “unraid”… Thus I bought a secondary power supply (flat/modular) 400W and wired it up to power on with the primary PSU. (you can google how to do all this – there are “multiple power supply adapters”) – DON’T just jump it with a paper clip… 🤦‍♂️

Anyway, after swapping the GPU to my desktop computer – I could play for “hours” without any issues, no crashes, etc. (using the primary 500w PSU, which has two main cables with 6-8 pinout plugs). This ruled out any GPU card issues… it’s fine. 😁

I pulled the secondary PSU (from the unraid server) and used it to “replicate” my setup. And BINGO – 10-20mins in the game CRASHed! I looked over at the GPU which now has blinking LEDs on the power ports… 🤔. Looking at the secondary PSU – the only tell was the fan was not spinning. Yep it “died”. 🤦‍♂️ So I’m guessing it has some thermal/over heat protection or it can’t supply 250w long term…. Which is what my GPU needs at 95-100% usage.. So don’t just buy a crappy PSU for GPUs – You have to have a PSU that can drive 300w+ long term for hours JUST for the GPU. That would mean a system PSU of 600+w, just to be clear. Again, I’m just driving the GPU – so 400w should have been enough, but that PSU just can’t handle the load…. "junk" and getting returned.

This is also why I had to power cycle / reboot unraid every time this would crash. Since there was no power to the GPU, Unraid could not reset the board (GPU). I also would get these errors… (clues) sometimes, not all times, in the logs.

vfio-pci 0000:01:00.1: Unable to change power state from D0 to D3hot, device inaccessible
vfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible
vfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible

Your experience could slightly different, if you’re using a primary PSU, it might just drop momentarily in power (voltage) because the 250w (or more, very GPU specific here) the GPU needs continuous is too much of a long term demand and the voltage drops just slightly enough that the GPU will then “crash”… at which time the mb / OS would try to reset the card – the logging above. Once the card is reset - some people are able then to re-start their VMs after the GPU crashed. Unless the Primary PSU would also die, you’d never know for a moment it dropped in voltage just enough for the GPU to crash. As Unraid and the rest of your system would be “fine”. (but could crash too)

For me the secondary PSU was off/died – so there was no way to reset and repower the GPU, other than to reboot. I guess I probably could have just pulled the power plug on the secondary PSU and then plugged it back in – now thinking about it. But whatever, a full reboot was necessary for me, which “reset” the secondary PSU to power back up. (and I only rebooted like 100+ times, trying to figure all this out, for DAYS) AHHHH 🤬

Overall, you get the idea… this post is now long enough. I would very, very strongly recommend you rule out the PSU if you get these or similar errors. Especially if the VM/GPU is fine during normal operational, but only crashes and these errors show up when it’s under load…

Good luck and hope this helps someone else.

Edited December 19, 2023 by howiser1

Going crazy trying to figure out GPU Passthrough stability issues

Recommended Posts

dcoulson

Link to comment

Enver

Link to comment

turnipisum

Link to comment

Enver

Link to comment

Enver

Link to comment

ich777

Link to comment

Enver

Link to comment

KoPVM

Link to comment

Mr.Will

Link to comment

howiser1

Link to comment

Join the conversation