April 5Apr 5 System specs- Unraid 7.2.4, kernel 6.12.54- AMD Threadripper / Matisse platform (ASUS board)- QEMU 9.2.3 / libvirt 11.7.0- RTX 4060 Gigabyte (AD107, 01:00.0) — passed through to Windows 10 VM- RTX 2080 Ti (TU102, 4d:00.0) — passed through to separate Windows 11 VM (working fine, no issues)- RTX 4060 is the primary/boot VGA on the host — console access must be preserved---ProblemMy Windows 10 VM with RTX 4060 passthrough fails to start after the VM has been stopped or crashed, with this error:```Execution errorinternal error: Unknown PCI header type '127' for device '0000:01:00.0'```A full Unraid server reboot always recovers it — the VM starts fine on first boot. The problem only happens when trying to restart the VM without rebooting the host.---IOMMU grouping — clean, no sharing issues```IOMMU Group 74 01:00.0 NVIDIA Corporation AD107 [GeForce RTX 4060] [10de:2882] (rev a1)IOMMU Group 74 01:00.1 NVIDIA Corporation AD107 HD Audio Controller [10de:22be] (rev a1)```Both functions are isolated in their own group. No ACS override needed.---Current boot args```BOOT_IMAGE=/bzimage vfio-pci.ids=1022:148c,10de:1ad6,10de:1ad7 video=efifb:off isolcpus=1-14,25-38 initrd=/bzroot kvm_amd.nested=1```Note: 10de:2882 and 10de:22be (the RTX 4060) are NOT in vfio-pci.ids. The 2080 Ti functions 10de:1ad6 and 10de:1ad7 are listed but not the GPU itself either — those were added for its USB/UCSI controllers.---Driver state after VM shutdown```ls: cannot access '/sys/bus/pci/devices/0000:01:00.0/driver': No such file or directoryls: cannot access '/sys/bus/pci/devices/0000:01:00.1/driver': No such file or directory```No driver is bound to the GPU between VM sessions. On boot it is detected correctly:```[ 0.392439] pci 0000:01:00.0: [10de:2882] type 00 class 0x030000 PCIe Legacy Endpoint[ 0.435513] pci 0000:01:00.0: vgaarb: setting as boot VGA device[ 2.086958] vfio_pci: add [1022:148c[ffffffff:ffffffff]] class 0x000000/00000000```The RTX 4060 is being picked up as boot VGA by the host, and is NOT being claimed by vfio-pci at boot (only the 2080 Ti USB controllers are).---dmesg after VM shutdown — GPU reset failure```vfio-pci 0000:01:00.0: vfio_bar_restore: reset recovery - restoring BARs [repeats ~30 times]vfio-pci 0000:01:00.0: timed out waiting for pending transaction; performing function level reset anywayvfio-pci 0000:01:00.1: Unable to change power state from D0 to D3hot, device inaccessiblepcieport 0000:00:01.1: broken device, retraining non-functional downstream link at 2.5GT/spcieport 0000:00:01.1: retraining failedpcieport 0000:00:01.1: Data Link Layer Link Active not set in 100 msecvfio-pci 0000:01:00.0: Unable to change power state from D3cold to D0, device inaccessiblevfio-pci 0000:01:00.1: Unable to change power state from D3cold to D0, device inaccessible```The PCIe link to the GPU drops entirely after the VM stops. On the next VM start QEMU reads all 0xFF from the config space (device unreachable) and reports header type 127.---VM XML — relevant passthrough section```xml<hostdev mode='subsystem' type='pci' managed='yes'><driver name='vfio'/><source><address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/></source><rom file='/mnt/user/isos/vbios/GeForceRTX4060.rom'/><address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0' multifunction='on'/></hostdev><hostdev mode='subsystem' type='pci' managed='yes'><driver name='vfio'/><source><address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/></source><address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x1'/></hostdev>```The custom vBIOS ROM file is only 61KB — which seems far too small for an RTX 4060 (I'd expect 200KB+). Not sure if this is contributing.---What I think is happeningThe RTX 4060 AD107 appears to have a reset bug similar to the AMD reset bug — after VM shutdown the FLR fails, the PCIe link drops, and the card is completely inaccessible until a full power cycle. The 2080 Ti on the same system passes through without any such issues.---Questions1. Is there a known fix for ths reset BUG?2. Would adding pcie=noaer to boot args help suppress the PCIe error cascade that drops the link? I don't want to brick my Unraid server, but still be able to access it over console.3. Is my vBIOS ROM ok? I'm not sure how did i get it, since it was long ago. Should I remove it or switch to <rom bar='off'/>? Where i can download a working 4060 bios ?4. Can I safely add 10de:2882,10de:22be to vfio-pci.ids without losing host console/display output — given this GPU is also the host's primary display output?Thanks in advance. Happy to run any additional diagnostics. Edited April 5Apr 5 by macmus
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.