GPU passthrough fails when rebooting a VM [edit: added Call Trace]


scud133b

Recommended Posts

Edit: Syslog with call trace here

 

My GPU (AMD RX480) passes through perfectly when I first boot the server, but it fails upon reboot of any VM that uses it. So any reboot on Win10 or Ubuntu or any other VM requires a full power cycle of the server to get the GPU working again.

 

I noticed the following errors in the Windows VM log after a reboot:

2017-12-25T16:41:52.220501Z qemu-system-x86_64: warning: Unknown firmware file in legacy mode: etc/msr_feature_control
2017-12-25T16:17:56.679351Z qemu-system-x86_64: -device vfio-pci,host=01:00.0,id=hostdev0,bus=pci.0,addr=0x5: vfio: Error: Failed to setup INTx fd: Device or resource busy
2017-12-25T16:17:58.613261Z qemu-system-x86_64: -device vfio-pci,host=01:00.0,id=hostdev0,bus=pci.0,addr=0x5: Device initialization failed
2017-12-25 16:17:58.662+0000: shutting down, reason=failed

 

And the following in the Unraid log, looks like it might be related:

Dec 25 10:41:54 Tower kernel: vfio-pci 0000:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

 

This is a recent problem -- earlier in 2017 I could reboot a VM without any issue. I use the auto-updates plugin so I'm not really able to isolate the change in behavior to any one update.

 

Hoping someone here has a suggestion of what to try.


Thanks!

Edited by scud133b
Link to comment
  • 2 weeks later...

Also catching this error occasionally, which again is the GPU:

 

Jan 7 19:12:09 Tower kernel: vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
Jan 7 19:12:09 Tower kernel: vfio-pci 0000:01:00.0: Refused to change power state, currently in D3
Jan 7 19:12:09 Tower kernel: vfio-pci 0000:01:00.1: Refused to change power state, currently in D3

 

Edited by scud133b
Link to comment

I remember vaguely having an issue similar to this on one of my servers. I had to change a power setting in the bios...  ACPI or change it from performance to os control... or low power state off on pcie... I'm sorry I don't remember exactly. And the fact that you said you changed nothing and it just started happening is perplexing. I don't think anything in the auto-update would trigger the change uless you changed unraid os version.

Link to comment
2 hours ago, 1812 said:

I remember vaguely having an issue similar to this on one of my servers. I had to change a power setting in the bios...  ACPI or change it from performance to os control... or low power state off on pcie... I'm sorry I don't remember exactly. And the fact that you said you changed nothing and it just started happening is perplexing. I don't think anything in the auto-update would trigger the change uless you changed unraid os version.

 

I definitely haven't changed anything in the BIOS, but that seems like a good place to start. It appears as though the GPU is getting stuck in "D3" power state and not resetting properly, then the VM fails because its GPU isn't working right and it refuses to boot. I'll check for a relevant BIOS setting tomorrow.

Edited by scud133b
Link to comment

Couldn't find anything that seemed relevant in my BIOS... only PCIE related power options were for wake-on-LAN types of settings. But while I was messing with Windows and safe mode and whatnot, I actually witnessed the GPU crash. The GPU fan immediately jumped to full power and the screen went blank. Here's the log from that moment:

 

Jan  9 18:22:51 Tower kernel: br0: port 2(vnet0) entered disabled state
Jan  9 18:22:51 Tower kernel: device vnet0 left promiscuous mode
Jan  9 18:22:51 Tower kernel: br0: port 2(vnet0) entered disabled state
Jan  9 18:22:53 Tower kernel: vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=io+mem,decodes=io+mem:owns=none
Jan  9 18:22:53 Tower kernel: irq 16: nobody cared (try booting with the "irqpoll" option)
Jan  9 18:22:53 Tower kernel: CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.9.30-unRAID #1
Jan  9 18:22:53 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./C236M WS, BIOS P2.60 09/25/2017
Jan  9 18:22:53 Tower kernel: ffff88067dd03ec8 ffffffff813a4a1b ffff8806615c3600 ffff8806615c3600
Jan  9 18:22:53 Tower kernel: ffff88067dd03ef0 ffffffff810864d1 ffff8806615c3600 0000000000000000
Jan  9 18:22:53 Tower kernel: 0000000000000010 ffff88067dd03f28 ffffffff8108679a ffff8806615c3600
Jan  9 18:22:53 Tower kernel: Call Trace:
Jan  9 18:22:53 Tower kernel: <IRQ> 
Jan  9 18:22:53 Tower kernel: [<ffffffff813a4a1b>] dump_stack+0x61/0x7e
Jan  9 18:22:53 Tower kernel: [<ffffffff810864d1>] __report_bad_irq+0x2b/0xb4
Jan  9 18:22:53 Tower kernel: [<ffffffff8108679a>] note_interrupt+0x1a0/0x22e
Jan  9 18:22:53 Tower kernel: [<ffffffff81084384>] handle_irq_event_percpu+0x3d/0x46
Jan  9 18:22:53 Tower kernel: [<ffffffff810843c3>] handle_irq_event+0x36/0x54
Jan  9 18:22:53 Tower kernel: [<ffffffff810872e6>] handle_fasteoi_irq+0x90/0xf8
Jan  9 18:22:53 Tower kernel: [<ffffffff81020585>] handle_irq+0x17/0x1b
Jan  9 18:22:53 Tower kernel: [<ffffffff8101ffda>] do_IRQ+0x46/0xc2
Jan  9 18:22:53 Tower kernel: [<ffffffff8167fec2>] common_interrupt+0x82/0x82
Jan  9 18:22:53 Tower kernel: <EOI> 
Jan  9 18:22:53 Tower kernel: [<ffffffff815533e4>] ? cpuidle_enter_state+0xfe/0x156
Jan  9 18:22:53 Tower kernel: [<ffffffff8155345e>] cpuidle_enter+0x12/0x14
Jan  9 18:22:53 Tower kernel: [<ffffffff8107c545>] call_cpuidle+0x33/0x35
Jan  9 18:22:53 Tower kernel: [<ffffffff8107c727>] cpu_startup_entry+0x13a/0x1b2
Jan  9 18:22:53 Tower kernel: [<ffffffff81035482>] start_secondary+0xf5/0xf8
Jan  9 18:22:53 Tower kernel: handlers:
Jan  9 18:22:53 Tower kernel: [<ffffffffa039ed39>] i801_isr [i2c_i801]
Jan  9 18:22:53 Tower kernel: Disabling IRQ #16
Jan  9 18:23:00 Tower kernel: vgaarb: device changed decodes: PCI:0000:01:00.0,olddecodes=io+mem,decodes=io+mem:owns=none
Jan  9 18:23:00 Tower kernel: br0: port 2(vnet0) entered blocking state
Jan  9 18:23:00 Tower kernel: br0: port 2(vnet0) entered disabled state
Jan  9 18:23:00 Tower kernel: device vnet0 entered promiscuous mode
Jan  9 18:23:00 Tower kernel: br0: port 2(vnet0) entered blocking state
Jan  9 18:23:00 Tower kernel: br0: port 2(vnet0) entered forwarding state
Jan  9 18:23:02 Tower kernel: vfio_ecap_init: 0000:01:00.0 hiding ecap 0x19@0x270
Jan  9 18:23:02 Tower kernel: vfio_ecap_init: 0000:01:00.0 hiding ecap 0x1b@0x2d0
Jan  9 18:23:02 Tower kernel: vfio_ecap_init: 0000:01:00.0 hiding ecap 0x1e@0x370
Jan  9 18:23:02 Tower kernel: vfio-pci 0000:02:00.0: enabling device (0400 -> 0402)
Jan  9 18:23:12 Tower kernel: kvm: zapping shadow pages for mmio generation wraparound
Jan  9 18:23:12 Tower kernel: kvm: zapping shadow pages for mmio generation wraparound
Jan  9 18:30:32 Tower kernel: nvme nvme0: async event result 00020101
Jan  9 18:31:37 Tower emhttp: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log libvirt/qemu/Windows 10.log
Jan  9 18:31:40 Tower emhttp: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog
Jan  9 18:37:23 Tower emhttp: cmd: /usr/local/emhttp/plugins/dynamix/scripts/tail_log syslog

 

Link to comment

ok, first the bad news: I am terrible with call traces.

 

now the good news: edit the title of this thread to include "Now with Call Trace" and hopefully someone who can decipher it for you will.

 

you could experiment by changing the GPU to a different IRQ in your bios in the meantime. it eventually disables IRQ 16, so if you check and it is set to that, change it. And if it does it again on a different IRQ, then  ¯\_(ツ)_/¯

Link to comment

I think it's related to this https://forum.level1techs.com/t/linux-host-windows-guest-gpu-passthrough-reinitialization-fix/121097 .

 

I have an R9 370 which behaves roughly the same. If I'm using the the Intel igd as primary gpu, most of the time the AMD R9 pass-through will fail with "Refused to change power state, currently in D3". If the R9 is the primary gpu, I don't have any issues with "Refused to change power...", but I get "Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff" and I can only boot Seabios VMs. For OVMF, it only works for first VM start, then I have to power down and then unplug the unRaid server to make it work again. After a lot of testing, I managed to make the OVMF boot to always start by providing the video rom file details only to the HDMI part ( 0000:01:00.1) in the XML file. 

Link to comment
5 minutes ago, thomas said:

I think it's related to this https://forum.level1techs.com/t/linux-host-windows-guest-gpu-passthrough-reinitialization-fix/121097 .

 

I have an R9 370 which behaves roughly the same. If I'm using the the Intel igd as primary gpu, most of the time the AMD R9 pass-through will fail with "Refused to change power state, currently in D3". If the R9 is the primary gpu, I don't have any issues with "Refused to change power...", but I get "Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff" and I can only boot Seabios VMs. For OVMF, it only works for first VM start, then I have to power down and then unplug the unRaid server to make it work again. After a lot of testing, I managed to make the OVMF boot to always start by providing the video rom file details only to the HDMI part ( 0000:01:00.1) in the XML file. 

 

Ok I'll try passing the GPU BIOS and see how that works. You said you had to pass it *only* to 0000:01:00.1 --- so not to 0000:01:00.0 ?

Link to comment
On 1/10/2018 at 6:55 AM, thomas said:

In my case, it's working only if the bios rom is passed to the HDMI part. If it was passed to the 0000:01:00.0 the screen will remain blank...

 

Well unfortunately that doesn't seem to help. I tried passing the rom to both the main GPU (00:01:00.0) and the HDMI Audio part (00:01:00.1) separately and together, and it still fails on reboot.

 

What I have discovered after messing with this for several days is that the VM boots into W10 startup repair and refuses to boot normally back into W10. Then the GPU failure (the original point of this thread) will happen after sitting on the startup repair screen for several minutes. It's actually not instant upon reboot. I am able to get back to W10 in safe mode only; if I want to boot as normal, I have to resort to a full power cycle of the entire sever.

Edited by scud133b
Link to comment

Another thing that you can try is to not pass the HDMI to the VM, pass only the graphic card and see if that helps. I have seen that most of the time if I try to shutdown and start my VM in a fast succession it will complain that the HDMI part is still used by previous used VM although the shutdown was successful and the screen is blank...

Link to comment

Just completed the update to 6.4. Also tried activating the new UEFI boot option on the unraid flash drive ---- no change.

 

Still having failures when I reboot my Windows VM; every time it tries to load into W10 startup repair, which is the same behavior that will lead to the GPU crash after 5-10 minutes of trying to boot. The startup repair screen is at low-res (like 800x600 or something).

 

So everything seems to be pointing to the GPU not properly initializing to pass-through upon reboot, then Windows freaks and gets into a boot loop with startup repair.

 

Edit: The Fix Common Problems plugin just notified that the server is still reporting errors for IRQ 16... confirmed with cat /proc/interrupts that this is the GPU at 00:01:00.0

 

>> Diagnostics attached.

tower-diagnostics-20180113-1118.zip

Edited by scud133b
Link to comment

 

interesting, your gpu isn't alone in it's iommu group. It's together with a pci bridge and usb.

 

/sys/kernel/iommu_groups/1/devices/0000:00:01.1
/sys/kernel/iommu_groups/1/devices/0000:01:00.1
/sys/kernel/iommu_groups/1/devices/0000:00:01.0
/sys/kernel/iommu_groups/1/devices/0000:02:00.0
/sys/kernel/iommu_groups/1/devices/0000:01:00.0

 

00:01.1 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) [8086:1905] (rev 05)
	Kernel driver in use: pcieport
00:01.0 PCI bridge [0604]: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) [8086:1901] (rev 05)
	Kernel driver in use: pcieport

01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/580] [1002:67df] (rev c7)
	Subsystem: PC Partner Limited / Sapphire Technology Radeon RX 470/480 [174b:e347]
01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 580] [1002:aaf0]
	Subsystem: PC Partner Limited / Sapphire Technology Device [174b:aaf0]
02:00.0 USB controller [0c03]: VIA Technologies, Inc. VL805 USB 3.0 Host Controller [1106:3483] (rev 01)
	Subsystem: VIA Technologies, Inc. VL805 USB 3.0 Host Controller [1106:3483]
	Kernel driver in use: vfio-pci

 

 

 

Let's try a few things.

 

first, can you move your gpu to a different slot and retry? you may need to reassign the gpu.

 

If it does the same error, please note the iommu groups (system>system devices>iommu groups) then try the following:

 

Modify your syslinux.cfg between append. initrd=/bzroot to this and reboot:

 

append pcie_acs_override=1002:67df,1002:aaf0 initrd=/bzroot

look and see then if the radeon and it's sound component are in their own iommu group now and try to restart. Note if the error is same/different afterwards. the pci bridges in the group may be causing the issue here.

 

if that doesn't work, read more below?

 

maybe try here:

 

 

 

 

 

 

 

 

Link to comment

Using the ACS override line did not change the IOMMU groups -- both AMD devices (GPU/audio) are still showing in group one along with the PCI bridges and my PCIE USB3 card:

 

image.thumb.png.0398c913addc1ed7b809e960dafd8e7b.png

 

The second linked thread sounds just like my problem: 

So just like the linked thread, I am using a PCIE USB3 card in order to pass-through hot-swappable USB ports to Windows 10. What's weird is that my hardware configuration hasn't changed in that regard, I've had that PCIE USB3 card for long before this problem came up.... also that thread doesn't have a real solution to the issue. I don't have 3 monitors to connect to solve it. I am also using the Display Port output, not HDMI.

 

One thing I did notice that was kind of weird was the CPU in my IOMMU Group 1 is showing as Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) [8086:1905] (rev 05). My CPU is an i7-7700k, not a Xenon. Is there maybe some conflict there with the C236 chipset on the motherboard?

Edited by scud133b
Link to comment
  • 2 weeks later...
  • 2 weeks later...
On 2/4/2018 at 5:30 AM, Siwat2545 said:

use pcie_acs_override=downstream,multifunction

 

So using this command did separate out the AMD components into their own IOMMU groups. The GPU is on IOMMU 12 and the audio is on IOMMU 13 now (and the USB3 controller is on group 14).

 

But I'm still getting the IRQ errors and of course can't successfully reboot the VM.

 

Feb 16 20:21:57 Tower kernel: irq 16: nobody cared (try booting with the "irqpoll" option)
Feb 16 20:21:57 Tower kernel: CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.14.13-unRAID #1
Feb 16 20:21:57 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./C236M WS, BIOS P2.60 09/25/2017
Feb 16 20:21:57 Tower kernel: Call Trace:
Feb 16 20:21:57 Tower kernel: <IRQ>
Feb 16 20:21:57 Tower kernel: dump_stack+0x5d/0x79
Feb 16 20:21:57 Tower kernel: __report_bad_irq+0x32/0xac
Feb 16 20:21:57 Tower kernel: note_interrupt+0x1d4/0x225
Feb 16 20:21:57 Tower kernel: handle_irq_event_percpu+0x39/0x3f
Feb 16 20:21:57 Tower kernel: handle_irq_event+0x31/0x4f
Feb 16 20:21:57 Tower kernel: handle_fasteoi_irq+0x8c/0xe7
Feb 16 20:21:57 Tower kernel: handle_irq+0x16/0x19
Feb 16 20:21:57 Tower kernel: do_IRQ+0x3b/0xb5
Feb 16 20:21:57 Tower kernel: common_interrupt+0x98/0x98
Feb 16 20:21:57 Tower kernel: </IRQ>
Feb 16 20:21:57 Tower kernel: RIP: 0010:cpuidle_enter_state+0xde/0x130
Feb 16 20:21:57 Tower kernel: RSP: 0018:ffffc900031e3ef8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff6d
Feb 16 20:21:57 Tower kernel: RAX: ffff88067dd20900 RBX: 0000000000000000 RCX: 000000000000001f
Feb 16 20:21:57 Tower kernel: RDX: 000000e38118d684 RSI: 0000000000020140 RDI: 0000000000000000
Feb 16 20:21:57 Tower kernel: RBP: ffff88067dd28800 R08: 000003d7d2c46568 R09: 0000000000000018
Feb 16 20:21:57 Tower kernel: R10: ffffc900031e3ed8 R11: 0000000000000000 R12: 0000000000000001
Feb 16 20:21:57 Tower kernel: R13: 000000e38118d684 R14: ffffffff81c59138 R15: 000000e38118ca28
Feb 16 20:21:57 Tower kernel: ? cpuidle_enter_state+0xb6/0x130
Feb 16 20:21:57 Tower kernel: do_idle+0x11a/0x179
Feb 16 20:21:57 Tower kernel: cpu_startup_entry+0x18/0x1a
Feb 16 20:21:57 Tower kernel: secondary_startup_64+0xa5/0xb0
Feb 16 20:21:57 Tower kernel: handlers:
Feb 16 20:21:57 Tower kernel: [<ffffffffa0066cd2>] i801_isr [i2c_i801]
Feb 16 20:21:57 Tower kernel: Disabling IRQ #16

 

Edited by scud133b
Link to comment

Ok I think @dvd.collector has linked us to the root of the problem. It seems that the GPU is expecting to have its power cut on a reboot, which isn't happening with the VM.

 

I created a batch script that will force the GPU to be disabled/enabled on startup and shutdown according to this post on L1. This only seems to work on manual reboots, however, and anything automatic (e.g., Windows Update) still results in failures.

 

Basically I'm good to go so long as I never have an automatic reboot...

 

Now the tricky question: any ideas why this would only be happening in recent months? I've had unraid on this exact machine for a while, but these reboot problems only started in the last few months...

 

 

Link to comment
  • 3 months later...
23 hours ago, planetwilson said:

Anyone found a solution for this yet?

 

I've been following the Level 1 Techs post I mentioned above. Seems to be a wider problem with WIndows VMs with GPU passthrough on a Linux host.

 

Any time a Windows Update comes down, I have to manually install it and manually reboot (so that my VM will properly disable and reinitialize the GPU). If a reboot or shutdown happens automatically, the Windows VM will not come back and I'll have to power-cycle the entire Unraid server.

Edited by scud133b
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.