GPU not resetting after graceful shutdown


Recommended Posts

Hello

I'm a new user of unraid but I've tried to conduct my research as best I can, I'm grateful for any advise given in this matter.

 

Specs: 

Threadripper 2920X

Gigabyte x399 Designare Ex

32GB RAM, 4x8GB

AMD Vega 64

Nvidia GTX 970

Nvidia GT 710

PCIEx4 SATA x8 card

 

Issue:

I'm having an issue with my AMD GPU passthrough, it cannot be used after a reboot of the guest VM. It requires a reboot of Unraid. Seemingly a common issue looking around related to the GPU not been reset or de-initialised correctly.

I've seen many reports of this happening when the VM is not powered down gracefully. However for me it this happens when shutting down the guest normally. After which, attempting to use the card again results in error 127. Requiring a host reboot. However if I use the Force Stop option on the VM, the card can then be used again without issue. So it seems that Unraid or the guest UEFI is doing something to de-init the card but this isn't happening during a graceful shutdown.

I'm using Manjaro (latest Plasma) on the guest.

 

The Nvidia GPU works as expected. The cards are to be configured to have Unraid use the GT710 and then 2 VM, one with the AMD and the other the Nvidia card.

I've noted that there are PCIE Bridge devices in the GPU's IOMMU group but read that these dummy and bridge devices shouldn't cause an issue.

I get the same issue on both I440FX and Q35

I'm passing through both the Display and audio devices of the card.

 

Thank you for any advise!

 

IOMMU - Latest BIOS with ACS disabled. (Same issue with it enabled):

IOMMU group 0:	[1022:1452] 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
[1022:1453] 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
[1022:43ba] 01:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset USB 3.1 xHCI Controller (rev 02)
[1022:43b6] 01:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset SATA Controller (rev 02)
[1022:43b1] 01:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset PCIe Bridge (rev 02)
[1022:43b4] 02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 02:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 02:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 02:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 02:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[8086:1539] 04:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
[8086:24fd] 05:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78)
[8086:1539] 06:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
[10de:128b] 07:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1)
[10de:0e0f] 07:00.1 Audio device: NVIDIA Corporation GK208 HDMI/DP Audio Controller (rev a1)
IOMMU group 1:	[1022:1452] 00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
IOMMU group 2:	[1022:1452] 00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
[1022:1453] 00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
[10de:13c2] 08:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1)
[10de:0fbb] 08:00.1 Audio device: NVIDIA Corporation GM204 High Definition Audio Controller (rev a1)
IOMMU group 3:	[1022:1452] 00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
IOMMU group 4:	[1022:1452] 00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
[1022:1454] 00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
[1022:145a] 09:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
[1022:1456] 09:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
[1022:145f] 09:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
IOMMU group 5:	[1022:1452] 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
[1022:1454] 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
[1022:1455] 0a:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
[1022:7901] 0a:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
[1022:1457] 0a:00.3 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller
IOMMU group 6:	[1022:790b] 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 59)
[1022:790e] 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
IOMMU group 7:	[1022:1460] 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
[1022:1461] 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
[1022:1462] 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
[1022:1463] 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
[1022:1464] 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
[1022:1465] 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
[1022:1466] 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
[1022:1467] 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
IOMMU group 8:	[1022:1460] 00:19.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
[1022:1461] 00:19.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
[1022:1462] 00:19.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
[1022:1463] 00:19.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
[1022:1464] 00:19.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
[1022:1465] 00:19.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
[1022:1466] 00:19.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
[1022:1467] 00:19.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
IOMMU group 9:	[1022:1452] 40:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
[1022:1453] 40:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
[1b4b:9235] 41:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9235 PCIe 2.0 x2 4-port SATA 6 Gb/s Controller (rev 11)
IOMMU group 10:	[1022:1452] 40:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
IOMMU group 11:	[1022:1452] 40:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
[1022:1453] 40:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
[1022:1470] 42:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1470 (rev c1)
[1022:1471] 43:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1471
[1002:687f] 44:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] (rev c1)
[1002:aaf8] 44:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64]
IOMMU group 12:	[1022:1452] 40:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
IOMMU group 13:	[1022:1452] 40:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
[1022:1454] 40:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
[1022:145a] 45:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Raven/Raven2 PCIe Dummy Function
[1022:1456] 45:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
[1022:145f] 45:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller
IOMMU group 14:	[1022:1452] 40:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge
[1022:1454] 40:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
[1022:1455] 46:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function
[1022:7901] 46:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)

 

Link to comment

 

10 minutes ago, Pducharme said:

Have a look at this excellent video :

 

https://www.youtube.com/watch?v=0uZODoPQH9c

Hello,

Thank you for this link, I've come across this video previously but was unsure if it was applicable as at first he stated that the reset bug happens when the machine is not powered down gracefully (force stop). Where as for me it's the other way around.

Regardless I shall look at trying this.

Edited by LightWrath
typo
Link to comment

Thanks for this!

I do feel a little silly having not tried that when I found based on my issue been inverted. But it makes perfect sense that it would resolve it regardless.

If this was just as described in the video where I'd only have to sleep upon an improper shutdown then that would be great, but doing it every time I power off or reboot a VM normally is going to be a little annoying. I'm a little reluctant to consider it resolved based on that, but I am grateful of the assistance. 

Link to comment
  • 1 year later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.