Jump to content
gelmi

AMD GPU linux driver breaks VM

15 posts in this topic Last Reply

Recommended Posts

Hi,

I have been tinkering with Unraid for almost 2 weeks now on my Ryzen build (ASUS X370-PRO, Ryzen 1600 not OCed, C states turned on, BIOS 0902, RAM OCed to 2933MHz and only a single GPU RX 550). As an Unraid beginner I had my ups and downs, but I am getting to the point where I would like to switch my workstation permanently into Unraid+VM workstation. Without ACS patch GPU is great in IOMMU:

IOMMU group 11
	[1002:699f] 28:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon RX 550] (rev c7)
	[1002:aae0] 28:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aae0

I could only pass through GPU with 6.4.0-rc9f to my Ubuntu 16.04 LTS VM (Gnome). Stable release was either unstable with C states enabled or having some issues with passing GPU.

My Ubuntu VM is Q35-2.9 and SeaBIOS (OVMF didn't want to take my GPU at all). On i440fx-2.9 I had problems with not releasing GPU after reboot/shutdown and I had to restart Unraid host to run VM one more time. If I tried to run it after reboot/force shutdown, whole Unraid host hang - no www GUI, no ping, no SSH. After host reboot I had to enable VM Management. I even tried to soft-remove GPU and shut down VM with script:

script:

#!/bin/sh
virsh detach-device Ubuntu-17.04 /mnt/user/system/gpudev.xml
virsh shutdown Ubuntu-17.04


xml:
<hostdev mode='subsystem' type='pci' managed='yes'>
  <source>
    <address domain='0x0000' bus='0x28' slot='0x00' function='0x0'/>
  </source>
</hostdev>

Also, one time after I had to hard reset Unraid tower it didn't boot. I had to restore flash from CA backup. So, I decided to go with template for Ubuntu - Q35.

 

On Q35-2.9 re/booting VM is better. Sometimes I get:

unraid internal error: Unknown PCI header type '127'

but it is not that common and also, host proper reboot helps (array down, reboot).

 

However, on 16.04 VM RX550 is not supported properly. System works fine, but any graphics lags hugely. Chrome/youtube/movies ale like 2 frames/sec. When I download and install AMDGPU-PRO drivers everything works super fine, but when I reboot my system breaks, I cannot get to login screen and usually Unraid host hard hangs (no www, no SSL). After tower reboot ,I can switch to VNC and uninstall drivers via SSH, shutdown, enable RX550 and boot VM. I also have tried soft-remove GPU, dump BIOS rom file, but no luck. Log before it crashes after drivers are installed:

Oct 7 10:27:35 DarkTower kernel: IO_PAGE_FAULT device=28:00.0 domain=0x0001 address=0x000000f40021a000 flags=0x0010]
Oct 7 10:27:35 DarkTower kernel: AMD-Vi: Event logged [
Oct 7 10:27:35 DarkTower kernel: IO_PAGE_FAULT device=28:00.0 domain=0x0001 address=0x000000f40021c000 flags=0x0010]
Oct 7 10:27:35 DarkTower kernel: AMD-Vi: Event logged [
Oct 7 10:27:35 DarkTower kernel: IO_PAGE_FAULT device=28:00.0 domain=0x0001 address=0x000000f40021bd40 flags=0x0010]
Oct 7 10:27:35 DarkTower kernel: AMD-Vi: Event logged [
Oct 7 10:27:35 DarkTower kernel: IO_PAGE_FAULT device=28:00.0 domain=0x0001 address=0x000000f40021de40 flags=0x0010]
Oct 7 10:27:35 DarkTower kernel: AMD-Vi: Event logged [
Oct 7 10:27:35 DarkTower kernel: IO_PAGE_FAULT device=28:00.0 domain=0x0001 address=0x000000f40021e000 flags=0x0010]
Oct 7 10:27:35 DarkTower kernel: AMD-Vi: Event logged [
Oct 7 10:27:35 DarkTower kernel: IO_PAGE_FAULT device=28:00.0 domain=0x0001 address=0x000000f40021c040 flags=0x0010]
Oct 7 10:27:35 DarkTower kernel: AMD-Vi: Event logged [
Oct 7 10:27:35 DarkTower kernel: IO_PAGE_FAULT device=28:00.0 domain=0x0001 address=0x000000f40021c340 flags=0x0010]
Oct 7 10:27:35 DarkTower kernel: AMD-Vi: Event logged [
Oct 7 10:27:35 DarkTower kernel: IO_PAGE_FAULT device=28:00.0 domain=0x0001 address=0x000000f40021e100 flags=0x0010]
Oct 7 10:27:36 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:27:36 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:27:37 DarkTower ntpd[1790]: Listen normally on 10 docker0 172.17.0.1:123
Oct 7 10:27:37 DarkTower ntpd[1790]: Listen normally on 11 docker0 [fe80::42:ccff:fec7:dcf5%10]:123
Oct 7 10:27:37 DarkTower ntpd[1790]: Listen normally on 12 veth5c771a1 [fe80::b0c1:fff:feeb:7bde%15]:123
Oct 7 10:27:37 DarkTower ntpd[1790]: new interface(s) found: waking up resolver
Oct 7 10:27:37 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:27:38 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd
Oct 7 10:27:39 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd
Oct 7 10:27:40 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd
Oct 7 10:30:11 DarkTower kernel: vethd541a47: renamed from eth0
Oct 7 10:30:11 DarkTower kernel: docker0: port 1(veth5c771a1) entered disabled state
Oct 7 10:30:12 DarkTower avahi-daemon[3813]: Interface veth5c771a1.IPv6 no longer relevant for mDNS.
Oct 7 10:30:12 DarkTower avahi-daemon[3813]: Leaving mDNS multicast group on interface veth5c771a1.IPv6 with address fe80::b0c1:fff:feeb:7bde.
Oct 7 10:30:12 DarkTower kernel: docker0: port 1(veth5c771a1) entered disabled state
Oct 7 10:30:12 DarkTower kernel: device veth5c771a1 left promiscuous mode
Oct 7 10:30:12 DarkTower kernel: docker0: port 1(veth5c771a1) entered disabled state
Oct 7 10:30:12 DarkTower avahi-daemon[3813]: Withdrawing address record for fe80::b0c1:fff:feeb:7bde on veth5c771a1.
Oct 7 10:30:14 DarkTower ntpd[1790]: Deleting interface #10 docker0, 172.17.0.1#123, interface stats: received=0, sent=0, dropped=0, active_time=157 secs
Oct 7 10:30:14 DarkTower ntpd[1790]: Deleting interface #11 docker0, fe80::42:ccff:fec7:dcf5%10#123, interface stats: received=0, sent=0, dropped=0, active_time=157 secs
Oct 7 10:30:14 DarkTower ntpd[1790]: Deleting interface #12 veth5c771a1, fe80::b0c1:fff:feeb:7bde%15#123, interface stats: received=0, sent=0, dropped=0, active_time=157 secs
Oct 7 10:32:12 DarkTower ntpd[1790]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
Oct 7 10:32:15 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:32:15 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd
Oct 7 10:32:17 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:32:17 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd
Oct 7 10:32:19 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:32:20 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd
Oct 7 10:32:20 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:32:21 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd
Oct 7 10:32:22 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:32:23 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd
Oct 7 10:32:23 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:32:24 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd
Oct 7 10:32:28 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:32:29 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd
Oct 7 10:32:30 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:32:30 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd
Oct 7 10:32:31 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:32:32 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:32:35 DarkTower kernel: usb 5-4: reset low-speed USB device number 4 using xhci_hcd
Oct 7 10:32:35 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd
Oct 7 10:32:39 DarkTower kernel: traps: emhttpd[3877] trap divide error ip:419a82 sp:2ab092766e00 error:0 in emhttpd[400000+25000]
Oct 7 10:32:46 DarkTower kernel: usb 5-3: reset low-speed USB device number 3 using xhci_hcd

So, do you have any advice how can I use my Ubuntu VM smoothly with GPU?

I am pretty happy with Unraid with its array/NAS capabilities, but I have to have a working GPU in linux machine for development. I do not want to have 2 PCs - one for NAS and one for work.

 

Share this post


Link to post

hey, I got the same GPU and whenever I install the GPU drivers into the windows 10 VM, it just stop working, black screen and nothing else happens... did you had any luck with this issue?

Share this post


Link to post

Yes. Mine is working like a charm now. Can you tell more about hardware you are using? Which mobo, which BIOS, how many GPUs (internal, external), etc.

Share this post


Link to post
2 hours ago, gelmi said:

Yes. Mine is working like a charm now. Can you tell more about hardware you are using? Which mobo, which BIOS, how many GPUs (internal, external), etc.

Sure:

CPU: Ryzen 5 2600 6 cores / 12 threads

Board: MSI X370 GAMING PRO CARBON (MS-7A32)

GPUs: 

- Primary: AMD Radeon RX 550 

- Secondary: NVIDIA Geforce GTX 750 ti

RAM: 16GB DDR4

HDD: 2 x 500GB 7200rpm (for the array)

SDD: 2 x 128 GB (Passed though to each windows VM)

M2SSD: 1 x 128 GB (for cache)

 

==========

IOMMU Groups

IOMMU group 0:	[1022:1452] 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
IOMMU group 1:	[1022:1453] 00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
IOMMU group 2:	[1022:1452] 00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
IOMMU group 3:	[1022:1452] 00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
IOMMU group 4:	[1022:1453] 00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
IOMMU group 5:	[1022:1453] 00:03.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
IOMMU group 6:	[1022:1452] 00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
IOMMU group 7:	[1022:1452] 00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
IOMMU group 8:	[1022:1454] 00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
IOMMU group 9:	[1022:1452] 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
IOMMU group 10:	[1022:1454] 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
IOMMU group 11:	[1022:790b] 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 59)
[1022:790e] 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
IOMMU group 12:	[1022:1460] 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
[1022:1461] 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
[1022:1462] 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
[1022:1463] 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
[1022:1464] 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
[1022:1465] 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
[1022:1466] 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6
[1022:1467] 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
IOMMU group 13:	[1022:43b9] 03:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset USB 3.1 xHCI Controller (rev 02)
[1022:43b5] 03:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset SATA Controller (rev 02)
[1022:43b0] 03:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] X370 Series Chipset PCIe Upstream Port (rev 02)
[1022:43b4] 16:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 16:01.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 16:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 16:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 16:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 16:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[8086:1539] 17:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
[1b21:2142] 1c:00.0 USB controller: ASMedia Technology Inc. ASM2142 USB 3.1 Host Controller
IOMMU group 14:	[1002:699f] 1d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon RX 550/550X] (rev c7)
[1002:aae0] 1d:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Baffin HDMI/DP Audio [Radeon RX 550 640SP / RX 560/560X]
IOMMU group 15:	[10de:1380] 1e:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2)
[10de:0fbc] 1e:00.1 Audio device: NVIDIA Corporation Device 0fbc (rev a1)
IOMMU group 16:	[1022:145a] 1f:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device 145a
IOMMU group 17:	[1022:1456] 1f:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
IOMMU group 18:	[1022:145f] 1f:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] USB 3.0 Host controller
IOMMU group 19:	[1022:1455] 20:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device 1455
IOMMU group 20:	[1022:7901] 20:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
IOMMU group 21:	[1022:1457] 20:00.3 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller

Also the Idea behind this hardware is to have the array for sharing files and 2 windows 10 VMs, each one with a dedicated GPU

Share this post


Link to post

OK. When installing drivers for RX550 you know that you need to install standard driver and NOT the PRO driver?

Share this post


Link to post

Hey, No I had no idea about that, I was thinking about selling my RX 550 and buying a GTX 1050 instead, since I was able to pass though the other GTX 750 ti, I will try to get those drivers on google and test it, thanks for the reply I appreciate it very much

Share this post


Link to post

PRO version is for i.e. VEGA or Radeon Pro. RX 550 should use regular driver.

Try to install it and post here if this fixes your issue.

Share this post


Link to post

Hey I tried with the drivers provided by the vendor in this case from Gigabyte webpage instead of AMD web page, and it worked just fine, even better than by GTX 750 ti that have some wired issue with HDMI Sound, thanks a lot

Share this post


Link to post

Great that it is working now. When you pass through GPU and corresponding sound card via HDMI and you have issues, try to either reinstall drivers or change cable. Sometimes it helps.

Share this post


Link to post

Today I tried this again with a new VM, but this time just let the windows 10 (downloaded directly from Microsoft website) search and install the driver for me, it worked really well, so as opposed to the suggested way of installing drivers on our one, I'll suggest leave that to windows 10 itself, right?

Share this post


Link to post

I have drivers from AMD website, but as I said I installed standard version, not PRO one.

Share this post


Link to post

Newer AMD GPU driver versions tend to break the passthrough. VMs won't start or black monitor. Couple people already reported that. Drivers from end of last year should work fine. Don't exactly know with which version that issue started.

Share this post


Link to post
Posted (edited)

It's a i440 problem. Switch to Q35 and you can install every driver version out there. But if you like to stay with i440, i guess latest working version without the driver hanging with a black screen during install was 18.2.1 - but don't quote me on that. If you really need to know, i could check later.

On 10/7/2017 at 12:12 PM, gelmi said:

unraid internal error: Unknown PCI header type '127'

Had the same issue. "VFIO allow unsafe interrupts" set to "yes" in VM Settings fixed it for me. Now i can boot, sleep, shutdown, change VM without any problems.

Edited by suRe

Share this post


Link to post

I have Ubuntu VM on i440fx-2.10 and SeaBios and W10 VM on i440fx-2.9 and SeaBios. Both of them can use AMD drivers with no issues.

Share this post


Link to post
Posted (edited)
10 hours ago, gelmi said:

I have Ubuntu VM on i440fx-2.10 and SeaBios and W10 VM on i440fx-2.9 and SeaBios. Both of them can use AMD drivers with no issues.

Might be due to SeaBios - never tested this actually. Can only speak for myself, i440fx with ovmf did never work.

Edited by suRe

Share this post


Link to post

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now