need help with RX Vega GPU pass-through (single GPU)


fae76

Recommended Posts

Hi
I just bought hardware to replace my 8 years old machine. So far I used a RAID-Controller and was running RAID-5. Before investing into a new RAID-Controller I would like to try Unraid. My question therefore is, does the current stable version of Unraid support my new hardware.

Hardware:
----------------
- Asus Corsair 6 Hero
- Ryzen 7
- 32GB DDR4
- AMD Radeon Vega 64
- 2x 500GB SATA3 SSD for caching (mirrored)
- 4x 8TB WD-red for array (parity)
- 1x NVMe SSD for Win10 VM

Use Cases:
-------------------
- NAS
- run dockers
- run Win10 virtual machine with hardware pass-through.

Hardware pass-through:
---------------------------------------
- I only have 1 GPU and I would like to pass this one through to the Win10 virtual machine
- mouse & keyboard pass-through
- onboard audio pass-through
- NVMe pass-through
- USB pass-through (at least some ports)

general questions:
------------------------------
- can I install and run Unraid with an AMD Vega GPU? I have read, that proper Vega support requires Kernel 4.15. Does this mean that I cannot use Unraid with my HW until Unraid includes Kernel 4.15? (sorry for the noob question, I didn't touch Linux for years and last time I did I had to give up because my hardware was not (yet) supported)
- does passing through the only GPU in the system work with AMD cards especially Vega? I only found hacks for nVidia cards requiring dumping vbios. Can I do the same with AMD cards?


Thank you for reading this far. Any help is appreciated.

Edited by fae76
Link to comment

First, I do not run any VM's in my setup.  However, as you have already surmised, unRAID does have some hardware requirements that have to be met.  There are a number of folks who are running Ryzen 1700 and they are using VM's.  So you should be alright on that score.  You might want to read this portion of the unRAID manual which discusses VM's:

 

        http://lime-technology.com/wiki/index.php/UnRAID_Manual_6#Using_Virtual_Machines

 

Be sure to go to the spreadsheet of user tested configuration found this portion of the manual to see if you can find any about your choice of hardware:

 

      http://lime-technology.com/wiki/index.php/UnRAID_Manual_6#Assigning_Graphics_Devices_to_Virtual_Machines_.28GPU_Pass_Through.29

 

Good Luck

Edited by Frank1940
Link to comment

Thanks Frank

 

I was reading and searching the forums quite a bit already and watched a bunch of tutorial videos. My main concern is Vega, I can't find any posts about someone running Unraid with Vega and I also can't find anything about single AMD GPU pass-through. I checked the spreadsheet before posting today, no luck about Vega, neither positive nor negative. There is one row mentioning my motherboard but unfortunately the important cells are left empty.

 

I know Vega pass-through is possible with KVM: https://forum.level1techs.com/t/threadripper-gpu-passthrough-working-with-vega/120594. This has been done on Fedora with some kernel hacks though. So my question remains, does Vega work with Unraid?

Edited by fae76
Link to comment

You could edit your first post and change the thread title to indicate that you are wanting info on passing your AMD Vega card through to a VM.

 

EDIT: you could also setup an unRAID system using a 30 Day trial license and see what results are.  (I seem to recall that you can get an extension for that license after the 30 days are up.)

Edited by Frank1940
Link to comment

I have everything up and running except for GPU passthrough. As long as I start my Win10 VM with VNC graphics everything works fine. I can passthrough my NVMe drive as well as onboard audio, mouse and keyboard. Terminating the VM will connect mouse and keyboard back to Unraid.

With GPU passthrough enabled my screen just freezes. Whatever image was on there stays. In order to stop the VM I have to "force stop" it.

Below are my IOMMU groups and the VM XML file. What else do you need me to provide in order to help me?
Thank you very much!!!


IOMMU group 0: [1022:1452] 00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
IOMMU group 1: [1022:1453] 00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
IOMMU group 2: [1022:1453] 00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
IOMMU group 3: [1022:1452] 00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
IOMMU group 4: [1022:1452] 00:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
IOMMU group 5: [1022:1453] 00:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge
IOMMU group 6: [1022:1452] 00:04.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
IOMMU group 7: [1022:1452] 00:07.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
IOMMU group 8: [1022:1454] 00:07.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
IOMMU group 9: [1022:1452] 00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge
IOMMU group 10: [1022:1454] 00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B
IOMMU group 11: [1022:790b] 00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 59)
[1022:790e] 00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
IOMMU group 12: [1022:1460] 00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0
[1022:1461] 00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1
[1022:1462] 00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2
[1022:1463] 00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3
[1022:1464] 00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4
[1022:1465] 00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5
[1022:1466] 00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric Device 18h Function 6
[1022:1467] 00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7
IOMMU group 13: [144d:a804] 01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
IOMMU group 14: [1022:43b9] 02:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] Device 43b9 (rev 02)
[1022:43b5] 02:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] Device 43b5 (rev 02)
[1022:43b0] 02:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43b0 (rev 02)
[1022:43b4] 03:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 03:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 03:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 03:04.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 03:05.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 03:06.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1022:43b4] 03:07.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port (rev 02)
[1b21:1343] 04:00.0 USB controller: ASMedia Technology Inc. Device 1343
[8086:1539] 05:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
IOMMU group 15: [1022:1470] 0b:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1470 (rev c1)
IOMMU group 16: [1022:1471] 0c:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1471
IOMMU group 17: [1002:687f] 0d:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] (rev c1)
IOMMU group 18: [1002:aaf8] 0d:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aaf8
IOMMU group 19: [1022:145a] 0e:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device 145a
IOMMU group 20: [1022:1456] 0e:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor
IOMMU group 21: [1022:145c] 0e:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller
IOMMU group 22: [1022:1455] 0f:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device 1455
IOMMU group 23: [1022:7901] 0f:00.2 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
IOMMU group 24: [1022:1457] 0f:00.3 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller


Win10 VM config:

win10vm.txt


VM logs:

2018-02-03 20:43:08.889+0000: starting up libvirt version: 3.8.0, qemu version: 2.10.2, hostname: RedStoneTower
LC_ALL=C PATH=/bin:/sbin:/usr/bin:/usr/sbin HOME=/ QEMU_AUDIO_DRV=none /usr/local/sbin/qemu -name 'guest=Win10 - Felix,debug-threads=on' -S -object 'secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-Win10 - Felix/master-key.aes' -machine pc-i440fx-2.10,accel=kvm,usb=off,dump-guest-core=off,mem-merge=off -cpu host,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff,hv_vendor_id=none -drive file=/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd,if=pflash,format=raw,unit=0,readonly=on -drive file=/etc/libvirt/qemu/nvram/c0b11a24-0155-0b78-9ea3-2ff17db22c8e_VARS-pure-efi.fd,if=pflash,format=raw,unit=1 -m 20480 -realtime mlock=off -smp 8,sockets=1,cores=8,threads=1 -uuid c0b11a24-0155-0b78-9ea3-2ff17db22c8e -display none -no-user-config -nodefaults -chardev 'socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-Win10 - Felix/monitor.sock,server,nowait' -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-hpet -no-shutdown -boot strict=on -o-pci,host=0f:00.3,id=hostdev1,bus=pci.0,addr=0x5 -device vfio-pci,host=0d:00.1,id=hostdev2,bus=pci.0,addr=0x6 -device vfio-pci,host=01:00.0,id=hostdev3,bus=pci.0,addr=0x8 -device usb-host,hostbus=1,hostaddr=3,id=hostdev4,bus=usb.0,port=1 -device usb-host,hostbus=1,hostaddr=2,id=hostdev5,bus=usb.0,port=2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x9 -msg timestamp=on
2018-02-03 20:43:08.889+0000: Domain id=1 is tainted: high-privileges
2018-02-03 20:43:08.889+0000: Domain id=1 is tainted: host-cpu
2018-02-03T20:43:08.936893Z qemu-system-x86_64: -chardev pty,id=charserial0: char device redirected to /dev/pts/1 (label charserial0)
2018-02-03T20:43:11.670147Z qemu-system-x86_64: -device vfio-pci,host=0d:00.0,id=hostdev0,bus=pci.0,addr=0x4: Failed to mmap 0000:0d:00.0 BAR 0. Performance may be slow
2018-02-03T20:43:12.951965Z qemu-system-x86_64: vfio: Cannot reset device 0000:0f:00.3, depends on group 22 which is not owned.
2018-02-03T20:43:13.063715Z qemu-system-x86_64: vfio: Cannot reset device 0000:0f:00.3, depends on group 22 which is not owned.

 

 

Edited by fae76
Link to comment

Yes it is. I cant, I do not have another GPU.

There are multiple posts in this forum which state that this should work fine with AMD cards. The nVidia cards need a vBIOS hack but can do it too.

 

Maybe I have to manually manipulate some config files in order to make this work, but I don't know where and what I have to add.

Link to comment

I am booting up Unraid in legacy mode now. I also dumped my vBIOS using GPU-Z (booted Win10 from an additional SSD) and also tried a vBIOS from TechPowerUP.

 

Now at least my screen flickers when I start the VM. However the screen stays black afterwards. I'm not sure how to verify if the dumped vBIOS is working. According to the how-to video from Spaceinvader at least for nVidia cards it's necessary to edit the dumped file. I cannot find the start of the real rom-file he is mentioning in his video in my vBIOS.

 

Any more hints and suggestions?

 

EDIT: The screen flicker already happens when I start Unraid in legacy mode. Without using vBIOS, it seems that at least one of the VMs I created with VNC is booting up (HD led is flickering like crazy). As soon as I add the vBIOS the HD led does not flicker anymore. I think the vBIOS is not valid and causes the VM to crash.

Edited by fae76
Link to comment

Replace your syslinux with

 

 

 

 

default menu.c32
menu title Lime Technology, Inc.
prompt 0
timeout 50
label unRAID OS
  menu default
  kernel /bzimage
  append vfio-pci.ids=1002:687f,1002:aaf8 initrd=/bzroot
label unRAID OS GUI Mode
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui
label unRAID OS Safe Mode (no plugins, no GUI)
  kernel /bzimage
  append initrd=/bzroot unraidsafemode
label unRAID OS GUI Safe Mode (no plugins)
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui unraidsafemode
label Memtest86+
  kernel /memtest
 

Link to comment

default menu.c32
menu title Lime Technology, Inc.
prompt 0
timeout 50
label unRAID OS
  menu default
  kernel /bzimage
  append vfio-pci.ids=1002:687f,1002:aaf8 disable_vga=1 initrd=/bzroot
label unRAID OS GUI Mode
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui
label unRAID OS Safe Mode (no plugins, no GUI)
  kernel /bzimage
  append initrd=/bzroot unraidsafemode
label unRAID OS GUI Safe Mode (no plugins)
  kernel /bzimage
  append initrd=/bzroot,/bzroot-gui unraidsafemode
label Memtest86+
  kernel /memtest

Link to comment

Here are my current Bios PCIe settings (sorry for the bad pics of my cell-phone).

 

I also started to learn how to debug in this environment and already learnt quite a bit. GPU passthrough does still not work, but I think I get a better understanding of what's going on. 

 

For example syslog was full of these messages:

Quote

Feb  5 20:36:56 RedStoneTower kernel: pcieport 0000:00:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000b(Receiver ID)
Feb  5 20:36:56 RedStoneTower kernel: pcieport 0000:00:01.3:   device [1022:1453] error status/mask=00000040/00006000
Feb  5 20:36:56 RedStoneTower kernel: pcieport 0000:00:01.3:    [ 6] Bad TLP              
Feb  5 20:36:57 RedStoneTower kernel: pcieport 0000:00:01.3: AER: Corrected error received: id=0000

After downgrading my PCIe form 3.0 to 2.0 they disappeared. I'm not sure yet what's going on but I suspect that either my mainboard has a hardware or bios bug (I have the newest one installed) or my Vega64 has a problem. Or maybe my motherboard applies some overclock to PCIe. Whatever this does, it does not seem to affect GPU passthrough.

 

The next thing I tracked down were these here:

Quote

Feb  5 22:24:55 RedStoneTower kernel: vfio-pci 0000:0d:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
Feb  5 22:24:56 RedStoneTower kernel: vfio-pci 0000:0d:00.0: BAR 0: can't reserve [mem 0xe0000000-0xefffffff 64bit pref]
Feb  5 22:24:59 RedStoneTower kernel: vfio-pci 0000:0d:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
Feb  5 22:24:59 RedStoneTower kernel: vfio-pci 0000:0d:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

Adding the vBIOS ROM file does solve this issue.

Attaching multiple GPUs to the VM (VNC as primary, Vega as secondary) does boot my Linux VM, the GPU however is not passed through (not claimed by kvm, not visible in the VM). Starting the same VM with the ROM file option results in a screen flicker and black screen but the VM will not boot or crash really early in the cycle (no SSH, no ping, but at least the GPU gets claimed by kvm). Unfortunately I do not yet know how to further follow and debug this hint. My gut feeling is that there is some kind of memory mapping or addressing issue.

For all the tests above I went back to the stock syslinux.conf since the modifications you suggested so far did not seem to have any effect.

 

The next thing I have to report are host crashes. Since I started with Unraid my host crashed at least 4 times already, not just the Web Interface but also SSH, and ping did not work anymore. I had to hard reset. I can not find any clues at all why this is happening. Even after starting logging syslog to my USB stick I cant find any hints why this is happening. There is no load on CPU/GPU. Only the parity synch was running in each case and 2 or 3 SSH connections.

During the weekend I was stress-testing CPU, RAM and GPU simultaneously using AIDA64's stability test under Win10 for 24h without any issues. CPU and GPU and all the other sensors reported max temps under 60°C during this 24h period so cooling is working as well.

 

Booting up Mint Linux from a USB Stick in UEFI mode worked flawlessly even though the image I used was quite old and still on a 4.4 kernel. Mint booted to GUI (no driver claimed my Vega since no AMD drivers were installed). This test shows that Vega does not necessarily need 4.15 kernel with full Vega support.

 

For now I'm out of ideas how to proceed. There seem to be multiple issues and only a few clues if at all, so any input is appreciated.

 

IMG_20180205_190714.jpg

IMG_20180205_190729.jpg

IMG_20180205_190744.jpg

IMG_20180205_190818.jpg

Edited by fae76
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.