Error when trying to pass-through Vega 64 ​🤨


Recommended Posts

Hey all,

 

So I just picked up an ASUS ROG STRIX Radeon Vega 64 to replace my older Gigabyte GTX970 Mini in my UnRAID 6.7.2 build.  I specifically went this route because I extensively use MacOS and wanted to move to a natively supported card for that VM so I didn't have to deal with the stupidities of trying to get an NVidia card supported in MacOS overall (after a few changes to the hardware which changed PCI express port mappings I was unable to get the GTX970 recognized in MacOS, but came up without any issue and fully functional in my Windows10 VM with the same XML settings for the card).  I swapped the cards and upon reboot, everything seems to be working fine both in bios and across both command-line and GUI versions of UnRAID, but when I go to start one of the VMs (after changing the PCI ports in the XML to match the new change from 7 to 9 and remove the vBIOs injection I was using to make the GTX card work in a VM) the start of each of them hung after getting a green triangle on the VM without loosing UnRAID's interface from the monitor connected to the Vega 64 and pinning one of the assigned logical CPU cores.  Looking in the VM logs I am getting:

2020-02-05T22:23:04.260722Z qemu-system-x86_64: vfio_err_notifier_handler(0000:09:00.1) Unrecoverable error detected. Please collect any data possible and then kill the guest
2020-02-05T22:23:04.260795Z qemu-system-x86_64: vfio_err_notifier_handler(0000:09:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest

I need help people!!! I knew I would have to do a little work on MacOS to remove Clover's forcing of use of the NVidia driver, etc., but expected that like bare metal hardware, Windows 10 should boot right up as long as I updated the PCI assignment from bus 7 to 9!  I've included below my Win10 VM XML and also attached the diagnostics from both extensive attempts of both VMs last night and my reboot today with just primarily trying to start my Win10 VM.

 

PLEASE HELP! 😝

<?xml version='1.0' encoding='UTF-8'?>
<domain type='kvm'>
  <name>Backblaze</name>
  <uuid>44cda2aa-66af-a307-7f6a-232c3dc374fd</uuid>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="/mnt/user/domains/Backblaze/backblaze.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>58720256</memory>
  <currentMemory unit='KiB'>58720256</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>20</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='2'/>
    <vcpupin vcpu='1' cpuset='14'/>
    <vcpupin vcpu='2' cpuset='3'/>
    <vcpupin vcpu='3' cpuset='15'/>
    <vcpupin vcpu='4' cpuset='4'/>
    <vcpupin vcpu='5' cpuset='16'/>
    <vcpupin vcpu='6' cpuset='5'/>
    <vcpupin vcpu='7' cpuset='17'/>
    <vcpupin vcpu='8' cpuset='6'/>
    <vcpupin vcpu='9' cpuset='18'/>
    <vcpupin vcpu='10' cpuset='7'/>
    <vcpupin vcpu='11' cpuset='19'/>
    <vcpupin vcpu='12' cpuset='8'/>
    <vcpupin vcpu='13' cpuset='20'/>
    <vcpupin vcpu='14' cpuset='9'/>
    <vcpupin vcpu='15' cpuset='21'/>
    <vcpupin vcpu='16' cpuset='10'/>
    <vcpupin vcpu='17' cpuset='22'/>
    <vcpupin vcpu='18' cpuset='11'/>
    <vcpupin vcpu='19' cpuset='23'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-i440fx-3.1'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/44cda2aa-66af-a307-7f6a-232c3dc374fd_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vendor_id state='on' value='none'/>
    </hyperv>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='20' threads='1'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='hypervclock' present='yes'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/mnt/user/domains/Backblaze/vdisk1.img'/>
      <target dev='hdc' bus='virtio'/>
      <boot order='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/isos/Windows10_Install.iso'/>
      <target dev='hda' bus='ide'/>
      <readonly/>
      <boot order='2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/isos/virtio-win-0.1.160-1.iso'/>
      <target dev='hdb' bus='ide'/>
      <readonly/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='ide' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x2'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:60:31:97'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x09' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x1'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <memballoon model='none'/>
  </devices>
</domain>

coultonstudios-diagnostics-20200205-2114.zip coultonstudios-diagnostics-20200205-2224.zip

Edited by TheTechnoPilot
Added VM XML
Link to comment

If I follow correctly when you start the win10 vm the monitor output doesn't change from unraid. It sounds like vfio is having trouble grabbing control of the gpu from unraid. 

 

Add "video=efifb:off" as a boot flag to your syslinux.cfg. This will disable the framebuffer in unraid making the gpu fully available. This will disable video output after the bootloader screen, so no gui, you will need to launch your vm from the webgui on another machine.

 

If you want to be able to use GUI mode I believe you will need 2 gpus. The gui mode is only recommended for fixing networking issues if you can't access the webgui, so setting efifb:off shouldn't be a loss.

Link to comment
9 minutes ago, Skitals said:

If I follow correctly when you start the win10 vm the monitor output doesn't change from unraid. It sounds like vfio is having trouble grabbing control of the gpu from unraid. 

 

Add "video=efifb:off" as a boot flag to your syslinux.cfg. This will disable the framebuffer in unraid making the gpu fully available. This will disable video output after the bootloader screen, so no gui, you will need to launch your vm from the webgui on another machine.

 

If you want to be able to use GUI mode I believe you will need 2 gpus. The gui mode is only recommended for fixing networking issues if you can't access the webgui, so setting efifb:off shouldn't be a loss.

You are 100% correct and have no need or desire to run UnRaid in GUI mode, it was just to confirm the new card was functioning without issue.  Normally I run it without and my only loss I feel doing this is seeing the IP address on bootup when working off a new network (use it for work on the sets of feature films).

 

I'll give this a try. Would you say though this behaviour for grabbing control would be graphics card dependent? As it had no issue previously when using the GTX970.

Link to comment
18 minutes ago, TheTechnoPilot said:

You are 100% correct and have no need or desire to run UnRaid in GUI mode, it was just to confirm the new card was functioning without issue.  Normally I run it without and my only loss I feel doing this is seeing the IP address on bootup when working off a new network (use it for work on the sets of feature films).

 

I'll give this a try. Would you say though this behaviour for grabbing control would be graphics card dependent? As it had no issue previously when using the GTX970.

I can't say for certain. I required "video=efifb:off" even with a single nvidia card on my setup. I use two GPUs now, and only pass my non-primary GPU so it's no longer needed for my setup.

Link to comment
1 hour ago, Skitals said:

I can't say for certain. I required "video=efifb:off" even with a single nvidia card on my setup. I use two GPUs now, and only pass my non-primary GPU so it's no longer needed for my setup.

So, interestingly, adding that line after my ACS overrides didn't disable video during loading on reboot.  Am I using it right?  Interestingly, when I tried to start the VM again at that point, it did yank the card from UnRAID's command-line but Win10 didn't seem to grab it and the boot stalled with the same error and single stuck pinned logical core out of the 10 pairs assigned.

Link to comment
11 hours ago, TheTechnoPilot said:

So, interestingly, adding that line after my ACS overrides didn't disable video during loading on reboot.  Am I using it right?  Interestingly, when I tried to start the VM again at that point, it did yank the card from UnRAID's command-line but Win10 didn't seem to grab it and the boot stalled with the same error and single stuck pinned logical core out of the 10 pairs assigned.

Try "video=vesafb:off,efifb:off"

 

 

Link to comment

So sadly, neither of those, nor direct override of passing the hardware IDs to the VFIO driver on boot has had any effect on the boot error... :(

 

@SpaceInvaderOne is there any chance you might be able to chime in with any thoughts?

 

The VM boot log:

-chardev socket,id=charmonitor,fd=24,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=localtime \
-no-hpet \
-no-shutdown \
-boot strict=on \
-device pcie-root-port,port=0x9,chassis=1,id=pci.1,bus=pcie.0,addr=0x1.0x1 \
-device pcie-root-port,port=0xa,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x2 \
-device pcie-root-port,port=0xb,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x3 \
-device pcie-root-port,port=0x13,chassis=4,id=pci.4,bus=pcie.0,addr=0x2.0x3 \
-device pcie-root-port,port=0x14,chassis=5,id=pci.5,bus=pcie.0,addr=0x2.0x4 \
-device pcie-root-port,port=0x8,chassis=6,id=pci.6,bus=pcie.0,multifunction=on,addr=0x1 \
-device ich9-usb-ehci1,id=usb,bus=pcie.0,addr=0x7.0x7 \
-device ich9-usb-uhci1,masterbus=usb.0,firstport=0,bus=pcie.0,multifunction=on,addr=0x7 \
-device ich9-usb-uhci2,masterbus=usb.0,firstport=2,bus=pcie.0,addr=0x7.0x1 \
-device ich9-usb-uhci3,masterbus=usb.0,firstport=4,bus=pcie.0,addr=0x7.0x2 \
-device virtio-serial-pci,id=virtio-serial0,bus=pci.2,addr=0x0 \
-drive 'file=/mnt/user/domains/Windows 10/vdisk2.img,format=raw,if=none,id=drive-virtio-disk2,cache=writeback' \
-device virtio-blk-pci,scsi=off,bus=pci.3,addr=0x0,drive=drive-virtio-disk2,id=virtio-disk2,bootindex=1,write-cache=on \
-drive file=/mnt/user/isos/Windows10_Install.iso,format=raw,if=none,id=drive-sata0-0-0,readonly=on \
-device ide-cd,bus=ide.0,drive=drive-sata0-0-0,id=sata0-0-0,bootindex=2 \
-drive file=/mnt/user/isos/virtio-win-0.1.160-1.iso,format=raw,if=none,id=drive-sata0-0-1,readonly=on \
-device ide-cd,bus=ide.1,drive=drive-sata0-0-1,id=sata0-0-1 \
-netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 \
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:e4:5e:83,bus=pci.1,addr=0x0 \
-chardev pty,id=charserial0 \
-device isa-serial,chardev=charserial0,id=serial0 \
-chardev socket,id=charchannel0,fd=29,server,nowait \
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0 \
-device usb-tablet,id=input0,bus=usb.0,port=1 \
-device vfio-pci,host=09:00.0,id=hostdev0,bus=pci.4,addr=0x0 \
-device vfio-pci,host=09:00.1,id=hostdev1,bus=pci.5,addr=0x0 \
-device usb-host,hostbus=1,hostaddr=2,id=hostdev2,bus=usb.0,port=2 \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
2020-02-10 00:28:41.371+0000: Domain id=1 is tainted: high-privileges
2020-02-10 00:28:41.371+0000: Domain id=1 is tainted: host-cpu
char device redirected to /dev/pts/0 (label charserial0)
2020-02-10T00:28:43.974057Z qemu-system-x86_64: vfio_err_notifier_handler(0000:09:00.1) Unrecoverable error detected. Please collect any data possible and then kill the guest
2020-02-10T00:28:43.974162Z qemu-system-x86_64: vfio_err_notifier_handler(0000:09:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest

Also see the attached log from my last boot.

coultonstudios-diagnostics-20200210-0034.zip

Link to comment
  • 9 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.