Release GPU After VM Crash


Recommended Posts

My VMs crash when they're under load because my CPU sucks, I'm dealing with it until my new motherboard comes in a few weeks. Whenever my Windows 10 VM crashes, it locks my Nvidia GTX 1070 that I have passed through to it, and won't let me boot the VM back up, citing this issue:

 

root@unRAID:~# virsh start Windows\ 10
error: Failed to start domain Windows 10
error: internal error: process exited while connecting to monitor: 2017-08-19T09:14:20.728766Z qemu-system-x86_64: -chardev pty,id=charserial0: char device redirected to /dev/pts/1 (label charserial0)
2017-08-19T09:14:20.810864Z qemu-system-x86_64: -device vfio-pci,host=04:00.0,id=hostdev0,bus=pci.0,addr=0x5: vfio error: 0000:04:00.0: failed to open /dev/vfio/25: Device or resource busy

 

If I choose VNC as my video output it's starts fine. A reboot of unRAID also fixes the issue, but I would rather not have to reboot my server when the VM crashes but everything else works well.

 

I found this string of commands relating to the same thing over on the RedHat forums, but the last one won't work and just fails with "-bash: echo: write error: No such device"

 

That was exactly what was going wrong. efifb had attached to some of the nvidia device's memory.
Since efifb can't be compiled as a module, and I'd rather not turn it off, here's what I did:

	echo 0 > /sys/class/vtconsole/vtcon0/bind
	echo 0 > /sys/class/vtconsole/vtcon1/bind
	echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind

This completely solves the problem and all is well doing passthrough on my Skylake system.
Hopefully now that there's a solution with the right magic words in it on the Internet, others will find their answer here. Thanks again!
Edited by brando56894
  • Like 1
Link to comment

Start by giving us all your specs.

What are the settings in your VM?
And are you passing through your only GPU? Or is unRAID using another GPU?
If my VM crashes I can launch it again just fine, but I have dumped the GPU VBIOS and pass it to the GPU using a directive in the XML file.

Link to comment
29 minutes ago, Ziggurat said:

Start by giving us all your specs.

 

Server: SuperMicro X10SDV-F-0 w/Xeon-D 1540 (16x 2 GHz), 1.2 KW EVGA PSU, 2x 32 GB DDR4 ECC RAM

Pool: 5x HGST 4 TB HDDs Cache: 1x 512 GB Samsung 840 Pro SATA SSD

 

29 minutes ago, Ziggurat said:

What are the settings in your VM?

 

<domain type='kvm' id='2'>
  <name>Windows 10</name>
  <uuid>f4914b40-ce13-7c85-09cf-1bbe740f2d41</uuid>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>16777216</memory>
  <currentMemory unit='KiB'>16777216</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='3'/>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.9'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/f4914b40-ce13-7c85-09cf-1bbe740f2d41_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vendor_id state='on' value='none'/>
    </hyperv>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='2' threads='2'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='hypervclock' present='yes'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/mnt/cache/domains/Windows 10/vdisk1.img'/>
      <backingStore/>
      <target dev='hdc' bus='virtio'/>
      <boot order='1'/>
      <alias name='virtio-disk2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='nec-xhci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:20:41:f6'/>
      <source bridge='br0'/>
      <target dev='vnet1'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/1'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/1'>
      <source path='/dev/pts/1'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-2-Windows 10/org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x04' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='no'>
      <source>
        <vendor id='0x046d'/>
        <product id='0xc22a'/>
        <address bus='3' device='6'/>
      </source>
      <alias name='hostdev2'/>
      <address type='usb' bus='0' port='1'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='no'>
      <source>
        <vendor id='0x046d'/>
        <product id='0xc22b'/>
        <address bus='3' device='4'/>
      </source>
      <alias name='hostdev3'/>
      <address type='usb' bus='0' port='2'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='no'>
      <source>
        <vendor id='0x046d'/>
        <product id='0xc52b'/>
        <address bus='3' device='8'/>
      </source>
      <alias name='hostdev4'/>
      <address type='usb' bus='0' port='3'/>
    </hostdev>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='none' model='none'/>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+0:+100</label>
    <imagelabel>+0:+100</imagelabel>
  </seclabel>
</domain>

 

29 minutes ago, Ziggurat said:

And are you passing through your only GPU?

 

The BMC has a built-in aspeed 2400 GPU, which isn't used for anything other than IPMI/console access, so I have the GTX passed through to the Windows VM which is my HTPC.

Edited by brando56894
Link to comment
  • 3 weeks later...
6 hours ago, brando56894 said:

Nope, buying a new card won't help, this is a software issue with either Linux or qemu/libvirt.

 

Dam, we have to figure out this issue!

 

Here are my specs and XML:

  • Supermicro 4U Server
    • CSE-846A-R1200B Chassis

    • X9DRI-F Motherboard

    • 2x E5-2670 2.6ghz 8-Core 8.0 GT/s / 20mb Smart Cache CPUs

    • 8x 8gb PC3-10600R Server Memory

    • 24x 3.5" Trays

    • SAS2-846EL1 Backplane

    • LSI 9207-8i

    • 2x 1200w PSU

All VMs on SSD Cache (500GB)

 

<domain type='kvm' id='3'>
  <name>LibreELEC</name>
  <uuid>b0d00937-53ca-72ef-0e66-cb938ec10e09</uuid>
  <metadata>
    <vmtemplate xmlns="unraid" name="Linux" icon="libreelec.png" os="linux"/>
  </metadata>
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='13'/>
    <vcpupin vcpu='1' cpuset='14'/>
    <vcpupin vcpu='2' cpuset='29'/>
    <vcpupin vcpu='3' cpuset='30'/>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-q35-2.7'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/b0d00937-53ca-72ef-0e66-cb938ec10e09_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough'>
    <topology sockets='1' cores='2' threads='2'/>
  </cpu>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='writeback'/>
      <source file='/mnt/user/domains/LibreELEC/vdisk2.img'/>
      <backingStore/>
      <target dev='hdc' bus='sata'/>
      <boot order='1'/>
      <alias name='sata0-0-2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <alias name='usb'/>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <alias name='usb'/>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <alias name='usb'/>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x2'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'>
      <alias name='pcie.0'/>
    </controller>
    <controller type='pci' index='1' model='dmi-to-pci-bridge'>
      <model name='i82801b11-bridge'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1e' function='0x0'/>
    </controller>
    <controller type='pci' index='2' model='pci-bridge'>
      <model name='pci-bridge'/>
      <target chassisNr='2'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x02' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:85:00:be'/>
      <source bridge='br0'/>
      <target dev='vnet2'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/2'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/2'>
      <source path='/dev/pts/2'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-3-LibreELEC/org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='disconnected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x82' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x03' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x82' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x04' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='usb' managed='no'>
      <source>
        <vendor id='0x0c45'/>
        <product id='0x5101'/>
        <address bus='2' device='4'/>
      </source>
      <alias name='hostdev2'/>
      <address type='usb' bus='0' port='1'/>
    </hostdev>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x05' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='none' model='none'/>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+0:+100</label>
    <imagelabel>+0:+100</imagelabel>
  </seclabel>
</domain>

 

Edited by CrimsonTyphoon
Link to comment

After doing a little more research, it may be as simple as just killing the qemu process that is hanging onto the device. It crashed for me last night but I hadn't seen this yet so I haven't had a chance to test it. My hung device is /dev/vfio/25 and IDK why I didn't think of this before but lsof will show the process that is using the device, which in this case is qemu

 

root@unRAID:~# lsof  /dev/vfio/25
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
qemu-syst 5388 root   24u   CHR  251,0      0t0 97425 /dev/vfio/25

 

So if that process still exists after the VM crashes and is shutdown a simple kill -9 5388 should release that device and allow the VM to be restarted since theoretically nothing will be using that device node.

 

Give it a try the next time you experience a crash and let me know what happens.

 

I posted a thread about this on reddit since we're not getting any help here. I also find a similar thread there relating to this, but not Windows VM specific: https://www.reddit.com/r/VFIO/comments/44f1oc/primary_gpu_hotplug/ (now that I see there is a VFIO subreddit I'm gonna cross-post it for more visibility)

 

  • Upvote 1
Link to comment

Didn't work for me :-(

 

Summary:

  • LibreELEC VM
  • 9500GT Passthru
  • See sig for rig specs

 

With the VM off, there is nothing in /dev/vifo. With the VM on, there is

 

Here is the error message in the console when I try to turn the VM back on (from unRAID console):

 

Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler
Shutting down cpus with NMI
Kernel Offset: disabled
Rebooting in 30 seconds..

It does not actually reboot.

 

I updated to the latest beta (rc8q) hoping it would help, but it did not. I also added the ROM file to the card bios, but still nothing

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.