Complete Unraid lockup when rebooting a VM


Recommended Posts

Hi all

 

Over the past few months I have been experiencing complete hard lockup of Unraid and have to power cycle.  Each time it happens as a direct result of attempted to reboot the same Windows 10 VM (via the shutdown menu inside the VM, not using web GUI).  Syslog as follows:

 

May  4 11:17:41 SERVER kernel: mdcmd (552): spindown 10
May  4 12:25:40 SERVER kernel: mdcmd (553): spindown 0
May  4 13:22:59 SERVER kernel: mdcmd (554): spindown 10
May  4 13:23:00 SERVER kernel: mdcmd (555): spindown 5
May  4 13:52:18 SERVER kernel: mdcmd (556): spindown 9
May  4 16:04:17 SERVER kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
May  4 16:04:17 SERVER kernel: rcu:     24-...0: (1 GPs behind) idle=a12/1/0x4000000000000000 softirq=109480389/109480389 fqs=14466 
May  4 16:04:17 SERVER kernel: rcu:     (detected by 25, t=60002 jiffies, g=587531517, q=76713)
May  4 16:04:17 SERVER kernel: Sending NMI from CPU 25 to CPUs 24:
May  4 16:04:17 SERVER kernel: NMI backtrace for cpu 24
May  4 16:04:17 SERVER kernel: CPU: 24 PID: 307 Comm: CPU 1/KVM Tainted: P           O      4.19.107-Unraid #1
May  4 16:04:17 SERVER kernel: Hardware name: ASUSTek Computer INC. TS700-E7-RS8/Z9PE-D16 Series, BIOS 5601 06/11/2015
May  4 16:04:17 SERVER kernel: RIP: 0010:qi_submit_sync+0x154/0x2db
May  4 16:04:17 SERVER kernel: Code: 30 02 0f 84 40 01 00 00 4d 8b 96 b0 00 00 00 49 8b 42 10 83 3c 30 03 75 0b 41 bc f5 ff ff ff e9 27 01 00 00 49 8b 06 8b 48 34 <f6> c1 10 74 68 49 8b 06 8b 80 80 00 00 00 c1 f8 04 41 39 c3 75 57
May  4 16:04:17 SERVER kernel: RSP: 0018:ffffc90006a8bb50 EFLAGS: 00000093
May  4 16:04:17 SERVER kernel: RAX: ffffc9000001f000 RBX: 0000000000000100 RCX: 0000000000000000
May  4 16:04:17 SERVER kernel: RDX: 0000000000000001 RSI: 000000000000006c RDI: ffff88903f418a00
May  4 16:04:17 SERVER kernel: RBP: ffffc90006a8bba8 R08: 0000000000640000 R09: 0000000000000000
May  4 16:04:17 SERVER kernel: R10: ffff88903f418a00 R11: 000000000000001a R12: 00000000000001b0
May  4 16:04:17 SERVER kernel: R13: ffff88903f418a00 R14: ffff88903f40f400 R15: 0000000000000086
May  4 16:04:17 SERVER kernel: FS:  000014bc64dff700(0000) GS:ffff88a03fa00000(0000) knlGS:0000000000000000
May  4 16:04:17 SERVER kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May  4 16:04:17 SERVER kernel: CR2: ffffe4881907d478 CR3: 000000108a6ba005 CR4: 00000000001626e0
May  4 16:04:17 SERVER kernel: Call Trace:
May  4 16:04:17 SERVER kernel: modify_irte+0xf0/0x136
May  4 16:04:17 SERVER kernel: intel_irq_remapping_deactivate+0x2d/0x47
May  4 16:04:17 SERVER kernel: __irq_domain_deactivate_irq+0x27/0x33
May  4 16:04:17 SERVER kernel: irq_domain_deactivate_irq+0x15/0x22
May  4 16:04:17 SERVER kernel: __free_irq+0x1d8/0x238
May  4 16:04:17 SERVER kernel: free_irq+0x5d/0x75
May  4 16:04:17 SERVER kernel: vfio_msi_set_vector_signal+0x84/0x231
May  4 16:04:17 SERVER kernel: ? flush_workqueue+0x2bf/0x2e3
May  4 16:04:17 SERVER kernel: vfio_msi_set_block+0x6c/0xac
May  4 16:04:17 SERVER kernel: vfio_msi_disable+0x61/0xa0
May  4 16:04:17 SERVER kernel: vfio_pci_set_msi_trigger+0x44/0x230
May  4 16:04:17 SERVER kernel: ? pci_bus_read_config_word+0x44/0x66
May  4 16:04:17 SERVER kernel: vfio_pci_ioctl+0x52d/0x9a2
May  4 16:04:17 SERVER kernel: ? vfio_pci_config_rw+0x209/0x2a6
May  4 16:04:17 SERVER kernel: ? __seccomp_filter+0x39/0x1ed
May  4 16:04:17 SERVER kernel: vfs_ioctl+0x19/0x26
May  4 16:04:17 SERVER kernel: do_vfs_ioctl+0x533/0x55d
May  4 16:04:17 SERVER kernel: ksys_ioctl+0x37/0x56
May  4 16:04:17 SERVER kernel: __x64_sys_ioctl+0x11/0x14
May  4 16:04:17 SERVER kernel: do_syscall_64+0x57/0xf2
May  4 16:04:17 SERVER kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
May  4 16:04:17 SERVER kernel: RIP: 0033:0x14bc687a54b7
May  4 16:04:17 SERVER kernel: Code: 00 00 90 48 8b 05 d9 29 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a9 29 0d 00 f7 d8 64 89 01 48
May  4 16:04:17 SERVER kernel: RSP: 002b:000014bc64dfe2e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
May  4 16:04:17 SERVER kernel: RAX: ffffffffffffffda RBX: 000014b84150a200 RCX: 000014bc687a54b7
May  4 16:04:17 SERVER kernel: RDX: 000014bc64dfe2f0 RSI: 0000000000003b6e RDI: 000000000000004b
May  4 16:04:17 SERVER kernel: RBP: 000014b84150a200 R08: 000000000000006c R09: 00000000ffffff00
May  4 16:04:17 SERVER kernel: R10: 000014b82cd8406b R11: 0000000000000246 R12: 000000000000006a
May  4 16:04:17 SERVER kernel: R13: 0000000000000080 R14: 0000000000000002 R15: 000014b84150a200

The output of the VM at this point is the Windows shutdown sequence saying "Restarting...".  If I try to use Virsh to shutdown I get the following, and shortly after a total Unraid lockup:

virsh # destroy "Windows 10"
error: Failed to destroy domain Windows 10
error: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainReset)

General info:

Unraid 6.8.3

Platform: Asus Z9PE-D16 with 2x Xeon 2667 v2

RAM: 128GB ECC

 

The VM is pinned to the second physical CPU, which is isolated from Unraid.  It has a GTX 2070 Super and NVMe drive passed through to it.

<?xml version='1.0' encoding='UTF-8'?>
<domain type='kvm' id='2'>
  <name>Windows 10</name>
  <uuid>39ac96f3-a777-0c5e-419f-596878b407e9</uuid>
  <description>Win-10</description>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>17301504</memory>
  <currentMemory unit='KiB'>17301504</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>16</vcpu>
  <iothreads>2</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='8'/>
    <vcpupin vcpu='1' cpuset='24'/>
    <vcpupin vcpu='2' cpuset='9'/>
    <vcpupin vcpu='3' cpuset='25'/>
    <vcpupin vcpu='4' cpuset='10'/>
    <vcpupin vcpu='5' cpuset='26'/>
    <vcpupin vcpu='6' cpuset='11'/>
    <vcpupin vcpu='7' cpuset='27'/>
    <vcpupin vcpu='8' cpuset='12'/>
    <vcpupin vcpu='9' cpuset='28'/>
    <vcpupin vcpu='10' cpuset='13'/>
    <vcpupin vcpu='11' cpuset='29'/>
    <vcpupin vcpu='12' cpuset='14'/>
    <vcpupin vcpu='13' cpuset='30'/>
    <vcpupin vcpu='14' cpuset='15'/>
    <vcpupin vcpu='15' cpuset='31'/>
    <emulatorpin cpuset='0,16'/>
    <iothreadpin iothread='1' cpuset='1,17'/>
    <iothreadpin iothread='2' cpuset='2,18'/>
  </cputune>
  <numatune>
    <memory mode='preferred' nodeset='1'/>
  </numatune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-3.1'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/39ac96f3-a777-0c5e-419f-596878b407e9_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vendor_id state='on' value='1278467890ab'/>
    </hyperv>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='8' threads='2'/>
    <cache mode='passthrough'/>
    <feature policy='require' name='topoext'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='hypervclock' present='yes'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <alias name='usb'/>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <alias name='usb'/>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <alias name='usb'/>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x2'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:be:8c:35'/>
      <source bridge='br0'/>
      <target dev='vnet1'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/2'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/2'>
      <source path='/dev/pts/2'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/channel/target/domain-2-Windows 10/org.qemu.guest_agent.0'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'>
      <alias name='input0'/>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'>
      <alias name='input1'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input2'/>
    </input>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x83' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x83' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x00' slot='0x1d' function='0x0'/>
      </source>
      <alias name='hostdev2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x82' slot='0x00' function='0x0'/>
      </source>
      <boot order='1'/>
      <alias name='hostdev3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x83' slot='0x00' function='0x2'/>
      </source>
      <alias name='hostdev4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x83' slot='0x00' function='0x3'/>
      </source>
      <alias name='hostdev5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0a' function='0x0'/>
    </hostdev>
    <memballoon model='none'/>
  </devices>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+0:+100</label>
    <imagelabel>+0:+100</imagelabel>
  </seclabel>
</domain>

Can anyone advise what the issue might be, or at least on a way to deal with this without it taking down the entire system?

 

Thanks

Edited by flaggart
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.