• [6.9.0 Beta 30] Can't start VM more than once without host reboot


    nickp85
    • Urgent

    Updated from 6.8.3 to 6.9 beta 30 and with my Windows 10 VM, it will start after the host is booted but if shut down and then started again, it hangs before even getting the BIOS screen with 100% CPU on the first core tied to the VM.  Reboot of the host allowed the VM to start again.  Reproducible repeatedly.

     

    VM config below and worked in 6.8.3.  I'm also trying to diagnose while Windows is idle, host CPU usage is high on first core tied to the VM, around 50%.  It was about 35% in 6.8.3 and now worse with 6.9 beta 30

     

    <?xml version='1.0' encoding='UTF-8'?>
    <domain type='kvm'>
      <name>Windows 10</name>
      <uuid>XXXX</uuid>
      <metadata>
        <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
      </metadata>
      <memory unit='KiB'>16777216</memory>
      <currentMemory unit='KiB'>16777216</currentMemory>
      <memoryBacking>
        <nosharepages/>
      </memoryBacking>
      <vcpu placement='static'>8</vcpu>
      <iothreads>1</iothreads>
      <cputune>
        <vcpupin vcpu='0' cpuset='2'/>
        <vcpupin vcpu='1' cpuset='8'/>
        <vcpupin vcpu='2' cpuset='3'/>
        <vcpupin vcpu='3' cpuset='9'/>
        <vcpupin vcpu='4' cpuset='4'/>
        <vcpupin vcpu='5' cpuset='10'/>
        <vcpupin vcpu='6' cpuset='5'/>
        <vcpupin vcpu='7' cpuset='11'/>
        <emulatorpin cpuset='0,6'/>
        <iothreadpin iothread='1' cpuset='0,6'/>
      </cputune>
      <os>
        <type arch='x86_64' machine='pc-q35-5.1'>hvm</type>
        <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
        <nvram>/etc/libvirt/qemu/nvram/4f3e9f82-1d47-0475-2b7d-4a6181c23948_VARS-pure-efi.fd</nvram>
        <smbios mode='host'/>
      </os>
      <features>
        <acpi/>
        <apic/>
        <hyperv>
          <relaxed state='on'/>
          <vapic state='on'/>
          <spinlocks state='on' retries='8191'/>
          <vpindex state='on'/>
          <synic state='on'/>
          <stimer state='on'/>
          <reset state='on'/>
          <vendor_id state='on' value='1234567890ab'/>
          <frequencies state='on'/>
          <reenlightenment state='on'/>
        </hyperv>
      </features>
      <cpu mode='host-passthrough' check='none' migratable='on'>
        <topology sockets='1' dies='1' cores='4' threads='2'/>
        <cache mode='passthrough'/>
      </cpu>
      <clock offset='localtime'>
        <timer name='hypervclock' present='yes'/>
        <timer name='hpet' present='no'/>
      </clock>
      <on_poweroff>destroy</on_poweroff>
      <on_reboot>restart</on_reboot>
      <on_crash>restart</on_crash>
      <devices>
        <emulator>/usr/local/sbin/qemu</emulator>
        <disk type='file' device='disk'>
          <driver name='qemu' type='raw' cache='writeback' io='threads' discard='unmap'/>
          <source file='/mnt/user/domains/Windows 10/vdisk1.img'/>
          <target dev='hdc' bus='scsi'/>
          <boot order='1'/>
          <address type='drive' controller='0' bus='0' target='0' unit='2'/>
        </disk>
        <disk type='block' device='disk'>
          <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>
          <source dev='/dev/disk/by-id/ata-Samsung_SSD_860_EVO_1TB_XXXXXXXXXX'/>
          <target dev='hdd' bus='scsi'/>
          <address type='drive' controller='1' bus='0' target='0' unit='2'/>
        </disk>
        <controller type='scsi' index='0' model='virtio-scsi'>
          <driver queues='8' iothread='1'/>
          <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
        </controller>
        <controller type='scsi' index='1' model='virtio-scsi'>
          <driver queues='8'/>
          <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
        </controller>
        <controller type='pci' index='0' model='pcie-root'/>
        <controller type='pci' index='1' model='pcie-root-port'>
          <model name='pcie-root-port'/>
          <target chassis='1' port='0x8'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/>
        </controller>
        <controller type='pci' index='2' model='pcie-root-port'>
          <model name='pcie-root-port'/>
          <target chassis='2' port='0x9'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
        </controller>
        <controller type='pci' index='3' model='pcie-root-port'>
          <model name='pcie-root-port'/>
          <target chassis='3' port='0xa'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
        </controller>
        <controller type='pci' index='4' model='pcie-root-port'>
          <model name='pcie-root-port'/>
          <target chassis='4' port='0xb'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
        </controller>
        <controller type='pci' index='5' model='pcie-root-port'>
          <model name='pcie-root-port'/>
          <target chassis='5' port='0xc'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
        </controller>
        <controller type='pci' index='6' model='pcie-root-port'>
          <model name='pcie-root-port'/>
          <target chassis='6' port='0xd'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x5'/>
        </controller>
        <controller type='pci' index='7' model='pcie-to-pci-bridge'>
          <model name='pcie-pci-bridge'/>
          <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
        </controller>
        <controller type='pci' index='8' model='pcie-root-port'>
          <model name='pcie-root-port'/>
          <target chassis='8' port='0xe'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x6'/>
        </controller>
        <controller type='pci' index='9' model='pcie-root-port'>
          <model name='pcie-root-port'/>
          <target chassis='9' port='0xf'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x7'/>
        </controller>
        <controller type='pci' index='10' model='pcie-root-port'>
          <model name='pcie-root-port'/>
          <target chassis='10' port='0x10'/>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
        </controller>
        <controller type='virtio-serial' index='0'>
          <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
        </controller>
        <controller type='sata' index='0'>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
        </controller>
        <controller type='usb' index='0' model='qemu-xhci' ports='15'>
          <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
        </controller>
        <interface type='bridge'>
          <mac address='52:54:00:0b:49:22'/>
          <source bridge='br0'/>
          <model type='virtio'/>
          <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
        </interface>
        <serial type='pty'>
          <target type='isa-serial' port='0'>
            <model name='isa-serial'/>
          </target>
        </serial>
        <console type='pty'>
          <target type='serial' port='0'/>
        </console>
        <channel type='unix'>
          <target type='virtio' name='org.qemu.guest_agent.0'/>
          <address type='virtio-serial' controller='0' bus='0' port='1'/>
        </channel>
        <input type='mouse' bus='ps2'/>
        <input type='keyboard' bus='ps2'/>
        <hostdev mode='subsystem' type='pci' managed='yes'>
          <driver name='vfio'/>
          <source>
            <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
          </source>
          <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
        </hostdev>
        <hostdev mode='subsystem' type='pci' managed='yes'>
          <driver name='vfio'/>
          <source>
            <address domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
          </source>
          <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
        </hostdev>
        <hostdev mode='subsystem' type='pci' managed='yes'>
          <driver name='vfio'/>
          <source>
            <address domain='0x0000' bus='0x00' slot='0x14' function='0x0'/>
          </source>
          <address type='pci' domain='0x0000' bus='0x07' slot='0x01' function='0x0'/>
        </hostdev>
        <memballoon model='none'/>
      </devices>
    </domain>

     

    nicknas2-diagnostics-20201011-0146.zip




    User Feedback

    Recommended Comments

    I'm having a similar problem.  Upgraded from 6.9.0-beta25 to beta30.  Windows 10 VM in beta 25 worked flawlessly. When using beta30 the VM will boot and will even run but will then freeze with CUP utilization at 100% on random cores assigned to the VM.  

     

    Running on an Intel i7 10700K on a Asus PRIME Z490-PLUS with latest firmware.  I figure it has something to do with the new kernel?  I've even tried blowing the VM away and recreating it, still same problem.  

     

    If I revert back to beta25 everything works as it should.

    Link to comment

    At the time, I was running a Core i7 8700K with an ASUS PRIME z370-a with latest BIOS.  I haven't had to restart my VM from the Unraid console so this hasn't happened again.  Usually I boot it after Unraid is up and it stays running all the time as my personal PC.

     

    I find it curious that we have the same manufacturer for the board.  I did just upgrade to a 9900K the other day (good sale).

    Link to comment

    Seeing similar behavior.  If I stop the array and re-start it (without rebooting the host) my VMs won't start.  As soon as I reboot the host and start the array, everything starts as expected. 

     

    I get QEMU errors about being unable to reserve port 5700 when I try and manually start the VM.

     

    ASRock B365-Pro4 and i5-8400 

    unraidnas-diagnostics-20201028-0942.zip

    Link to comment

    I'm having a similar problem, where the VM worked in 6.8.3 and upon upgrading the 6.9beta30, it completely dies. Even downgrading back to 6.8.3 doesn't fix it. After further restarts, Unraid doesn't even start...

    Link to comment

    Hi everyone and thank you for your reports on this issue.  I have been trying to recreate this in my lab and haven't been successful yet.  A few things.

     

    @nickp85 from reviewing your logs, it looks like you have a call trace in your logs:

     

    Oct 11 01:17:37 nicknas2 kernel: swapper/11: page allocation failure: order:0, mode:0xa20(GFP_ATOMIC), nodemask=(null),cpuset=/,mems_allowed=0
    Oct 11 01:17:37 nicknas2 kernel: CPU: 11 PID: 0 Comm: swapper/11 Not tainted 5.8.13-Unraid #1
    Oct 11 01:17:37 nicknas2 kernel: Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2401 07/12/2019
    Oct 11 01:17:37 nicknas2 kernel: Call Trace:
    Oct 11 01:17:37 nicknas2 kernel: <IRQ>
    Oct 11 01:17:37 nicknas2 kernel: dump_stack+0x6b/0x83
    Oct 11 01:17:37 nicknas2 kernel: warn_alloc+0xe2/0x160
    Oct 11 01:17:37 nicknas2 kernel: __alloc_pages_slowpath.constprop.0+0x753/0x780
    Oct 11 01:17:37 nicknas2 kernel: __alloc_pages_nodemask+0x1a1/0x1fc
    Oct 11 01:17:37 nicknas2 kernel: e1000_alloc_rx_buffers_ps+0x9b/0x20a [e1000e]
    Oct 11 01:17:37 nicknas2 kernel: e1000_clean_rx_irq_ps+0x4b4/0x4da [e1000e]
    Oct 11 01:17:38 nicknas2 kernel: e1000e_poll+0x79/0x227 [e1000e]
    Oct 11 01:17:38 nicknas2 kernel: net_rx_action+0xf3/0x277
    Oct 11 01:17:38 nicknas2 kernel: __do_softirq+0xc4/0x1c2
    Oct 11 01:17:38 nicknas2 kernel: asm_call_irq_on_stack+0x12/0x20
    Oct 11 01:17:38 nicknas2 kernel: </IRQ>
    Oct 11 01:17:38 nicknas2 kernel: do_softirq_own_stack+0x2c/0x39
    Oct 11 01:17:38 nicknas2 kernel: __irq_exit_rcu+0x45/0x80
    Oct 11 01:17:38 nicknas2 kernel: common_interrupt+0x11a/0x130
    Oct 11 01:17:38 nicknas2 kernel: asm_common_interrupt+0x1e/0x40
    Oct 11 01:17:38 nicknas2 kernel: RIP: 0010:arch_local_irq_enable+0x7/0x8
    Oct 11 01:17:38 nicknas2 kernel: Code: fc ff ff 85 c0 75 0a c7 83 90 00 00 00 01 00 00 00 5b c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5
    Oct 11 01:17:38 nicknas2 kernel: RSP: 0018:ffffc90000113e80 EFLAGS: 00000246
    Oct 11 01:17:38 nicknas2 kernel: RAX: ffff88883ece2100 RBX: ffffe8ffffce5600 RCX: 000000000000001f
    Oct 11 01:17:38 nicknas2 kernel: RDX: 0000000000000000 RSI: 0000000022a1cd05 RDI: 0000000000000000
    Oct 11 01:17:38 nicknas2 kernel: RBP: 000002c641161f51 R08: 000002c641161f51 R09: 000002c640c0541f
    Oct 11 01:17:38 nicknas2 kernel: R10: 00000000000001e3 R11: 071c71c71c71c71c R12: ffffffff82054f20
    Oct 11 01:17:38 nicknas2 kernel: R13: 0000000000000002 R14: ffffffff82055008 R15: 0000000000000000
    Oct 11 01:17:38 nicknas2 kernel: cpuidle_enter_state+0xd5/0x193
    Oct 11 01:17:38 nicknas2 kernel: cpuidle_enter+0x25/0x31
    Oct 11 01:17:38 nicknas2 kernel: do_idle+0x1c3/0x236
    Oct 11 01:17:38 nicknas2 kernel: cpu_startup_entry+0x18/0x1a
    Oct 11 01:17:38 nicknas2 kernel: start_secondary+0x145/0x163
    Oct 11 01:17:38 nicknas2 kernel: secondary_startup_64+0xa4/0xb0
    Oct 11 01:17:38 nicknas2 kernel: Mem-Info:
    Oct 11 01:17:38 nicknas2 kernel: active_anon:4628609 inactive_anon:45594 isolated_anon:0
    Oct 11 01:17:38 nicknas2 kernel: active_file:529205 inactive_file:2463292 isolated_file:0
    Oct 11 01:17:38 nicknas2 kernel: unevictable:37 dirty:21825 writeback:0
    Oct 11 01:17:38 nicknas2 kernel: slab_reclaimable:262222 slab_unreclaimable:47586
    Oct 11 01:17:38 nicknas2 kernel: mapped:70489 shmem:242929 pagetables:10575 bounce:0
    Oct 11 01:17:38 nicknas2 kernel: free:39814 free_pcp:3722 free_cma:0
    Oct 11 01:17:38 nicknas2 kernel: Node 0 active_anon:18514436kB inactive_anon:182376kB active_file:2116820kB inactive_file:9853168kB unevictable:148kB isolated(anon):0kB isolated(file):0kB mapped:281956kB dirty:87300kB writeback:0kB shmem:971716kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 17166336kB writeback_tmp:0kB all_unreclaimable? no
    Oct 11 01:17:38 nicknas2 kernel: Node 0 DMA free:15884kB min:32kB low:44kB high:56kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
    Oct 11 01:17:38 nicknas2 kernel: lowmem_reserve[]: 0 2199 31829 31829
    Oct 11 01:17:38 nicknas2 kernel: Node 0 DMA32 free:120132kB min:4668kB low:6920kB high:9172kB reserved_highatomic:2048KB active_anon:80848kB inactive_anon:48kB active_file:478800kB inactive_file:1679756kB unevictable:0kB writepending:86272kB present:2481536kB managed:2398052kB mlocked:0kB kernel_stack:912kB pagetables:212kB bounce:0kB free_pcp:5692kB local_pcp:248kB free_cma:0kB
    Oct 11 01:17:38 nicknas2 kernel: lowmem_reserve[]: 0 0 29629 29629
    Oct 11 01:17:38 nicknas2 kernel: Node 0 Normal free:23240kB min:62880kB low:93220kB high:123560kB reserved_highatomic:4096KB active_anon:18433588kB inactive_anon:182328kB active_file:1638020kB inactive_file:8173412kB unevictable:148kB writepending:1028kB present:30916608kB managed:30340520kB mlocked:32kB kernel_stack:8976kB pagetables:42088kB bounce:0kB free_pcp:9196kB local_pcp:688kB free_cma:0kB
    Oct 11 01:17:38 nicknas2 kernel: lowmem_reserve[]: 0 0 0 0
    Oct 11 01:17:38 nicknas2 kernel: Node 0 DMA: 1*4kB (U) 1*8kB (U) 2*16kB (U) 1*32kB (U) 3*64kB (U) 2*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
    Oct 11 01:17:38 nicknas2 kernel: Node 0 DMA32: 37*4kB (MEH) 132*8kB (UMEH) 41*16kB (UM) 2*32kB (UM) 1*64kB (U) 7*128kB (UM) 12*256kB (UE) 3*512kB (U) 4*1024kB (UME) 1*2048kB (E) 26*4096kB (M) = 120132kB
    Oct 11 01:17:38 nicknas2 kernel: Node 0 Normal: 194*4kB (MEH) 102*8kB (UME) 171*16kB (MEH) 91*32kB (MEH) 44*64kB (UME) 11*128kB (UME) 6*256kB (UME) 4*512kB (UMH) 2*1024kB (UE) 1*2048kB (E) 1*4096kB (M) = 23240kB
    Oct 11 01:17:38 nicknas2 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
    Oct 11 01:17:38 nicknas2 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
    Oct 11 01:17:38 nicknas2 kernel: 3233059 total pagecache pages
    Oct 11 01:17:38 nicknas2 kernel: 0 pages in swap cache
    Oct 11 01:17:38 nicknas2 kernel: Swap cache stats: add 0, delete 0, find 0/0
    Oct 11 01:17:38 nicknas2 kernel: Free swap  = 0kB
    Oct 11 01:17:38 nicknas2 kernel: Total swap = 0kB
    Oct 11 01:17:38 nicknas2 kernel: 8353533 pages RAM
    Oct 11 01:17:38 nicknas2 kernel: 0 pages HighMem/MovableOnly
    Oct 11 01:17:38 nicknas2 kernel: 164919 pages reserved
    Oct 11 01:17:38 nicknas2 kernel: 0 pages cma reserved

    Are you running pretty tight on memory with your VM running?  Have you tried reducing it to see if this problem goes away?  Just curious because this is concerning.  In addition, right after those log events, I see this:

    Oct 11 01:41:54 nicknas2 kernel: BTRFS info (device nvme0n1p1): balance: start -d -m -s
    Oct 11 01:41:54 nicknas2 kernel: BTRFS info (device nvme0n1p1): relocating block group 180444200960 flags data|raid1
    Oct 11 01:41:54 nicknas2 kernel: BTRFS info (device nvme0n1p1): relocating block group 179370459136 flags data|raid1
    Oct 11 01:41:54 nicknas2 kernel: BTRFS info (device nvme0n1p1): relocating block group 178296717312 flags data|raid1

    Did you start a balance operation when collecting these logs or is this unexpected to see here?

     

    This definitely is feeling like an issue with kernel or qemu, but still not 100% sure this isn't just configuration related.

     

    @WackyWRZ the error message you are referring to is when a VM is trying to use a port for VNC access that was already in use by another VM or application running.  Can you confirm this isn't the case?

    @Shadz please attach your system diagnostics in your next post on this topic so we can review those as well.

    Link to comment

    I was able to get 6.9beta30 to boot w/ GUI only. I can't produce a diagnostic b/c it just crashes/reboots immediately when booting headless. I imagine it's because of the new kernel reacting to the iGPU - which would explain why it works under GUI mode?

     

    System

    CPU: Intel i7-6700k, integrated GPU

    Mobo: Asus Z270-A Prime

    GPU: Asus GT 1030

    Drives: Lite-On 256 GB SATA M2, 2 TB AData XPG 8200 Pro, 5x12 TB WD Red

    Link to comment

    @jonp I have 32GB installed and 16GB allocated to the VM.  When my VM is running I typically have 40% system memory free.  The btfrs balance you see may have been me doing things because I changed my cache to the new format and my docker image to xfs after doing beta 30 from 6.8.3.

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.