VM won't start after update to 6.3.0 and restart


stottle

Recommended Posts

I have a Win10 VM that I have GPU passthru on for.  I downloaded unRAID 6.3.0 (from 6.2.4), shutdown the VM, restarted unRAID.  Now unRAID pops up a big error message when I try to start the VM:

 

Execution error

internal error: qemu unexpectedly closed the monitor: 2017-02-04T21:56:40.034390Z qemu-system-x86_64: Initialization of device cfi.pflash01 failed: failed to read the initial flash content

 

Could be 6.3.0 related, but might not be.  And I figured the message would make more sense to the people in this forum.

 

I hope you guys have suggestions.

Link to comment

If it helps, my VM xml is shown below

 

I do see

    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>

 

This is a file on unRAID dated Feb 2nd, so I am guessing it is part of the 6.3.0 update.  I'm not sure what that means, but the error message says initialization of pflash failed...

 

<domain type='kvm'>
  <name>Win10</name>
  <uuid>449c8082-8631-ef95-bd97-1bdad139ddc7</uuid>
  <description>Windows 10</description>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>1</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.3'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/449c8082-8631-ef95-bd97-1bdad139ddc7_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough'>
    <topology sockets='1' cores='1' threads='1'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/mnt/cache/VM/Win10/vdisk1.img'/>
      <target dev='hdc' bus='virtio'/>
      <boot order='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='nec-xhci'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:6b:d2:ee'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </memballoon>
  </devices>
</domain>

 

 

Link to comment

Already tried that, it didn't help.  The original xml was from before I tried the steps listed in the release notes.  I tried those steps, with no luck, then tried disabling all passthrough devices as well.  Same error message.

 

My current xml is:

<domain type='kvm'>
  <name>Win10</name>
  <uuid>449c8082-8631-ef95-bd97-1bdad139ddc7</uuid>
  <description>Windows 10</description>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>1</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.7'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/449c8082-8631-ef95-bd97-1bdad139ddc7_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough'>
    <topology sockets='1' cores='1' threads='1'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/mnt/cache/VM/Win10/vdisk1.img'/>
      <target dev='hdc' bus='virtio'/>
      <boot order='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='nec-xhci'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:6b:d2:ee'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' port='-1' autoport='yes' websocket='-1' listen='0.0.0.0' keymap='en-us'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
    <video>
      <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </memballoon>
  </devices>
</domain>

Link to comment

I'm trying to see if there is something else that might be causing the problem.  I'm running a btrfs raid1 cache, but get the following

root@Tower2:/# btrfs scrub start -rdB /mnt/cache > /boot/logs/scrub_cache.log
root@Tower2:/# vi /boot/logs/scrub_cache.log 
reading /boot/logs/scrub_cache.log
Read /boot/logs/scrub_cache.log, 8 lines, 416 chars
                                                   scrub device /dev/sdh1 (id 1) done
        scrub started at Sun Feb 12 17:23:16 2017 and finished after 00:06:38
        total bytes scrubbed: 75.88GiB with 175313 errors
        error details: verify=814 csum=174499
        corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdg1 (id 2) done
        scrub started at Sun Feb 12 17:23:16 2017 and finished after 00:06:38
        total bytes scrubbed: 75.88GiB with 0 errors

 

I assume errors in the VM image could cause the issue I am seeing.  Since one disk has errors, but the 2nd one doesn't I assume I can run scrub again, but without the readonly flag?  Any risk in doing this?

 

Would I be better off reverting from 6.3 back to 6.2.4?

 

Thanks

Link to comment

Hmm, I turned off all running dockers (array is still running, but this is the cache drive) and tried running a 2nd readonly scrub.  I was curious how repeatable it was.  It actually has a few LESS errors.

 

root@Tower2:/# btrfs scrub start -rdB /mnt/cache > /boot/logs/scrub_cache2.log
root@Tower2:/# cat /boot/logs/scrub_cache2.log 
scrub device /dev/sdh1 (id 1) done
scrub started at Sun Feb 12 18:00:23 2017 and finished after 00:06:36
total bytes scrubbed: 75.84GiB with 175178 errors
error details: verify=679 csum=174499
corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdg1 (id 2) done
scrub started at Sun Feb 12 18:00:23 2017 and finished after 00:06:33
total bytes scrubbed: 75.84GiB with 0 errors
root@Tower2:/# btrfs dev stats /mnt/cache
[/dev/sdh1].write_io_errs   181614787
[/dev/sdh1].read_io_errs    147213104
[/dev/sdh1].flush_io_errs   3842528
[/dev/sdh1].corruption_errs 349010
[/dev/sdh1].generation_errs 1493
[/dev/sdg1].write_io_errs   0
[/dev/sdg1].read_io_errs    0
[/dev/sdg1].flush_io_errs   0
[/dev/sdg1].corruption_errs 0
[/dev/sdg1].generation_errs 0

 

Diagnostics also attached.

tower2-diagnostics-20170212-1810.zip

Link to comment

You now know that sdh (cache2) is the one with issues, there are already a lot of errors from when the syslog starts on feb 4, unRAID release you're using has a bug and there are no SMART reports but these issues on SSDs are mostly caused by bad cabling, I would replace both cables to rule them out, run a correcting scrub and keep on eye on it for the next few days, you can reset the btrfs stat errors by running:

 

btrfs dev stats -z /mnt/cache

 

Then run it again without -z for the next few days, all errors should be 0.

 

 

Link to comment

Ok, powered off and replaced the sata cables.  I had tried starting the VM after running scrub with corrections enabled, but received the same error.  Now, after powering back on after swapping cables, I'm getting a new message:

 

Execution error

internal error: qemu unexpectedly closed the monitor: 2017-02-13T00:07:35.264400Z qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,bus=pci.0,addr=0x6: vfio: error, group 1 is not viable, please ensure all devices within the iommu_group are bound to their vfio bus driver.
2017-02-13T00:07:35.264413Z qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,bus=pci.0,addr=0x6: vfio: failed to get group 1
2017-02-13T00:07:35.264421Z qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,bus=pci.0,addr=0x6: Device initialization failed

Link to comment

Not sure if this is progress or not.  The above error looked like a passthrough issue, so opened the edit window and changed the soundcard from the nvidia GPU to "None".  So VNC instead of GPU for audio/video, and I removed the passthrough of my pcie usb controller.

 

With these changes, the VM will actually start, but it goes immediately into the BSOD (Windows ran into a problem) in the VNC window.

 

I would have thought windows would have all necessary drivers for VNC, so not sure what the problem is here.  Help?

 

And to make matters worse, sdh is already showing new errors after the count was zeroed (earlier runs showed all zero):

root@Tower2:~# btrfs dev stats /mnt/cache
[/dev/sdh1].write_io_errs   441286
[/dev/sdh1].read_io_errs    407435
[/dev/sdh1].flush_io_errs   2040
[/dev/sdh1].corruption_errs 0
[/dev/sdh1].generation_errs 0
[/dev/sdg1].write_io_errs   0
[/dev/sdg1].read_io_errs    0
[/dev/sdg1].flush_io_errs   0
[/dev/sdg1].corruption_errs 0
[/dev/sdg1].generation_errs 0

Link to comment

That's a lot of errors during so little time, post new diagnostics, if you replaced both cables (power and SATA) try using a different SATA port, if errors continue then it can be a bad SSD.

 

After changing ports run another correcting scrub, check that all errors are corrected and then run another one to confirm errors remain at 0.

 

PS: Although it's important to fix this issue you're VM problems are probably unrelated, as the other cache device is working properly and any reads from the "bad" device are corrected by the pool using the good member.

Link to comment

<swearing>

So I tried changing the sata port for the problematic drive on my mobo.  Array and cache drives looked ok on reboot, so I tried stats and got

root@Tower2:~# btrfs dev stats /mnt/cache
ERROR: cannot check /mnt/cache: No such file or directory
ERROR: '/mnt/cache' is not a mounted btrfs device

Hmm, ok.  Maybe the cache isn't available until I start the array, so I start the array.  It now says

unmountable disk present
Cache • Samsung_SSD_850_EVO_500GB_S21HNXAGC11924P (sdb)

 

Immediately turned off the array.  Apologies, but I don't want to touch anything until I get some advice.

 

Diagnostics attached.  Main->Cache Devices shows both drives.  How do I recover the cache pool?  Note, it is the good drive that isn't being recognized.

 

tower2-diagnostics-20170213-1908.zip

Link to comment

@johnnie.black - thanks for your help and patience.  I've updated to 6.3.1, powered down, disconnected cache2 and restarted.

 

Then started the array.  At that point, cache mounts to /mnt/cache.  So far so good.

 

Should just add the 2nd drive and then balance, or do you have other suggestions in this case?

 

FYI

root@Tower2:~# btrfs dev stats /mnt/cache
[devid:1].write_io_errs   441396
[devid:1].read_io_errs    407459
[devid:1].flush_io_errs   2047
[devid:1].corruption_errs 0
[devid:1].generation_errs 0
[/dev/sdb1].write_io_errs   0
[/dev/sdb1].read_io_errs    0
[/dev/sdb1].flush_io_errs   0
[/dev/sdb1].corruption_errs 0
[/dev/sdb1].generation_errs 0

 

I assume that the errors are leftovers, as I didn't zero stat before I removed the 2nd cache drive.  NB - the serial matches cache1, the drive that wasn't seeing errors (S21HNXAGC11924P)

Link to comment

When the above is done you can proceed to re-add the other device:

 

-before doing it is a good time to backup any important cache data in case something goes wrong

-if enable disable array auto start (settings -> disk settings -> enable auto start)

-power down and reconnect cache2

-power up and before starting the array run:

 

blkdiscard /dev/sdX

 

replace X with cache2 identifier, double check it's the correct ssd, this will delete all its data.

 

-assign cache2, there should be a blue icon next to it, if it's green it's not cleared

(if it was already assigned you will need to refresh the webGUI for the icon to turn blue)

-start the array, a balance will begin, you can follow the progress on the cache webpage

 

After or during the balance check for btrfs stat errors.

Link to comment

Something seems to be wrong.  The GUI is running very slowly (several second wait to load webpages) and I've clicked "Balance" twice without it making any changes to the screen.  I.e., still looks like the image I sent previous, not saying it is doing a balance operation.

 

I'm seeing the following repeated in the logs

Feb 15 18:57:41 Tower2 root: ERROR: unable to resize '/var/lib/docker': Read-only file system
Feb 15 18:57:41 Tower2 root: Resize '/var/lib/docker' of 'max'
Feb 15 18:57:41 Tower2 emhttp: shcmd (461): /etc/rc.d/rc.docker start |& logger
Feb 15 18:57:41 Tower2 root: starting docker ...
Feb 15 18:57:51 Tower2 emhttp: shcmd (463): umount /var/lib/docker |& logger
Feb 15 18:58:09 Tower2 php: /usr/local/emhttp/plugins/dynamix/scripts/btrfs_balance 'start' '/mnt/cache' '-dconvert=raid1 -mconvert=raid1'
Feb 15 18:58:09 Tower2 emhttp: shcmd (475): set -o pipefail ; /usr/local/sbin/mount_image '/mnt/cache/docker.img' /var/lib/docker 20 |& logger
Feb 15 18:58:09 Tower2 root: truncate: cannot open '/mnt/cache/docker.img' for writing: Read-only file system
Feb 15 18:58:09 Tower2 kernel: BTRFS info (device loop1): disk space caching is enabled
Feb 15 18:58:09 Tower2 kernel: BTRFS info (device loop1): has skinny extents

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.