VM won't start after update to 6.3.0 and restart

stottle · February 4, 2017

I have a Win10 VM that I have GPU passthru on for. I downloaded unRAID 6.3.0 (from 6.2.4), shutdown the VM, restarted unRAID. Now unRAID pops up a big error message when I try to start the VM:

Execution error

internal error: qemu unexpectedly closed the monitor: 2017-02-04T21:56:40.034390Z qemu-system-x86_64: Initialization of device cfi.pflash01 failed: failed to read the initial flash content

Could be 6.3.0 related, but might not be. And I figured the message would make more sense to the people in this forum.

I hope you guys have suggestions.

stottle · February 5, 2017

If it helps, my VM xml is shown below

I do see

    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>

This is a file on unRAID dated Feb 2nd, so I am guessing it is part of the 6.3.0 update. I'm not sure what that means, but the error message says initialization of pflash failed...

<domain type='kvm'>
  <name>Win10</name>
  <uuid>449c8082-8631-ef95-bd97-1bdad139ddc7</uuid>
  <description>Windows 10</description>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>1</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.3'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/449c8082-8631-ef95-bd97-1bdad139ddc7_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough'>
    <topology sockets='1' cores='1' threads='1'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/mnt/cache/VM/Win10/vdisk1.img'/>
      <target dev='hdc' bus='virtio'/>
      <boot order='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='nec-xhci'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:6b:d2:ee'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </memballoon>
  </devices>
</domain>

stottle · February 11, 2017

Note: This isn't the

Trying to start my VM gives a "Invalid Machine Type" error

issue noted in the release notes. I've tried the suggestions listed there and they have no effect.

Any other suggestions?

NotYetRated · February 11, 2017

change "2.3" to "2.7" in the below noted line of your config. I had to on mine as well.

stottle · February 11, 2017

Already tried that, it didn't help. The original xml was from before I tried the steps listed in the release notes. I tried those steps, with no luck, then tried disabling all passthrough devices as well. Same error message.

My current xml is:

<domain type='kvm'>
  <name>Win10</name>
  <uuid>449c8082-8631-ef95-bd97-1bdad139ddc7</uuid>
  <description>Windows 10</description>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>1</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.7'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/449c8082-8631-ef95-bd97-1bdad139ddc7_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough'>
    <topology sockets='1' cores='1' threads='1'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/mnt/cache/VM/Win10/vdisk1.img'/>
      <target dev='hdc' bus='virtio'/>
      <boot order='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='nec-xhci'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:6b:d2:ee'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target port='0'/>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' port='-1' autoport='yes' websocket='-1' listen='0.0.0.0' keymap='en-us'>
      <listen type='address' address='0.0.0.0'/>
    </graphics>
    <video>
      <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </hostdev>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </memballoon>
  </devices>
</domain>

stottle · February 12, 2017

I'm trying to see if there is something else that might be causing the problem. I'm running a btrfs raid1 cache, but get the following

root@Tower2:/# btrfs scrub start -rdB /mnt/cache > /boot/logs/scrub_cache.log
root@Tower2:/# vi /boot/logs/scrub_cache.log 
reading /boot/logs/scrub_cache.log
Read /boot/logs/scrub_cache.log, 8 lines, 416 chars
                                                   scrub device /dev/sdh1 (id 1) done
        scrub started at Sun Feb 12 17:23:16 2017 and finished after 00:06:38
        total bytes scrubbed: 75.88GiB with 175313 errors
        error details: verify=814 csum=174499
        corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdg1 (id 2) done
        scrub started at Sun Feb 12 17:23:16 2017 and finished after 00:06:38
        total bytes scrubbed: 75.88GiB with 0 errors

I assume errors in the VM image could cause the issue I am seeing. Since one disk has errors, but the 2nd one doesn't I assume I can run scrub again, but without the readonly flag? Any risk in doing this?

Would I be better off reverting from 6.3 back to 6.2.4?

Thanks

JorgeB · February 12, 2017

You should run a correcting scrub, it would also be important to find out the cause of so many errors.

stottle · February 12, 2017

Thanks for the help.

Any suggestions for determining what is causing the errors?

JorgeB · February 12, 2017

Diagnostics (Tools -> Diagnostics) may help, also post the output of:

btrfs dev stats /mnt/cache

stottle · February 12, 2017

Hmm, I turned off all running dockers (array is still running, but this is the cache drive) and tried running a 2nd readonly scrub. I was curious how repeatable it was. It actually has a few LESS errors.

root@Tower2:/# btrfs scrub start -rdB /mnt/cache > /boot/logs/scrub_cache2.log
root@Tower2:/# cat /boot/logs/scrub_cache2.log 
scrub device /dev/sdh1 (id 1) done
scrub started at Sun Feb 12 18:00:23 2017 and finished after 00:06:36
total bytes scrubbed: 75.84GiB with 175178 errors
error details: verify=679 csum=174499
corrected errors: 0, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdg1 (id 2) done
scrub started at Sun Feb 12 18:00:23 2017 and finished after 00:06:33
total bytes scrubbed: 75.84GiB with 0 errors
root@Tower2:/# btrfs dev stats /mnt/cache
[/dev/sdh1].write_io_errs   181614787
[/dev/sdh1].read_io_errs    147213104
[/dev/sdh1].flush_io_errs   3842528
[/dev/sdh1].corruption_errs 349010
[/dev/sdh1].generation_errs 1493
[/dev/sdg1].write_io_errs   0
[/dev/sdg1].read_io_errs    0
[/dev/sdg1].flush_io_errs   0
[/dev/sdg1].corruption_errs 0
[/dev/sdg1].generation_errs 0

Diagnostics also attached.

tower2-diagnostics-20170212-1810.zip

JorgeB · February 12, 2017

You now know that sdh (cache2) is the one with issues, there are already a lot of errors from when the syslog starts on feb 4, unRAID release you're using has a bug and there are no SMART reports but these issues on SSDs are mostly caused by bad cabling, I would replace both cables to rule them out, run a correcting scrub and keep on eye on it for the next few days, you can reset the btrfs stat errors by running:

btrfs dev stats -z /mnt/cache

Then run it again without -z for the next few days, all errors should be 0.

stottle · February 13, 2017

Ok, powered off and replaced the sata cables. I had tried starting the VM after running scrub with corrections enabled, but received the same error. Now, after powering back on after swapping cables, I'm getting a new message:

Execution error

internal error: qemu unexpectedly closed the monitor: 2017-02-13T00:07:35.264400Z qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,bus=pci.0,addr=0x6: vfio: error, group 1 is not viable, please ensure all devices within the iommu_group are bound to their vfio bus driver.
2017-02-13T00:07:35.264413Z qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,bus=pci.0,addr=0x6: vfio: failed to get group 1
2017-02-13T00:07:35.264421Z qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,bus=pci.0,addr=0x6: Device initialization failed

stottle · February 13, 2017

Not sure if this is progress or not. The above error looked like a passthrough issue, so opened the edit window and changed the soundcard from the nvidia GPU to "None". So VNC instead of GPU for audio/video, and I removed the passthrough of my pcie usb controller.

With these changes, the VM will actually start, but it goes immediately into the BSOD (Windows ran into a problem) in the VNC window.

I would have thought windows would have all necessary drivers for VNC, so not sure what the problem is here. Help?

And to make matters worse, sdh is already showing new errors after the count was zeroed (earlier runs showed all zero):

root@Tower2:~# btrfs dev stats /mnt/cache
[/dev/sdh1].write_io_errs   441286
[/dev/sdh1].read_io_errs    407435
[/dev/sdh1].flush_io_errs   2040
[/dev/sdh1].corruption_errs 0
[/dev/sdh1].generation_errs 0
[/dev/sdg1].write_io_errs   0
[/dev/sdg1].read_io_errs    0
[/dev/sdg1].flush_io_errs   0
[/dev/sdg1].corruption_errs 0
[/dev/sdg1].generation_errs 0

JorgeB · February 13, 2017

That's a lot of errors during so little time, post new diagnostics, if you replaced both cables (power and SATA) try using a different SATA port, if errors continue then it can be a bad SSD.

After changing ports run another correcting scrub, check that all errors are corrected and then run another one to confirm errors remain at 0.

PS: Although it's important to fix this issue you're VM problems are probably unrelated, as the other cache device is working properly and any reads from the "bad" device are corrected by the pool using the good member.

stottle · February 14, 2017

So I tried changing the sata port for the problematic drive on my mobo. Array and cache drives looked ok on reboot, so I tried stats and got

root@Tower2:~# btrfs dev stats /mnt/cache
ERROR: cannot check /mnt/cache: No such file or directory
ERROR: '/mnt/cache' is not a mounted btrfs device

Hmm, ok. Maybe the cache isn't available until I start the array, so I start the array. It now says

unmountable disk present
Cache • Samsung_SSD_850_EVO_500GB_S21HNXAGC11924P (sdb)

Immediately turned off the array. Apologies, but I don't want to touch anything until I get some advice.

Diagnostics attached. Main->Cache Devices shows both drives. How do I recover the cache pool? Note, it is the good drive that isn't being recognized.

tower2-diagnostics-20170213-1908.zip

JorgeB · February 14, 2017

You're getting a superblock crc mismatch, power down, disconnect one of the devices (cache2 first), power back on and start with cache1 only, if it still doesn't mount grab and post new diagnostics.

Also update to v6.3.1 before powering down as v6.3.0 has a problem with changes to assigned slots not sticking.

JorgeB · February 14, 2017

Just noticed that you changed the SATA port on cache1, that's not a problem but you'd need to change the port for cache2, preferably to an unused port, in case the current cache2 SATA port is damaged.

stottle · February 14, 2017

@johnnie.black - thanks for your help and patience. I've updated to 6.3.1, powered down, disconnected cache2 and restarted.

Then started the array. At that point, cache mounts to /mnt/cache. So far so good.

Should just add the 2nd drive and then balance, or do you have other suggestions in this case?

FYI

root@Tower2:~# btrfs dev stats /mnt/cache
[devid:1].write_io_errs   441396
[devid:1].read_io_errs    407459
[devid:1].flush_io_errs   2047
[devid:1].corruption_errs 0
[devid:1].generation_errs 0
[/dev/sdb1].write_io_errs   0
[/dev/sdb1].read_io_errs    0
[/dev/sdb1].flush_io_errs   0
[/dev/sdb1].corruption_errs 0
[/dev/sdb1].generation_errs 0

I assume that the errors are leftovers, as I didn't zero stat before I removed the 2nd cache drive. NB - the serial matches cache1, the drive that wasn't seeing errors (S21HNXAGC11924P)

JorgeB · February 14, 2017

The cache should auto re-balance to single, on the main click on "cache" and check if a balance is in progress, don't stop the array if it is.

Before adding the other cache device it needs to be cleared, I'll post what to do in a few minutes.

JorgeB · February 15, 2017

Forgot that since it's a single device now balance won't be shown on the cache page, check with:

btrfs balance status /mnt/cache

When done (when it output is "no balance found on /mnt/cache") check that cache pool is just one device and there are no missing devices:

btrfs fi show /mnt/cache

JorgeB · February 15, 2017

When the above is done you can proceed to re-add the other device:

-before doing it is a good time to backup any important cache data in case something goes wrong

-if enable disable array auto start (settings -> disk settings -> enable auto start)

-power down and reconnect cache2

-power up and before starting the array run:

blkdiscard /dev/sdX

replace X with cache2 identifier, double check it's the correct ssd, this will delete all its data.

-assign cache2, there should be a blue icon next to it, if it's green it's not cleared

(if it was already assigned you will need to refresh the webGUI for the icon to turn blue)

-start the array, a balance will begin, you can follow the progress on the cache webpage

After or during the balance check for btrfs stat errors.

stottle · February 15, 2017

It isn't looking like a balance started automatically. Under Main->Cache Devices, both SSDs were listed. When I ran blkdiscard and refreshed, Cache 2's icon turned from green to blue, so I started the array. Now the cache details look like the attached image, with no balance seeming to be running.

Diagnostics attached as well.

tower2-diagnostics-20170215-1843.zip

JorgeB · February 15, 2017

It should be automatic, but run one manually by clicking on "Balance" on the cache page.

JorgeB · February 15, 2017

When the balance is done check "btrfs filesystem df" on the cache page, there's should be only the raid1 data profile, not like now, single and raid1 data profiles.

Also check btrfs dev stats for any errors.

stottle · February 16, 2017

Something seems to be wrong. The GUI is running very slowly (several second wait to load webpages) and I've clicked "Balance" twice without it making any changes to the screen. I.e., still looks like the image I sent previous, not saying it is doing a balance operation.

I'm seeing the following repeated in the logs

Feb 15 18:57:41 Tower2 root: ERROR: unable to resize '/var/lib/docker': Read-only file system
Feb 15 18:57:41 Tower2 root: Resize '/var/lib/docker' of 'max'
Feb 15 18:57:41 Tower2 emhttp: shcmd (461): /etc/rc.d/rc.docker start |& logger
Feb 15 18:57:41 Tower2 root: starting docker ...
Feb 15 18:57:51 Tower2 emhttp: shcmd (463): umount /var/lib/docker |& logger
Feb 15 18:58:09 Tower2 php: /usr/local/emhttp/plugins/dynamix/scripts/btrfs_balance 'start' '/mnt/cache' '-dconvert=raid1 -mconvert=raid1'
Feb 15 18:58:09 Tower2 emhttp: shcmd (475): set -o pipefail ; /usr/local/sbin/mount_image '/mnt/cache/docker.img' /var/lib/docker 20 |& logger
Feb 15 18:58:09 Tower2 root: truncate: cannot open '/mnt/cache/docker.img' for writing: Read-only file system
Feb 15 18:58:09 Tower2 kernel: BTRFS info (device loop1): disk space caching is enabled
Feb 15 18:58:09 Tower2 kernel: BTRFS info (device loop1): has skinny extents

VM won't start after update to 6.3.0 and restart

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation