stottle Posted February 4, 2017 Share Posted February 4, 2017 I have a Win10 VM that I have GPU passthru on for. I downloaded unRAID 6.3.0 (from 6.2.4), shutdown the VM, restarted unRAID. Now unRAID pops up a big error message when I try to start the VM: Execution error internal error: qemu unexpectedly closed the monitor: 2017-02-04T21:56:40.034390Z qemu-system-x86_64: Initialization of device cfi.pflash01 failed: failed to read the initial flash content Could be 6.3.0 related, but might not be. And I figured the message would make more sense to the people in this forum. I hope you guys have suggestions. Quote Link to comment
stottle Posted February 5, 2017 Author Share Posted February 5, 2017 If it helps, my VM xml is shown below I do see <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader> This is a file on unRAID dated Feb 2nd, so I am guessing it is part of the 6.3.0 update. I'm not sure what that means, but the error message says initialization of pflash failed... <domain type='kvm'> <name>Win10</name> <uuid>449c8082-8631-ef95-bd97-1bdad139ddc7</uuid> <description>Windows 10</description> <metadata> <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/> </metadata> <memory unit='KiB'>8388608</memory> <currentMemory unit='KiB'>8388608</currentMemory> <memoryBacking> <nosharepages/> </memoryBacking> <vcpu placement='static'>1</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0'/> </cputune> <os> <type arch='x86_64' machine='pc-i440fx-2.3'>hvm</type> <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader> <nvram>/etc/libvirt/qemu/nvram/449c8082-8631-ef95-bd97-1bdad139ddc7_VARS-pure-efi.fd</nvram> </os> <features> <acpi/> <apic/> </features> <cpu mode='host-passthrough'> <topology sockets='1' cores='1' threads='1'/> </cpu> <clock offset='localtime'> <timer name='rtc' tickpolicy='catchup'/> <timer name='pit' tickpolicy='delay'/> <timer name='hpet' present='no'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <devices> <emulator>/usr/local/sbin/qemu</emulator> <disk type='file' device='disk'> <driver name='qemu' type='raw' cache='writeback'/> <source file='/mnt/cache/VM/Win10/vdisk1.img'/> <target dev='hdc' bus='virtio'/> <boot order='1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </disk> <controller type='usb' index='0' model='nec-xhci'> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </controller> <controller type='pci' index='0' model='pci-root'/> <controller type='virtio-serial' index='0'> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </controller> <interface type='bridge'> <mac address='52:54:00:6b:d2:ee'/> <source bridge='br0'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </interface> <serial type='pty'> <target port='0'/> </serial> <console type='pty'> <target type='serial' port='0'/> </console> <channel type='unix'> <target type='virtio' name='org.qemu.guest_agent.0'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> <input type='mouse' bus='ps2'/> <input type='keyboard' bus='ps2'/> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </hostdev> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/> </hostdev> <memballoon model='virtio'> <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/> </memballoon> </devices> </domain> Quote Link to comment
stottle Posted February 11, 2017 Author Share Posted February 11, 2017 Note: This isn't the Trying to start my VM gives a "Invalid Machine Type" error issue noted in the release notes. I've tried the suggestions listed there and they have no effect. Any other suggestions? Quote Link to comment
NotYetRated Posted February 11, 2017 Share Posted February 11, 2017 change "2.3" to "2.7" in the below noted line of your config. I had to on mine as well. <type arch='x86_64' machine='pc-i440fx-2.3'>hvm</type> Quote Link to comment
stottle Posted February 11, 2017 Author Share Posted February 11, 2017 Already tried that, it didn't help. The original xml was from before I tried the steps listed in the release notes. I tried those steps, with no luck, then tried disabling all passthrough devices as well. Same error message. My current xml is: <domain type='kvm'> <name>Win10</name> <uuid>449c8082-8631-ef95-bd97-1bdad139ddc7</uuid> <description>Windows 10</description> <metadata> <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/> </metadata> <memory unit='KiB'>8388608</memory> <currentMemory unit='KiB'>8388608</currentMemory> <memoryBacking> <nosharepages/> </memoryBacking> <vcpu placement='static'>1</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0'/> </cputune> <os> <type arch='x86_64' machine='pc-i440fx-2.7'>hvm</type> <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader> <nvram>/etc/libvirt/qemu/nvram/449c8082-8631-ef95-bd97-1bdad139ddc7_VARS-pure-efi.fd</nvram> </os> <features> <acpi/> <apic/> </features> <cpu mode='host-passthrough'> <topology sockets='1' cores='1' threads='1'/> </cpu> <clock offset='localtime'> <timer name='rtc' tickpolicy='catchup'/> <timer name='pit' tickpolicy='delay'/> <timer name='hpet' present='no'/> </clock> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <devices> <emulator>/usr/local/sbin/qemu</emulator> <disk type='file' device='disk'> <driver name='qemu' type='raw' cache='writeback'/> <source file='/mnt/cache/VM/Win10/vdisk1.img'/> <target dev='hdc' bus='virtio'/> <boot order='1'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </disk> <controller type='usb' index='0' model='nec-xhci'> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </controller> <controller type='pci' index='0' model='pci-root'/> <controller type='virtio-serial' index='0'> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </controller> <interface type='bridge'> <mac address='52:54:00:6b:d2:ee'/> <source bridge='br0'/> <model type='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </interface> <serial type='pty'> <target port='0'/> </serial> <console type='pty'> <target type='serial' port='0'/> </console> <channel type='unix'> <target type='virtio' name='org.qemu.guest_agent.0'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> <input type='tablet' bus='usb'> <address type='usb' bus='0' port='1'/> </input> <input type='mouse' bus='ps2'/> <input type='keyboard' bus='ps2'/> <graphics type='vnc' port='-1' autoport='yes' websocket='-1' listen='0.0.0.0' keymap='en-us'> <listen type='address' address='0.0.0.0'/> </graphics> <video> <model type='qxl' ram='65536' vram='65536' vgamem='16384' heads='1' primary='yes'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </video> <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio'/> <source> <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/> </source> <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> </hostdev> <memballoon model='virtio'> <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/> </memballoon> </devices> </domain> Quote Link to comment
stottle Posted February 12, 2017 Author Share Posted February 12, 2017 I'm trying to see if there is something else that might be causing the problem. I'm running a btrfs raid1 cache, but get the following root@Tower2:/# btrfs scrub start -rdB /mnt/cache > /boot/logs/scrub_cache.log root@Tower2:/# vi /boot/logs/scrub_cache.log reading /boot/logs/scrub_cache.log Read /boot/logs/scrub_cache.log, 8 lines, 416 chars scrub device /dev/sdh1 (id 1) done scrub started at Sun Feb 12 17:23:16 2017 and finished after 00:06:38 total bytes scrubbed: 75.88GiB with 175313 errors error details: verify=814 csum=174499 corrected errors: 0, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdg1 (id 2) done scrub started at Sun Feb 12 17:23:16 2017 and finished after 00:06:38 total bytes scrubbed: 75.88GiB with 0 errors I assume errors in the VM image could cause the issue I am seeing. Since one disk has errors, but the 2nd one doesn't I assume I can run scrub again, but without the readonly flag? Any risk in doing this? Would I be better off reverting from 6.3 back to 6.2.4? Thanks Quote Link to comment
JorgeB Posted February 12, 2017 Share Posted February 12, 2017 You should run a correcting scrub, it would also be important to find out the cause of so many errors. Quote Link to comment
stottle Posted February 12, 2017 Author Share Posted February 12, 2017 Thanks for the help. Any suggestions for determining what is causing the errors? Quote Link to comment
JorgeB Posted February 12, 2017 Share Posted February 12, 2017 Diagnostics (Tools -> Diagnostics) may help, also post the output of: btrfs dev stats /mnt/cache Quote Link to comment
stottle Posted February 12, 2017 Author Share Posted February 12, 2017 Hmm, I turned off all running dockers (array is still running, but this is the cache drive) and tried running a 2nd readonly scrub. I was curious how repeatable it was. It actually has a few LESS errors. root@Tower2:/# btrfs scrub start -rdB /mnt/cache > /boot/logs/scrub_cache2.log root@Tower2:/# cat /boot/logs/scrub_cache2.log scrub device /dev/sdh1 (id 1) done scrub started at Sun Feb 12 18:00:23 2017 and finished after 00:06:36 total bytes scrubbed: 75.84GiB with 175178 errors error details: verify=679 csum=174499 corrected errors: 0, uncorrectable errors: 0, unverified errors: 0 scrub device /dev/sdg1 (id 2) done scrub started at Sun Feb 12 18:00:23 2017 and finished after 00:06:33 total bytes scrubbed: 75.84GiB with 0 errors root@Tower2:/# btrfs dev stats /mnt/cache [/dev/sdh1].write_io_errs 181614787 [/dev/sdh1].read_io_errs 147213104 [/dev/sdh1].flush_io_errs 3842528 [/dev/sdh1].corruption_errs 349010 [/dev/sdh1].generation_errs 1493 [/dev/sdg1].write_io_errs 0 [/dev/sdg1].read_io_errs 0 [/dev/sdg1].flush_io_errs 0 [/dev/sdg1].corruption_errs 0 [/dev/sdg1].generation_errs 0 Diagnostics also attached. tower2-diagnostics-20170212-1810.zip Quote Link to comment
JorgeB Posted February 12, 2017 Share Posted February 12, 2017 You now know that sdh (cache2) is the one with issues, there are already a lot of errors from when the syslog starts on feb 4, unRAID release you're using has a bug and there are no SMART reports but these issues on SSDs are mostly caused by bad cabling, I would replace both cables to rule them out, run a correcting scrub and keep on eye on it for the next few days, you can reset the btrfs stat errors by running: btrfs dev stats -z /mnt/cache Then run it again without -z for the next few days, all errors should be 0. Quote Link to comment
stottle Posted February 13, 2017 Author Share Posted February 13, 2017 Ok, powered off and replaced the sata cables. I had tried starting the VM after running scrub with corrections enabled, but received the same error. Now, after powering back on after swapping cables, I'm getting a new message: Execution error internal error: qemu unexpectedly closed the monitor: 2017-02-13T00:07:35.264400Z qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,bus=pci.0,addr=0x6: vfio: error, group 1 is not viable, please ensure all devices within the iommu_group are bound to their vfio bus driver. 2017-02-13T00:07:35.264413Z qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,bus=pci.0,addr=0x6: vfio: failed to get group 1 2017-02-13T00:07:35.264421Z qemu-system-x86_64: -device vfio-pci,host=01:00.1,id=hostdev0,bus=pci.0,addr=0x6: Device initialization failed Quote Link to comment
stottle Posted February 13, 2017 Author Share Posted February 13, 2017 Not sure if this is progress or not. The above error looked like a passthrough issue, so opened the edit window and changed the soundcard from the nvidia GPU to "None". So VNC instead of GPU for audio/video, and I removed the passthrough of my pcie usb controller. With these changes, the VM will actually start, but it goes immediately into the BSOD (Windows ran into a problem) in the VNC window. I would have thought windows would have all necessary drivers for VNC, so not sure what the problem is here. Help? And to make matters worse, sdh is already showing new errors after the count was zeroed (earlier runs showed all zero): root@Tower2:~# btrfs dev stats /mnt/cache [/dev/sdh1].write_io_errs 441286 [/dev/sdh1].read_io_errs 407435 [/dev/sdh1].flush_io_errs 2040 [/dev/sdh1].corruption_errs 0 [/dev/sdh1].generation_errs 0 [/dev/sdg1].write_io_errs 0 [/dev/sdg1].read_io_errs 0 [/dev/sdg1].flush_io_errs 0 [/dev/sdg1].corruption_errs 0 [/dev/sdg1].generation_errs 0 Quote Link to comment
JorgeB Posted February 13, 2017 Share Posted February 13, 2017 That's a lot of errors during so little time, post new diagnostics, if you replaced both cables (power and SATA) try using a different SATA port, if errors continue then it can be a bad SSD. After changing ports run another correcting scrub, check that all errors are corrected and then run another one to confirm errors remain at 0. PS: Although it's important to fix this issue you're VM problems are probably unrelated, as the other cache device is working properly and any reads from the "bad" device are corrected by the pool using the good member. Quote Link to comment
stottle Posted February 14, 2017 Author Share Posted February 14, 2017 <swearing> So I tried changing the sata port for the problematic drive on my mobo. Array and cache drives looked ok on reboot, so I tried stats and got root@Tower2:~# btrfs dev stats /mnt/cache ERROR: cannot check /mnt/cache: No such file or directory ERROR: '/mnt/cache' is not a mounted btrfs device Hmm, ok. Maybe the cache isn't available until I start the array, so I start the array. It now says unmountable disk present Cache • Samsung_SSD_850_EVO_500GB_S21HNXAGC11924P (sdb) Immediately turned off the array. Apologies, but I don't want to touch anything until I get some advice. Diagnostics attached. Main->Cache Devices shows both drives. How do I recover the cache pool? Note, it is the good drive that isn't being recognized. tower2-diagnostics-20170213-1908.zip Quote Link to comment
JorgeB Posted February 14, 2017 Share Posted February 14, 2017 You're getting a superblock crc mismatch, power down, disconnect one of the devices (cache2 first), power back on and start with cache1 only, if it still doesn't mount grab and post new diagnostics. Also update to v6.3.1 before powering down as v6.3.0 has a problem with changes to assigned slots not sticking. Quote Link to comment
JorgeB Posted February 14, 2017 Share Posted February 14, 2017 Just noticed that you changed the SATA port on cache1, that's not a problem but you'd need to change the port for cache2, preferably to an unused port, in case the current cache2 SATA port is damaged. Quote Link to comment
stottle Posted February 14, 2017 Author Share Posted February 14, 2017 @johnnie.black - thanks for your help and patience. I've updated to 6.3.1, powered down, disconnected cache2 and restarted. Then started the array. At that point, cache mounts to /mnt/cache. So far so good. Should just add the 2nd drive and then balance, or do you have other suggestions in this case? FYI root@Tower2:~# btrfs dev stats /mnt/cache [devid:1].write_io_errs 441396 [devid:1].read_io_errs 407459 [devid:1].flush_io_errs 2047 [devid:1].corruption_errs 0 [devid:1].generation_errs 0 [/dev/sdb1].write_io_errs 0 [/dev/sdb1].read_io_errs 0 [/dev/sdb1].flush_io_errs 0 [/dev/sdb1].corruption_errs 0 [/dev/sdb1].generation_errs 0 I assume that the errors are leftovers, as I didn't zero stat before I removed the 2nd cache drive. NB - the serial matches cache1, the drive that wasn't seeing errors (S21HNXAGC11924P) Quote Link to comment
JorgeB Posted February 14, 2017 Share Posted February 14, 2017 The cache should auto re-balance to single, on the main click on "cache" and check if a balance is in progress, don't stop the array if it is. Before adding the other cache device it needs to be cleared, I'll post what to do in a few minutes. Quote Link to comment
JorgeB Posted February 15, 2017 Share Posted February 15, 2017 Forgot that since it's a single device now balance won't be shown on the cache page, check with: btrfs balance status /mnt/cache When done (when it output is "no balance found on /mnt/cache") check that cache pool is just one device and there are no missing devices: btrfs fi show /mnt/cache Quote Link to comment
JorgeB Posted February 15, 2017 Share Posted February 15, 2017 When the above is done you can proceed to re-add the other device: -before doing it is a good time to backup any important cache data in case something goes wrong -if enable disable array auto start (settings -> disk settings -> enable auto start) -power down and reconnect cache2 -power up and before starting the array run: blkdiscard /dev/sdX replace X with cache2 identifier, double check it's the correct ssd, this will delete all its data. -assign cache2, there should be a blue icon next to it, if it's green it's not cleared (if it was already assigned you will need to refresh the webGUI for the icon to turn blue) -start the array, a balance will begin, you can follow the progress on the cache webpage After or during the balance check for btrfs stat errors. Quote Link to comment
stottle Posted February 15, 2017 Author Share Posted February 15, 2017 It isn't looking like a balance started automatically. Under Main->Cache Devices, both SSDs were listed. When I ran blkdiscard and refreshed, Cache 2's icon turned from green to blue, so I started the array. Now the cache details look like the attached image, with no balance seeming to be running. Diagnostics attached as well. tower2-diagnostics-20170215-1843.zip Quote Link to comment
JorgeB Posted February 15, 2017 Share Posted February 15, 2017 It should be automatic, but run one manually by clicking on "Balance" on the cache page. Quote Link to comment
JorgeB Posted February 15, 2017 Share Posted February 15, 2017 When the balance is done check "btrfs filesystem df" on the cache page, there's should be only the raid1 data profile, not like now, single and raid1 data profiles. Also check btrfs dev stats for any errors. Quote Link to comment
stottle Posted February 16, 2017 Author Share Posted February 16, 2017 Something seems to be wrong. The GUI is running very slowly (several second wait to load webpages) and I've clicked "Balance" twice without it making any changes to the screen. I.e., still looks like the image I sent previous, not saying it is doing a balance operation. I'm seeing the following repeated in the logs Feb 15 18:57:41 Tower2 root: ERROR: unable to resize '/var/lib/docker': Read-only file system Feb 15 18:57:41 Tower2 root: Resize '/var/lib/docker' of 'max' Feb 15 18:57:41 Tower2 emhttp: shcmd (461): /etc/rc.d/rc.docker start |& logger Feb 15 18:57:41 Tower2 root: starting docker ... Feb 15 18:57:51 Tower2 emhttp: shcmd (463): umount /var/lib/docker |& logger Feb 15 18:58:09 Tower2 php: /usr/local/emhttp/plugins/dynamix/scripts/btrfs_balance 'start' '/mnt/cache' '-dconvert=raid1 -mconvert=raid1' Feb 15 18:58:09 Tower2 emhttp: shcmd (475): set -o pipefail ; /usr/local/sbin/mount_image '/mnt/cache/docker.img' /var/lib/docker 20 |& logger Feb 15 18:58:09 Tower2 root: truncate: cannot open '/mnt/cache/docker.img' for writing: Read-only file system Feb 15 18:58:09 Tower2 kernel: BTRFS info (device loop1): disk space caching is enabled Feb 15 18:58:09 Tower2 kernel: BTRFS info (device loop1): has skinny extents Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.