[SOLVED] VM doesn't start and locks up Unraid


Endy

Recommended Posts

I've recently switched hardware and so I've started to create a new Windows 10 vm from scratch to make sure that I can get passthrough working. It looks like there are 2 problems. If I try to start the vm with passed through usb, the vm doesn't start and then it locks up the whole Unraid server. If I try to start the vm with just the video card passed through and not the usb, I don't seem to be getting any video.

 

Hardware specs in sig. This is the only video card in the system and I am trying to passthrough a motherboard usb controller. They are both in their own groups.

IOMMU group 28:
[10de:1b81] 0d:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
[10de:10f0] 0d:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)

IOMMU group 33:
[1022:149c] 10:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller

 

I am using the vfio-pci.cfg file method.

BIND=10:00.3 0d:00.0 0d:00.1

vm xml

<?xml version='1.0' encoding='UTF-8'?>
<domain type='kvm'>
  <name>Windows 10 Test</name>
  <uuid>8fe401f7-9b18-2ef5-d565-8a916dc0a78c</uuid>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>8388608</memory>
  <currentMemory unit='KiB'>8388608</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>8</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='4'/>
    <vcpupin vcpu='1' cpuset='12'/>
    <vcpupin vcpu='2' cpuset='5'/>
    <vcpupin vcpu='3' cpuset='13'/>
    <vcpupin vcpu='4' cpuset='6'/>
    <vcpupin vcpu='5' cpuset='14'/>
    <vcpupin vcpu='6' cpuset='7'/>
    <vcpupin vcpu='7' cpuset='15'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-q35-4.1'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/8fe401f7-9b18-2ef5-d565-8a916dc0a78c_VARS-pure-efi.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='8' threads='1'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source file='/mnt/user/domains/Windows 10 Test/vdisk1.img'/>
      <target dev='hdc' bus='virtio'/>
      <boot order='1'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/Data/ISO/Windows 10/Windows1909.iso'/>
      <target dev='hda' bus='sata'/>
      <readonly/>
      <boot order='2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/Data/ISO/virtio-win-0.1.171.iso'/>
      <target dev='hdb' bus='sata'/>
      <readonly/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0xa'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0xb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0x13'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0x14'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0x8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0xc'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
    </controller>
    <controller type='pci' index='8' model='pcie-to-pci-bridge'>
      <model name='pcie-pci-bridge'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:a3:f7:80'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x0d' slot='0x00' function='0x0'/>
      </source>
      <rom file='/mnt/user/domains/vbios/GTX1070.rom'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x0d' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x1'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x10' slot='0x00' function='0x3'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </hostdev>
    <memballoon model='none'/>
  </devices>
</domain>

Relevant Syslinux section. I'm not sure if the video=efifb:off is necessary, it was just something I was trying.

label unRAID OS 
  menu default
  kernel /bzimage
  append isolcpus=4-7,12-15 initrd=/bzroot video=efifb:off mitigations=off

This was in the system log file. I'm not sure what's relevant besides the vfio-pci not ready messages.

Jan 15 15:21:03 Turtle kernel: br0: port 2(vnet0) entered blocking state
Jan 15 15:21:03 Turtle kernel: br0: port 2(vnet0) entered disabled state
Jan 15 15:21:03 Turtle kernel: device vnet0 entered promiscuous mode
Jan 15 15:21:03 Turtle kernel: br0: port 2(vnet0) entered blocking state
Jan 15 15:21:03 Turtle kernel: br0: port 2(vnet0) entered forwarding state

Jan 15 15:21:08 Turtle kernel: clocksource: timekeeping watchdog on CPU10: Marking clocksource 'tsc' as unstable because the skew is too large:
Jan 15 15:21:08 Turtle kernel: clocksource: 'hpet' wd_now: 9f7787c3 wd_last: 9e931f84 mask: ffffffff
Jan 15 15:21:08 Turtle kernel: clocksource: 'tsc' cs_now: a5d3c006266 cs_last: a5ccefb201b mask: ffffffffffffffff
Jan 15 15:21:08 Turtle kernel: tsc: Marking TSC unstable due to clocksource watchdog
Jan 15 15:21:08 Turtle kernel: TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
Jan 15 15:21:08 Turtle kernel: sched_clock: Marking unstable (2886087668830, -16976841)<-(2886210465096, -139776844)
Jan 15 15:21:08 Turtle kernel: clocksource: Switched to clocksource hpet

Jan 15 15:21:16 Turtle kernel: vfio-pci 0000:10:00.3: not ready 1023ms after FLR; waiting
Jan 15 15:21:18 Turtle kernel: vfio-pci 0000:10:00.3: not ready 2047ms after FLR; waiting
Jan 15 15:21:21 Turtle kernel: vfio-pci 0000:10:00.3: not ready 4095ms after FLR; waiting
Jan 15 15:21:26 Turtle kernel: vfio-pci 0000:10:00.3: not ready 8191ms after FLR; waiting
Jan 15 15:21:35 Turtle kernel: vfio-pci 0000:10:00.3: not ready 16383ms after FLR; waiting
Jan 15 15:21:53 Turtle kernel: vfio-pci 0000:10:00.3: not ready 32767ms after FLR; waiting

Anything I left out and need to add? I'm not sure what to try next. 

Edited by Endy
Link to comment

With regards to USB locking things up, have you checked (and double checked) that your Unraid USB stick isn't connected to the controller you are passing through?

Next, have you checked if the USB controller can be reset?

 

With regards to the GPU, how did you obtain the vbios file?

Are you able to RDP into the VM to see if the Nvidia driver spits out error 43?

Link to comment

You can not pass through that usb controller on x570. That and onboard audio will lock up unraid without fail. You can pass through the other two usb controllers together, assuming your unraid usb isn't plugged into one of them. See my screenshot, on my system there are 4 devices in group 24. Pass all three devices I have checked together.

 

If you use my VFIO-PCI Config plugin you can see which usb controller your unraid usb is connected to. In my case it's the Cruzer Fit in 14.00.3. Not that the ".3" here is a bit of a red flag. Even though it is in its own IOMMU group, it is linked to the other 14.00.x devices (10.00.x in your case), which includes the problematic onboard audio (that should be 10.00.4 in your case). It is an agesa bug at the very least. I've seen reports of getting onboard audio working with a kernel patch, I haven't investigated yet, it might also fix the usb issue. Either way, it is best case scenario to use your 10.00.3 for unraid usb and pass the others.

screenshot2.png

Edited by Skitals
  • Thanks 1
Link to comment
3 hours ago, testdasi said:

With regards to USB locking things up, have you checked (and double checked) that your Unraid USB stick isn't connected to the controller you are passing through?

Next, have you checked if the USB controller can be reset?

 

With regards to the GPU, how did you obtain the vbios file?

Are you able to RDP into the VM to see if the Nvidia driver spits out error 43?

I did my best to map out which usb ports go to which motherboard controller. USB devices connected to the controller I was trying to pass through no longer show up in System Devices, but the Unraid USB stick does. Yes, the USB controller can be reset. It looks like Skitals might have the answer for me on this one.

 

I dumped the bios myself and removed the header as shown in SpacerInvaderOne's video. Being a fresh vm, I haven't been able to get in far enough to actually install Windows 10 so no RDP.

 

1 hour ago, Skitals said:

You can not pass through that usb controller on x570. That and onboard audio will lock up unraid without fail. You can pass through the other two usb controllers together, assuming your unraid usb isn't plugged into one of them. See my screenshot, on my system there are 4 devices in group 24. Pass all three devices I have checked together.

 

If you use my VFIO-PCI Config plugin you can see which usb controller your unraid usb is connected to. In my case it's the Cruzer Fit in 14.00.3. Not that the ".3" here is a bit of a red flag. Even though it is in its own IOMMU group, it is linked to the other 14.00.x devices (10.00.x in your case), which includes the problematic onboard audio (that should be 10.00.4 in your case). It is an agesa bug at the very least. I've seen reports of getting onboard audio working with a kernel patch, I haven't investigated yet, it might also fix the usb issue. Either way, it is best case scenario to use your 10.00.3 for unraid usb and pass the others.

screenshot2.png

That makes a lot of sense. I will try that out and report back. 

Link to comment
2 minutes ago, Endy said:

I did my best to map out which usb ports go to which motherboard controller. USB devices connected to the controller I was trying to pass through no longer show up in System Devices, but the Unraid USB stick does. Yes, the USB controller can be reset. It looks like Skitals might have the answer for me on this one.

 

I dumped the bios myself and removed the header as shown in SpacerInvaderOne's video. Being a fresh vm, I haven't been able to get in far enough to actually install Windows 10 so no RDP.

 

That makes a lot of sense. I will try that out and report back. 

Also note that my 0d:00.1 does not have reset functionality. 0d:00.0 "non-essential instrumentation" controls reset for that controller, which is why it's passed together.

Link to comment

Success! The vm now starts and Unraid doesn't lock up. I am still having the no video problem. @testdasi I assume what I need to do is to use vnc to setup the vm and then try to add the graphics card afterwards so that then I could RDP into it?

 

2 hours ago, Skitals said:

Also note that my 0d:00.1 does not have reset functionality. 0d:00.0 "non-essential instrumentation" controls reset for that controller, which is why it's passed together.

Mine is the same (except that it's 0a:00.0 and oa:00.1).

What about 03:08.0 [RESET] 1022:57a4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57a4 since that is in the same iommu group? I think what I typically read is that everything in a group needs to be passed through? So far it seems to be working without that passed through. 

Link to comment
6 minutes ago, Endy said:

Success! The vm now starts and Unraid doesn't lock up. I am still having the no video problem. @testdasi I assume what I need to do is to use vnc to setup the vm and then try to add the graphics card afterwards so that then I could RDP into it?

Start with having the GPU passed through (note: remember to also include the HDMI audio device) to see if there's display. Default generic driver from the windows installer doesn't do error code 43. Error code 43 is a Nvidia driver thing.

 

If no display THEN do Windows install in VNC. However, in this case, it's more likely due to a bad vbios (e.g. file downloaded from Techpowerup that doesn't match the device).

 

In other words, if you do your pass through correctly on the host end (e.g. correct VM xml, correct vbios, stubbing etc.) then you should be able to install Windows with the GPU passed through.

If it doesn't work at this stage then you need to focus on fixing the host config first.

 

Once you have installed Windows + Nvidia driver and then you lose display (or can't install Nvidia driver to begin with) then it's likely error code 43 so you then deal with that.

Edited by testdasi
Link to comment

Ok, trying to start with just vnc instead of the graphics card I get this

Quote

internal error: qemu unexpectedly closed the monitor: 2020-01-16T17:19:14.841017Z qemu-system-x86_64: -device pcie-pci-bridge,id=pci.8,bus=pci.1,addr=0x0: Bus 'pci.1' not found

 

Link to comment
2 minutes ago, testdasi said:

If no display THEN do Windows install in VNC. However, in this case, it's more likely due to a bad vbios (e.g. file downloaded from Techpowerup that doesn't match the device).

There was no display and I did get the vbios from my card, I did not download it from techpowerup. 

Link to comment
4 minutes ago, Endy said:

Ok, trying to start with just vnc instead of the graphics card I get this

 

Start a new template and use Q35 machine type (with OVMF).

 

You are better off sorting it out first to make sure your VM boots (to Windows installer) with a display. At the very least, you want to be able to see the Tiano Core screen at VM boot.

  • Thanks 1
Link to comment

Ok so I just added the 0a:00.3 usb controller to the vm and now I can get it to start with vnc.

 

The message that comes up is: Guest has not initialized the display (yet).

@testdasi like you said, I deleted the template and started again. Now it's working. Thank you. :)

 

I still need to install the nvidia driver and make sure that works.

Link to comment
27 minutes ago, Endy said:

Success! The vm now starts and Unraid doesn't lock up. I am still having the no video problem. @testdasi I assume what I need to do is to use vnc to setup the vm and then try to add the graphics card afterwards so that then I could RDP into it?

 

Mine is the same (except that it's 0a:00.0 and oa:00.1).

What about 03:08.0 [RESET] 1022:57a4 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 57a4 since that is in the same iommu group? I think what I typically read is that everything in a group needs to be passed through? So far it seems to be working without that passed through. 

The numbers in the address stand for Bus:Device.Function. You want to pass all functions of a device. Typically saying passing the whole IOMMU group means the same thing... but in this case there is weirdness with the groupings. It's the same reason you want to pass through your gpu audio along with the graphics card, even if they are in different groups (xx:xx.0 and xx:xx.1).

  • Like 1
Link to comment

Oh, and to fix your gpu issue, for single gpu on x570 you need to disable the framebuffer in unraid. Add this parameter to your syslinux.cfg:

video=efifb:off

When you boot unraid, you will get no video output after the bootloader. A gtx1070 should work fine with a good vbios. This was my previous setup before upgrading to a 5700XT + second gpu.

  • Like 1
Link to comment
4 hours ago, Skitals said:

The numbers in the address stand for Bus:Device.Function. You want to pass all functions of a device. Typically saying passing the whole IOMMU group means the same thing... but in this case there is weirdness with the groupings. It's the same reason you want to pass through your gpu audio along with the graphics card, even if they are in different groups (xx:xx.0 and xx:xx.1).

Thanks for this. It's starting to make sense. In all my searching I would keep finding these things that people say to do, but usually not with any explanations as to why.

 

3 hours ago, Skitals said:

Oh, and to fix your gpu issue, for single gpu on x570 you need to disable the framebuffer in unraid. Add this parameter to your syslinux.cfg:


video=efifb:off

When you boot unraid, you will get no video output after the bootloader. A gtx1070 should work fine with a good vbios. This was my previous setup before upgrading to a 5700XT + second gpu.

Yes, that was already done. Kind of like I was saying before, I saw it mentioned somewhere to do it, but no explanation of why.

 

Thank you both for the help. So far everything is now working.

Link to comment
  • 4 months later...

I wanted to mention, this issue has just recently been affecting me.  I am on an MSI x470 Gaming M7 AC motherboard.  The issue occurred when I switched from the 2700x CPU to the 3900x CPU (I wanted more cores!!).  I swapped the CPUs and all the VFIO Bus:Device.Function numbers changed (that's probably expected).  However, the `USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller` (IOMMU group 22 in the pic below) I had passed-through with the 2700x no longer works in pass-thru, even after adjusting for the vfio numbers.  Using it now locks up the system no matter what combo of "Non-essential" components I tag with vfio and pass with it.  @Skitals described this problem with all x570 boards... Looks like it's the same for x470 boards with a 3000 series CPU.  Why the heck is this a problem in the 3000 series, BTW??

 

I am going down the path of passing through the other USB controllers, as stated above.  Unfortunately, on this board one of the controllers is the front USB, which are not very accessible to me.  The other contains 3 regular USB ports and 1 USB-C port on the back to use, but that is contained in a very large IOMMU group that includes things like the Ethernet controller (IOMMU group 17), so I have my doubts about isolating that back USB panel.

 

image.thumb.png.a97a472c23f0e9e72c22623b303ab91c.png

 

Will provide an update when I get this going... Worst case I buy a separate, PCI USB controller and go that route.

Edited by mattz
added IOMMU groups
Link to comment

Wanted to follow-up.  The cause was totally that FLR issue posted above.  Luckily, someone on this forum had already compiled a kernel with a temporary fix, and I used that.  Note that I tried Unraid 6.9.0-beta1 and it did not yet have the FLR fix in the Linux kernel.  Find that custom kernel for Unraid 6.8.3 here: 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.