Unraid hang and Network storm after VM shutdown


softfeet

23 posts in this topic Last Reply

Recommended Posts

Hello and thanks for having a look. 
 

I've been digging into this issue where on shutdown of a Win10 VM from inside the vm, that has been running for an hour or more will crash the unraid host and spray network packets over the network. The win10 VM is using a gpu passthrough. 

 

gpu of a 1070. cpu pinned and isolated. shutdown by clicking start>shutdown. network hangs. synergy mouse sharing fails. Network collapses. hosts cant talk to anything on lan or outside. (google etc). This is a really WEIRD fail case. Help is appreciated. 

 

I've been seeing a number of posts with the same issue. I have an x79 board. So I modified the boot. did not help. 

example: append pcie_no_flr=1022:149c,1022:1487 vfio-pci.ids=8086:10e8 isolcpus=1-11,13-23 initrd=/bzroot

 

I keep seeing people that have the same issue. and no real solution. 

seems like others have had this same sort of issue for one reason or another since 2015

 

advice is appreciated!

 

Here is my xml for the VM

 

<?xml version='1.0' encoding='UTF-8'?>
<domain type='kvm'>
  <name>Win10_Game_001</name>
  <uuid>8e78427f-65a1-5c1a-1d46-43f5a254e863</uuid>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>4194304</memory>
  <currentMemory unit='KiB'>4194304</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>4</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='6'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <vcpupin vcpu='2' cpuset='7'/>
    <vcpupin vcpu='3' cpuset='19'/>
    <emulatorpin cpuset='0,12'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-i440fx-4.2'>hvm</type>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough' check='none'>
    <topology sockets='1' cores='2' threads='2'/>
    <cache mode='passthrough'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source file='/mnt/cache/domains/Win10_Game_001/vdisk1.img'/>
      <target dev='hdc' bus='virtio'/>
      <boot order='1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <controller type='pci' index='0' model='pci-root'/>
    <controller type='ide' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x2'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:53:00:cd:17:15'/>
      <source bridge='br0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='tablet' bus='usb'>
      <address type='usb' bus='0' port='1'/>
    </input>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <sound model='ich9'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </sound>
    <hostdev mode='subsystem' type='pci' managed='yes' xvga='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>
    <memballoon model='none'/>
  </devices>
</domain>
 

Edited by softfeet
Link to post

Hello,

 

Have you use an edited bios for the graphics card removing the first part of it? SpaceInvader made a video on how to do it. Also, you may want to try DDU to uninstall the graphics drivers and install the latest one. 

Link to post
17 minutes ago, XiuzSu said:

Hello,

 

Have you use an edited bios for the graphics card removing the first part of it? SpaceInvader made a video on how to do it. Also, you may want to try DDU to uninstall the graphics drivers and install the latest one. 

I had tried space invader's video on a first go through. It didn't work for me though. I dumped, found a bios file online. But found that just passing through worked... to a point. per the thread. 

 

I am not familiar with DDU. 

 

Can you explain why the options you are mentioning would help with my specific problem? I dont want to spend a lot of time trying something if it does not have a rational for being a solution. That is, I dont want to spend hours testing other people's theories unless I am sure that it is a guess and check session. 

 

Link to post
1 hour ago, softfeet said:

I had tried space invader's video on a first go through. It didn't work for me though. I dumped, found a bios file online. But found that just passing through worked... to a point. per the thread. 

 

I am not familiar with DDU. 

 

Can you explain why the options you are mentioning would help with my specific problem? I dont want to spend a lot of time trying something if it does not have a rational for being a solution. That is, I dont want to spend hours testing other people's theories unless I am sure that it is a guess and check session. 

 

This issue is likely due to the graphics card passthro. You can confirm this by running the VM without the graphics card being passthro. If it doesn't crash anymore, well there's your problem. Feel free to test it by removing the graphics card passthro.

 

The drivers being mixed in with windows drivers or older nvidia drivers may cause an issue as well which is why I suggested you try DDU, and then install new drivers.

 

The bios from the website didn't quite work for my graphics card. I just put it in another system, and ran gpuz to extract the bios directly from my graphics card. Then use a hex editor to remove the header.
 

 

Edited by XiuzSu
Typo
Link to post

I just went through the rom/dump procedure in the vm that crashes (the only windows machine i got :D) 

I am now runing the .dump file and the system looks stable. I'll try and get it to crash on shut down some time in the next few days. 

Unsure what causes the crash specifically, but this looks like a step in the right direction. 

Thanks for the explanation and tips. I'll check back in after a few days(or sooner if it blows up ) with updates if I use the ddu. 

25 minutes ago, XiuzSu said:

This issue is likely due to the graphics card passthro. You can confirm this by running the VM without the graphics card being passthro. If it doesn't crash anymore, well there's your problem. Feel free to test it by removing the graphics card passthro.

 

The drivers being mixed in with windows drivers or older nvidia drivers may cause an issue as well which is why I suggested you try DDU, and then install new drivers.

 

The bios from the website didn't quite work for my graphics card. I just put it in another system, and ran gpuz to extract the bios directly from my graphics card. Then use a hex editor to remove the header.
 

 

 

Link to post

Just had the unraid system lock up again. Was logged into win10 machine via parsec. Hit the shutdown button from the start menu. Entire network came to a grinding halt. 

 

network usage:

two cifs connections open to two osx computers. (share mounted, not utilized for stream or transfer)

smb mount in vm of linux. active data being transferred from wan to smb mount. 

nfs4 connection from above linux vm2. with active file transfer from wan. 

the win10 vm up. has a smb connected share to unraid. 

iscsi vm running with network share to linux vm. 

 

Everything works fine until the win10 vm is shut off. then unraid goes into network hell and the entire network grinds to a halt. packet blender. 

 

This is with the dumped rom file for the video card. 

 

I don't get it. This makes no sense. I would look at logs... but dont know exactly what to look for in a usb based system. . 

 

 

 

 

 

 

Link to post
12 minutes ago, softfeet said:

Just had the unraid system lock up again. Was logged into win10 machine via parsec. Hit the shutdown button from the start menu. Entire network came to a grinding halt. 

 

network usage:

two cifs connections open to two osx computers. (share mounted, not utilized for stream or transfer)

smb mount in vm of linux. active data being transferred from wan to smb mount. 

nfs4 connection from above linux vm2. with active file transfer from wan. 

the win10 vm up. has a smb connected share to unraid. 

iscsi vm running with network share to linux vm. 

 

Everything works fine until the win10 vm is shut off. then unraid goes into network hell and the entire network grinds to a halt. packet blender. 

 

This is with the dumped rom file for the video card. 

 

I don't get it. This makes no sense. I would look at logs... but dont know exactly what to look for in a usb based system. . 

 

 

 

 

 

 

Did you edited the rom file as well?

Cleared old drivers and re-installed them?

Tried making a new VM (using the same vdisk) but without the GPU passthro? If so, does it crashes then?

Do you still have "pcie_no_flr=1022:149c,1022:1487" on your unraid settings section? 

Link to post
9 minutes ago, XiuzSu said:

Did you edited the rom file as well?

Cleared old drivers and re-installed them?

Tried making a new VM (using the same vdisk) but without the GPU passthro? If so, does it crashes then?

Do you still have "pcie_no_flr=1022:149c,1022:1487" on your unraid settings section? 

edit rom file: yes. This was a lot easier on a windows machine. :D

cleared old drivers: not yet. something to try. 

new vm method: I presume this would be just the regular emulated video hardware. the resolution is terrible. worth a shot to have it running as a clone to start and restart every few hours. 

pcie_no_flr in place, yes it is still in place. 

Link to post

Just run UUD (there is an option to stop windows from installing drivers temporarily). Uninstall the drivers, install the latest ones, then untick the windows option to continue to install drivers when done.

 

You may also want to try running the VM with the emulated video hardware and see if you experience issues.

 

 

(edit) I just remember, make sure to edit your XML and under the graphic card section, you make it look like this.

 

      <alias name='hostdev0'/>
      <rom file='/mnt/user/isos/GTX - Bios Dump/Updated Asus GTX 1050 Ti.rom'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x1'/>

with the 'multifunction='on'

 

And change the next pci line to match the same slot as the first, then change the function to match the same function +1 as shown above.

 

Note: When you edit XML, if you go back to the regular settings and change something and save it, you have to go back to the XML to make this change as it will be changed back to default.

Edited by XiuzSu
Link to post
    <hostdev mode='subsystem' type='pci' managed='yes' xvga='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev0'/>
      <rom file='/mnt/user/windows_share/desktop/MSI.GTX1070.dump'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x01' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x1'/>
    </hostdev>

 

Link to post

I still need to do the ddu step. though I will give this phase an hour or two and see if it survives. then if it knocks itself out, i'll try the ddu. 

 

going to backup the xml too. I recall having that setting a while back ... but so many edits. like you say. they get lost depending on the mode. 

 

Thanks! 

Link to post
6 hours ago, softfeet said:

This says that It should be logging syslog here... but it is a complete lie or misdirection. As far as I can tell. 

 

Figured out how it can work. setup the listener and the server in the sections listed. so that it is sending to the ip of itself... (unraid server)... 

Took about 3 minutes to push a few buttons and test after looking back into it. creates a file in the listed directory. having tcp and tcp set to the same is important... udp/tcp failed. lol. 

 

 

Screen Shot 2020-10-12 at 8.28.45 PM.png

Edited by softfeet
Link to post

Hmm. Still seeing some weirdness. I have another gpu of the exact same type in the unraid host. 

 

When I load the second vm (ubuntu 16.04) with the second gpu. it crashed the system. unraid offline. network packet storm. 

after reboot. the ubuntu vm wont boot at all. even after switching back to the regular vnc video. 

 

so strange. Could be due to board bios type that is being emulated. but i cant imagine that as the reason. 

 

I also checked the sylog file. nothing informative. 

Edited by softfeet
Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.