Jump to content

VMs lock up at random with 100% CPU utilization


Recommended Posts

Hey guys, I'm at a bit of a loss here and appreciate any input. The tldr of my problem is: After a varying amount of time (never less then an hour) my VMs lock up suddenly utilizing 100% of the assigned CPUs.

 

This happens regardless of the OS the VMs are using (Linux + Windows) and regardless of the load placed on the VM at that time (I have seen this happening while the VM was sitting at idle, at full load and anything inbetween). Journalctl on the guest shows nothing neither does unraids syslog or the log of the VMs. The only thing these VMs have in common is a passed through GPU (only one of the VMs runs at a time). However, the issue occurs while the GPU is at full load just as likely, as when I'm not placing any load on it. I'm using Q35-7.1 and OVMF.

 

I don't really know, what I should look at next, so any input is greatly appreciated. Thanks already!

Link to comment

Where are you observing the 100% CPU? If it's on the Unraid dashboard, then a likely cause could be the drive that hosts the VM image is busy, the CPU graph there includes I/O wait, which means the CPU is "busy" waiting for data to be read or written to a disk.

 

What drives are you using, how are they arranged as far as array and pools, and where do the vdisk image files live?

  • Like 1
Link to comment
Posted (edited)

You are right, I observed it on the Unraid dashboard, but I also noticed an increase in CPU temperature that is in line with the utilization. Since the issue occurred just now, I checked with 'top':

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                       
15183 root      20   0   33.0g  32.3g  32.0g S  2362  51.5   7214:28 qemu-system-x86 

 

So, the CPU utilization appears to be genuine load and not just idle waiting.

 

The VM images are stored on my cache drive (a singular 2TB PCIe 4.0 Samsung NVMe). Each of the VMs also has access to some of the shares through virtiofs.

 

 

##### Edit #####

 

For what its worth, here are the logs Unraid gives me for the VM. Notice the absence of any event between the completed startup and me being forced to stop it.

 

2024-07-24 18:00:38.199+0000: shutting down, reason=destroyed
2024-07-25 05:55:31.356+0000: Starting external device: virtiofsd
/usr/libexec/virtiofsd --fd=36 -o source=/mnt/user/linux,cache=always,sandbox=chroot,xattr
2024-07-25 05:55:31.360+0000: Starting external device: virtiofsd
/usr/libexec/virtiofsd --fd=36 -o source=/mnt/user/downloads,cache=always,sandbox=chroot,xattr
2024-07-25 05:55:31.363+0000: Starting external device: virtiofsd
/usr/libexec/virtiofsd --fd=36 -o source=/mnt/user/Media,cache=always,sandbox=chroot,xattr
2024-07-25 05:55:31.532+0000: starting up libvirt version: 8.7.0, qemu version: 7.1.0, kernel: 6.1.36-Unraid, hostname: Tower
LC_ALL=C \
PATH=/bin:/sbin:/usr/bin:/usr/sbin \
HOME=/var/lib/libvirt/qemu/domain-5-Arch-VM \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-5-Arch-VM/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-5-Arch-VM/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-5-Arch-VM/.config \
/usr/local/sbin/qemu \
-name guest=Arch-VM,debug-threads=on \
-S \
-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-5-Arch-VM/master-key.aes"}' \
-blockdev '{"driver":"file","filename":"/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"}' \
-blockdev '{"driver":"file","filename":"/etc/libvirt/qemu/nvram/f8ad561e-6032-0377-b2a7-d6febc57689a_VARS-pure-efi.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"}' \
-machine pc-q35-7.1,usb=off,dump-guest-core=off,mem-merge=off,memory-backend=pc.ram,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format \
-accel kvm \
-cpu host,migratable=on,topoext=on,host-cache-info=on,l3-cache=off \
-m 32768 \
-object '{"qom-type":"memory-backend-memfd","id":"pc.ram","share":true,"x-use-canonical-path-for-ramblock-id":false,"size":34359738368}' \
-overcommit mem-lock=off \
-smp 24,sockets=1,dies=1,cores=12,threads=2 \
-uuid f8ad561e-6032-0377-b2a7-d6febc57689a \
-display none \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=35,server=on,wait=off \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-boot strict=on \
-device '{"driver":"pcie-root-port","port":8,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x1"}' \
-device '{"driver":"pcie-root-port","port":9,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x1.0x1"}' \
-device '{"driver":"pcie-root-port","port":10,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x1.0x2"}' \
-device '{"driver":"pcie-root-port","port":11,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x1.0x3"}' \
-device '{"driver":"pcie-root-port","port":12,"chassis":5,"id":"pci.5","bus":"pcie.0","addr":"0x1.0x4"}' \
-device '{"driver":"pcie-root-port","port":13,"chassis":6,"id":"pci.6","bus":"pcie.0","addr":"0x1.0x5"}' \
-device '{"driver":"pcie-root-port","port":14,"chassis":7,"id":"pci.7","bus":"pcie.0","addr":"0x1.0x6"}' \
-device '{"driver":"pcie-root-port","port":15,"chassis":8,"id":"pci.8","bus":"pcie.0","addr":"0x1.0x7"}' \
-device '{"driver":"pcie-root-port","port":16,"chassis":9,"id":"pci.9","bus":"pcie.0","multifunction":true,"addr":"0x2"}' \
-device '{"driver":"pcie-root-port","port":17,"chassis":10,"id":"pci.10","bus":"pcie.0","addr":"0x2.0x1"}' \
-device '{"driver":"ich9-usb-ehci1","id":"usb","bus":"pcie.0","addr":"0x7.0x7"}' \
-device '{"driver":"ich9-usb-uhci1","masterbus":"usb.0","firstport":0,"bus":"pcie.0","multifunction":true,"addr":"0x7"}' \
-device '{"driver":"ich9-usb-uhci2","masterbus":"usb.0","firstport":2,"bus":"pcie.0","addr":"0x7.0x1"}' \
-device '{"driver":"ich9-usb-uhci3","masterbus":"usb.0","firstport":4,"bus":"pcie.0","addr":"0x7.0x2"}' \
-device '{"driver":"virtio-serial-pci","id":"virtio-serial0","bus":"pci.5","addr":"0x0"}' \
-blockdev '{"driver":"file","filename":"/mnt/user/domains/Arch-VM/vdisk1.img","node-name":"libvirt-2-storage","cache":{"direct":false,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":false,"cache":{"direct":false,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"}' \
-device '{"driver":"virtio-blk-pci","bus":"pci.6","addr":"0x0","drive":"libvirt-2-format","id":"virtio-disk2","bootindex":1,"write-cache":"on","serial":"vdisk1"}' \
-blockdev '{"driver":"file","filename":"/mnt/user/isos/manjaro-gnome-23.1.4-240406-linux66.iso","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":true,"driver":"raw","file":"libvirt-1-storage"}' \
-device '{"driver":"ide-cd","bus":"ide.0","drive":"libvirt-1-format","id":"sata0-0-0","bootindex":2}' \
-chardev socket,id=chr-vu-fs0,path=/var/lib/libvirt/qemu/domain-5-Arch-VM/fs0-fs.sock \
-device '{"driver":"vhost-user-fs-pci","id":"fs0","chardev":"chr-vu-fs0","queue-size":1024,"tag":"linux","bus":"pci.1","addr":"0x0"}' \
-chardev socket,id=chr-vu-fs1,path=/var/lib/libvirt/qemu/domain-5-Arch-VM/fs1-fs.sock \
-device '{"driver":"vhost-user-fs-pci","id":"fs1","chardev":"chr-vu-fs1","queue-size":1024,"tag":"downloads","bus":"pci.2","addr":"0x0"}' \
-chardev socket,id=chr-vu-fs2,path=/var/lib/libvirt/qemu/domain-5-Arch-VM/fs2-fs.sock \
-device '{"driver":"vhost-user-fs-pci","id":"fs2","chardev":"chr-vu-fs2","queue-size":1024,"tag":"Media","bus":"pci.3","addr":"0x0"}' \
-netdev tap,fd=36,id=hostnet0 \
-device '{"driver":"virtio-net","netdev":"hostnet0","id":"net0","mac":"52:54:00:d2:0f:a7","bus":"pci.4","addr":"0x0"}' \
-chardev pty,id=charserial0 \
-device '{"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0}' \
-chardev socket,id=charchannel0,fd=34,server=on,wait=off \
-device '{"driver":"virtserialport","bus":"virtio-serial0.0","nr":1,"chardev":"charchannel0","id":"channel0","name":"org.qemu.guest_agent.0"}' \
-audiodev '{"id":"audio1","driver":"none"}' \
-device '{"driver":"vfio-pci","host":"0000:07:00.0","id":"hostdev0","bus":"pci.7","multifunction":true,"addr":"0x0","romfile":"/mnt/user/isos/vbios/rtx3080.rom"}' \
-device '{"driver":"vfio-pci","host":"0000:07:00.1","id":"hostdev1","bus":"pci.8","addr":"0x0"}' \
-device '{"driver":"vfio-pci","host":"0000:02:00.0","id":"hostdev2","bus":"pci.9","addr":"0x0"}' \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
char device redirected to /dev/pts/1 (label charserial0)
2024-07-25T05:55:43.143902Z qemu-system-x86_64: vfio: Cannot reset device 0000:02:00.0, depends on group 16 which is not owned.
2024-07-25T05:55:43.360664Z qemu-system-x86_64: vfio: Cannot reset device 0000:02:00.0, depends on group 16 which is not owned.
2024-07-25T13:12:35.490092Z qemu-system-x86_64: terminating on signal 15 from pid 11419 (/usr/sbin/libvirtd)
2024-07-25 13:12:44.333+0000: shutting down, reason=destroyed

 

And here the relevant parts of the syslog:
 

Jul 25 07:55:31 Tower kernel: device vnet4 entered promiscuous mode
Jul 25 07:55:31 Tower kernel: br0: port 2(vnet4) entered blocking state
Jul 25 07:55:31 Tower kernel: br0: port 2(vnet4) entered forwarding state
Jul 25 07:55:42 Tower kernel: vfio-pci 0000:07:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258
Jul 25 07:55:42 Tower kernel: vfio-pci 0000:07:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Jul 25 07:55:42 Tower kernel: vfio-pci 0000:07:00.0: vfio_ecap_init: hiding ecap 0x26@0xc1c
Jul 25 07:55:42 Tower kernel: vfio-pci 0000:07:00.0: vfio_ecap_init: hiding ecap 0x27@0xd00
Jul 25 07:55:42 Tower kernel: vfio-pci 0000:07:00.0: vfio_ecap_init: hiding ecap 0x25@0xe00
Jul 25 07:55:42 Tower kernel: vfio-pci 0000:07:00.0: No more image in the PCI ROM
Jul 25 07:55:42 Tower kernel: vfio-pci 0000:07:00.1: vfio_ecap_init: hiding ecap 0x25@0x160
Jul 25 07:55:42 Tower kernel: vfio-pci 0000:02:00.0: vfio_ecap_init: hiding ecap 0x19@0x200
Jul 25 07:55:42 Tower kernel: vfio-pci 0000:02:00.0: vfio_ecap_init: hiding ecap 0x1e@0x400
Jul 25 08:43:33 Tower emhttpd: spinning down /dev/sdb
Jul 25 08:46:52 Tower emhttpd: read SMART /dev/sdb
Jul 25 09:07:44 Tower kernel: veth9b835d9: renamed from eth0
Jul 25 09:07:44 Tower kernel: docker0: port 4(vethd2aaba5) entered disabled state
Jul 25 09:07:44 Tower kernel: docker0: port 4(vethd2aaba5) entered disabled state
Jul 25 09:07:44 Tower kernel: device vethd2aaba5 left promiscuous mode
Jul 25 09:07:44 Tower kernel: docker0: port 4(vethd2aaba5) entered disabled state
Jul 25 09:07:44 Tower kernel: docker0: port 4(vethc17f0b2) entered blocking state
Jul 25 09:07:44 Tower kernel: docker0: port 4(vethc17f0b2) entered disabled state
Jul 25 09:07:44 Tower kernel: device vethc17f0b2 entered promiscuous mode
Jul 25 09:07:44 Tower kernel: docker0: port 4(vethc17f0b2) entered blocking state
Jul 25 09:07:44 Tower kernel: docker0: port 4(vethc17f0b2) entered forwarding state
Jul 25 09:07:45 Tower kernel: docker0: port 4(vethc17f0b2) entered disabled state
Jul 25 09:07:45 Tower kernel: eth0: renamed from veth2fd2b19
Jul 25 09:07:45 Tower kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethc17f0b2: link becomes ready
Jul 25 09:07:45 Tower kernel: docker0: port 4(vethc17f0b2) entered blocking state
Jul 25 09:07:45 Tower kernel: docker0: port 4(vethc17f0b2) entered forwarding state
Jul 25 09:17:00 Tower emhttpd: spinning down /dev/sdb
Jul 25 09:40:10 Tower emhttpd: read SMART /dev/sde
Jul 25 10:13:01 Tower emhttpd: read SMART /dev/sdb
Jul 25 10:15:24 Tower emhttpd: spinning down /dev/sde
Jul 25 10:43:02 Tower emhttpd: spinning down /dev/sdb
Jul 25 11:10:01 Tower Plugin Auto Update: Checking for available plugin updates
Jul 25 11:10:04 Tower Plugin Auto Update: Checking for language updates
Jul 25 11:10:04 Tower Plugin Auto Update: Community Applications Plugin Auto Update finished
Jul 25 12:00:07 Tower emhttpd: read SMART /dev/sdb
Jul 25 12:00:08 Tower emhttpd: read SMART /dev/sde
Jul 25 13:01:19 Tower emhttpd: spinning down /dev/sde
Jul 25 13:01:19 Tower emhttpd: spinning down /dev/sdb
Jul 25 13:56:09 Tower emhttpd: read SMART /dev/sde
Jul 25 14:29:47 Tower emhttpd: spinning down /dev/sde
Jul 25 14:53:53 Tower emhttpd: read SMART /dev/sdb
Jul 25 15:12:35 Tower kernel: br0: port 2(vnet4) entered disabled state
Jul 25 15:12:35 Tower kernel: device vnet4 left promiscuous mode
Jul 25 15:12:35 Tower kernel: br0: port 2(vnet4) entered disabled state
Jul 25 15:13:03 Tower kernel: br0: port 2(vnet5) entered blocking state
Jul 25 15:13:03 Tower kernel: br0: port 2(vnet5) entered disabled state
Jul 25 15:13:03 Tower kernel: device vnet5 entered promiscuous mode
Jul 25 15:13:03 Tower kernel: br0: port 2(vnet5) entered blocking state
Jul 25 15:13:03 Tower kernel: br0: port 2(vnet5) entered forwarding state
Jul 25 15:24:01 Tower emhttpd: spinning down /dev/sdb

 

Edited by _monad
Link to comment

I did some more troubleshooting yesterday and the issue appears to be related to the GPU passthrough (as I can not reproduce it for a VM without GPU).

 

What should I look at / test now, to narrow down what exactly is going wrong? Once again, any input is greatly appreciated.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...