[SOLVED] [6.12.8] OOM Crashes during high CPU or network usage on a specific VM

gusgus · March 24

Hello all

I am getting reproducible out of memory crashes under very specific circumstances, and it appears to be triggered by one VM. I've not run into this on a KVM system before so I'm not sure what to do. Details below, any help is appreciated. I'd be happy to provide further troubleshooting information if anyone could provide a little guidance. EDIT: I have not tried doing this on another VM which I could do if you think it's useful. I expect that this is a problem with unRAID, not a VM.

System Setup

Hardware specs

Dell Optiplex 5080
32 GB RAM

Network settings

All VMs have network source set to br0
Only one NIC port is a member of br0
All other NIC ports are unused and not a member of br0 (no bonding)

Plugins installed

Community Applications
Fix Common Problems
GPU Statistics
docker.patch2
Intel GPU TOP
Intel GVT-g
iSCSI Initiator
Nvidia Driver
Unassigned Devices
Unassigned Devices Plus
Unassigned Devices Preclear

Offending VM build

OS: Linux Mint 20.3 xfce
Memory assignment: 20 GB
See the attached XML file "OffendingVM.xml" for the VM configuration. This is after the below (minor) changes were made.

How the problem has manifested and what I tried

I first noticed that the problem appeared to happen during high network usage, what would happen is I would try to copy a 70 GB file from remote SMB share to the local storage and it would crash about 3.5 GB in. I changed the Network Model for the VM from e1000 to virtio. No change in crash behavior. The crash had the following characteristics:
- VM immediately dropped my SSH connections and would no longer respond to connection attempts
- unRAID webUI was completely unresponsive and my web browser acted as if the website was down
- unRAID LAN interface IP address responded correctly to pings as if there was no issue
- unRAID physical console appeared to work correctly and gave no indication of a problem
- unRAID behaved normally again after a reboot via either the physical button or the physical console
I changed the Network Model for the VM from virtio to virtio-net after reading that virtio-net is most stable. No change in crash behavior.
I suspected an issue with the onboard 1 Gb NIC or its driver so I replaced it with a Intel X550 10 Gb PCIe card. A change in crash behavior here. After the swap I was able to copy the file as fast as the remote SMB server could manage (140-500 MBps) without triggering a crash.
Now being able to copy files over the network at full speed without a crash, I moved on to computing sha512sum hashes using the linux command "sha512sum". This is a single threaded workload. After 1-2 minutes of this the following happened. Now my working theory is that the NIC swap improved the issue because the new NIC offloads calculations differently, and this was a OOM issue all along.
- VM dropped my SSH connections
- unRAID webUI became unresponsive
- 5-10 minutes later, the webUI suddenly became responsive again. I was able to see that the VM had been shut down. I got a notification from the Fix Common Problems plugin of "Out Of Memory errors detected on your server"
- It's worth noting that I was running htop in a SSH window from the VM at the time of the crash. htop clearly shows that the sha512sum command became a dead process (it had not finished executing) right before the crash and the memory usage of the VM was no more than 280 MB of 19.5 GB.

metaverse-diagnostics-20240324-1101.zip OffendingVM.xml

Edited March 25 by gusgus
solved

gusgus · March 24

Additional info:
I found in the VM logs the following line at the time of each crash (with different timestamps of course):
2024-03-24 17:41:47.443+0000: shutting down, reason=crashed

So it would appear that qemu recognizes a problem and recovers long enough to write to the log. This is the only line after each crash.

gusgus · March 25

Solved.

All this time I assumed that the OOM error was not actually unRAID running out of memory, because I miscalculated the amount of RAM that should be allocated to each VM to prevent an OOM condition. I reset the memory allocation for each VM (correctly this time) and the system no longer crashes.

[SOLVED] [6.12.8] OOM Crashes during high CPU or network usage on a specific VM

Recommended Posts

gusgus

Link to comment

gusgus

Link to comment

gusgus

Link to comment

Join the conversation