[SOLVED] [6.12.8] OOM Crashes during high CPU or network usage on a specific VM


Go to solution Solved by gusgus,

Recommended Posts

Hello all

 

I am getting reproducible out of memory crashes under very specific circumstances, and it appears to be triggered by one VM. I've not run into this on a KVM system before so I'm not sure what to do. Details below, any help is appreciated. I'd be happy to provide further troubleshooting information if anyone could provide a little guidance. EDIT: I have not tried doing this on another VM which I could do if you think it's useful. I expect that this is a problem with unRAID, not a VM.

 

System Setup

Hardware specs

  • Dell Optiplex 5080
  • 32 GB RAM

Network settings

  • All VMs have network source set to br0
  • Only one NIC port is a member of br0
  • All other NIC ports are unused and not a member of br0 (no bonding)

Plugins installed

  • Community Applications
  • Fix Common Problems
  • GPU Statistics
  • docker.patch2
  • Intel GPU TOP
  • Intel GVT-g
  • iSCSI Initiator
  • Nvidia Driver
  • Unassigned Devices
  • Unassigned Devices Plus
  • Unassigned Devices Preclear

 

Offending VM build

  1. OS: Linux Mint 20.3 xfce
  2. Memory assignment: 20 GB
  3. See the attached XML file "OffendingVM.xml" for the VM configuration. This is after the below (minor) changes were made.

 

How the problem has manifested and what I tried

  • I first noticed that the problem appeared to happen during high network usage, what would happen is I would try to copy a 70 GB file from remote SMB share to the local storage and it would crash about 3.5 GB in. I changed the Network Model for the VM from e1000 to virtio. No change in crash behavior. The crash had the following characteristics:
    • VM immediately dropped my SSH connections and would no longer respond to connection attempts
    • unRAID webUI was completely unresponsive and my web browser acted as if the website was down
    • unRAID LAN interface IP address responded correctly to pings as if there was no issue
    • unRAID physical console appeared to work correctly and gave no indication of a problem
    • unRAID behaved normally again after a reboot via either the physical button or the physical console
       
  • I changed the Network Model for the VM from virtio to virtio-net after reading that virtio-net is most stable. No change in crash behavior.
     
  • I suspected an issue with the onboard 1 Gb NIC or its driver so I replaced it with a Intel X550 10 Gb PCIe card. A change in crash behavior here. After the swap I was able to copy the file as fast as the remote SMB server could manage (140-500 MBps) without triggering a crash.
     
  • Now being able to copy files over the network at full speed without a crash, I moved on to computing sha512sum hashes using the linux command "sha512sum". This is a single threaded workload. After 1-2 minutes of this the following happened. Now my working theory is that the NIC swap improved the issue because the new NIC offloads calculations differently, and this was a OOM issue all along.
    • VM dropped my SSH connections
    • unRAID webUI became unresponsive
    • 5-10 minutes later, the webUI suddenly became responsive again. I was able to see that the VM had been shut down. I got a notification from the Fix Common Problems plugin of "Out Of Memory errors detected on your server"
    • It's worth noting that I was running htop in a SSH window from the VM at the time of the crash. htop clearly shows that the sha512sum command became a dead process (it had not finished executing) right before the crash and the memory usage of the VM was no more than 280 MB of 19.5 GB.

metaverse-diagnostics-20240324-1101.zip OffendingVM.xml

Edited by gusgus
solved
Link to comment

Additional info:
I found in the VM logs the following line at the time of each crash (with different timestamps of course):
2024-03-24 17:41:47.443+0000: shutting down, reason=crashed

So it would appear that qemu recognizes a problem and recovers long enough to write to the log. This is the only line after each crash.

Link to comment
  • Solution

Solved.

 

All this time I assumed that the OOM error was not actually unRAID running out of memory, because I miscalculated the amount of RAM that should be allocated to each VM to prevent an OOM condition. I reset the memory allocation for each VM (correctly this time) and the system no longer crashes.

  • Thanks 1
Link to comment
  • gusgus changed the title to [SOLVED] [6.12.8] OOM Crashes during high CPU or network usage on a specific VM

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.