Host crashing, seems VM load related

harshl · February 7, 2022

Hello all,

I have a system that has been running stable for nearly 2 years. I changed things up a bit by building a new server, moving most of the storage, container and a few server based VMs to that system. This system is now running only 1 container and 3 VMs. 2 of those VMs are gaming VMs with a dedicated GPU and USB controller for each.

This system had heavy load while it was stable as well, but after adding the second dedicated GPU and USB controller has been exhibiting issues, namely, the host crashes/reboots without warning, generally only when under load (gaming or rendering something). It seems to be GPU load that triggers it, but CPU load comes pretty naturally with GPU load in our activities, so not sure. The CPU did not change though and in it's previous server duties would be very busy from time to time and was rock solid through it all. I also upgraded the PSU to an 850 watt PSU when adding the second GPU and USB controller.

In order to pass the second GPU to the second VM I did have to enable PCIe ACS Override. Currently it is set to 'downstream'.

I am trying to figure out why this machine might be crashing but am coming up empty. There doesn't seem to be any meaningful logs leading up to the crash, I am syslogging to the other unraid server, but they show nothing of significance leading up to the reboot time. Snippet of those logs before and after the crash is attached. That snippet actually shows two crashes relatively close together. I have also attached a diagnostics.

I have three suspicions

PCIe ACS Override is causing an issue, but the more research I do, the more it seems that is pretty much required when passing through more than one PCIe GPU. Can anyone confirm or deny if that is true?
The second video card I added is a 1050 Ti that is bus powered and I wonder if the motherboard is unable to power everything with the heavy video card draw? Not sure how likely that might be. Thoughts on that?
If one is correct in that ACS is causing an issue and I can get a motherboard that doesn't need it to pass through two GPUs, then perhaps a better motherboard will solve it?

I wish I had another video card that wasn't bus powered to try in it, but I don't and the current market conditions don't make it very conducive to buy one to throw at it for troubleshooting.

Anyway, if you have any thoughts on how to troubleshoot this, that would be appreciated.

Here are the main components of the build for convenience:

I9-9900K

ASRock Z390 Taichi

64GB of DDR4 memory that is on the QVL, I can't find my order to see what it was right now

EVGA SuperNOVA 850 G5 <--New after change

2x Mushkin 1TB drives mirrored for the array, MKNSSDRE1TB (VM OS's live here) <--New after change, but not new drives

2x WD 4TB drive for the array, WD40EFRX-68W <--New after change, but not new drives

2x Samsung 2TB 980 Pros, unassigned, set to passthrough, one each passed through to the gaming VMs (Games and large software for the gaming VMs lives here)

1x Bluray drive, BW-16D1HT

RTX2060 SUPER, passed through to one gaming VM

GTX1050 Ti, passed through to the other gaming VM <--New after change, but not new card

2x FL1100 controllers, one each passed through to the gaming VMs <--1x new after change

There are of course some other USB accessories and things, but the above is the majority of the major components.

Thanks for any thoughts you might have on how to narrow this down.

cube-diagnostics-20220207-0931.zip log-snippet.txt

harshl · March 7, 2022

Anyone have any thoughts on this at all? Specifically interested if anyone has thoughts around the PCIe ACS Overrides being related to crashes? Am I barking up the wrong tree there? Guess I will move my gaming VMs to be installed directly on the passed through SSD to eliminate the cache as a player in this from a load perspective. Maybe I will buy another power supply also to see if that makes any difference since it was new with the change...

What would you try if it was your system?

I just tried running mprime on the host with test number 2, 2 = Small FFTs (tests L1/L2/L3 caches, maximum power/heat/CPU stress). This ran just fine for a long period of time, so the CPU/CPU temp is not an issue as far as I can tell.

I also just ran it with test 3, 3 = Large FFTs (stresses memory controller and RAM). This ran just fine as well, so I don't believe I have a memory problem.

Well, I just ran mprep.info/gpu from each of the gaming VMs and it has been running the CPU&GPU test for over 30 minutes now without issue. Playing two games at the same time would have for sure crashed it within that time. This is steering me more towards some kind of a storage issue now. Super strange...

Just finished running another test against the CPU, video card, and passed through Samsung drives on both VMs simultaneously, ran fine. Just the cache left to test now. If that is stable also, then I guess I suspect the USB controllers? That is the only thing that would be heavily utilized while playing a game that hasn't been load tested so far... Still strange...

Well, I've tested the array drives (C drive in the VMs) along with CPU and video cards and still... all runs fine. I cannot crash this thing without having two people use the machines manually with some at least somewhat significant load, like playing a game... This is the weirdest issue ever...

Thanks,

Edited March 7, 2022 by harshl

harshl · May 13, 2022

I know it is a super edge case, but I thought I would close the loop on this one. In my case, this issue was power supply related. Not a wattage issue, but a faulty power supply issue. Put in the same wattage and everything is running great again.

jgillman · December 23, 2022

@harshlI'm curious how you debugged the PSU issue. I've got a similar thing going on where when I have a VM (Ubuntu) running with my GPU passed through it locks up the whole host system and I have to hard reboot. This has only happened twice thus far but seems to happen when there's higher load on the VM.

I've got a new 550w PSU any my max usage should never exceed 400w but I'm kinda grasping at straws here...

harshl · December 23, 2022

Honestly, I was also grasping at straws. I decided to just start throwing parts at it.

Nothing made any difference until I swapped out the power supply. I had a higher wattage one, threw it in, and no more sporadic reboots. So I double checked all of my wattage calculations and purchased another power supply that should safely cover the load and everything has been fine since.

If it's locking up, it might be a little different than my situation as mine was instantly rebooting itself. But I suppose it could manifest many different ways.

I would recommend throwing a bigger (wattage) one at it for a time to see if it helps.

Good luck @jgillman, instability is no fun.

Host crashing, seems VM load related

Recommended Posts

harshl

Link to comment

harshl

Link to comment

harshl

Link to comment

jgillman

Link to comment

harshl

Link to comment

Join the conversation