Entire unraid server hangs when VM is under load


Recommended Posts

Hello,

I'm using unraid to host a few shares and run a headless gaming VM (utilizing Steam In Home Streaming / Remote Play). After playing for 5-10 minutes, the entire unraid instance crashes (shares go offline, server is inaccessible via web or ssh, and the local terminal display freezes and is unresponsive). The only way to bring the server back up is with a power cycle. 

 

At first I thought it had to do with temps, but the CPU temps never went above 86C (and even that was uncommon). I then thought it had to do with the GPU (EVGA GTX 1660), but the problem persisted even without a GPU physically connected at all. My next thought was to try lowering the amount of RAM allocated to the VM, but that didn't help either (I tried 12GB, 8GB, then 4GB out of 16GB total). 

 

I have more RAM on the way, which will put my total at 32GB (not specifically due to this issue, but it will give me another thing to try). That's the max my motherboard can support. I also have another network card on the way too (also not purchased due to this issue specifically), so I can also try passing through my existing network card instead of using bridged/bonded networking.

 

I’m running a memtest now, and it’s gone through one pass without errors so far.

 

Thanks!!

 

UnRAID syslog at the time of the frozen screen (nothing was generated at the time of the crash): https://imgur.com/BAE5AVI

Windows 10 VM Settings: https://imgur.com/70wUEfu

 

Specs:

Intel Core i7 4790k @ 4.0 GHz (stock)

EVGA Z97 Classified motherboard with latest BIOS (2.06)

16GB DDR3 RAM @ 1600 MHz

EVGA GTX 1660 XC Gaming

Supermicro LSI 9300-8i HBA (IT mode)

3x Ironwolf Pro 6TB 

2x 256GB SATA SSD for cache (different brands)

Link to comment

I've run into this issue myself with docker containers that run wild with the CPU, when I did not have CPU pinning configured such that they do not have effectively unlimited access to the CPU. I believe your problem may be caused by over-provisioing your VM. Stated differently: you have allocated the vast majority of resources to your VM, and Unraid does not have enough leftover to do the work of managing the VM.

 

How many of your cores/threads are you currently passing through to the VM? If you're passing all 4 cores and 8 threads to your VM right now, then that's almost 100% the issue.  I'd say you want to leave Unraid with at least a whole core to itself (with both it's threads). That still leaves you with 3 cores to the VM, and I would expect it to preform decently. 

 

(RAM could also be the issue, but I'm suspecting CPU based on what you've said. 32GB cant hurt, but as long as you arent passing through all 16Gb of your memory to the VM, I'd imagine you should be okay. I'd try giving maybe 12Gb to the VM and leave 4 for unraid, see how that runs, but maybe thats what you're doing already.) 

Link to comment
16 hours ago, swallace said:

How many of your cores/threads are you currently passing through to the VM? If you're passing all 4 cores and 8 threads to your VM right now, then that's almost 100% the issue.  I'd say you want to leave Unraid with at least a whole core to itself (with both it's threads). That still leaves you with 3 cores to the VM, and I would expect it to preform decently. 

I've been passing 3 physical cores (6 threads) to the VM.  I've tried fewer cores/threads, but it still crashed eventually.

 

16 hours ago, swallace said:

(RAM could also be the issue, but I'm suspecting CPU based on what you've said. 32GB cant hurt, but as long as you arent passing through all 16Gb of your memory to the VM, I'd imagine you should be okay. I'd try giving maybe 12Gb to the VM and leave 4 for unraid, see how that runs, but maybe thats what you're doing already.) 

I started out giving 12GB (out of 16GB total) to the VM, and it crashed.  I tried lowering that to 8GB and 4GB to no avail.  When my next 16GB of RAM arrives, I'll try that first without changing anything else.  If it continues crashing with more RAM, I'll try lowering the number of cores allocated to the VM and report back.

Link to comment

Settings -> Syslog Server -> turn on mirroring to flash.

Then next time it hangs, reboot and attach the syslog saved to flash.

That's the only way to see the log of what happens shortly before the crash.

 

PS: Avoid taking screenshots where possible e.g. you are better off copy-paste stuff to notepad and attach the text files to the forum post instead of taking screenshot.

Link to comment
On 3/5/2020 at 2:02 AM, testdasi said:

Settings -> Syslog Server -> turn on mirroring to flash.

Then next time it hangs, reboot and attach the syslog saved to flash.

That's the only way to see the log of what happens shortly before the crash.

 

PS: Avoid taking screenshots where possible e.g. you are better off copy-paste stuff to notepad and attach the text files to the forum post instead of taking screenshot.

That log has never contained anything useful right before the crash. In this attached syslog, the crash happens at 21:12, where there are no log messages.  The system locks without any apparent hints as to why. Here’s the diagnostics as well.

syslog.txt

server-diagnostics-20200307-2128.zip

Edited by vtor
Link to comment
  • 2 weeks later...
12 hours ago, vtor said:

Wanted to bump this. I can reliably reproduce the locking behavior simply by running prime95 in my VM. Running prime95 directly on unRAID causes no issues at all (aside from expected system sluggishness from stressing the CPU).

Unfortunately this kind of instability is incredibly hard to diagnose.

 

The only obvious thing I can see is you are running DDR3 at 1600MHz. It has been so long ago since I had DDR3 but I think that's overclocked speed isn't it? Overclock as in above stock DDR3 speed, not over manufacturer spec (which could be certified overclock).

So perhaps run your RAM at a lower frequency or run memtest.

 

Also turn off Turbo Boost.

Link to comment
On 3/19/2020 at 8:06 AM, testdasi said:

Unfortunately this kind of instability is incredibly hard to diagnose.

 

The only obvious thing I can see is you are running DDR3 at 1600MHz. It has been so long ago since I had DDR3 but I think that's overclocked speed isn't it? Overclock as in above stock DDR3 speed, not over manufacturer spec (which could be certified overclock).

So perhaps run your RAM at a lower frequency or run memtest.

 

Also turn off Turbo Boost.

A Combination of turning off XMP for the RAM, turning off Turbo Boost, and underclocking my CPU to 3.8 GHz (down from 4 GHz) seems to work! I’ve been playing for 3-4 hours with no crashes, when I previously couldn’t go more than 15 minutes without a crash. Thank you!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.