VM got stuck when using 100% of all cpus for long time


amstel

Recommended Posts

Hi,

 

I'm running win10 on VM, using cpu isolation for cpus 1,2,3,5,6,7 out of 8 .

 

my xml settings for the cpus:

  <cputune>
    <vcpupin vcpu='0' cpuset='1'/>
    <vcpupin vcpu='1' cpuset='2'/>
    <vcpupin vcpu='2' cpuset='3'/>
    <vcpupin vcpu='3' cpuset='5'/>
    <vcpupin vcpu='4' cpuset='6'/>
    <vcpupin vcpu='5' cpuset='7'/>
    <emulatorpin cpuset='0,4'/>
  </cputune>

I have also noticed that after moving to unraid 6.4,

the automatic settings have changed the order of the cpus, so I also tried this:

  <cputune>
    <vcpupin vcpu='0' cpuset='1'/>
    <vcpupin vcpu='1' cpuset='5'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='6'/>
    <vcpupin vcpu='4' cpuset='3'/>
    <vcpupin vcpu='5' cpuset='7'/>
    <emulatorpin cpuset='0,4'/>
  </cputune>

 

 

I'm using software like Matlab that does calculations.

 

the software uses 100% off all 6 cores.

I run that software for 15 hours to finish come calculations.

I leave it to run over the night,

but in the morning the VM doesn't wakes up when I move the mouse nor hit the keyboard.

also teamviewer shows the VM is offline.

 

unraid's dashboard shows that all 6's cpus are using 98%-100% also when VM is stuck.

 

VM log and unraid's log don't show anything irregular..

 

well,

any ideas what the problem is?

 

my HW  for the VM is:

6 isolated CPUS.

12GB RAM.

unraid's cache SSD also contains the VM's C (main) drive.

passthru of GTX 1060 6GB with 3 connected monitors.

windows auto turn off monitor is OFF.

screen saver turns on after 5 minutes (problem also occurs without screensaver on).

 

 

where do I start investigating the issue?

 

Thanks.

Edited by amstel
Link to comment
1 minute ago, 1812 said:

did the problem start after moving to 6.4 or was it there before?

 

 

 

I started using the software after moving to version 6.4, so I did not check in previous versions.
do you think that downgrading to 6.3.x version might solve that issue?

 

Link to comment
7 minutes ago, nuhll said:

I would not downgrade, since unraid doesnt seem to have a problem, i think its your VM. Can you limit ur prog to 95% e.g.?


What happens when u connect to the Vm via the builtin VNC?

 

thanks for replying.

1. how do I limit the prog to 95%/90% ..?

2. didn't try... I will give it a try..

Link to comment
9 hours ago, amstel said:

 

I started using the software after moving to version 6.4, so I did not check in previous versions.
do you think that downgrading to 6.3.x version might solve that issue?

 

 

no, just trying to isolate the problem area. stay on 6.4

 

is this software you are running in another os, like windows? if so, what guide did you follow to set it up? If you didn't follow a guide, then there is a decent chance that is your problem.

Link to comment
7 hours ago, 1812 said:

 

no, just trying to isolate the problem area. stay on 6.4

 

is this software you are running in another os, like windows? if so, what guide did you follow to set it up? If you didn't follow a guide, then there is a decent chance that is your problem.

 

well, the software configuration is just fine.

 

I ran the software all night running on VNC instead of passthru,

and it is working fine.. didn't get stuck.

 

 

Link to comment

It sounds as though memory may be filling up in the Windows VM and processes are starting to get cancelled. The fact that the CPU is still pegged means that the VM is still running. Will the process complete eventually and CPU utilization come back down to baseline? Or is it your assessment that the process is hung at this high CPU is indicating some sort of an infinite loop?

 

Obviously this is not something other users are observing. But running at that level of utilization is not common, except maybe crypto mining.

 

The fact you did not see the problem with VNC may just be coincidence, but it may be very relevant.

 

Curious if you are using ACS override? I know it can cause some issues.

 

I would try the following in no particular order:

- Upgrade the GPU drivers. If the only symptom of failure relates to use of the video card, it is worth a try. You could also update your motherboard BIOS.

 

- Try running both VNC (or something like SplashTop or NoMachine) and graphics card at the same time with the passthru in place. Would be interesting to see if the other video interface continues to operate after passthrough shuts down. You mention teamviewer is showing offline, so not optimistic here. If the network goes, none of these types of tools will work.

 

- Set up a batch file that outputs cpu and memory utilization to a text file on your cache and on your c: drive. I'm sure you can find command line programs to output this type of data that can be piped to two different files. Also output the date/time. Have it sleep for say 5 minutes between logs. Would be interesting to see if the logging continues even after the display shuts down, and if the c: drive logging continues after the network logging. (After a reboot you should see c: drive logging). It is possible that resources (like memory) are being consumed and processes / services are shutting down. So you may see memory utilization increasing. And if the local goes longer than the network, this would imply some sort of gradual loss of function vs a single event that shuts everything down. Multiples runs that are pretty consistent in terms of duration before the logging stops would point to a resource / software issue, whereas a large amount of variance would point to something more random like hardware or heat.

 

- download Prime95. It is a program that can stress the CPU to similar levels as your application. It is frequently used to verify stable overclock. See if running Prime95 has similar results as the program you are running. Prime95 does not leak memory and would exclude using your proprietary app which no one could try and reproduce. Be careful, as Prime95, depending on the settings, can produce a lot of heat. There is a blended test that may be suitable. But I would run it and watch your CPU temps for at least 15-30 mins to make sure they are not getting out of hand before letting run unattended. Tests that say no memory mean the CPU is able to run harder, because memory accesses give the CPU a little breathing room. The harder the CPU is pushed the hotter it runs.

 

- Try playing with the CPU allocation. Remove the emulator pin. Remove another core. See if reducing the extent of CPU engagement has an effect (e.g., delays or eliminates the hang)

 

Good luck. These types of issues are complex to debug. Hope this gives some ideas of how to narrow it down.

Link to comment
7 hours ago, nuhll said:

 

Thats interesting. So this might be the problem. If u dont need GPU, then youre fine, i guess.

well, I do need the GPU, I also mine with it, never got stuck.

 

8 hours ago, SSD said:

It sounds as though memory may be filling up in the Windows VM and processes are starting to get cancelled. The fact that the CPU is still pegged means that the VM is still running. Will the process complete eventually and CPU utilization come back down to baseline? Or is it your assessment that the process is hung at this high CPU is indicating some sort of an infinite loop?

 

Obviously this is not something other users are observing. But running at that level of utilization is not common, except maybe crypto mining.

 

The fact you did not see the problem with VNC may just be coincidence, but it may be very relevant.

 

Curious if you are using ACS override? I know it can cause some issues.

 

I would try the following in no particular order:

- Upgrade the GPU drivers. If the only symptom of failure relates to use of the video card, it is worth a try. You could also update your motherboard BIOS.

 

- Try running both VNC (or something like SplashTop or NoMachine) and graphics card at the same time with the passthru in place. Would be interesting to see if the other video interface continues to operate after passthrough shuts down. You mention teamviewer is showing offline, so not optimistic here. If the network goes, none of these types of tools will work.

 

- Set up a batch file that outputs cpu and memory utilization to a text file on your cache and on your c: drive. I'm sure you can find command line programs to output this type of data that can be piped to two different files. Also output the date/time. Have it sleep for say 5 minutes between logs. Would be interesting to see if the logging continues even after the display shuts down, and if the c: drive logging continues after the network logging. (After a reboot you should see c: drive logging). It is possible that resources (like memory) are being consumed and processes / services are shutting down. So you may see memory utilization increasing. And if the local goes longer than the network, this would imply some sort of gradual loss of function vs a single event that shuts everything down. Multiples runs that are pretty consistent in terms of duration before the logging stops would point to a resource / software issue, whereas a large amount of variance would point to something more random like hardware or heat.

 

- download Prime95. It is a program that can stress the CPU to similar levels as your application. It is frequently used to verify stable overclock. See if running Prime95 has similar results as the program you are running. Prime95 does not leak memory and would exclude using your proprietary app which no one could try and reproduce. Be careful, as Prime95, depending on the settings, can produce a lot of heat. There is a blended test that may be suitable. But I would run it and watch your CPU temps for at least 15-30 mins to make sure they are not getting out of hand before letting run unattended. Tests that say no memory mean the CPU is able to run harder, because memory accesses give the CPU a little breathing room. The harder the CPU is pushed the hotter it runs.

 

- Try playing with the CPU allocation. Remove the emulator pin. Remove another core. See if reducing the extent of CPU engagement has an effect (e.g., delays or eliminates the hang)

 

Good luck. These types of issues are complex to debug. Hope this gives some ideas of how to narrow it down.

thanks for the detailed reply.

well I did try with and without the ACS override -->> same results.

crypto mining is working well for me, even when doing it 24/7 (CPU + GPU).

GPU drivers are being updated constantly.

 

how can I run VNC + passthru at the same time?

 

I have also checked and this software consume almost 100% of the CPU at run time,

but consumes really small amount of RAM while running, 210MB.

I also ran the software on my laptop and on VNC and it works great.

 

I also tried removing the emulator pin -->> same results.

 

 

well I guess that if this would be a lot of trouble for me I would have to boot into the winOS instead to unRaid and see how it works there.

 

 

Thanks for trying till now.

 

Link to comment
53 minutes ago, amstel said:

crypto mining is working well for me, even when doing it 24/7 (CPU + GPU).

Sort of points back to an issue with Matlab. Similar to running Prime95 I guess.

 

53 minutes ago, amstel said:

how can I run VNC + passthru at the same time?

Not sure if VNC server is part of VM by default. You can try to connect via VNC to the running VM. If that does not work, you can run a VNC server in the VM and then connect from the laptop or something. Splashtop Desktop and NoMachine are similar software. I had passthrough issues with my first VM install and used SplashTop to remote in while passthrough was active and was able to fix the passthrough problem. Without it I was blind and would have never have been able to get the driver installation issue solved.

 

53 minutes ago, amstel said:

I also tried removing the emulator pin -->> same results.

Not surprised, but worth a chance. Thought maybe unRAID was being starved. Removing one CPU is another thing to try. 

 

53 minutes ago, amstel said:

well I guess that if this would be a lot of trouble for me I would have to boot into the winOS instead to unRaid and see how it works there.

You could try on bare metal and see if it continues.

 

Did you try the logging? Might tell you more about the consistency of the hang on successive runs.

Link to comment
8 hours ago, SSD said:

Sort of points back to an issue with Matlab. Similar to running Prime95 I guess.

 

Not sure if VNC server is part of VM by default. You can try to connect via VNC to the running VM. If that does not work, you can run a VNC server in the VM and then connect from the laptop or something. Splashtop Desktop and NoMachine are similar software. I had passthrough issues with my first VM install and used SplashTop to remote in while passthrough was active and was able to fix the passthrough problem. Without it I was blind and would have never have been able to get the driver installation issue solved.

 

Not surprised, but worth a chance. Thought maybe unRAID was being starved. Removing one CPU is another thing to try. 

 

You could try on bare metal and see if it continues.

 

Did you try the logging? Might tell you more about the consistency of the hang on successive runs.

 

 

will try that,

but each of this tests takes alot of time..

 

I dunno if it is related,

but also trying now "CPU Mode" "Emulated" instead of "Host Passthrough".

 

 

maybe I should also try changing between SeaBIOS and OVMF..

 

 

instead of using the OS as is I became a tester for the OS.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.