amstel Posted January 29, 2018 Share Posted January 29, 2018 (edited) Hi, I'm running win10 on VM, using cpu isolation for cpus 1,2,3,5,6,7 out of 8 . my xml settings for the cpus: <cputune> <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='2'/> <vcpupin vcpu='2' cpuset='3'/> <vcpupin vcpu='3' cpuset='5'/> <vcpupin vcpu='4' cpuset='6'/> <vcpupin vcpu='5' cpuset='7'/> <emulatorpin cpuset='0,4'/> </cputune> I have also noticed that after moving to unraid 6.4, the automatic settings have changed the order of the cpus, so I also tried this: <cputune> <vcpupin vcpu='0' cpuset='1'/> <vcpupin vcpu='1' cpuset='5'/> <vcpupin vcpu='2' cpuset='2'/> <vcpupin vcpu='3' cpuset='6'/> <vcpupin vcpu='4' cpuset='3'/> <vcpupin vcpu='5' cpuset='7'/> <emulatorpin cpuset='0,4'/> </cputune> I'm using software like Matlab that does calculations. the software uses 100% off all 6 cores. I run that software for 15 hours to finish come calculations. I leave it to run over the night, but in the morning the VM doesn't wakes up when I move the mouse nor hit the keyboard. also teamviewer shows the VM is offline. unraid's dashboard shows that all 6's cpus are using 98%-100% also when VM is stuck. VM log and unraid's log don't show anything irregular.. well, any ideas what the problem is? my HW for the VM is: 6 isolated CPUS. 12GB RAM. unraid's cache SSD also contains the VM's C (main) drive. passthru of GTX 1060 6GB with 3 connected monitors. windows auto turn off monitor is OFF. screen saver turns on after 5 minutes (problem also occurs without screensaver on). where do I start investigating the issue? Thanks. Edited January 29, 2018 by amstel Quote Link to comment
NewDisplayName Posted January 29, 2018 Share Posted January 29, 2018 I would check windows "show event" function. Dont know excact english name, in german its Ereignisanzeige... Quote Link to comment
amstel Posted January 29, 2018 Author Share Posted January 29, 2018 2 hours ago, nuhll said: I would check windows "show event" function. Dont know excact english name, in german its Ereignisanzeige... checked that. couldn't find anything helpful there... Quote Link to comment
1812 Posted January 29, 2018 Share Posted January 29, 2018 did the problem start after moving to 6.4 or was it there before? Quote Link to comment
amstel Posted January 29, 2018 Author Share Posted January 29, 2018 1 minute ago, 1812 said: did the problem start after moving to 6.4 or was it there before? I started using the software after moving to version 6.4, so I did not check in previous versions. do you think that downgrading to 6.3.x version might solve that issue? Quote Link to comment
NewDisplayName Posted January 29, 2018 Share Posted January 29, 2018 (edited) I would not downgrade, since unraid doesnt seem to have a problem, i think its your VM. Can you limit ur prog to 95% e.g.? What happens when u connect to the Vm via the builtin VNC? Edited January 29, 2018 by nuhll Quote Link to comment
amstel Posted January 29, 2018 Author Share Posted January 29, 2018 7 minutes ago, nuhll said: I would not downgrade, since unraid doesnt seem to have a problem, i think its your VM. Can you limit ur prog to 95% e.g.? What happens when u connect to the Vm via the builtin VNC? thanks for replying. 1. how do I limit the prog to 95%/90% ..? 2. didn't try... I will give it a try.. Quote Link to comment
1812 Posted January 29, 2018 Share Posted January 29, 2018 9 hours ago, amstel said: I started using the software after moving to version 6.4, so I did not check in previous versions. do you think that downgrading to 6.3.x version might solve that issue? no, just trying to isolate the problem area. stay on 6.4 is this software you are running in another os, like windows? if so, what guide did you follow to set it up? If you didn't follow a guide, then there is a decent chance that is your problem. Quote Link to comment
amstel Posted January 30, 2018 Author Share Posted January 30, 2018 7 hours ago, 1812 said: no, just trying to isolate the problem area. stay on 6.4 is this software you are running in another os, like windows? if so, what guide did you follow to set it up? If you didn't follow a guide, then there is a decent chance that is your problem. well, the software configuration is just fine. I ran the software all night running on VNC instead of passthru, and it is working fine.. didn't get stuck. Quote Link to comment
SSD Posted January 30, 2018 Share Posted January 30, 2018 It sounds as though memory may be filling up in the Windows VM and processes are starting to get cancelled. The fact that the CPU is still pegged means that the VM is still running. Will the process complete eventually and CPU utilization come back down to baseline? Or is it your assessment that the process is hung at this high CPU is indicating some sort of an infinite loop? Obviously this is not something other users are observing. But running at that level of utilization is not common, except maybe crypto mining. The fact you did not see the problem with VNC may just be coincidence, but it may be very relevant. Curious if you are using ACS override? I know it can cause some issues. I would try the following in no particular order: - Upgrade the GPU drivers. If the only symptom of failure relates to use of the video card, it is worth a try. You could also update your motherboard BIOS. - Try running both VNC (or something like SplashTop or NoMachine) and graphics card at the same time with the passthru in place. Would be interesting to see if the other video interface continues to operate after passthrough shuts down. You mention teamviewer is showing offline, so not optimistic here. If the network goes, none of these types of tools will work. - Set up a batch file that outputs cpu and memory utilization to a text file on your cache and on your c: drive. I'm sure you can find command line programs to output this type of data that can be piped to two different files. Also output the date/time. Have it sleep for say 5 minutes between logs. Would be interesting to see if the logging continues even after the display shuts down, and if the c: drive logging continues after the network logging. (After a reboot you should see c: drive logging). It is possible that resources (like memory) are being consumed and processes / services are shutting down. So you may see memory utilization increasing. And if the local goes longer than the network, this would imply some sort of gradual loss of function vs a single event that shuts everything down. Multiples runs that are pretty consistent in terms of duration before the logging stops would point to a resource / software issue, whereas a large amount of variance would point to something more random like hardware or heat. - download Prime95. It is a program that can stress the CPU to similar levels as your application. It is frequently used to verify stable overclock. See if running Prime95 has similar results as the program you are running. Prime95 does not leak memory and would exclude using your proprietary app which no one could try and reproduce. Be careful, as Prime95, depending on the settings, can produce a lot of heat. There is a blended test that may be suitable. But I would run it and watch your CPU temps for at least 15-30 mins to make sure they are not getting out of hand before letting run unattended. Tests that say no memory mean the CPU is able to run harder, because memory accesses give the CPU a little breathing room. The harder the CPU is pushed the hotter it runs. - Try playing with the CPU allocation. Remove the emulator pin. Remove another core. See if reducing the extent of CPU engagement has an effect (e.g., delays or eliminates the hang) Good luck. These types of issues are complex to debug. Hope this gives some ideas of how to narrow it down. Quote Link to comment
NewDisplayName Posted January 30, 2018 Share Posted January 30, 2018 6 hours ago, amstel said: well, the software configuration is just fine. I ran the software all night running on VNC instead of passthru, and it is working fine.. didn't get stuck. Thats interesting. So this might be the problem. If u dont need GPU, then youre fine, i guess. Quote Link to comment
amstel Posted January 30, 2018 Author Share Posted January 30, 2018 7 hours ago, nuhll said: Thats interesting. So this might be the problem. If u dont need GPU, then youre fine, i guess. well, I do need the GPU, I also mine with it, never got stuck. 8 hours ago, SSD said: It sounds as though memory may be filling up in the Windows VM and processes are starting to get cancelled. The fact that the CPU is still pegged means that the VM is still running. Will the process complete eventually and CPU utilization come back down to baseline? Or is it your assessment that the process is hung at this high CPU is indicating some sort of an infinite loop? Obviously this is not something other users are observing. But running at that level of utilization is not common, except maybe crypto mining. The fact you did not see the problem with VNC may just be coincidence, but it may be very relevant. Curious if you are using ACS override? I know it can cause some issues. I would try the following in no particular order: - Upgrade the GPU drivers. If the only symptom of failure relates to use of the video card, it is worth a try. You could also update your motherboard BIOS. - Try running both VNC (or something like SplashTop or NoMachine) and graphics card at the same time with the passthru in place. Would be interesting to see if the other video interface continues to operate after passthrough shuts down. You mention teamviewer is showing offline, so not optimistic here. If the network goes, none of these types of tools will work. - Set up a batch file that outputs cpu and memory utilization to a text file on your cache and on your c: drive. I'm sure you can find command line programs to output this type of data that can be piped to two different files. Also output the date/time. Have it sleep for say 5 minutes between logs. Would be interesting to see if the logging continues even after the display shuts down, and if the c: drive logging continues after the network logging. (After a reboot you should see c: drive logging). It is possible that resources (like memory) are being consumed and processes / services are shutting down. So you may see memory utilization increasing. And if the local goes longer than the network, this would imply some sort of gradual loss of function vs a single event that shuts everything down. Multiples runs that are pretty consistent in terms of duration before the logging stops would point to a resource / software issue, whereas a large amount of variance would point to something more random like hardware or heat. - download Prime95. It is a program that can stress the CPU to similar levels as your application. It is frequently used to verify stable overclock. See if running Prime95 has similar results as the program you are running. Prime95 does not leak memory and would exclude using your proprietary app which no one could try and reproduce. Be careful, as Prime95, depending on the settings, can produce a lot of heat. There is a blended test that may be suitable. But I would run it and watch your CPU temps for at least 15-30 mins to make sure they are not getting out of hand before letting run unattended. Tests that say no memory mean the CPU is able to run harder, because memory accesses give the CPU a little breathing room. The harder the CPU is pushed the hotter it runs. - Try playing with the CPU allocation. Remove the emulator pin. Remove another core. See if reducing the extent of CPU engagement has an effect (e.g., delays or eliminates the hang) Good luck. These types of issues are complex to debug. Hope this gives some ideas of how to narrow it down. thanks for the detailed reply. well I did try with and without the ACS override -->> same results. crypto mining is working well for me, even when doing it 24/7 (CPU + GPU). GPU drivers are being updated constantly. how can I run VNC + passthru at the same time? I have also checked and this software consume almost 100% of the CPU at run time, but consumes really small amount of RAM while running, 210MB. I also ran the software on my laptop and on VNC and it works great. I also tried removing the emulator pin -->> same results. well I guess that if this would be a lot of trouble for me I would have to boot into the winOS instead to unRaid and see how it works there. Thanks for trying till now. Quote Link to comment
SSD Posted January 30, 2018 Share Posted January 30, 2018 53 minutes ago, amstel said: crypto mining is working well for me, even when doing it 24/7 (CPU + GPU). Sort of points back to an issue with Matlab. Similar to running Prime95 I guess. 53 minutes ago, amstel said: how can I run VNC + passthru at the same time? Not sure if VNC server is part of VM by default. You can try to connect via VNC to the running VM. If that does not work, you can run a VNC server in the VM and then connect from the laptop or something. Splashtop Desktop and NoMachine are similar software. I had passthrough issues with my first VM install and used SplashTop to remote in while passthrough was active and was able to fix the passthrough problem. Without it I was blind and would have never have been able to get the driver installation issue solved. 53 minutes ago, amstel said: I also tried removing the emulator pin -->> same results. Not surprised, but worth a chance. Thought maybe unRAID was being starved. Removing one CPU is another thing to try. 53 minutes ago, amstel said: well I guess that if this would be a lot of trouble for me I would have to boot into the winOS instead to unRaid and see how it works there. You could try on bare metal and see if it continues. Did you try the logging? Might tell you more about the consistency of the hang on successive runs. Quote Link to comment
amstel Posted January 31, 2018 Author Share Posted January 31, 2018 8 hours ago, SSD said: Sort of points back to an issue with Matlab. Similar to running Prime95 I guess. Not sure if VNC server is part of VM by default. You can try to connect via VNC to the running VM. If that does not work, you can run a VNC server in the VM and then connect from the laptop or something. Splashtop Desktop and NoMachine are similar software. I had passthrough issues with my first VM install and used SplashTop to remote in while passthrough was active and was able to fix the passthrough problem. Without it I was blind and would have never have been able to get the driver installation issue solved. Not surprised, but worth a chance. Thought maybe unRAID was being starved. Removing one CPU is another thing to try. You could try on bare metal and see if it continues. Did you try the logging? Might tell you more about the consistency of the hang on successive runs. will try that, but each of this tests takes alot of time.. I dunno if it is related, but also trying now "CPU Mode" "Emulated" instead of "Host Passthrough". maybe I should also try changing between SeaBIOS and OVMF.. instead of using the OS as is I became a tester for the OS. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.