TIE Fighter Posted June 12 Share Posted June 12 hi all My unraid server with two win 10 vms running is force shutting down and syslog shows warning "critical temperature reached, shutting down" after abt 40min of gameplay testing. cinebench testing on each vm goes without problems. the server was updated recently but i cant remember having this problems before. I'm not sure what the correct way to find the device associated with the thermal_zone0 name is. something in /sys/class/thermal/thermal_zone0 folder? sensors att i have searched for why this is happening but to no avail. any help would be much appreciated syslog att syslog-previous Quote Link to comment
Kilrah Posted June 12 Share Posted June 12 What were the drive temps? Only aware of drive temp related auto-shutdown on unraid. Quote Link to comment
TIE Fighter Posted June 12 Author Share Posted June 12 no high temp warnings from any of the drives in a well ventilated game case. Quote Link to comment
JorgeB Posted June 13 Share Posted June 13 HDD temps should not cause shutdown, usually only CPU overheating, and this is controlled by the kernel and firmware, not Unraid. Quote Link to comment
Kilrah Posted June 13 Share Posted June 13 16 minutes ago, JorgeB said: HDD temps should not cause shutdown The parity check tuning plugin has a feature to shutdown based on drive temps, although it's not clear whether it's only active during a parity op or all the time. OP doesn't have it installed though so yeah, not that. Quote Link to comment
JorgeB Posted June 13 Share Posted June 13 Yep, and that should be logged as coming from the kernel. Quote Link to comment
itimpi Posted June 13 Share Posted June 13 3 hours ago, Kilrah said: temps, although it's not clear whether it's only active during a parity op or all the time. I would have to check the code, bit I think it is only active during a parity check although it would be easy to adjust it to always be active. However if it is triggered that way you end up with messages in the syslog from the plugin and notifications (assuming you get a chance to see them) so it would be obvious what triggered it. 1 Quote Link to comment
Mainfrezzer Posted June 13 Share Posted June 13 (edited) 17 hours ago, TIE Fighter said: I'm not sure what the correct way to find the device associated with the thermal_zone0 name is. something in /sys/class/thermal/thermal_zone0 folder? "sensors -u" should show your the devices listed under thermal_zone0 Edit: Although, it seems like this might be the culprit 17 hours ago, TIE Fighter said: That seems oddly low of a critical temp, for anything really. You could try to start with "thermal.nocrt=1" to disable the automatic shutdown feature and re-do what you did when it originally triggered while having an eye on the sensor to see which one hit critical. Edited June 13 by Mainfrezzer Quote Link to comment
TIE Fighter Posted June 13 Author Share Posted June 13 (edited) 23 hours ago, itimpi said: I would have to check the code, bit I think it is only active during a parity check although it would be easy to adjust it to always be active. However if it is triggered that way you end up with messages in the syslog from the plugin and notifications (assuming you get a chance to see them) so it would be obvious what triggered it. no "parity check tuning" plugin installed as i do no have a parity drive in the array yet. 23 hours ago, Mainfrezzer said: "sensors -u" should show your the devices listed under thermal_zone0 Edit: Although, it seems like this might be the culprit That seems oddly low of a critical temp, for anything really. You could try to start with "thermal.nocrt=1" to disable the automatic shutdown feature and re-do what you did when it originally triggered while having an eye on the sensor to see which one hit critical. I disabled the plugin "corefreq" deamon and uninstall the plugin. gamed on both Vm:s for about one hour and no force shutdown yet. however one Vm was crashing with " vfio-pci 0000:4a:00.0: vfio_bar_restore: reset recovery - restoring BARs" in syslog, i did some more readings in the forums and added "pcie_aspm=off" to flash syslinux config file and that seems to solve it. i'll return if critical temp issue persist after more testing. Edited June 14 by TIE Fighter Quote Link to comment
TIE Fighter Posted June 14 Author Share Posted June 14 (edited) On 6/13/2024 at 1:55 PM, Mainfrezzer said: "sensors -u" should show your the devices listed under thermal_zone0 Edit: Although, it seems like this might be the culprit That seems oddly low of a critical temp, for anything really. You could try to start with "thermal.nocrt=1" to disable the automatic shutdown feature and re-do what you did when it originally triggered while having an eye on the sensor to see which one hit critical. Did some more test gaming and yet again the server auto shutdown due to critical temp reached, this was again after about 40min of gameplay. where do you suggest the "thermal.nocrt=1" ? in terminal, command not found. Edited June 14 by TIE Fighter Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.