Throttle if disks/cpu/sensors get too hot


Recommended Posts

Hi,

i would suggest add an action to the warnings you get about high temperature.

 

Lets say a chassies fan/or a fan controller dies and you are not at home, there would be nothing which would prevent dmg (i think hdd/ssds/cpu (lets say HDD1) would slow down to prevent dmg, But hdd2-22, which is in normal temperature wouldnt slow down and still generate more heat).

 

I dont know how this could be best archieved, but i think a throttle at XX° to half speed (cpu clock, HDD reads & writes) and check again after XX minutes would be sufficient.  

 

I also dont know if this behaivior would be archievable with plugins. 

Edited by nuhll
Link to comment

For CPU, yes, but like i said, if not all equipment has the same temperature this wouldnt help.

 

You understand? 

 

Like HDDs have 65°, and CPU also, CPU wouldnt throttle, because its not hot for a CPU, but its hot for HDDs... :)

 

Unraid could (if that is possible, i have no idea) just throttle all equipment. If only one piece gets too hot. Before all burn down... :D

Edited by nuhll
Link to comment
14 minutes ago, michael123 said:

Isn't this behavior provided by the motherboard already?

 

The processor is throttled if getting too hot - actual behavior depending on manufacturer, model, Linux kernel, ...

 

No throttling of disks. But it is normally not possible to get a disk hot enough that it breaks if the disks aren't running too hot in the normal case when every fan is working. The major issue with high temperature is that it affects ageing. In general, the expected lifetime of electronics is halved for every 10°C higher temperature. But one day at 10°C higher temperature isn't affecting the total lifetime unless the drives were already running too hot before the temperature rise.

 

With drives normally running at 35-40°C, it's a long way to reach above 60°C for a single fan failure. But it would be good if unRAID could send alarms for any measurable voltage, temperature or RPM found outside of configured range.

 

If people have servers with drives failing from overheating, then the servers have normally been running continuously at a very high temperature or the ventilation in the computer room has failed and no one have bothered to respond. There are also multiple stories about people putting blankets etc over a computer because of the noise. I know disks that survived people moving their brand new PS/2 model 80 into a closet. When they then heard it beep (because it was overheating) they additionally put blankets over it.

Link to comment
6 minutes ago, pwm said:

With drives normally running at 35-40°C, it's a long way to reach above 60°C for a single fan failure.

 

If your using a 5in3 where the disks are very close together and a that fan fails they get hot rather quickly, I remember this happened to me a while ago and I did make a feature request for auto shutdown at at set temp, but still not implemented unfortunately:

 

 

 

Link to comment

I try to use door-mounted fans at the front of the case to get push-pull. This removes the possibility of single-point-of-failure.

 

I have tried mounting two fans after each other behind drives but got issues with awful interference sound if the two fans happened to run at almost the same speed.

 

It's very hard to throttle the different unRAID subsystems - especially for a system with Docker and KVM support. But support for a full shutdown should be quite simple to implement. If it's just the disk temperatures, then it would be possible to shut down dockers/KVM/mover/... and then stop the array and unmount all disks, but leave unRAID online.

Link to comment
2 hours ago, johnnie.black said:

 

If your using a 5in3 where the disks are very close together and a that fan fails they get hot rather quickly, I remember this happened to me a while ago and I did make a feature request for auto shutdown at at set temp, but still not implemented unfortunately:

 

 

 

 

I've asked for this since unRAID 4.4 or so. I think I had an extra setting, and if the temp got to there then it would give up any attempt at a clean shutdown and do its level best to just power down (i.e, "pull" the power button in for 5 seconds :))

 

Auto shutdown would help if HVAC failed and ambient got crazy hot, or fan broke, or fire, or who knows what. You'd just need a way to override it in case of a sensor failure so the server would stay booted!

Link to comment

When I jumped on unRAID, version 5 was so lacking on supervision that I did build a supervisor monitor that I have been running since then. But that was a c++ application with no GUI for configuration. It collected all SMART + fans + temps + voltages + volume storage levels + CPU% and a couple of other metrics and fed to an external MQTT broker but with the ability to do a shutdown if in real panic. I've had a real bad lesson costing significant $$$ with a system where the PSU started to oscillate the output voltages when it got warm.

Link to comment

Shutdown would be atleast a way to stop all this without a atomic explosion. 

 

Besides this i rather would like a way to throttle unraid before it shutdown. So only shutdown if throttle is not enaguht.

 

Yesterday i had a hdd and ssd at 61°... so this is not impossible to reach, even more than 3/4 of disks were spun down... but i have very bad air flow and need to adjust the fans, space between hdds anyway.

 

I may move my server parts from a 4 rack to a big size tower... racks are just sh1t at temperature and air flow... OR you have sound like a airplane starting... 


But still, if any fans fail, e.g. the fans which transport the heat out or in, the whole system could get much higher temps then suggested... so an automatic way of handling this with a throttle, would be really nice. I know limit cpu would be easy, if u can limit the hdd, i dont know.

Edited by nuhll
Link to comment

Okay, my Reds are rated at 65 C max, and sometimes while transferring files to my system for long periods of time -like more than an hour @ 1 gbit- a disk or 2 reaches ~56-59 C and remains there for the duration of the transfer and switches back to sub 45 C temps later....I am not worried that much since I did not hit the max...

Link to comment
33 minutes ago, Mat1926 said:

 

Why WD claims that RED can operate at 65 C? If there are any side effects of temps close to that they would inform their customers...

 

65°C is the red hot limit, as in warranty void.

 

Just as you don't want to rev your car engine all the way to red continuously, you don't want a HDD to continuously run close to the maximum allowed temperature.

 

As I wrote earlier in the thread - every 10°C higher temperature halves the expected life of electronics. It doesn't mean the drive breaks - just that the probability increases.

 

Here is a link to Seagate where they discuss a drive where the MTBF is based on 40°C temperature.

http://knowledge.seagate.com/articles/en_US/FAQ/174791en?language=en_US

 

Link to comment
14 minutes ago, pwm said:

 

65°C is the red hot limit, as in warranty void.

 

Just as you don't want to rev your car engine all the way to red continuously, you don't want a HDD to continuously run close to the maximum allowed temperature.

 

As I wrote earlier in the thread - every 10°C higher temperature halves the expected life of electronics. It doesn't mean the drive breaks - just that the probability increases.

 

Here is a link to Seagate where they discuss a drive where the MTBF is based on 40°C temperature.

http://knowledge.seagate.com/articles/en_US/FAQ/174791en?language=en_US

 

 

I am not talking about Seagate here. And I don't know what you mean by warranty void, they clearly say that the operating temps are 0-65 for Red and 0-60 for the Red Pro, and they did not mention anything related to warranty at all...

Link to comment
Just now, Mat1926 said:

 

I am not talking about Seagate here. And I don't know what you mean by warranty void, they clearly say that the operating temps are 0-65 for Red and 0-60 for the Red Pro, and they did not mention anything related to warranty at all...

 

No, I know you are talking about WD RED. But same with any manufacturer - the Seagate text is good to explain how the drive manufacturers makes their lifetime predictions. The Seagate Barracuda drives doesn't have a max temperature of 40°C, but it's 40°C Seagate base their estimates on.

 

65°C is a temperature you should not pass. And WD will not base their MTBF estimates on drive usage at 65°C even if that is the maximum temperature they have specified.

 

And just as with revving a car engine, the failure rate will go up if you run the drive close to 65°C compared to if you run it at 40°C or 30°C.

 

With 65°C being the maximum operating temperature - what do you think WD considers about their warranty if the lifetime high measured by the drive is above 65°C?

  • Upvote 1
Link to comment
6 hours ago, pwm said:

 

No, I know you are talking about WD RED. But same with any manufacturer - the Seagate text is good to explain how the drive manufacturers makes their lifetime predictions. The Seagate Barracuda drives doesn't have a max temperature of 40°C, but it's 40°C Seagate base their estimates on.

 

65°C is a temperature you should not pass. And WD will not base their MTBF estimates on drive usage at 65°C even if that is the maximum temperature they have specified.

 

And just as with revving a car engine, the failure rate will go up if you run the drive close to 65°C compared to if you run it at 40°C or 30°C.

 

With 65°C being the maximum operating temperature - what do you think WD considers about their warranty if the lifetime high measured by the drive is above 65°C?

 

This 100% makes sense and is consistent with forum recommendations. And those are that drives should operate in the upper 30Cs to low 40Cs, and that if you are going much over 45C, it is time to consider additional cooling options.

 

IMO, even 50C is way too hot.

 

But I have read that it is the temperature fluctuations that are worst for drives. So could be a drive consistently run at 55C will live a long happy life, whereas a disk that is spun down (20Cs) most of the time, and when it spins up, runs in the low 30Cs, and is spiking into the upper 50CS, is more negatively impacted.

Link to comment
6 hours ago, pwm said:

 

No, I know you are talking about WD RED. But same with any manufacturer - the Seagate text is good to explain how the drive manufacturers makes their lifetime predictions. The Seagate Barracuda drives doesn't have a max temperature of 40°C, but it's 40°C Seagate base their estimates on.

 

65°C is a temperature you should not pass. And WD will not base their MTBF estimates on drive usage at 65°C even if that is the maximum temperature they have specified.

 

And just as with revving a car engine, the failure rate will go up if you run the drive close to 65°C compared to if you run it at 40°C or 30°C.

 

With 65°C being the maximum operating temperature - what do you think WD considers about their warranty if the lifetime high measured by the drive is above 65°C?

I had some discussion today with Seagate, when is was calling today for my RMA disk about disktemps.

Older drives are maximum 5-50c and newer drives series are 5-60c maximum operation temperature.

Running above 60c would normally void my warranty according to Seagate, since its outside operating specs.

 

MTBF for my drives is calculated based on 30c, but Seagate now prefers to use AFR since the think its more accurate.

So i just checked there FAQ on that: http://knowledge.seagate.com/articles/en_US/FAQ/174791en?language=en_US

 

Reading there FAQ on that states:

Systems will provide adequate cooling to ensure the case temperatures do not exceed 40°C. Temperatures outside the specifications in Section 2.9 will increase the product AFR and decrease MTBF.

But off-course that is quite hard to translate to drive temperature.

 

I hope the will implement that feature, or maybe add it to dynamix systems temp / fan plugin.

Sadly there is no array fan monitoring on my motherboard.

 

Link to comment
  • 2 weeks later...

This feature is sorely needed, my drives spin down a lot normally and run at about 35c when spun up.  But when the parity check runs, they're all going full speed and jump up to over 50c.  I don't want (and shouldn't need) to add more fans when everything operates within range normally.  I have a quiet unit and would like to keep it that way.  I think at least having an option to rate limit the parity check would be a nice feature.  Most enterprise RAID cards I've used have that option.  I understand this is not that, but it should still be doable.  I'd rather have my drives do a 2 day parity check at 30% than burn up at 100% for 6 hours.

Link to comment
9 minutes ago, Tubez said:

But when the parity check runs, they're all going full speed and jump up to over 50c.  I don't want (and shouldn't need) to add more fans when everything operates within range normally.

 

¿Que?

 

If they overheat when doing a parity check they are not operation within range and you need better cooling, parity check is not a stress test, it's just a sequential disk read which any disk can do without effort or overheating, I'm usually worried in the winter when my disks are much colder than I'd like even during a parity check.

  • Upvote 1
Link to comment

The way to get drives to overheat is when you do a large number of writes where the drive has to constantly move between two regions of the disk surface.

 

As @johnnie.black notes, a sequential read isn't much work for the drives. Or are you maybe downloading or streaming data during the parity sync? Because other disk transfers concurrently with the parity sync will result in the drive heads having to constantly move to handle the multiple tasks.

Link to comment

When I wrote tons and tons of data to from my SSD I would get periodic Over Temp messages from unRAID. To the point I was at dinner and was freaking out until I got home to check. 

 

You have to remember the default settings in unRAID are just that. Just defaults. I took some time and realized the temps for my drives was way off and adjusted them to allow for my heat since every drive manufacturer sets Normal/High/Low thresholds. As well I took a 5th look at my SSD and realized it wasn't getting the best air flow in my case and corrected that too. 

 

In the end everything was fine, but getting some messages from unRAID made me take a second, third look heck even 5th like I said to make sure things are where they should be. If something had of shut my system down because of a temp problem I could of lost some data simply because of a freak out that really wasn't anything.

 

Don't get me wrong I'm not saying there shouldn't be something built in. I'm just saying we should check our own systems to make sure we are sitting in the Norm or out of the Norm.  

Link to comment
  • 1 year later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.