NVME drive has reported overheat twice in 1 month's time


Recommended Posts

I have an NVME drive (nvme2n1) in raid 1 cache pool used for Plex only. The 2 drives are identical make and model (Samsung SSD 980 1 TB). 1 drive has reported overheat twice. The first time I was not home and could not monitor it. The second time was less than 1 hour ago reporting 84 degrees. Mover was not running and nobody was even watching Plex. However, there was 1 download happening at 21:14 for 1 minute. I got the overheat notice for 84 degrees at 21:16. I was looking at all different areas to find what was causing it. While doing that, I noticed that in the disk smart information, it was reporting the drive at 23 degrees all while still reporting 84 on the dashboard and the main screen. At first I thought the smart info was old info but after pressing F5 a few times, it updated to 24. As far as I can tell the temperature never deviated from 84 for the next half hour. Then at 21:47 the server reported the disk returned to normal temperature. It was now showing 26 on the dashboard. How could a drive drop almost 60 degrees in 30 minutes? I am attaching my diagnostics and screen shots of the temperatures. Thank you.

chrome_N8kGyup2dP.png

chrome_tRvNJC9Ham.png

chrome_e1L5VSq0hN.png

lucifer-diagnostics-20220121-2159.zip

Link to comment

Temperatures are polled by Unraid on an interval set by the 'Tunable (poll_attributes)' setting in Settings->Disk Settings.  The default is 1800 seconds, or 30 minutes.  Nvme disks being solid state can spike to a high temperature and depending on the when the poll hits, a high temperature could be detected that will go down quickly when the device is not heavily used.

 

Because the update occurs at a 30 minute interval, you see the same temperature on the display for a half an hour.

 

There are several things you can do depending on the result you'd like to see:

  • Reduce the polling interval.  Be careful with this because each poll interogates all the disks and there could be a performance hit.
  • Increase the temperature thresholds above 84 degrees so you don't get the warning.
  • Provide some cooling in the case so the nvme drive doesn't get so hot.
Link to comment
3 hours ago, ChatNoir said:

It should be OK, I think there is an issue with 6.10 RC and NVME temperature reporting spiking to 83 or 84°C.

It is just a reporting issue that should be fixed for the stable release.

That would be a welcome update. Thanks for the update.

 

27 minutes ago, dlandon said:

Temperatures are polled by Unraid on an interval set by the 'Tunable (poll_attributes)' setting in Settings->Disk Settings.  The default is 1800 seconds, or 30 minutes.  Nvme disks being solid state can spike to a high temperature and depending on the when the poll hits, a high temperature could be detected that will go down quickly when the device is not heavily used.

 

Because the update occurs at a 30 minute interval, you see the same temperature on the display for a half an hour.

 

There are several things you can do depending on the result you'd like to see:

  • Reduce the polling interval.  Be careful with this because each poll interogates all the disks and there could be a performance hit.
  • Increase the temperature thresholds above 84 degrees so you don't get the warning.
  • Provide some cooling in the case so the nvme drive doesn't get so hot.

Thank you. I reduced the polling interval to 900. I am not sure I want to increase the threshold above 84. If it is a bug, I want to see it happen again and get notified. If it is not a bug, I want to find out what is causing it. I do not think it is a cooling issue as not of my other 4 nvme drives report anywhere close to this. I don't think any have gotten above 40.

 

24 minutes ago, Squid said:

If this does actually continue, can I suggest that you install NetData.  With it (scroll down to sensors), you'll be able to see a graph of the temps over time that the drive is reporting.

 

Default display is last 5 minutes, but at the top you can switch it to 12 hours

Thanks for this tip. I have netdata already installed. I could not figure out where the temp was last night but I will increase the interval to something like 6 hours. 

Link to comment
14 minutes ago, kevin182 said:

I am not sure I want to increase the threshold above 84

I've had my nvme disk set at 80 deg for a long time because I was also getting the warnings.

 

15 minutes ago, kevin182 said:

Thanks for this tip. I have netdata already installed.

This will be interesting to see.  How often does netdata get the temperature?

Link to comment
5 hours ago, dlandon said:

I've had my nvme disk set at 80 deg for a long time because I was also getting the warnings.

 

This will be interesting to see.  How often does netdata get the temperature?

It lasts for 3 seconds. Now I know it is not for real. Temperature goes from 24 to 84 and back to 24 all within 5 seconds? Hmmmm.

btw this is the twin raid 1 drive (same make and model) to the 1 I mentioned above. Maybe a manufacturing defect? 

I wish I could have the alarm for high temp but it must last for more than X minutes or seconds before sending the alert.
image.thumb.png.3aa151adeaead3261297e815bbe6e1d7.png

Link to comment
11 minutes ago, kevin182 said:

Now I know it is not for real.

Isn't the disk temperature going up that high?

 

As I explained before, Unraid polls the disk temperature and will only alarm if the disk is high at that time it polls (every 30 seconds by default).  It would have to hit in the 3 seconds time when it is high to cause an alarm.  Three seconds out of 30 minutes is not a very high probablility that Unraid would hit it.  A couple times a month explains that.

 

You need to raise the temperature threshold.

Link to comment
7 minutes ago, dlandon said:

Isn't the disk temperature going up that high?

 

As I explained before, Unraid polls the disk temperature and will only alarm if the disk is high at that time it polls (every 30 seconds by default).  It would have to hit in the 3 seconds time when it is high to cause an alarm.  Three seconds out of 30 minutes is not a very high probablility that Unraid would hit it.  A couple times a month explains that.

 

You need to raise the temperature threshold.

If that temp is any period within 30 seconds there is no physical way the temperature would go from 24 to 84 and back to 24 within 90 seconds. That is physically impossible.  No?

I can raise the temperature threshold to stop getting the warnings. But it is just going to stop the notifications. The drive is still reporting 84 and throttling according the the post I referenced above. Say I do raise the temp to 85 for the upper threshold. What if my drive hits 75 for real? I will never know unless I constantly monitor it.

Link to comment

If the drive actually throttles itself because of the false temperature, then that's an issue with the firmware and nothing you can do about it other than complain to Samsung

 

If the drive doesn't throttle itself on the false temperature then there's no big deal and it will throttle when the real temperature gets to that point, and you still have to complain to Samsung about the false readings.

 

Thinking back, I've possibly seen this notification and then another saying the drive returned to normal temperature.  I've just said "OK", the drive was under extremely heavy load (which it usually is)

 

What else can you do except complain to the manufacturer or live with it?

Link to comment
8 minutes ago, Squid said:

If the drive actually throttles itself because of the false temperature, then that's an issue with the firmware and nothing you can do about it other than complain to Samsung

 

If the drive doesn't throttle itself on the false temperature then there's no big deal and it will throttle when the real temperature gets to that point, and you still have to complain to Samsung about the false readings.

 

Thinking back, I've possibly seen this notification and then another saying the drive returned to normal temperature.  I've just said "OK", the drive was under extremely heavy load (which it usually is)

 

What else can you do except complain to the manufacturer or live with it?

 I do see firmware on the Samsung site but no indication on what it is for or what it solves. It is a different firmware from what I have. I might just install it to see if it goes away. Apparently it happens quite often. At least once every 5 minutes as I have been watching it. It only gets caught by the Unraid sensor every once in a while I guess.

Link to comment
  • 1 month later...

I have a 2230 Kioxia with no heatsink on a PCIe card that's at 58C. 

 

No heatsink. I figure that's a reasonable temp. I set the warnings to 140/145F. 

 

2600MB/s in an R720xd if anyone is interested in the future.

 

Also it isn't a flat line in Diskspeed and isn't being used just yet, so there's a datapoint. 2728mb high and 2579 low so something like 8% variance?

Edited by RealActorRob
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.