kevin182 Posted January 22, 2022 Share Posted January 22, 2022 I have an NVME drive (nvme2n1) in raid 1 cache pool used for Plex only. The 2 drives are identical make and model (Samsung SSD 980 1 TB). 1 drive has reported overheat twice. The first time I was not home and could not monitor it. The second time was less than 1 hour ago reporting 84 degrees. Mover was not running and nobody was even watching Plex. However, there was 1 download happening at 21:14 for 1 minute. I got the overheat notice for 84 degrees at 21:16. I was looking at all different areas to find what was causing it. While doing that, I noticed that in the disk smart information, it was reporting the drive at 23 degrees all while still reporting 84 on the dashboard and the main screen. At first I thought the smart info was old info but after pressing F5 a few times, it updated to 24. As far as I can tell the temperature never deviated from 84 for the next half hour. Then at 21:47 the server reported the disk returned to normal temperature. It was now showing 26 on the dashboard. How could a drive drop almost 60 degrees in 30 minutes? I am attaching my diagnostics and screen shots of the temperatures. Thank you. lucifer-diagnostics-20220121-2159.zip Quote Link to comment
ChatNoir Posted January 22, 2022 Share Posted January 22, 2022 It should be OK, I think there is an issue with 6.10 RC and NVME temperature reporting spiking to 83 or 84°C. It is just a reporting issue that should be fixed for the stable release. Quote Link to comment
dlandon Posted January 22, 2022 Share Posted January 22, 2022 Temperatures are polled by Unraid on an interval set by the 'Tunable (poll_attributes)' setting in Settings->Disk Settings. The default is 1800 seconds, or 30 minutes. Nvme disks being solid state can spike to a high temperature and depending on the when the poll hits, a high temperature could be detected that will go down quickly when the device is not heavily used. Because the update occurs at a 30 minute interval, you see the same temperature on the display for a half an hour. There are several things you can do depending on the result you'd like to see: Reduce the polling interval. Be careful with this because each poll interogates all the disks and there could be a performance hit. Increase the temperature thresholds above 84 degrees so you don't get the warning. Provide some cooling in the case so the nvme drive doesn't get so hot. Quote Link to comment
Squid Posted January 22, 2022 Share Posted January 22, 2022 If this does actually continue, can I suggest that you install NetData. With it (scroll down to sensors), you'll be able to see a graph of the temps over time that the drive is reporting. Default display is last 5 minutes, but at the top you can switch it to 12 hours Quote Link to comment
kevin182 Posted January 22, 2022 Author Share Posted January 22, 2022 3 hours ago, ChatNoir said: It should be OK, I think there is an issue with 6.10 RC and NVME temperature reporting spiking to 83 or 84°C. It is just a reporting issue that should be fixed for the stable release. That would be a welcome update. Thanks for the update. 27 minutes ago, dlandon said: Temperatures are polled by Unraid on an interval set by the 'Tunable (poll_attributes)' setting in Settings->Disk Settings. The default is 1800 seconds, or 30 minutes. Nvme disks being solid state can spike to a high temperature and depending on the when the poll hits, a high temperature could be detected that will go down quickly when the device is not heavily used. Because the update occurs at a 30 minute interval, you see the same temperature on the display for a half an hour. There are several things you can do depending on the result you'd like to see: Reduce the polling interval. Be careful with this because each poll interogates all the disks and there could be a performance hit. Increase the temperature thresholds above 84 degrees so you don't get the warning. Provide some cooling in the case so the nvme drive doesn't get so hot. Thank you. I reduced the polling interval to 900. I am not sure I want to increase the threshold above 84. If it is a bug, I want to see it happen again and get notified. If it is not a bug, I want to find out what is causing it. I do not think it is a cooling issue as not of my other 4 nvme drives report anywhere close to this. I don't think any have gotten above 40. 24 minutes ago, Squid said: If this does actually continue, can I suggest that you install NetData. With it (scroll down to sensors), you'll be able to see a graph of the temps over time that the drive is reporting. Default display is last 5 minutes, but at the top you can switch it to 12 hours Thanks for this tip. I have netdata already installed. I could not figure out where the temp was last night but I will increase the interval to something like 6 hours. Quote Link to comment
dlandon Posted January 22, 2022 Share Posted January 22, 2022 14 minutes ago, kevin182 said: I am not sure I want to increase the threshold above 84 I've had my nvme disk set at 80 deg for a long time because I was also getting the warnings. 15 minutes ago, kevin182 said: Thanks for this tip. I have netdata already installed. This will be interesting to see. How often does netdata get the temperature? Quote Link to comment
Squid Posted January 22, 2022 Share Posted January 22, 2022 4 hours ago, dlandon said: This will be interesting to see. How often does netdata get the temperature? Looks like every second Quote Link to comment
dlandon Posted January 22, 2022 Share Posted January 22, 2022 That would be a good granularity. Quote Link to comment
kevin182 Posted January 22, 2022 Author Share Posted January 22, 2022 5 hours ago, dlandon said: I've had my nvme disk set at 80 deg for a long time because I was also getting the warnings. This will be interesting to see. How often does netdata get the temperature? It lasts for 3 seconds. Now I know it is not for real. Temperature goes from 24 to 84 and back to 24 all within 5 seconds? Hmmmm. btw this is the twin raid 1 drive (same make and model) to the 1 I mentioned above. Maybe a manufacturing defect? I wish I could have the alarm for high temp but it must last for more than X minutes or seconds before sending the alert. Quote Link to comment
kevin182 Posted January 22, 2022 Author Share Posted January 22, 2022 Looks like it is a manufacturer problem. https://us.community.samsung.com/t5/Monitors-and-Memory/SSD-980-heat-spikes-to-84-C-183-F/td-p/2002779 Quote Link to comment
dlandon Posted January 22, 2022 Share Posted January 22, 2022 11 minutes ago, kevin182 said: Now I know it is not for real. Isn't the disk temperature going up that high? As I explained before, Unraid polls the disk temperature and will only alarm if the disk is high at that time it polls (every 30 seconds by default). It would have to hit in the 3 seconds time when it is high to cause an alarm. Three seconds out of 30 minutes is not a very high probablility that Unraid would hit it. A couple times a month explains that. You need to raise the temperature threshold. Quote Link to comment
Squid Posted January 22, 2022 Share Posted January 22, 2022 Definitely what looks like a firmware bug on the drive that primarily affects Linux systems across the board (and some Windows installations) Quote Link to comment
kevin182 Posted January 22, 2022 Author Share Posted January 22, 2022 7 minutes ago, dlandon said: Isn't the disk temperature going up that high? As I explained before, Unraid polls the disk temperature and will only alarm if the disk is high at that time it polls (every 30 seconds by default). It would have to hit in the 3 seconds time when it is high to cause an alarm. Three seconds out of 30 minutes is not a very high probablility that Unraid would hit it. A couple times a month explains that. You need to raise the temperature threshold. If that temp is any period within 30 seconds there is no physical way the temperature would go from 24 to 84 and back to 24 within 90 seconds. That is physically impossible. No? I can raise the temperature threshold to stop getting the warnings. But it is just going to stop the notifications. The drive is still reporting 84 and throttling according the the post I referenced above. Say I do raise the temp to 85 for the upper threshold. What if my drive hits 75 for real? I will never know unless I constantly monitor it. Quote Link to comment
Squid Posted January 22, 2022 Share Posted January 22, 2022 If the drive actually throttles itself because of the false temperature, then that's an issue with the firmware and nothing you can do about it other than complain to Samsung If the drive doesn't throttle itself on the false temperature then there's no big deal and it will throttle when the real temperature gets to that point, and you still have to complain to Samsung about the false readings. Thinking back, I've possibly seen this notification and then another saying the drive returned to normal temperature. I've just said "OK", the drive was under extremely heavy load (which it usually is) What else can you do except complain to the manufacturer or live with it? Quote Link to comment
kevin182 Posted January 22, 2022 Author Share Posted January 22, 2022 8 minutes ago, Squid said: If the drive actually throttles itself because of the false temperature, then that's an issue with the firmware and nothing you can do about it other than complain to Samsung If the drive doesn't throttle itself on the false temperature then there's no big deal and it will throttle when the real temperature gets to that point, and you still have to complain to Samsung about the false readings. Thinking back, I've possibly seen this notification and then another saying the drive returned to normal temperature. I've just said "OK", the drive was under extremely heavy load (which it usually is) What else can you do except complain to the manufacturer or live with it? I do see firmware on the Samsung site but no indication on what it is for or what it solves. It is a different firmware from what I have. I might just install it to see if it goes away. Apparently it happens quite often. At least once every 5 minutes as I have been watching it. It only gets caught by the Unraid sensor every once in a while I guess. Quote Link to comment
RealActorRob Posted March 10, 2022 Share Posted March 10, 2022 (edited) I have a 2230 Kioxia with no heatsink on a PCIe card that's at 58C. No heatsink. I figure that's a reasonable temp. I set the warnings to 140/145F. 2600MB/s in an R720xd if anyone is interested in the future. Also it isn't a flat line in Diskspeed and isn't being used just yet, so there's a datapoint. 2728mb high and 2579 low so something like 8% variance? Edited March 10, 2022 by RealActorRob Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.