NVMe overheating


Recommended Posts

Hello all,

 

I just updated to 6.10.0-rc2 and noticed that I'm now having issues with my NVMe cache SSDs spike up from 31 C to 84 C. I use this cache drive for Docker containers. I first noticed this issue after I 1) upgraded to rc2, and 2) installed a new container (Jellyfin). After removing the new container the problem seemed to go away until tonight. 

 

Any thoughts? Should I downgrade or is my NVMe faulty?

 

Thanks, Brian

Link to comment

I know, it's weird. At first I thought it was a docker container I was trying to install (it's my Docker cache drive) but I still see the issue. 

 

You bring up a good point... it is an older NVMe that I had used before. I suppose I could touch it to see for sure! :-) Otherwise it might be time to upgrade to a 1TB drive w/ heatsink!

Link to comment
3 hours ago, MaxwellHouse said:

I know, it's weird. At first I thought it was a docker container I was trying to install (it's my Docker cache drive) but I still see the issue. 

 

You bring up a good point... it is an older NVMe that I had used before. I suppose I could touch it to see for sure! 🙂 Otherwise it might be time to upgrade to a 1TB drive w/ heatsink!

 

Revert back to your previous urnarid version and see if issues persists. If not then is clearly a bad reading in the kernel somewhere.

Link to comment
On 11/4/2021 at 1:20 AM, MaxwellHouse said:

Hello all,

 

I just updated to 6.10.0-rc2 and noticed that I'm now having issues with my NVMe cache SSDs spike up from 31 C to 84 C. I use this cache drive for Docker containers. I first noticed this issue after I 1) upgraded to rc2, and 2) installed a new container (Jellyfin). After removing the new container the problem seemed to go away until tonight. 

 

Any thoughts? Should I downgrade or is my NVMe faulty?

 

Thanks, Brian

Confirming the same: Running new dual m2’s (one for docker and one for vm’s) and they are both reporting 84 degree spikes. I have had no such reports on rc1, but several a day after rc2. They are both heat sinked and operates in the 35-42 range. Even more strange is that it always spikes directly to 84 - never more, never less - before normalizing. 
 

I’m running a Ryzen 5600G rig on an Asus ROG Strix X570-F board. 

 

Attaching the latest log entries:

08-11-2021 08:11    Unraid Dockers disk message    Notice  - Dockers disk returned to normal temperature    Samsung_SSD_980_1TB_S649NF0R675515B (nvme1n1)    normal    
08-11-2021 07:39    Unraid Dockers disk temperature    Alert  - Dockers disk overheated (84 C)    Samsung_SSD_980_1TB_S649NF0R675515B (nvme1n1)    alert    
08-11-2021 01:27    Unraid Virtuals disk message    Notice  - Virtuals disk returned to normal temperature    Samsung_SSD_980_1TB_S649NF0R675513Z (nvme0n1)    normal    
08-11-2021 01:27    Unraid Dockers disk message    Notice  - Dockers disk returned to normal temperature    Samsung_SSD_980_1TB_S649NF0R675515B (nvme1n1)    normal    
08-11-2021 00:56    Unraid Dockers disk temperature    Alert  - Dockers disk overheated (84 C)    Samsung_SSD_980_1TB_S649NF0R675515B (nvme1n1)    alert    
07-11-2021 23:54    Unraid Virtuals disk temperature    Alert  - Virtuals disk overheated (84 C)    Samsung_SSD_980_1TB_S649NF0R675513Z (nvme0n1)    alert    
07-11-2021 22:53    Unraid Virtuals disk message    Notice  - Virtuals disk returned to normal temperature    Samsung_SSD_980_1TB_S649NF0R675513Z (nvme0n1)    normal    
07-11-2021 22:22    Unraid Virtuals disk temperature    Alert  - Virtuals disk overheated (84 C)    Samsung_SSD_980_1TB_S649NF0R675513Z (nvme0n1)    alert    
07-11-2021 20:21    Unraid Dockers disk message    Notice  - Dockers disk returned to normal temperature    Samsung_SSD_980_1TB_S649NF0R675515B (nvme1n1)    normal    
07-11-2021 19:50    Unraid Dockers disk temperature    Alert  - Dockers disk overheated (84 C)    Samsung_SSD_980_1TB_S649NF0R675515B (nvme1n1)    alert    
07-11-2021 17:49    Unraid Virtuals disk message    Notice  - Virtuals disk returned to normal temperature    Samsung_SSD_980_1TB_S649NF0R675513Z (nvme0n1)    normal    
07-11-2021 16:18    Unraid Virtuals disk temperature    Alert  - Virtuals disk overheated (84 C)    Samsung_SSD_980_1TB_S649NF0R675513Z (nvme0n1)    alert    

//UlfThomas

Edited by Ulf Thomas Johansen
Link to comment
6 minutes ago, Ulf Thomas Johansen said:

Indeed - which leads me to speculate that it might be a misread and not an actual temp reading perhaps?

 

Not sure, but I think you could further troubleshoot by apply different loading to NVMe to check will got middle temperature.

 

Some thinking ongoing,

 

i.e. in RC2, mention ACPI

 

[rc2] Enabled additional ACPI kernel options
[rc2] Updated out-of-tree drivers

[rc2] Enabled TPM kernel modules (not utilized yet) - note this is for Unraid host utilizing physical TPM, not emulated TPM support for virtual machnes.

 

 

 

Link to comment
22 minutes ago, Ulf Thomas Johansen said:

 

Any suggestions as to how I would do this? Just plain copy jobs?

 

 

Pls use docker disk ( stop docker ) or VM disk ( stop VM ) and perform below test, adjust the count value for different loading

 

dd if=/dev/random of=/mnt/xxx/test.bin bs=1MB count=1024

 

edit : pls at command prompt type sensors, check does NVMe have report its temperature, pls also post the output here

Edited by Vr2Io
Link to comment
1 hour ago, Vr2Io said:

 

edit : pls at command prompt type sensors, check does NVMe have report its temperature, pls also post the output here

 

Will perform tests later today. This is the output of 'sensors'.

 

amdgpu-pci-0a00
Adapter: PCI adapter
vddgfx:      906.00 mV
vddnb:       993.00 mV
edge:         +33.0°C
power1:      1000.00 uW

nvme-pci-0300
Adapter: PCI adapter
Composite:    +41.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +41.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +42.9°C  (low  = -273.1°C, high = +65261.8°C)

nct6798-isa-0290
Adapter: ISA adapter
in0:                        1.15 V  (min =  +0.00 V, max =  +1.74 V)
in1:                      1000.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in2:                        3.38 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in3:                        3.31 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in4:                        1.01 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:                        2.04 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:                      360.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in7:                        3.38 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in8:                        3.33 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in9:                      896.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in10:                       1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in11:                     496.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in12:                       1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in13:                     392.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in14:                     328.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
Array Fan:                 463 RPM  (min =    0 RPM)
Array Fan:                1124 RPM  (min =    0 RPM)
SYSTIN:                    -62.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
CPU Temp:                  +30.5°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
AUXTIN0:                   +79.0°C    sensor = thermistor
AUXTIN1:                   -62.0°C    sensor = thermistor
MB Temp:                   +26.0°C    sensor = thermistor
AUXTIN3:                   +84.0°C    sensor = thermistor
PECI Agent 0 Calibration:  +32.5°C
intrusion0:               ALARM
intrusion1:               ALARM
beep_enable:              disabled

nvme-pci-0900
Adapter: PCI adapter
Composite:    +31.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +34.9°C  (low  = -273.1°C, high = +65261.8°C)

 

Link to comment
3 hours ago, Vr2Io said:

 

dd if=/dev/random of=./test.bin bs=1MB count=10240

Tested with the above command whilst extracting sensor data. It does indeed report increasing temperatures:

 

Composite:    +42.9°C  (low  = -273.1°C, high = +81.8°C)
Sensor 1:     +42.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +46.9°C  (low  = -273.1°C, high = +65261.8°C)

Composite:    +43.9°C  (low  = -273.1°C, high = +81.8°C)
Sensor 1:     +43.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +52.9°C  (low  = -273.1°C, high = +65261.8°C)

Composite:    +43.9°C  (low  = -273.1°C, high = +81.8°C)
Sensor 1:     +43.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +53.9°C  (low  = -273.1°C, high = +65261.8°C)

Composite:    +43.9°C  (low  = -273.1°C, high = +81.8°C)
Sensor 1:     +43.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +54.9°C  (low  = -273.1°C, high = +65261.8°C)

Composite:    +43.9°C  (low  = -273.1°C, high = +81.8°C)
Sensor 1:     +43.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +54.9°C  (low  = -273.1°C, high = +65261.8°C)

Composite:    +43.9°C  (low  = -273.1°C, high = +81.8°C)
Sensor 1:     +43.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +46.9°C  (low  = -273.1°C, high = +65261.8°C)

 

Edited by Ulf Thomas Johansen
  • Like 1
Link to comment

No worries! Thank you very much for your suggestions to try to troubleshoot this issue! I upgraded to rc2 and re-ran the random file write. Very interesting results on Sensor 2... see here:

 

There is definitely more of a delta. Incidentally I increased the write command to 2048 counts and the Sensor2 temp peaked at 53.9C. Hmm....

 

EDIT: After reverting back to rc1 I noticed a delta of 5C increase when writing 2048 counts.

 

root@maxwell:~# sensors | grep nvme-pci-0400 -A 5
nvme-pci-0400
Adapter: PCI adapter
Composite:    +30.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +30.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +41.9°C  (low  = -273.1°C, high = +65261.8°C)
root@maxwell:~# sensors | grep nvme-pci-0400 -A 5
nvme-pci-0400
Adapter: PCI adapter
Composite:    +31.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)
root@maxwell:~# sensors | grep nvme-pci-0400 -A 5
nvme-pci-0400
Adapter: PCI adapter
Composite:    +31.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)
root@maxwell:~# sensors | grep nvme-pci-0400 -A 5
nvme-pci-0400
Adapter: PCI adapter
Composite:    +31.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +45.9°C  (low  = -273.1°C, high = +65261.8°C)
root@maxwell:~# sensors | grep nvme-pci-0400 -A 5
nvme-pci-0400
Adapter: PCI adapter
Composite:    +31.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +44.9°C  (low  = -273.1°C, high = +65261.8°C)
root@maxwell:~# sensors | grep nvme-pci-0400 -A 5
nvme-pci-0400
Adapter: PCI adapter
Composite:    +31.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +44.9°C  (low  = -273.1°C, high = +65261.8°C)
root@maxwell:~# sensors | grep nvme-pci-0400 -A 5
nvme-pci-0400
Adapter: PCI adapter
Composite:    +31.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +43.9°C  (low  = -273.1°C, high = +65261.8°C)
root@maxwell:~# sensors | grep nvme-pci-0400 -A 5
nvme-pci-0400
Adapter: PCI adapter
Composite:    +31.9°C  (low  = -273.1°C, high = +81.8°C)
                       (crit = +84.8°C)
Sensor 1:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +41.9°C  (low  = -273.1°C, high = +65261.8°C)

 

Edited by MaxwellHouse
  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.