Jump to content

False temp alerts and fan Issues - Xeon E5-25x0 v1


False CPU Sensor Events and Max Fan Speeds  

40 members have voted

You do not have permission to vote in this poll, or see the poll results. Please sign in or register to vote in this poll.

Recommended Posts

I'm trying to troubleshoot with ASRock technicians an issue with my  ASRock EP2C602-4L/D16 motherboard where I will get false alerts that a cpu is overheating (and then after a few seconds, will return to normal) and sometimes the fans will get stuck at max speed.

 

Their techs are having a difficult time replicating the issue, so I thought I'd try and provide them with some more data points.

 

I'd like to find out how many people are having the problem, and if it is occurring with any other motherboards and/or chip combinations, so thought I'd do a poll.

 

I'm mostly interested in people using an  ASRock EP2C602-4L/D16 with a Xeon E5-2650/70, but I added other options as well for anyone else experiencing the issues, in case the issue is in fact with the chip and not the mobo.

 

Please use the comments if your response needs to be expanded (If you chose 'Other' for example).

 

Thanks!!!

 

Link to comment

If I have alerts enabled I receive a MFT of them. I have alerts disabled and might have changed another setting or two to not be annoyed by all the false positives. I don't know if I'm running on the latest BIOS and Firmware though.

 

Dual Intel® Xeon® CPU E5-2670 0 @ 2.60GHz [ Type 0, Family 6, Model 45, Stepping ] on EP2C602-4L/D16, BIOS L1.89E 05/15/2015 with SMBIOS 2.7 present using 128 Gig RAM @ 1333 MHz.

 

Device Information

Firmware Revision: 0.13.9
Firmware Build Time: Apr 8 2015 14:34:22 CST

 

I'll have to check the IPMI Logs to see just how silly it's been. I have 69 pages of this shit going on since the last time I CLEARED all events from 2017-04-17 back to 2016-10-31.

 

 

 

 

ASRock_TempSensor_FUBARED.png

ASRock_TempSensor_FUBARED_SinceLastCleared.png

  • Upvote 1
Link to comment
21 hours ago, DoeBoye said:

Thanks BRiT!

 

Anyone else have a vote? I'd love to be able to help ASRock solve this issue!

 

Thanks! :)

 

I started another thread today about my attempted switch from Dual E5-2670's to Dual E5-2650L's on a Supermicro X9DRi-LN4F+ and I was getting way higher fan speeds with it even though temps were super low. Switched back to the 2670's and fan noise is back down. I tried everything to fix it, but I finally submitted a ticket with Supermicro.

 

 

  • Upvote 1
Link to comment

Hi Guys, same issue here.

2x Xeon E5 2697v2 (ES).

 

So far I have been able to connect to the sensors via IPMI locally as well as to the BMC with ipmi-sensors-config and direct telnet.
However I am stuck on what to do next. My particular issue is also the insane Speed up of 1300RPM Noctuas up to 2600+. (3400 Max)

 

It's really annoying to see the CPU deassertion/assertion spikes from 40 degree to 90 to 30, etc... Must be a BMC issue IMO.


I have created a support ticket to Asrock directly yesterday.

And keep you posted.

 

Regards
KyDay

  • Upvote 1
Link to comment
  • 4 months later...
On 5/10/2017 at 11:19 AM, KyDay said:

I have created a support ticket to Asrock directly yesterday.

And keep you posted.

 

Regards
KyDay

Just following up to see if you ever resolved this. Currently my solution is a filter on my inbox to auto-move these warnings to a sub-folder so they don't constantly pollute my inbox.

 

As an aside, I've noticed that it seems to happen most often whenever there is moderate to heavy load (UnRaring, parity checks etc).

 

Anyone else figure this out/heard from ASRock about a solution?

Link to comment
  • 2 months later...

Just noticed same issue here.  Anyone have any updates on this?  I was going to try clearing the cmos by removing the battery and shorting the cmos pins on the motherboard but assuming others have tried that?

 

I first thought they were happening on only 1 cpu (BSP1)...but left a load on the machine last night and saw that AP1 had a few events as well.   I agree that these events are more frequent with increased load.

 

ep2c602-4l/d16 with 2 E5-2670 V1

 

Bios : 1.80 (latest on ASROCK web site) is there a newer bios?

BMC: 00.18.00 (Latest on ASROCK site page)

 

Normally the events are so short that they don't appear in the widgets.  I was able to capture some as pictured.

asrock.PNG

Edited by xnaron
Link to comment

This might not be a false temp reading on the spikes.  Maybe there is a bug in the bios or MB hardware that is causing an over voltage and making the CPU temp  spike.  Regardless the motherboard thinks the CPU is overheating and puts the fans to max.  I wonder if it is also thermal throttling the cores.  

Edited by xnaron
Link to comment

I ran some more tests.  This time booting into memtestx86 instead of esxi and using all cores in parallel for the test. A couple of the events were over 30 seconds.  I am concerned that these aren't false positives and that the mainboard/bios is doing something out of spec with voltage and causing the overtemp.  While monitoring the graphs I have seen it at 101C.

 

 

asrock3.PNG

Edited by xnaron
Link to comment

 

I did some more testing today.  I installed Ubuntu server 16.04 and used stress to load the 32 cores.  I wrote a script to collect the temps (every second) for the cores using lm-sensors (not ipmi) and log them .  I have the time synced on the server/bmc with NTP.  I waited for an event to occur and then checked the logs.  There is no spike in temperature in the lm-sensors log at the corresponding time for a UNC assertion in the BMC log.  This makes me feel better and that it is a bug reading the temp rather than a flaw causing the cpu to exceed the UNC temp.  I checked multiple events and grep'd the lm-sensor log to try and find a spike and I could not.

 

Tue Nov 28 19:06:13 MST 2017
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +51.0°C  (high = +90.0°C, crit = +100.0°C)
Core 0:         +51.0°C  (high = +90.0°C, crit = +100.0°C)
Core 1:         +50.0°C  (high = +90.0°C, crit = +100.0°C)
Core 2:         +49.0°C  (high = +90.0°C, crit = +100.0°C)
Core 3:         +48.0°C  (high = +90.0°C, crit = +100.0°C)
Core 4:         +48.0°C  (high = +90.0°C, crit = +100.0°C)
Core 5:         +48.0°C  (high = +90.0°C, crit = +100.0°C)
Core 6:         +49.0°C  (high = +90.0°C, crit = +100.0°C)
Core 7:         +47.0°C  (high = +90.0°C, crit = +100.0°C)

coretemp-isa-0001
Adapter: ISA adapter
Physical id 1:  +64.0°C  (high = +90.0°C, crit = +100.0°C)
Core 0:         +59.0°C  (high = +90.0°C, crit = +100.0°C)
Core 1:         +61.0°C  (high = +90.0°C, crit = +100.0°C)
Core 2:         +63.0°C  (high = +90.0°C, crit = +100.0°C)
Core 3:         +58.0°C  (high = +90.0°C, crit = +100.0°C)
Core 4:         +61.0°C  (high = +90.0°C, crit = +100.0°C)
Core 5:         +60.0°C  (high = +90.0°C, crit = +100.0°C)
Core 6:         +64.0°C  (high = +90.0°C, crit = +100.0°C)
Core 7:         +62.0°C  (high = +90.0°C, crit = +100.0°C)

Tue Nov 28 19:06:14 MST 2017
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +51.0°C  (high = +90.0°C, crit = +100.0°C)
Core 0:         +51.0°C  (high = +90.0°C, crit = +100.0°C)
Core 1:         +49.0°C  (high = +90.0°C, crit = +100.0°C)
Core 2:         +48.0°C  (high = +90.0°C, crit = +100.0°C)
Core 3:         +47.0°C  (high = +90.0°C, crit = +100.0°C)
Core 4:         +48.0°C  (high = +90.0°C, crit = +100.0°C)
Core 5:         +48.0°C  (high = +90.0°C, crit = +100.0°C)
Core 6:         +49.0°C  (high = +90.0°C, crit = +100.0°C)
Core 7:         +47.0°C  (high = +90.0°C, crit = +100.0°C)

coretemp-isa-0001
Adapter: ISA adapter
Physical id 1:  +64.0°C  (high = +90.0°C, crit = +100.0°C)
Core 0:         +59.0°C  (high = +90.0°C, crit = +100.0°C)
Core 1:         +61.0°C  (high = +90.0°C, crit = +100.0°C)
Core 2:         +63.0°C  (high = +90.0°C, crit = +100.0°C)
Core 3:         +58.0°C  (high = +90.0°C, crit = +100.0°C)
Core 4:         +61.0°C  (high = +90.0°C, crit = +100.0°C)
Core 5:         +61.0°C  (high = +90.0°C, crit = +100.0°C)
Core 6:         +64.0°C  (high = +90.0°C, crit = +100.0°C)
Core 7:         +62.0°C  (high = +90.0°C, crit = +100.0°C)

Tue Nov 28 19:06:15 MST 2017
coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +52.0°C  (high = +90.0°C, crit = +100.0°C)
Core 0:         +52.0°C  (high = +90.0°C, crit = +100.0°C)
Core 1:         +50.0°C  (high = +90.0°C, crit = +100.0°C)
Core 2:         +49.0°C  (high = +90.0°C, crit = +100.0°C)
Core 3:         +48.0°C  (high = +90.0°C, crit = +100.0°C)
Core 4:         +48.0°C  (high = +90.0°C, crit = +100.0°C)
Core 5:         +48.0°C  (high = +90.0°C, crit = +100.0°C)
Core 6:         +48.0°C  (high = +90.0°C, crit = +100.0°C)
Core 7:         +48.0°C  (high = +90.0°C, crit = +100.0°C)

 

 

 

asrock4.PNG

Edited by xnaron
  • Like 1
Link to comment
  • 2 months later...
  • 1 month later...

Hey Guys, 

 

I created an account just to say I am having the same issue. I currently have only 1 2680 v2 and am experiencing temperature spikes / fans getting stuck at 100 percent. I am using this motherboard for a home lab build, so the fan spikes are getting pretty annoying. 

 

I really feel like this is an issue with the BMC, I am on version 18.0. I emailed someone at ASRocks who appears to know his stuff, will let you know what I find out.  

 

Have any of you tried reverting to an older version of BMC? I may if I get desperate enough. 

 

Thanks

2018-04-05 21_53_25-Megarac SP.png

Link to comment
14 hours ago, collsni said:

 I currently have only 1 2680 v2 and am experiencing temperature spikes / fans getting stuck at 100 percent. I am using this motherboard for a home lab build, so the fan spikes are getting pretty annoying. 

 

I think you're the first v2 chip to show this issue. Makes me feel more confident that this is a board issue.

 

I really wish they would just take ownership and admit that it is a problem (Seems like everytime it is reported, asRock acts like it's the first they hear of it) and come up with a solution. The Conspiracy Theorist in me thinks it is a hardware issue that can't be resolved by a BMC/Bios update and they are avoiding it so they don't need to replace a bunch of defective boards...

 

Also, I have not tried an older BMC.

Link to comment
  • 1 month later...

Hi Guys,

 

I know this is an ongoing thing with this motherboard, indeed I also experienced it at one time. I use 2 x Intel® Xeon® CPU E5-2670 0 @ 2.60GHz and every now and again I would get the 100c spikes as mentioned, or the fans would spin up to full and only a reboot would solve the issue.

 

However, by some miracle I found a cure for it and I'm not sure if anyone else is using this, but I installed IPMI plugin and set up thresholds and since then I haven't' experienced the spikes or the fans running at full speed. Just to give you an idea, the fans would spin up to full every 10 days or so and the server required a reboot, but I've now been running the IPMI plugin for several months and not once have they spun up to full speed and stuck there. Ok, due to the thresholds I have setup they do spin up faster every now and again but that is due to the ambient temp rising, but they never go full pelt these days.

 

I'm so glad I found a solution to this as it was driving me nuts when the fans spun up to full speed, sounded like a jet engine taking off.

Link to comment

Ok, just checked the log in the IPMI plugin and it seems that I too am still getting a false positive (hopefully), but it only appears on one of the CPU's in my case:

 

image.thumb.png.35f2f072d54a38c61ae547dfec7e2642.png

 

However, at least the IPMI plugin is keeping the fans under control and I'm not experiencing what sounded like a jet engine every 10 or so days now.

 

Yep, just checked all 273 entries, and all are showing the spike of CPU_BS1 only.

Edited by apefray
Link to comment

The SMBus (I2C) communication is notorious for either sometimes having bit errors in the transfer or sometimes totally botch the transfer. So any program processing SMBus data needs a filter time so it doesn't react to spikes and instead requires multiple high values in a row before issuing an alarm or increasing the fan speed.

 

Another thing is that only one program at a time can take control of the SMBus master and perform reads. If two programs supervises without some software mechanic to synchronize then there will be lots and lots of failed transfers. The old Motherboard Monitor introduced a synchronization method for running multiple Windows supervision softwares but that solution isn't applicable outside of Windows.

  • Upvote 2
Link to comment
10 hours ago, pwm said:

So any program processing SMBus data needs a filter time so it doesn't react to spikes and instead requires multiple high values in a row before issuing an alarm

 

Thanks for the info!

 

I wonder if there's some way to force the BMC to have this requirement before sending out an alert... That would be a solution to this issue, as the event seems to only occur for an extremely short period of time.

Link to comment

Not a solution for the problem but I turn off the fan control essentially and run them full speed all the time.  But I have my servers in the basement in my rack so it doesn't matter how loud they are.  I still see the BIOS event log entries so I periodically go and clear the log.  It's annoying because it is just spamming the log and potentially hiding other important entries.  So if anybody sees an improvement after a bios upgrade I would be interested too!

Link to comment
  • 1 month later...
  • 2 weeks later...
On 6/30/2018 at 5:51 AM, deaerator said:

I have the same cpu and asrock board and have the exact same problem.  Anyone know of a fix yet? 

 

I just saw your post so thought I'd check if they released an updated Bios and BMC for the EP2C602-4L/D16 and it looks like they have released a new bios!! June 26 2018. v.190!

 

Has anyone installed it to see if it helps with this? Notes show Spectre and Meltdown support, but also "Improved system performance"....

Link to comment

I

1 minute ago, DoeBoye said:

 

I just saw your post so thought I'd check if they released an updated Bios and BMC for the EP2C602-4L/D16 and it looks like they have released a new bios!! June 26 2018. v.190!

 

Has anyone installed it to see if it helps with this? Notes show Spectre and Meltdown support, but also "Improved system performance"....

I have the latest bios and still showing false temperatures and the cpu keeps on getting de-asserted.  I found that disabling Turbo has made my system a bit more stable but still getting crashes every couple of days vs every day.

 

Link to comment
Just now, deaerator said:

I have the latest bios and still showing false temperatures and the cpu keeps on getting de-asserted.  I found that disabling Turbo has made my system a bit more stable but still getting crashes every couple of days vs every day.

 

:(. Drag. I was hoping they fixed this, but sounds like they can't... Sigh

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...