DoeBoye Posted April 26, 2017 Share Posted April 26, 2017 I'm trying to troubleshoot with ASRock technicians an issue with my ASRock EP2C602-4L/D16 motherboard where I will get false alerts that a cpu is overheating (and then after a few seconds, will return to normal) and sometimes the fans will get stuck at max speed. Their techs are having a difficult time replicating the issue, so I thought I'd try and provide them with some more data points. I'd like to find out how many people are having the problem, and if it is occurring with any other motherboards and/or chip combinations, so thought I'd do a poll. I'm mostly interested in people using an ASRock EP2C602-4L/D16 with a Xeon E5-2650/70, but I added other options as well for anyone else experiencing the issues, in case the issue is in fact with the chip and not the mobo. Please use the comments if your response needs to be expanded (If you chose 'Other' for example). Thanks!!! Quote Link to comment
BRiT Posted April 26, 2017 Share Posted April 26, 2017 If I have alerts enabled I receive a MFT of them. I have alerts disabled and might have changed another setting or two to not be annoyed by all the false positives. I don't know if I'm running on the latest BIOS and Firmware though. Dual Intel® Xeon® CPU E5-2670 0 @ 2.60GHz [ Type 0, Family 6, Model 45, Stepping ] on EP2C602-4L/D16, BIOS L1.89E 05/15/2015 with SMBIOS 2.7 present using 128 Gig RAM @ 1333 MHz. Device Information Firmware Revision: 0.13.9 Firmware Build Time: Apr 8 2015 14:34:22 CST I'll have to check the IPMI Logs to see just how silly it's been. I have 69 pages of this shit going on since the last time I CLEARED all events from 2017-04-17 back to 2016-10-31. 1 Quote Link to comment
DoeBoye Posted May 2, 2017 Author Share Posted May 2, 2017 Thanks BRiT! Anyone else have a vote? I'd love to be able to help ASRock solve this issue! Thanks! Quote Link to comment
TBSCamCity Posted May 2, 2017 Share Posted May 2, 2017 21 hours ago, DoeBoye said: Thanks BRiT! Anyone else have a vote? I'd love to be able to help ASRock solve this issue! Thanks! I started another thread today about my attempted switch from Dual E5-2670's to Dual E5-2650L's on a Supermicro X9DRi-LN4F+ and I was getting way higher fan speeds with it even though temps were super low. Switched back to the 2670's and fan noise is back down. I tried everything to fix it, but I finally submitted a ticket with Supermicro. 1 Quote Link to comment
KyDay Posted May 10, 2017 Share Posted May 10, 2017 Hi Guys, same issue here. 2x Xeon E5 2697v2 (ES). So far I have been able to connect to the sensors via IPMI locally as well as to the BMC with ipmi-sensors-config and direct telnet. However I am stuck on what to do next. My particular issue is also the insane Speed up of 1300RPM Noctuas up to 2600+. (3400 Max) It's really annoying to see the CPU deassertion/assertion spikes from 40 degree to 90 to 30, etc... Must be a BMC issue IMO. I have created a support ticket to Asrock directly yesterday. And keep you posted. Regards KyDay 1 Quote Link to comment
DoeBoye Posted May 10, 2017 Author Share Posted May 10, 2017 22 minutes ago, KyDay said: I have created a support ticket to Asrock directly yesterday. And keep you posted. Please do! Quote Link to comment
DoeBoye Posted September 12, 2017 Author Share Posted September 12, 2017 On 5/10/2017 at 11:19 AM, KyDay said: I have created a support ticket to Asrock directly yesterday. And keep you posted. Regards KyDay Just following up to see if you ever resolved this. Currently my solution is a filter on my inbox to auto-move these warnings to a sub-folder so they don't constantly pollute my inbox. As an aside, I've noticed that it seems to happen most often whenever there is moderate to heavy load (UnRaring, parity checks etc). Anyone else figure this out/heard from ASRock about a solution? Quote Link to comment
xnaron Posted November 27, 2017 Share Posted November 27, 2017 (edited) Just noticed same issue here. Anyone have any updates on this? I was going to try clearing the cmos by removing the battery and shorting the cmos pins on the motherboard but assuming others have tried that? I first thought they were happening on only 1 cpu (BSP1)...but left a load on the machine last night and saw that AP1 had a few events as well. I agree that these events are more frequent with increased load. ep2c602-4l/d16 with 2 E5-2670 V1 Bios : 1.80 (latest on ASROCK web site) is there a newer bios? BMC: 00.18.00 (Latest on ASROCK site page) Normally the events are so short that they don't appear in the widgets. I was able to capture some as pictured. Edited November 27, 2017 by xnaron Quote Link to comment
xnaron Posted November 27, 2017 Share Posted November 27, 2017 (edited) Check out this weird dip on BSP1 and a Spike on AP1. Definitely a bug. Impossible for it to dip to 0C. Edited November 27, 2017 by xnaron Quote Link to comment
xnaron Posted November 27, 2017 Share Posted November 27, 2017 (edited) This might not be a false temp reading on the spikes. Maybe there is a bug in the bios or MB hardware that is causing an over voltage and making the CPU temp spike. Regardless the motherboard thinks the CPU is overheating and puts the fans to max. I wonder if it is also thermal throttling the cores. Edited November 27, 2017 by xnaron Quote Link to comment
xnaron Posted November 28, 2017 Share Posted November 28, 2017 (edited) I ran some more tests. This time booting into memtestx86 instead of esxi and using all cores in parallel for the test. A couple of the events were over 30 seconds. I am concerned that these aren't false positives and that the mainboard/bios is doing something out of spec with voltage and causing the overtemp. While monitoring the graphs I have seen it at 101C. Edited November 28, 2017 by xnaron Quote Link to comment
xnaron Posted November 29, 2017 Share Posted November 29, 2017 (edited) I did some more testing today. I installed Ubuntu server 16.04 and used stress to load the 32 cores. I wrote a script to collect the temps (every second) for the cores using lm-sensors (not ipmi) and log them . I have the time synced on the server/bmc with NTP. I waited for an event to occur and then checked the logs. There is no spike in temperature in the lm-sensors log at the corresponding time for a UNC assertion in the BMC log. This makes me feel better and that it is a bug reading the temp rather than a flaw causing the cpu to exceed the UNC temp. I checked multiple events and grep'd the lm-sensor log to try and find a spike and I could not. Tue Nov 28 19:06:13 MST 2017 coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +51.0°C (high = +90.0°C, crit = +100.0°C) Core 0: +51.0°C (high = +90.0°C, crit = +100.0°C) Core 1: +50.0°C (high = +90.0°C, crit = +100.0°C) Core 2: +49.0°C (high = +90.0°C, crit = +100.0°C) Core 3: +48.0°C (high = +90.0°C, crit = +100.0°C) Core 4: +48.0°C (high = +90.0°C, crit = +100.0°C) Core 5: +48.0°C (high = +90.0°C, crit = +100.0°C) Core 6: +49.0°C (high = +90.0°C, crit = +100.0°C) Core 7: +47.0°C (high = +90.0°C, crit = +100.0°C) coretemp-isa-0001 Adapter: ISA adapter Physical id 1: +64.0°C (high = +90.0°C, crit = +100.0°C) Core 0: +59.0°C (high = +90.0°C, crit = +100.0°C) Core 1: +61.0°C (high = +90.0°C, crit = +100.0°C) Core 2: +63.0°C (high = +90.0°C, crit = +100.0°C) Core 3: +58.0°C (high = +90.0°C, crit = +100.0°C) Core 4: +61.0°C (high = +90.0°C, crit = +100.0°C) Core 5: +60.0°C (high = +90.0°C, crit = +100.0°C) Core 6: +64.0°C (high = +90.0°C, crit = +100.0°C) Core 7: +62.0°C (high = +90.0°C, crit = +100.0°C) Tue Nov 28 19:06:14 MST 2017 coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +51.0°C (high = +90.0°C, crit = +100.0°C) Core 0: +51.0°C (high = +90.0°C, crit = +100.0°C) Core 1: +49.0°C (high = +90.0°C, crit = +100.0°C) Core 2: +48.0°C (high = +90.0°C, crit = +100.0°C) Core 3: +47.0°C (high = +90.0°C, crit = +100.0°C) Core 4: +48.0°C (high = +90.0°C, crit = +100.0°C) Core 5: +48.0°C (high = +90.0°C, crit = +100.0°C) Core 6: +49.0°C (high = +90.0°C, crit = +100.0°C) Core 7: +47.0°C (high = +90.0°C, crit = +100.0°C) coretemp-isa-0001 Adapter: ISA adapter Physical id 1: +64.0°C (high = +90.0°C, crit = +100.0°C) Core 0: +59.0°C (high = +90.0°C, crit = +100.0°C) Core 1: +61.0°C (high = +90.0°C, crit = +100.0°C) Core 2: +63.0°C (high = +90.0°C, crit = +100.0°C) Core 3: +58.0°C (high = +90.0°C, crit = +100.0°C) Core 4: +61.0°C (high = +90.0°C, crit = +100.0°C) Core 5: +61.0°C (high = +90.0°C, crit = +100.0°C) Core 6: +64.0°C (high = +90.0°C, crit = +100.0°C) Core 7: +62.0°C (high = +90.0°C, crit = +100.0°C) Tue Nov 28 19:06:15 MST 2017 coretemp-isa-0000 Adapter: ISA adapter Physical id 0: +52.0°C (high = +90.0°C, crit = +100.0°C) Core 0: +52.0°C (high = +90.0°C, crit = +100.0°C) Core 1: +50.0°C (high = +90.0°C, crit = +100.0°C) Core 2: +49.0°C (high = +90.0°C, crit = +100.0°C) Core 3: +48.0°C (high = +90.0°C, crit = +100.0°C) Core 4: +48.0°C (high = +90.0°C, crit = +100.0°C) Core 5: +48.0°C (high = +90.0°C, crit = +100.0°C) Core 6: +48.0°C (high = +90.0°C, crit = +100.0°C) Core 7: +48.0°C (high = +90.0°C, crit = +100.0°C) Edited November 29, 2017 by xnaron 1 Quote Link to comment
DoeBoye Posted November 29, 2017 Author Share Posted November 29, 2017 Thanks for all the detailed testing! It's nice to get hard numbers supporting the theories! Quote Link to comment
DoeBoye Posted February 22, 2018 Author Share Posted February 22, 2018 Anyone ever hear back from ASRock about a solution? It's fallen off my radar, as I've set up a filter in my email to move the warning emails to a separate folder so it's not annoying me on all the time. That said, I'm still getting multiple warning on a daily basis . Quote Link to comment
collsni Posted April 6, 2018 Share Posted April 6, 2018 Hey Guys, I created an account just to say I am having the same issue. I currently have only 1 2680 v2 and am experiencing temperature spikes / fans getting stuck at 100 percent. I am using this motherboard for a home lab build, so the fan spikes are getting pretty annoying. I really feel like this is an issue with the BMC, I am on version 18.0. I emailed someone at ASRocks who appears to know his stuff, will let you know what I find out. Have any of you tried reverting to an older version of BMC? I may if I get desperate enough. Thanks Quote Link to comment
DoeBoye Posted April 6, 2018 Author Share Posted April 6, 2018 14 hours ago, collsni said: I currently have only 1 2680 v2 and am experiencing temperature spikes / fans getting stuck at 100 percent. I am using this motherboard for a home lab build, so the fan spikes are getting pretty annoying. I think you're the first v2 chip to show this issue. Makes me feel more confident that this is a board issue. I really wish they would just take ownership and admit that it is a problem (Seems like everytime it is reported, asRock acts like it's the first they hear of it) and come up with a solution. The Conspiracy Theorist in me thinks it is a hardware issue that can't be resolved by a BMC/Bios update and they are avoiding it so they don't need to replace a bunch of defective boards... Also, I have not tried an older BMC. Quote Link to comment
apefray Posted May 11, 2018 Share Posted May 11, 2018 Hi Guys, I know this is an ongoing thing with this motherboard, indeed I also experienced it at one time. I use 2 x Intel® Xeon® CPU E5-2670 0 @ 2.60GHz and every now and again I would get the 100c spikes as mentioned, or the fans would spin up to full and only a reboot would solve the issue. However, by some miracle I found a cure for it and I'm not sure if anyone else is using this, but I installed IPMI plugin and set up thresholds and since then I haven't' experienced the spikes or the fans running at full speed. Just to give you an idea, the fans would spin up to full every 10 days or so and the server required a reboot, but I've now been running the IPMI plugin for several months and not once have they spun up to full speed and stuck there. Ok, due to the thresholds I have setup they do spin up faster every now and again but that is due to the ambient temp rising, but they never go full pelt these days. I'm so glad I found a solution to this as it was driving me nuts when the fans spun up to full speed, sounded like a jet engine taking off. Quote Link to comment
apefray Posted May 11, 2018 Share Posted May 11, 2018 (edited) Ok, just checked the log in the IPMI plugin and it seems that I too am still getting a false positive (hopefully), but it only appears on one of the CPU's in my case: However, at least the IPMI plugin is keeping the fans under control and I'm not experiencing what sounded like a jet engine every 10 or so days now. Yep, just checked all 273 entries, and all are showing the spike of CPU_BS1 only. Edited May 11, 2018 by apefray Quote Link to comment
pwm Posted May 11, 2018 Share Posted May 11, 2018 The SMBus (I2C) communication is notorious for either sometimes having bit errors in the transfer or sometimes totally botch the transfer. So any program processing SMBus data needs a filter time so it doesn't react to spikes and instead requires multiple high values in a row before issuing an alarm or increasing the fan speed. Another thing is that only one program at a time can take control of the SMBus master and perform reads. If two programs supervises without some software mechanic to synchronize then there will be lots and lots of failed transfers. The old Motherboard Monitor introduced a synchronization method for running multiple Windows supervision softwares but that solution isn't applicable outside of Windows. 2 Quote Link to comment
DoeBoye Posted May 11, 2018 Author Share Posted May 11, 2018 10 hours ago, pwm said: So any program processing SMBus data needs a filter time so it doesn't react to spikes and instead requires multiple high values in a row before issuing an alarm Thanks for the info! I wonder if there's some way to force the BMC to have this requirement before sending out an alert... That would be a solution to this issue, as the event seems to only occur for an extremely short period of time. Quote Link to comment
BobPhoenix Posted May 11, 2018 Share Posted May 11, 2018 Not a solution for the problem but I turn off the fan control essentially and run them full speed all the time. But I have my servers in the basement in my rack so it doesn't matter how loud they are. I still see the BIOS event log entries so I periodically go and clear the log. It's annoying because it is just spamming the log and potentially hiding other important entries. So if anybody sees an improvement after a bios upgrade I would be interested too! Quote Link to comment
deaerator Posted June 30, 2018 Share Posted June 30, 2018 I have the same cpu and asrock board and have the exact same problem. Anyone know of a fix yet? Quote Link to comment
DoeBoye Posted July 10, 2018 Author Share Posted July 10, 2018 On 6/30/2018 at 5:51 AM, deaerator said: I have the same cpu and asrock board and have the exact same problem. Anyone know of a fix yet? I just saw your post so thought I'd check if they released an updated Bios and BMC for the EP2C602-4L/D16 and it looks like they have released a new bios!! June 26 2018. v.190! Has anyone installed it to see if it helps with this? Notes show Spectre and Meltdown support, but also "Improved system performance".... Quote Link to comment
deaerator Posted July 10, 2018 Share Posted July 10, 2018 I 1 minute ago, DoeBoye said: I just saw your post so thought I'd check if they released an updated Bios and BMC for the EP2C602-4L/D16 and it looks like they have released a new bios!! June 26 2018. v.190! Has anyone installed it to see if it helps with this? Notes show Spectre and Meltdown support, but also "Improved system performance".... I have the latest bios and still showing false temperatures and the cpu keeps on getting de-asserted. I found that disabling Turbo has made my system a bit more stable but still getting crashes every couple of days vs every day. Quote Link to comment
DoeBoye Posted July 10, 2018 Author Share Posted July 10, 2018 Just now, deaerator said: I have the latest bios and still showing false temperatures and the cpu keeps on getting de-asserted. I found that disabling Turbo has made my system a bit more stable but still getting crashes every couple of days vs every day. :(. Drag. I was hoping they fixed this, but sounds like they can't... Sigh Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.