Jump to content

SOLVED: Cache-error-trigger in log file


Go to solution Solved by JorgeB,

Recommended Posts

Hello.

I have been trying to troubleshoot some mce errors in my logs, while my server seems to be running okay, I would like to fix whatever issue is being identified here.

 

After updating to 6.10 I got some more details now.

 

My log is full of repeating versions of this:

 

May 18 10:07:29 Tower mcelog: Trigger `cache-error-trigger' exited with status 1
May 18 10:07:30 Tower mcelog: CPU 0 on socket 0 has large number of corrected cache errors in Level-3 Instruction
May 18 10:07:30 Tower mcelog: System operating correctly, but might lead to uncorrected cache errors soon
May 18 10:07:30 Tower mcelog: Running trigger `cache-error-trigger' (reporter: yellow)
May 18 10:07:30 Tower mcelog: Offlining CPU 0 due to cache error threshold
May 18 10:07:30 Tower mcelog: Offlining CPU 0 failed
May 18 10:07:30 Tower mcelog: Offlining CPU 1 due to cache error threshold
May 18 10:07:30 Tower mcelog: Offlining CPU 2 due to cache error threshold
May 18 10:07:30 Tower mcelog: Offlining CPU 3 due to cache error threshold
May 18 10:07:30 Tower mcelog: Offlining CPU 4 due to cache error threshold
May 18 10:07:30 Tower mcelog: Offlining CPU 5 due to cache error threshold
May 18 10:07:30 Tower mcelog: Offlining CPU 6 due to cache error threshold
May 18 10:07:30 Tower mcelog: Offlining CPU 7 due to cache error threshold

 

Thanks for any suggestions!

tower-diagnostics-20220518-1008.zip

Edited by slofiend
Link to comment

Here is the latest output....my log memory is not pinged at 100%...

Tower kernel: smpboot: CPU0: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (family: 0x6, model: 0x5e, stepping: 0x3)
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: Machine check events logged
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: cc44dec000041136
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 100190ac0 MISC 3000034086 
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1652890472 SOCKET 0 APIC 0 microcode f0
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: Machine check events logged
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c40004000041152
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: TSC 1bd2a07240 ADDR 302bcc0 MISC 3020034086 
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1652890472 SOCKET 0 APIC 0 microcode f0
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c4000400004117a
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: TSC 1bd2ace75e ADDR 83e0d8c0 MISC 3936e034086 
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1652890472 SOCKET 0 APIC 0 microcode f0
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c40004000041136
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: TSC 1bd2b5632c ADDR 48034c0 MISC 3040034086 
May 18 09:14:57 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1652890472 SOCKET 0 APIC 0 microcode f0

 

This goes on and on....is this a bad CPU maybe?

Link to comment

I have attached diagnostics.

I know there are mce errors in the log, but after an upgrade my log memory is maxed out, that didn't happen before.

 

I don't know what to do about the mce errors, if anyone can help me narrow down what the issue might be, I'd be really grateful.

I'm hoping correcting that will help reduce the memory utilization.

 

syslog.1.txt

tower-diagnostics-20220519-1214.zip

Link to comment

I have merged your threads. Please don't post in multiple threads for the same problem. It makes it impossible to coordinate replies.

 

If after a reasonable time you feel your thread isn't getting attention, make a new post in the same thread (known as "bumping", people will often just post "bump".)

 

This seems like a hardware problem to me. Do you have good power?

Link to comment

Thanks for the help, and posting correction. Fair enough.
So far as I know, power is fine, the server has been up for months. It's also running through a battery backup and power regulator so it should be quite stable.

And I agree it seems like hardware, I'm just not sure how to diagnose which/what hardware is the issue aside from just replacing the CPU (if that is the issue).

Edited by slofiend
Link to comment

Ah good test idea.
So the memory issue no longer occurs on 6.9.2, but I think that's becuase the mce errors seem to get logged at a lower verbosity level than in 6.10.

Though I still have them.

 

I've attached updated log if that helps at all.


May 20 08:23:15 Tower kernel: rcu: Hierarchical SRCU implementation.
May 20 08:23:15 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c40004000041152
May 20 08:23:15 Tower kernel: mce: [Hardware Error]: TSC 1bf8e2f6d6 ADDR 43a00c0 MISC 3020034086 
May 20 08:23:15 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1653060177 SOCKET 0 APIC 0 microcode e2
May 20 08:23:15 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c40004000041152
May 20 08:23:15 Tower kernel: mce: [Hardware Error]: TSC 1bf8e5b88f ADDR 3347cc0 MISC 3020034086 
May 20 08:23:15 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1653060177 SOCKET 0 APIC 0 microcode e2

 

I don't have another CPU to swap out to see if that fixes it, but it doesn't read to me like a memory error where I just need to move some DIMMs around, so I'm not sure the best way to A/B test this to confirm a fix.

 

Thanks for the help.

tower-diagnostics-20220520-0828.zip

Link to comment
  • slofiend changed the title to SOLVED: Cache-error-trigger in log file
  • 1 year later...
On 5/20/2022 at 8:37 AM, JorgeB said:

Check that it's well seated, if errors persist you likely have a bad CPU.

Just to clarify, @slofiend. Did you have to reseat your CPU? Was it bad and had to replace? I'm getting MCE errors myself after upgrading my CPU and Memory.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...