May 18, 20224 yr Hello. I have been trying to troubleshoot some mce errors in my logs, while my server seems to be running okay, I would like to fix whatever issue is being identified here. After updating to 6.10 I got some more details now. My log is full of repeating versions of this: May 18 10:07:29 Tower mcelog: Trigger `cache-error-trigger' exited with status 1 May 18 10:07:30 Tower mcelog: CPU 0 on socket 0 has large number of corrected cache errors in Level-3 Instruction May 18 10:07:30 Tower mcelog: System operating correctly, but might lead to uncorrected cache errors soon May 18 10:07:30 Tower mcelog: Running trigger `cache-error-trigger' (reporter: yellow) May 18 10:07:30 Tower mcelog: Offlining CPU 0 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 0 failed May 18 10:07:30 Tower mcelog: Offlining CPU 1 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 2 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 3 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 4 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 5 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 6 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 7 due to cache error threshold Thanks for any suggestions! tower-diagnostics-20220518-1008.zip Edited May 20, 20224 yr by slofiend
May 19, 20224 yr Author Here is the latest output....my log memory is not pinged at 100%... Tower kernel: smpboot: CPU0: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (family: 0x6, model: 0x5e, stepping: 0x3) May 18 09:14:57 Tower kernel: mce: [Hardware Error]: Machine check events logged May 18 09:14:57 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: cc44dec000041136 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 100190ac0 MISC 3000034086 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1652890472 SOCKET 0 APIC 0 microcode f0 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: Machine check events logged May 18 09:14:57 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c40004000041152 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: TSC 1bd2a07240 ADDR 302bcc0 MISC 3020034086 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1652890472 SOCKET 0 APIC 0 microcode f0 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c4000400004117a May 18 09:14:57 Tower kernel: mce: [Hardware Error]: TSC 1bd2ace75e ADDR 83e0d8c0 MISC 3936e034086 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1652890472 SOCKET 0 APIC 0 microcode f0 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c40004000041136 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: TSC 1bd2b5632c ADDR 48034c0 MISC 3040034086 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1652890472 SOCKET 0 APIC 0 microcode f0 This goes on and on....is this a bad CPU maybe?
May 20, 20224 yr Author I have attached diagnostics. I know there are mce errors in the log, but after an upgrade my log memory is maxed out, that didn't happen before. I don't know what to do about the mce errors, if anyone can help me narrow down what the issue might be, I'd be really grateful. I'm hoping correcting that will help reduce the memory utilization. syslog.1.txt tower-diagnostics-20220519-1214.zip
May 20, 20224 yr Community Expert I have merged your threads. Please don't post in multiple threads for the same problem. It makes it impossible to coordinate replies. If after a reasonable time you feel your thread isn't getting attention, make a new post in the same thread (known as "bumping", people will often just post "bump".) This seems like a hardware problem to me. Do you have good power?
May 20, 20224 yr Author Thanks for the help, and posting correction. Fair enough. So far as I know, power is fine, the server has been up for months. It's also running through a battery backup and power regulator so it should be quite stable. And I agree it seems like hardware, I'm just not sure how to diagnose which/what hardware is the issue aside from just replacing the CPU (if that is the issue). Edited May 20, 20224 yr by slofiend
May 20, 20224 yr Community Expert Try going back to v6.9.2 to see if the errors go way, IIRC there's another user with a similar issue with the same family CPU.
May 20, 20224 yr Author Ah good test idea. So the memory issue no longer occurs on 6.9.2, but I think that's becuase the mce errors seem to get logged at a lower verbosity level than in 6.10. Though I still have them. I've attached updated log if that helps at all. May 20 08:23:15 Tower kernel: rcu: Hierarchical SRCU implementation. May 20 08:23:15 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c40004000041152 May 20 08:23:15 Tower kernel: mce: [Hardware Error]: TSC 1bf8e2f6d6 ADDR 43a00c0 MISC 3020034086 May 20 08:23:15 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1653060177 SOCKET 0 APIC 0 microcode e2 May 20 08:23:15 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c40004000041152 May 20 08:23:15 Tower kernel: mce: [Hardware Error]: TSC 1bf8e5b88f ADDR 3347cc0 MISC 3020034086 May 20 08:23:15 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1653060177 SOCKET 0 APIC 0 microcode e2 I don't have another CPU to swap out to see if that fixes it, but it doesn't read to me like a memory error where I just need to move some DIMMs around, so I'm not sure the best way to A/B test this to confirm a fix. Thanks for the help. tower-diagnostics-20220520-0828.zip
May 20, 20224 yr Community Expert Solution Check that it's well seated, if errors persist you likely have a bad CPU.
April 16, 20242 yr On 5/20/2022 at 8:37 AM, JorgeB said: Check that it's well seated, if errors persist you likely have a bad CPU. Just to clarify, @slofiend. Did you have to reseat your CPU? Was it bad and had to replace? I'm getting MCE errors myself after upgrading my CPU and Memory.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.