slofiend Posted May 18, 2022 Share Posted May 18, 2022 (edited) Hello. I have been trying to troubleshoot some mce errors in my logs, while my server seems to be running okay, I would like to fix whatever issue is being identified here. After updating to 6.10 I got some more details now. My log is full of repeating versions of this: May 18 10:07:29 Tower mcelog: Trigger `cache-error-trigger' exited with status 1 May 18 10:07:30 Tower mcelog: CPU 0 on socket 0 has large number of corrected cache errors in Level-3 Instruction May 18 10:07:30 Tower mcelog: System operating correctly, but might lead to uncorrected cache errors soon May 18 10:07:30 Tower mcelog: Running trigger `cache-error-trigger' (reporter: yellow) May 18 10:07:30 Tower mcelog: Offlining CPU 0 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 0 failed May 18 10:07:30 Tower mcelog: Offlining CPU 1 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 2 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 3 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 4 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 5 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 6 due to cache error threshold May 18 10:07:30 Tower mcelog: Offlining CPU 7 due to cache error threshold Thanks for any suggestions! tower-diagnostics-20220518-1008.zip Edited May 20, 2022 by slofiend Quote Link to comment
slofiend Posted May 19, 2022 Author Share Posted May 19, 2022 Here is the latest output....my log memory is not pinged at 100%... Tower kernel: smpboot: CPU0: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz (family: 0x6, model: 0x5e, stepping: 0x3) May 18 09:14:57 Tower kernel: mce: [Hardware Error]: Machine check events logged May 18 09:14:57 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: cc44dec000041136 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 100190ac0 MISC 3000034086 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1652890472 SOCKET 0 APIC 0 microcode f0 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: Machine check events logged May 18 09:14:57 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c40004000041152 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: TSC 1bd2a07240 ADDR 302bcc0 MISC 3020034086 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1652890472 SOCKET 0 APIC 0 microcode f0 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c4000400004117a May 18 09:14:57 Tower kernel: mce: [Hardware Error]: TSC 1bd2ace75e ADDR 83e0d8c0 MISC 3936e034086 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1652890472 SOCKET 0 APIC 0 microcode f0 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c40004000041136 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: TSC 1bd2b5632c ADDR 48034c0 MISC 3040034086 May 18 09:14:57 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1652890472 SOCKET 0 APIC 0 microcode f0 This goes on and on....is this a bad CPU maybe? Quote Link to comment
slofiend Posted May 20, 2022 Author Share Posted May 20, 2022 I have attached diagnostics. I know there are mce errors in the log, but after an upgrade my log memory is maxed out, that didn't happen before. I don't know what to do about the mce errors, if anyone can help me narrow down what the issue might be, I'd be really grateful. I'm hoping correcting that will help reduce the memory utilization. syslog.1.txt tower-diagnostics-20220519-1214.zip Quote Link to comment
trurl Posted May 20, 2022 Share Posted May 20, 2022 I have merged your threads. Please don't post in multiple threads for the same problem. It makes it impossible to coordinate replies. If after a reasonable time you feel your thread isn't getting attention, make a new post in the same thread (known as "bumping", people will often just post "bump".) This seems like a hardware problem to me. Do you have good power? Quote Link to comment
slofiend Posted May 20, 2022 Author Share Posted May 20, 2022 (edited) Thanks for the help, and posting correction. Fair enough. So far as I know, power is fine, the server has been up for months. It's also running through a battery backup and power regulator so it should be quite stable. And I agree it seems like hardware, I'm just not sure how to diagnose which/what hardware is the issue aside from just replacing the CPU (if that is the issue). Edited May 20, 2022 by slofiend Quote Link to comment
JorgeB Posted May 20, 2022 Share Posted May 20, 2022 Try going back to v6.9.2 to see if the errors go way, IIRC there's another user with a similar issue with the same family CPU. Quote Link to comment
slofiend Posted May 20, 2022 Author Share Posted May 20, 2022 Ah good test idea. So the memory issue no longer occurs on 6.9.2, but I think that's becuase the mce errors seem to get logged at a lower verbosity level than in 6.10. Though I still have them. I've attached updated log if that helps at all. May 20 08:23:15 Tower kernel: rcu: Hierarchical SRCU implementation. May 20 08:23:15 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c40004000041152 May 20 08:23:15 Tower kernel: mce: [Hardware Error]: TSC 1bf8e2f6d6 ADDR 43a00c0 MISC 3020034086 May 20 08:23:15 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1653060177 SOCKET 0 APIC 0 microcode e2 May 20 08:23:15 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 6: 8c40004000041152 May 20 08:23:15 Tower kernel: mce: [Hardware Error]: TSC 1bf8e5b88f ADDR 3347cc0 MISC 3020034086 May 20 08:23:15 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:506e3 TIME 1653060177 SOCKET 0 APIC 0 microcode e2 I don't have another CPU to swap out to see if that fixes it, but it doesn't read to me like a memory error where I just need to move some DIMMs around, so I'm not sure the best way to A/B test this to confirm a fix. Thanks for the help. tower-diagnostics-20220520-0828.zip Quote Link to comment
Solution JorgeB Posted May 20, 2022 Solution Share Posted May 20, 2022 Check that it's well seated, if errors persist you likely have a bad CPU. Quote Link to comment
slofiend Posted May 20, 2022 Author Share Posted May 20, 2022 Thanks for the support. I'll mark this as solved for now. Quote Link to comment
Gex2501 Posted April 16 Share Posted April 16 On 5/20/2022 at 8:37 AM, JorgeB said: Check that it's well seated, if errors persist you likely have a bad CPU. Just to clarify, @slofiend. Did you have to reseat your CPU? Was it bad and had to replace? I'm getting MCE errors myself after upgrading my CPU and Memory. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.