joeskii Posted September 6, 2018 Share Posted September 6, 2018 Version: Unraid 6.5.3 I'm getting this hardware error on my CPU, what is this? I've attached my logs as well Thank you for your help! Sep 5 12:16:04 Tower kernel: mce: [Hardware Error]: Machine check events logged Sep 5 12:16:04 Tower kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 8: cc0004c00001009f Sep 5 12:16:04 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR c957ef740 MISC 102040800016c4c Sep 5 12:16:04 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1536174945 SOCKET 1 APIC 20 microcode a einstein-diagnostics-20180906-0842.zip Quote Link to comment
trurl Posted September 6, 2018 Share Posted September 6, 2018 Have you done a memtest? Quote Link to comment
joeskii Posted September 8, 2018 Author Share Posted September 8, 2018 I just finished running one. No errors Quote Link to comment
JorgeB Posted September 8, 2018 Share Posted September 8, 2018 Board should have a system event log, if it does check it in the bios, there might be more info there. Quote Link to comment
joeskii Posted September 10, 2018 Author Share Posted September 10, 2018 Sep 7 18:36:15 einstein kernel: smpboot: CPU0: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (family: 0x6, model: 0x2c, stepping: 0x2) Sep 7 18:36:15 einstein kernel: Performance Events: PEBS fmt1+, Westmere events, 16-deep LBR, Intel PMU driver. Sep 7 18:36:15 einstein kernel: core: CPUID marked event: 'bus cycles' unavailable Sep 7 18:36:15 einstein kernel: ... version: 3 Sep 7 18:36:15 einstein kernel: ... bit width: 48 Sep 7 18:36:15 einstein kernel: ... generic registers: 4 Sep 7 18:36:15 einstein kernel: ... value mask: 0000ffffffffffff Sep 7 18:36:15 einstein kernel: ... max period: 000000007fffffff Sep 7 18:36:15 einstein kernel: ... fixed-purpose events: 3 Sep 7 18:36:15 einstein kernel: ... event mask: 000000070000000f Sep 7 18:36:15 einstein kernel: Hierarchical SRCU implementation. Sep 7 18:36:15 einstein kernel: smp: Bringing up secondary CPUs ... Sep 7 18:36:15 einstein kernel: x86: Booting SMP configuration: Sep 7 18:36:15 einstein kernel: .... node #1, CPUs: #1 Sep 7 18:36:15 einstein kernel: mce: [Hardware Error]: Machine check events logged Sep 7 18:36:15 einstein kernel: mce: [Hardware Error]: CPU 1: Machine Check: 0 Bank 8: cc0004c00001009f Sep 7 18:36:15 einstein kernel: mce: [Hardware Error]: TSC 0 ADDR c79edeb80 MISC 102040800016040 Sep 7 18:36:15 einstein kernel: mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1536370556 SOCKET 1 APIC 20 microcode a Sep 7 18:36:15 einstein kernel: .... node #0, CPUs: #2 Sep 7 18:36:15 einstein kernel: .... node #1, CPUs: #3 Sep 7 18:36:15 einstein kernel: .... node #0, CPUs: #4 Sep 7 18:36:15 einstein kernel: .... node #1, CPUs: #5 Sep 7 18:36:15 einstein kernel: .... node #0, CPUs: #6 Sep 7 18:36:15 einstein kernel: .... node #1, CPUs: #7 Sep 7 18:36:15 einstein kernel: .... node #0, CPUs: #8 Sep 7 18:36:15 einstein kernel: .... node #1, CPUs: #9 Sep 7 18:36:15 einstein kernel: .... node #0, CPUs: #10 Sep 7 18:36:15 einstein kernel: .... node #1, CPUs: #11 Sep 7 18:36:15 einstein kernel: smp: Brought up 2 nodes, 12 CPUs Sep 7 18:36:15 einstein kernel: smpboot: Total of 12 processors activated (57598.65 BogoMIPS) Does this mean anything to anyone? I found this in my syslog. the CPU 1: Machine Check: 0 Bank 8: cc0004c00001009f Quote Link to comment
binky Posted September 10, 2018 Share Posted September 10, 2018 I have encountered Machine Check Exceptions "mce:" in the past and for me they have always been a failing ECC memory chip. Memtest says the memory is OK because as far as it is concerned, the memory is working as correct values are being written and verified, but the ECC hardware has had to correct the bits on the chip, raising an MCE exception that Memtest hasn't detected/isn't hooked into. I've have a machine that is currently generating MCE exceptions and if I run Windows on it, I can't tell it's happening but if I run Linux I can see then errors occasionally. They don't happen often as the machine has 256GB of ECC RAM so it's not often using the bit of ram that's 'iffy'. This has been my experience, although there could be other reasons MCE exceptions are being raised. 🤔 2 Quote Link to comment
joeskii Posted September 11, 2018 Author Share Posted September 11, 2018 Thanks for the info! I'll test out the sticks individually to see if I can find out which one. I originally thought it was a CPU error so I'm much happier if it was just an issue with a stick of ram. Quote Link to comment
binky Posted September 11, 2018 Share Posted September 11, 2018 6 hours ago, joeskii said: I originally thought it was a CPU error so I'm much happier if it was just an issue with a stick of ram. An of course, in this thread @bfeist isn't using ECC Ram so it could be an issue with the CPU... Quote Link to comment
ghost82 Posted June 12, 2019 Share Posted June 12, 2019 On 9/10/2018 at 11:15 AM, binky said: I have encountered Machine Check Exceptions "mce:" in the past and for me they have always been a failing ECC memory chip. Memtest says the memory is OK because as far as it is concerned, the memory is working as correct values are being written and verified, but the ECC hardware has had to correct the bits on the chip, raising an MCE exception that Memtest hasn't detected/isn't hooked into. I've have a machine that is currently generating MCE exceptions and if I run Windows on it, I can't tell it's happening but if I run Linux I can see then errors occasionally. They don't happen often as the machine has 256GB of ECC RAM so it's not often using the bit of ram that's 'iffy'. This has been my experience, although there could be other reasons MCE exceptions are being raised. 🤔 I confirm this! Thank you for your information. I have a windows workstation and I started noticed in the event logger errors related to whea (every minute), but the system run smooth without crashes, for days. Memtest showed no errors for all the ecc ram modules. When I installed unraid I noticed errors related to cpu hardware and memory; by removing the faulty ram module both errors don't show anymore. Quote Link to comment
testdasi Posted June 12, 2019 Share Posted June 12, 2019 25 minutes ago, ghost82 said: I confirm this! Thank you for your information. I have a windows workstation and I started noticed in the event logger errors related to whea (every minute), but the system run smooth without crashes, for days. Memtest showed no errors for all the ecc ram modules. When I installed unraid I noticed errors related to cpu hardware and memory; by removing the faulty ram module both errors don't show anymore. Good to know. I used to have these mce errors on my previous servers but they didn't cause any issue so I ignored them. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.