June 13, 20242 yr Hi, I have recently started seein Machine Check Events detect a problem on my server. I have not had any hardware changes in the past 6 months, neither configuration updates really....a new docker here or there. I am affraid I need some help figuring out what the issue. the relevant syslog lines are lines 836-841: Jun 12 18:29:40 unraid kernel: mce: [Hardware Error]: Machine check events logged Jun 12 18:29:40 unraid kernel: registered taskstats version 1 Jun 12 18:29:40 unraid kernel: mce: [Hardware Error]: CPU 26: Machine Check: 0 Bank 2: bea0200004020136 Jun 12 18:29:40 unraid kernel: mce: [Hardware Error]: TSC 0 ADDR de3566250 MISC d012000200000000 SYND 74f11d442b29 IPID 200b000000000 Jun 12 18:29:40 unraid kernel: Btrfs loaded, crc32c=crc32c-generic, zoned=no, fsverity=no Jun 12 18:29:40 unraid kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1718231339 SOCKET 0 APIC 15 microcode a20120a Diagnotics are attached. Many thanks for you time and assistance. unraid-diagnostics-20240612-1904.zip
June 13, 20242 yr run mem test. 90% of the time its ecc memorry correction. if amd hardware mcelog is not working. If intell open termainl and run mcelog is mcelog blank? then its ecc memory doe it have content thats the hardware error failing. Looks like your processor memory controller may be getting too hot and or starting to fail. Or you have some undervolting / overclocking is wrong.
June 14, 20242 yr Author I do have AMD and no ECC RAM. From what I read mcelog was needed for older UNRAID versions. aren't the log lines above what mcelog would print out? This seems to happen each time a partiy check starts. I got another one tonight (I had cancelled the parity check last night). Both seem related to CPU 26: Machine Check: 0 Bank 2 Jun 13 21:06:02 unraid kernel: mce: [Hardware Error]: Machine check events logged Jun 13 21:06:02 unraid kernel: mce: [Hardware Error]: CPU 26: Machine Check: 0 Bank 2: bea0200004020166 Jun 13 21:06:02 unraid kernel: mce: [Hardware Error]: TSC 0 ADDR 446a74670 MISC d012000500000000 SYND e0f01d44163a IPID 200b000000000 Jun 13 21:06:02 unraid kernel: mce: [Hardware Error]: PROCESSOR 2:a20f12 TIME 1718327131 SOCKET 0 APIC 15 microcode a20120a
June 14, 20242 yr Try running the server with just 2 sticks of RAM, if the errors continue try the other two, that will basically rule out a RAM issue.
September 5, 20241 yr OP, what was result... I started getting MCE events on AMD recently and wonder if its related to new unRAID versions (and notice a fair few similar posts!)
September 6, 20241 yr 18 hours ago, methanoid said: OP, what was result... I started getting MCE events on AMD recently and wonder if its related to new unRAID versions (and notice a fair few similar posts!) atm I can only recommend running meme test at boot and a live test mem with the plugin. Per other posts: mcelog on amd is a bit misleading. on amd its informative but mean nothing. when you revieve a mce error on AMD it is best to run mem test and post diags. test 1 GB while still live then reboot and run mem test
September 6, 20241 yr Author On 9/5/2024 at 2:08 AM, methanoid said: OP, what was result... I started getting MCE events on AMD recently and wonder if its related to new unRAID versions (and notice a fair few similar posts!) Line three in the log is the real clue here. ‘’’ Jun 12 18:29:40 unraid kernel: mce: [Hardware Error]: CPU 26: Machine Check: 0 Bank 2: bea0200004020136 ’’’ I actually ended up contacting Unraid support and they pointed this out. It was the CPU for me, and an updated motherboard firmware fixed it.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.