Machine Check Events Error


Recommended Posts

  • 4 months later...

I got the dreaded error warning today after a hard reboot (system was unresponsive) it's running now, parity check is going.  Anyone have any clue if this is telling me something? 

 

Attached the full log but.... these don't sound like good news.

Jun 25 05:50:04 TheBronze kernel: mce: [Hardware Error]: Machine check events logged
Jun 25 05:50:04 TheBronze kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 3: fe00000000800400
Jun 25 05:50:04 TheBronze kernel: mce: [Hardware Error]: TSC 0 ADDR ffffffff8108843d MISC ffffffff8108843d 
Jun 25 05:50:04 TheBronze kernel: mce: [Hardware Error]: PROCESSOR 0:a0655 TIME 1624618181 SOCKET 0 APIC 0 microcode e0
Jun 25 05:50:04 TheBronze kernel: mce: [Hardware Error]: Machine check events logged
Jun 25 05:50:04 TheBronze kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: fe00000000800400
Jun 25 05:50:04 TheBronze kernel: mce: [Hardware Error]: TSC 0 ADDR fffff8044651bb59 MISC fffff8044651bb59 
Jun 25 05:50:04 TheBronze kernel: mce: [Hardware Error]: PROCESSOR 0:a0655 TIME 1624618181 SOCKET 0 APIC 0 microcode e0

syslog

Link to comment
  • 1 month later...

Hi guys,

 

Hope someone can shed a light on my Ryzen 5950 Unraid system - first time builder - please be patient with me...

 

I am getting this after a few months and this month twice...  might be the heat in the room...

 

Aug  1 19:14:09 MyBongo kernel: mce: [Hardware Error]: Machine check events logged
Aug  1 19:14:09 MyBongo kernel: mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 0: bc00080001010135
Aug  1 19:14:09 MyBongo kernel: mce: [Hardware Error]: TSC 0 ADDR fb8d39280 MISC d012000000000000 IPID 1000b000000000 
Aug  1 19:14:09 MyBongo kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1627870430 SOCKET 0 APIC 8 microcode a201009

 

Much appreciate if I can safely ignore.  I am planning to upgrade my Gigabyte X570 BIOS Master and also upgrade UNRAID OS to the latest... just being extra careful...

 

 

mybongo-syslog-20210802-0321.zip

Link to comment
  • 3 weeks later...

Hey all,

I'm new to this. 
This morning, my Intel machine mysteriously rebooted (I guess the BIOS is set to reboot when encountering hardware problems? I know, I know, I should change this, and I will).

When I logged in at around 2pm, I noticed it started a parity check on reboot.

MCE tells me that there's a hardware error. 
Here's my zip file. Can anyone tell me if this is true?

 

nas-diagnostics-20210822-1449.zip

Link to comment
11 minutes ago, Squid said:

The mce listed happened during core initialization, and isn't anything to worry about and happens on certain hardware combinations

 

I would start with running a memtest for a pass or two


So some hardware combos are just doomed to randomly reboot? That sucks camel caboose.

How do I go about running a memtest? I've never done one.

Link to comment
3 minutes ago, Corvus said:

So some hardware combos are just doomed to randomly reboot?

I didn't say that.  I said the mce happens on certain hardware combinations when initializing the cpu cores and is nothing to worry about.

3 minutes ago, Corvus said:

How do I go about running a memtest? I've never done one.

Its on the boot menu.  If you're booting via UEFI, then you'll have to temporarily switch to Legacy in order to run it (or download a new stick from https://www.memtest86.com/)

Link to comment
9 minutes ago, Squid said:

I didn't say that.  I said the mce happens on certain hardware combinations when initializing the cpu cores and is nothing to worry about.

Its on the boot menu.  If you're booting via UEFI, then you'll have to temporarily switch to Legacy in order to run it (or download a new stick from https://www.memtest86.com/)


Ok that's gonna be a problem. You see, my particular motherboard has this known bug where if the secondary m.2 is occupied, it sometimes refuses to output display via the GPU until the m.2 is reseated - and that's not possible because I'd have to dismantle the entire system to do that.

Sooo anywho, I have no direct display output capabilities whatsoever.

Any alternative?

Link to comment
  • 1 month later...

Hello All,

 

Ive also received this error: Your server has detected hardware errors. You should install mcelog via the NerdPack plugin, post your diagnostics and ask for assistance on the unRaid forums. The output of mcelog (if installed) has been logged.

 

Ive uploaded both methods of obtaining logs below.

 

I don't know what I would be looking for.  Any help would be appreciated

 

syslog yianni-diagnostics-20211010-0708.zip

Link to comment
  • 2 weeks later...
Oct 13 00:44:23 Yianni kernel: mce: Uncorrected memory error in page 0x0 ignored
Oct 13 00:44:23 Yianni kernel: Rebuild kernel with CONFIG_MEMORY_FAILURE=y for smarter handling
Oct 13 00:44:23 Yianni kernel: [Hardware Error]: Deferred error, no action required.
Oct 13 00:44:23 Yianni kernel: [Hardware Error]: CPU:1 (19:21:0) MC24_STATUS[Over|-|-|AddrV|-|-|UECC|Deferred|-|-]: 0xd589f68949fd8949
Oct 13 00:44:23 Yianni kernel: [Hardware Error]: Error Addr: 0x0000000000000000
Oct 13 00:44:23 Yianni kernel: [Hardware Error]: IPID: 0x0000000000000000
Oct 13 00:44:23 Yianni kernel: [Hardware Error]: System Management Unit Ext. Error Code: 61
Oct 13 00:44:23 Yianni kernel: [Hardware Error]: cache level: L1, tx: GEN

 

Safe to ignore.  It's just a known Ryzen issue where that happens on earlier kernels

Link to comment
  • 2 weeks later...
  • 1 month later...

Received the same message. Log file is attached. I believe that this is the relevant portion. 

 

Dec 19 10:08:11 Tower root: Fix Common Problems: Error: Machine Check Events detected on your server
Dec 19 10:08:11 Tower root: Hardware event. This is not a software error.
Dec 19 10:08:11 Tower root: MCE 0
Dec 19 10:08:11 Tower root: CPU 1 BANK 6 TSC e0507bb7fe8f4 
Dec 19 10:08:11 Tower root: MISC a010414 ADDR bdbaeefc0 
Dec 19 10:08:11 Tower root: TIME 1639771657 Fri Dec 17 14:07:37 2021
Dec 19 10:08:11 Tower root: MCG status:
Dec 19 10:08:11 Tower root: MCi status:
Dec 19 10:08:11 Tower root: Corrected error
Dec 19 10:08:11 Tower root: MCi_MISC register valid
Dec 19 10:08:11 Tower root: MCi_ADDR register valid
Dec 19 10:08:11 Tower root: Threshold based error status: green
Dec 19 10:08:11 Tower root: MCA: corrected filtering (some unreported errors in same region)
Dec 19 10:08:11 Tower root: Generic CACHE Level-2 Data-Write Error
Dec 19 10:08:11 Tower root: STATUS 8c2000400001114a MCGSTATUS 0
Dec 19 10:08:11 Tower root: MCGCAP 1c09 APICID 0 SOCKETID 0 
Dec 19 10:08:11 Tower root: MICROCODE 1f
Dec 19 10:08:11 Tower root: CPUID Vendor Intel Family 6 Model 44
Dec 19 10:08:11 Tower root: mcelog: warning: 8 bytes ignored in each record
Dec 19 10:08:11 Tower root: mcelog: consider an update
Dec 19 10:08:21 Tower emhttpd: read SMART /dev/sdm
Dec 19 10:08:21 Tower emhttpd: read SMART /dev/sdj
Dec 19 10:08:21 Tower emhttpd: read SMART /dev/sdh
Dec 19 10:08:21 Tower emhttpd: read SMART /dev/sdn
Dec 19 10:08:35 Tower emhttpd: read SMART /dev/sdi

 

tower-diagnostics-20211219-1008.zip

Link to comment
  • 4 weeks later...

My webui became "unresponsive". I was able to to move around the interface.  It told me the array had stopped but my VMs and Docker containers were all responding and working fine. I tried shutting them down via the Unraid UI but it seemed like no commands were executing...even though the ui seemed to confirm the commands were being sent. This message was displayed when I clicked on the Apps tab. Luckily I took a picture because it didn't show again after a page refresh.

789621977_20220111_144921399.thumb.jpg.95c2bb6a8622abe97c8d61577a0e24a6.jpg

I was able to shutdown my VM's and then had to shutdown the unraid server via the physical power button.

On restart everything seems fine but I got the MCE error. Please see attached logs. Any guidance would be appreciated.

 

 

 

Edited by stephack
Link to comment
  • 4 weeks later...

Had this pop up today, any insight?  Much appreciated!

 

Feb  5 08:30:59 Tower kernel: smpboot: CPU0: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz (family: 0x6, model: 0x3f, stepping: 0x2)
Feb  5 08:30:59 Tower kernel: mce: [Hardware Error]: Machine check events logged
Feb  5 08:30:59 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 17: ee2000000004017a
Feb  5 08:30:59 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 5f000000 MISC 4f00031e0000086 
Feb  5 08:30:59 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1644067837 SOCKET 0 APIC 0 microcode 44
Feb  5 08:30:59 Tower kernel: mce: [Hardware Error]: Machine check events logged
Feb  5 08:30:59 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 18: ee2000000004017a
Feb  5 08:30:59 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 5f100000 MISC 44f00031e0000086 
Feb  5 08:30:59 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1644067837 SOCKET 0 APIC 0 microcode 44
Feb  5 08:30:59 Tower kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 19: ee2000000004017a
Feb  5 08:30:59 Tower kernel: mce: [Hardware Error]: TSC 0 ADDR 5f100080 MISC 84f00031e0000086 
Feb  5 08:30:59 Tower kernel: mce: [Hardware Error]: PROCESSOR 0:306f2 TIME 1644067837 SOCKET 0 APIC 0 microcode 44

tower-diagnostics-20220205-0854.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.