Need help: "mce: [Hardware Error]: Machine check events logged" / "kernel: mce: CMCI storm detected: switching to poll mode"

RavenX · September 29, 2022

My unRAID server worked fine for 8+ month without any problems.

My setup: Dual E5-2699, 64 GB ECC Memory, 4x 12 TB Seagate IronWolf HDDs. Unraid version 6.9.2

Samsung SSD 980 PRO 1TB as Cache drive.

The mainboard has 2 network ports, both connected to the network switch with different IP addresses.

Then in August I suddenly had weird network issues: The network connection would drop (both connections). Host was unreachable for about 30s to 1 minute. Then the network connection was restored as if nothing had happened. The rest of the network devices are still reachable, so it's not the switch.

At first I could not make out a pattern. The connection drops seem to be random. But then I noticed something: whenever the mobile air condition device (also in the same room on the same breaker) would kick in, the network connection was dropped. After observing it 3 times in a row I knew something was up with the power network. I assumed the additional power draw would cause the network failure. It appears as if the massive power draw somehow caused other devices to fail.

I removed the air conditioner from the power circuit but the problems remained. I had to hard reset my network printer to get it working again.

But my unRAID system was unstable from then on. I observed the following problems:

The network connection would drop and then come back.
Network would drop permanently. I had to log in directly on the PC to restart it
System crashes completely. I have to cut power to restart it. (This has only happened 3 times so far)
System reboots. This only happens during the night though. I notice it when the array is down in the morning

I checked the logs and found the following:

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce_notify_irq: 4 callbacks suppressed
kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce_notify_irq: 37 callbacks suppressed
kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: [Hardware Error]: Machine check events logged

This continues for a few minutes until it disappears or the system crashes completely.

Another strange log entry:

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: CMCI storm detected: switching to poll mode
dhcpcd[3473]: br0: fe80::1 is unreachable
dhcpcd[3565]: br1: fe80::1 is unreachable
dhcpcd[3565]: br1: soliciting an IPv6 router
dhcpcd[3473]: br0: soliciting an IPv6 router
dhcpcd[3565]: br1: no IPv6 Routers available
dhcpcd[3473]: br0: no IPv6 Routers available
kernel: ------------[ cut here ]------------
kernel: NETDEV WATCHDOG: eth1 (r8169): transmit queue 0 timed out
kernel: WARNING: CPU: 10 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0xcf/0x12b
kernel: Modules linked in: input_leds led_class xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle nf_tables vhost_net tun vhost vhost_iotlb tap veth macvlan xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding r8169 realtek sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate i2c_i801 i2c_smbus i2c_core intel_uncore nvme ahci mxm_wmi nvme_core libahci wmi button [last unloaded: realtek]
kernel: CPU: 10 PID: 0 Comm: swapper/10 Not tainted 5.10.28-Unraid #1
kernel: Hardware name: JINGSHA Default string/Default string, BIOS 5.11 09/14/2020
kernel: RIP: 0010:dev_watchdog+0xcf/0x12b
kernel: Code: 79 b7 00 00 75 38 48 89 ef c6 05 63 79 b7 00 01 e8 79 dd fc ff 44 89 e1 48 89 ee 48 c7 c7 ef 7f de 81 48 89 c2 e8 50 16 10 00 <0f> 0b eb 10 41 ff c4 48 05 40 01 00 00 41 39 f4 75 9d eb 16 48 8b
kernel: RSP: 0018:ffffc900066f8ed8 EFLAGS: 00010286
kernel: RAX: 0000000000000000 RBX: ffff888126136438 RCX: 0000000000000027
kernel: RDX: 00000000ffffbfff RSI: 0000000000000001 RDI: ffff88885f698920
kernel: RBP: ffff888126136000 R08: 0000000000000000 R09: 00000000ffffbfff
kernel: R10: ffffc900066f8d08 R11: ffffc900066f8d00 R12: 0000000000000000
kernel: R13: ffffc900066f8f10 R14: ffffc900066f8f10 R15: ffffffff820060c8
kernel: FS:  0000000000000000(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000153fcce4e4e6 CR3: 00000008bda4c005 CR4: 00000000001706e0
kernel: Call Trace:
kernel: <IRQ>
kernel: call_timer_fn.isra.0+0x12/0x6f
kernel: ? netif_tx_lock+0x7a/0x7a
kernel: __run_timers.part.0+0x144/0x185
kernel: ? update_process_times+0x68/0x6e
kernel: ? hrtimer_forward+0x73/0x7b
kernel: ? tick_sched_timer+0x5a/0x64
kernel: ? timerqueue_add+0x62/0x68
kernel: ? recalibrate_cpu_khz+0x1/0x1
kernel: run_timer_softirq+0x21/0x43
kernel: __do_softirq+0xc4/0x1c2
kernel: asm_call_irq_on_stack+0x12/0x20
kernel: </IRQ>
kernel: do_softirq_own_stack+0x2c/0x39
kernel: __irq_exit_rcu+0x45/0x80
kernel: sysvec_apic_timer_interrupt+0x87/0x95
kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
kernel: RIP: 0010:arch_local_irq_enable+0x7/0x8
kernel: Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5
kernel: RSP: 0018:ffffc90006357ea0 EFLAGS: 00000246
kernel: RAX: ffff88885f6a2380 RBX: 0000000000000002 RCX: 000000000000001f
kernel: RDX: 0000000000000000 RSI: 0000000037c7f30d RDI: 0000000000000000
kernel: RBP: ffffe8f7fecb7f00 R08: 00000977de756348 R09: 000000000000044e
kernel: R10: 000000007fffffff R11: 071c71c71c71c71c R12: 00000977de756348
kernel: R13: ffffffff820c5dc0 R14: 0000000000000002 R15: 0000000000000000
kernel: cpuidle_enter_state+0x101/0x1c4
kernel: cpuidle_enter+0x25/0x31
kernel: do_idle+0x1a6/0x214
kernel: cpu_startup_entry+0x18/0x1a
kernel: secondary_startup_64_no_verify+0xb0/0xbb
kernel: ---[ end trace 330284c5f5e85237 ]---
kernel: r8169 0000:07:00.0 eth1: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).
dhcpcd[3565]: br1: fe80::1 is reachable again
dhcpcd[3473]: br0: fe80::1 is reachable again
kernel: r8169 0000:06:00.0 eth0: rtl_rxtx_empty_cond == 0 (loop: 42, delay: 100).

I'm not exactly sure what to make of this, but it looks to me like something is wrong with the network adapter on the motherboard. It's almost as if the air conditioner somehow send a power surge through the network and damaged the mainboard. (I'm just guessing here).

I was hoping the problem might go away, but it doesn't. The server runs fine for a few days until the problems reappear, usually with the

kernel: mce: [Hardware Error]: Machine check events logged
kernel: mce: CMCI storm detected: switching to poll mode

messages.

Can anyone tell me what is going on? Is the network chip on the motherboard broken? Is my switch broken? (I don't think so because the rest of the PCs work fine)

The problem seems to be getting worse and I'm getting desperate.

Any kind of help or suggestions are welcome.

(I attached log files with the full info)

syslog-20220923-180914.txt syslog-192.168.2.112_2022-09-29.log

JorgeB · September 29, 2022

I would suggest installing an add-on NIC to test, if that's an option.

RavenX · September 29, 2022

Thanks. I will try that.

But I still would like to know what these errors actually mean. Can anyone explain it to me?

RavenX · September 30, 2022

I tried a new NIC. The new card is detected and eth2 and eth3 show up with ifconfig. The old network ports don't work at all any more.

I cannot reach the server over the network because the old ports don't respond and the unRAID is not configured to use the new card.

So how do I setup unRAID to use the new NIC from the command line (GUI is NOT available!).

JorgeB · September 30, 2022

If you can boot using the GUI mode use that to configure the LAN, if not disable the onboard NICs in the board BIOS and reboot, if still issues delete/rename /config/network.cfg and /config/network-rules.cfg

RavenX · September 30, 2022

Thanks for the tipp! I did that and reconfigured the network devices.

It didn't seem to work though. I gave the two new ports a static and DHCP address, but was unable to ping the router.

Only after plugging in network cables into the old ports did it work eventually. (I'm still not sure what did it in the end).

Now I can only hope that it remains stable.

about the BIOS: Unfortunately it's one of those - very few actually settings. I'm not sure I can disable the on board NIC.

Should I do this? I'm a little worried that it fucks up my network config and nothing works any more.

JorgeB · September 30, 2022

You can leave them enabled, if they are not being used they should not cause any issues, leave them as eth2 and eth3 with no IP address configured.

RavenX · October 5, 2022

Okay it seems to be stable for now (4 days). Thanks for the help!

Need help: "mce: [Hardware Error]: Machine check events logged" / "kernel: mce: CMCI storm detected: switching to poll mode"

Recommended Posts

RavenX

Link to comment

JorgeB

Link to comment

RavenX

Link to comment

RavenX

Link to comment

JorgeB

Link to comment

RavenX

Link to comment

JorgeB

Link to comment

RavenX

Link to comment

Join the conversation