[Solved] Random reboots "Kernel panic - not syncing: Timeout: Not all cpus entered broadcast exception handler" - General Support

July 27, 20196 yr

Hello everyone

I've had my server since last Christmas so I'm quite new to this, and this morning I've woke up with a hardware error in my system:

Quote

Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 5: cc1d53c000010091
Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: TSC 0
Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: ADDR 106714a940
Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: MISC 2040444486
Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:206d6 TIME 1564198254 SOCKET 1 APIC 20
Jul 27 05:30:54 Tower kernel: EDAC MC1: 30031 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x106714a offset:0x940 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:1)

As far as I understand, there's a defect memory module. So should I just remove/replace this module?

Syslog says it's the Channel 1, DIMM 0 module. I've attached the diagram of my MB. Channel 1 DIMM 0 would correspond to CPU1_DIMM_C0?

Thanks in advace for your help!

syslog tower-diagnostics-20190727-0756.zip

Edited September 17, 20196 yr by sedoro
Title updated

Quote

July 27, 20196 yr

Community Expert

There is memtest on the boot menu.

Quote

July 27, 20196 yr

Memtest wont find the errors since they're being corrected

Sent from my phone as I'm probably having a beer and enjoying a fire

Quote

July 29, 20196 yr

Author

Thanks for the answers, I tried memtest in the boot menu, the system rebooted but nothing happened.

After reboot, everything was fine until this morning when I received another Hardware error. This one is different.

What should I do next?

syslog290719

Quote

August 30, 20196 yr

Author

It's been a month since I had the first Hardware Error, and it just got worst.

The system is randomly rebooting since end of July (Kernel Panic reboots - see attached capture).

I haven't been able to perform a parity check as the system always reboots before it finishes (10 TB, 25hours usually) and I know there are parity errors so living in the edge now.

When not performing parity check, the maximum period of no reboots have been of 4 days, but is is so random, that sometimes it just reboots before I can start array again.

This is what I've discarded and why:

RAM: I removed all sticks but 1 and ran system. Same reboots. I did it with 3 different sticks and different slots.
PSU: I have dual PSU, have tried with only 1 at a time with same result.
APU: Ran the system directly to AC. Same results.
Latest Unraid upgrade. The problems started, more or less, when I upgraded to 6.7.2. I downgraded to 6.7.1 but reboots happens like always.

I also removed both CPUs, looked for dust or twisted pins, and applied new thermal grease after that.

I contacted the retailer and after some hardware tests they said this:

Quote

It seems to be a small known issue with unraid, something to do with Broadwell Era CPU's. Some people are suggesting a boot option to set C States on the CPU to C1 but not a definitive fix.
Have a look at this forum and see if any applies to your situation - https://forums.unraid.net/topic/55140-632-kernel-panic-not-syncing-timeout-not-all-cpus-entered-broadcast-exception-handler/

Could it be related to a buggy microcode or to a software problem? They say I could try downgrade to 6.3.2 as seemed to be the point of conversation in that thread. What do you think? Is it worh trying?

Also, two days ago I got a new Hardware Error:

Quote

Aug 28 01:42:03 Tower kernel: mce: [Hardware Error]: Machine check events logged
Aug 28 01:42:03 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Aug 28 01:42:03 Tower kernel: EDAC MC1: 14 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x105fa95 offset:0xf00 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:0)
Aug 28 05:30:38 Tower root: Fix Common Problems: Error: Machine Check Events detected on your server
Aug 28 05:30:38 Tower root: Hardware event. This is not a software error.
Aug 28 05:30:38 Tower root: Uncorrected error
Aug 28 05:30:38 Tower root: Data CACHE Level-2 Snoop Error
Aug 28 12:38:46 Tower kernel: mce: [Hardware Error]: Machine check events logged
Aug 28 12:38:46 Tower kernel: mce: [Hardware Error]: Machine check events logged

Thanks all for you help.

PD: Title changed according to new symtoms.

syslog

Quote

September 5, 20196 yr

Author

So I've been able to complete a parity check (25 hours, 2.943 errors) by dowgrading Unraid to 6.6.7. The system has been up for 1 day 2 hours now, maybe I've just been lucky, but I have good feelings as I tried parity check like 20 times before with version 6.7.x with no luck. No more MCE errors neither. I'll run another parity check in some days and if the system doesn't reboot will add a [Solved] to the title.

Quote

September 17, 20196 yr

Author

Hi

it's been 14 days uptime with 0 problems nor errors. Two parity checks completed with 0 errors. It seems the problem was related with Unraid 6.7.x somehow. Hope it gets fixed in future updates.

I find the "Hardware event. This is not a software error." message quite misleading.

Quote

[Solved] Random reboots "Kernel panic - not syncing: Timeout: Not all cpus entered broadcast exception handler"

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)