[Solved] Random reboots "Kernel panic - not syncing: Timeout: Not all cpus entered broadcast exception handler"


Recommended Posts

Hello everyone

 

I've had my server since last Christmas so I'm quite new to this, and this morning I've woke up with a hardware error  in my system:

 

Quote

Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: CPU 8: Machine Check Event: 0 Bank 5: cc1d53c000010091
Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: TSC 0 
Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: ADDR 106714a940 
Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: MISC 2040444486 
Jul 27 05:30:54 Tower kernel: EDAC sbridge MC1: PROCESSOR 0:206d6 TIME 1564198254 SOCKET 1 APIC 20
Jul 27 05:30:54 Tower kernel: EDAC MC1: 30031 CE memory read error on CPU_SrcID#1_Ha#0_Chan#1_DIMM#0 (channel:1 slot:0 page:0x106714a offset:0x940 grain:32 syndrome:0x0 -  OVERFLOW area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:2 rank:1)

 

As far as I understand, there's a defect memory module. So should I just remove/replace this module?

 

Syslog says it's the Channel 1, DIMM 0 module. I've attached the diagram of my MB. Channel 1 DIMM 0 would correspond to CPU1_DIMM_C0?


Anotació 2019-07-27 100653.png

 

Thanks in advace for your help!

syslog tower-diagnostics-20190727-0756.zip

Edited by sedoro
Title updated
Link to comment
  • 1 month later...

It's been a month since I had the first Hardware Error, and it just got worst. 

 

The system is randomly rebooting since end of July (Kernel Panic reboots - see attached capture).

I haven't been able to perform a parity check as the system always reboots before it finishes (10 TB, 25hours usually) and I know there are parity errors so living in the edge now.

When not performing parity check, the maximum period of no reboots have been of 4 days, but is is so random, that sometimes it just reboots before I can start array again.

 

This is what I've discarded and why:

  • RAM: I removed all sticks but 1 and ran system. Same reboots. I did it with 3 different sticks and different slots.
  • PSU: I have dual PSU, have tried with only 1 at a time with same result.
  • APU: Ran the system directly to AC. Same results.
  • Latest Unraid upgrade. The problems started, more or less, when I upgraded to 6.7.2. I downgraded to 6.7.1 but reboots happens like always.

I also removed both CPUs, looked for dust or twisted pins, and applied new thermal grease after that.

 

I contacted the retailer and after some hardware tests they said this:

Quote

It seems to be a small known issue with unraid, something to do with Broadwell Era CPU's. Some people are suggesting a boot option to set C States on the CPU to C1 but not a definitive fix.
Have a look at this forum and see if any applies to your situation - https://forums.unraid.net/topic/55140-632-kernel-panic-not-syncing-timeout-not-all-cpus-entered-broadcast-exception-handler/

 

 

Could it be related to a buggy microcode or to a software problem? They say I could try downgrade to 6.3.2 as seemed to be the point of conversation in that thread. What do you think? Is it worh trying?

 

Also, two days ago I got a new Hardware Error:

Quote

Aug 28 01:42:03 Tower kernel: mce: [Hardware Error]: Machine check events logged
Aug 28 01:42:03 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
Aug 28 01:42:03 Tower kernel: EDAC MC1: 14 CE memory read error on CPU_SrcID#1_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x105fa95 offset:0xf00 grain:32 syndrome:0x0 - OVERFLOW area:DRAM err_code:0001:0090 socket:1 ha:0 channel_mask:1 rank:0)
Aug 28 05:30:38 Tower root: Fix Common Problems: Error: Machine Check Events detected on your server
Aug 28 05:30:38 Tower root: Hardware event. This is not a software error.
Aug 28 05:30:38 Tower root: Uncorrected error
Aug 28 05:30:38 Tower root: Data CACHE Level-2 Snoop Error
Aug 28 12:38:46 Tower kernel: mce: [Hardware Error]: Machine check events logged
Aug 28 12:38:46 Tower kernel: mce: [Hardware Error]: Machine check events logged

 

Thanks all for you help.

 

PD: Title changed according to new symtoms.

image.png

syslog

Link to comment

So I've been able to complete a parity check (25 hours, 2.943 errors) by dowgrading Unraid to 6.6.7. The system has been up for 1 day 2 hours now, maybe I've just been lucky, but I have good feelings as I tried parity check like 20 times before with version 6.7.x with no luck. No more MCE errors neither.  I'll run another parity check in some days and if the system doesn't reboot will add a [Solved] to the title.

  • Like 1
Link to comment

Hi

 

it's been 14 days uptime with 0 problems nor errors. Two parity checks completed with 0 errors. It seems the problem was related with Unraid 6.7.x somehow. Hope it gets fixed in future updates.

 

I find the "Hardware event. This is not a software error." message quite misleading.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.