Kernel Panic and Out Of Memory Errors [SOLVED]

Holmesware · February 25, 2021

I'm not good with decoding syslog kernel panics and need some help please.

System Specs:

Supermicro X11SSH-LN4F - c states disable in bios

Xeon E3-1240 v6

64 GB DDR4 ECC

2 ZFS pools, one Nytro SSD pool and one HDD pool

UNRAID disk is on an nvme, using a Samsung Fit flash drive.

No longer running VMs or Dockers but ran stable with them.

Got some remote replication to another machine.

Enabled saving syslog to the flash drive and waited for another failure.

Found a Kernel panic Feb 23rd 1412 and repeats going forward. The system worked but cranky.

System gets worse around Feb 24th 1400.

Out of memory errors at Feb 24th 1408 and repeats. kernel: Out of memory: Kill process 12302 (monitor) score 0 or sacrifice child

System required hard reboot Feb 24th 1443.

Been digging through forums and google and come up with possible bad RAM issues but it's ECC RAM.

Possible left over routes from VMs and Dockers. Easy to remote but can't see that doing it.

Takes about 6 days to blow up. Currently rebooting on day 4.

There is a record of a boot where the two pools tried mounting to the same mount point. Ignore that, it got fixed.

Another Kernel panic at Feb 24th 17:24 but it doesn't repeat. System stable from this point on.

Thanks in advance.

syslog

Edited March 18, 2021 by Holmesware

Squid · February 26, 2021

Can you also post your diagnostics so the task lists etc can be put into perspective.

Holmesware · February 26, 2021

Sorry, here they are. Thanks for looking.

brewmaster-diagnostics-20210226-0651.zip

Holmesware · March 3, 2021

This is the issue I'm having, 1 cpu pegged at 100%, wdss is the process, this goes on until the kernel start panicing or running out of memory.

Trying the restart script at the end of the thread and will report back.

https://forums.unraid.net/topic/85073-wsdd-100-using-1-core/page/2/

EDIT: Script did not reset the 100% cpu usage. Disabled WDS. Kept script running for now.

Edited March 3, 2021 by Holmesware

Holmesware · March 4, 2021

Finally found this, looks like I got a bad stick of RAM.

I'm running ECC RAM and reseated the ram during the first server crash.

memtest didn't show anything after a quick run, didn't have time do a full test.

Heat is not an issue with my setup. I have a good quality 750W PSU.

Going to swap the DIMM on channel 0 with the one in channel 3 and see if this shows up again.

Mar 4 07:03:17 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 07:04:56 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 07:15:59 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 07:19:42 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 07:22:31 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 07:23:28 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 07:25:47 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 07:26:18 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 07:26:38 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 07:55:16 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 07:58:00 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 07:59:02 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 08:01:05 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 08:09:12 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 08:09:58 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 08:10:03 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 08:18:00 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 08:21:35 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar 4 08:21:56 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)

Edit: To help find what DIMM is having the error:

root@system~: grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:34 <- ERROR COUNT /mc0/csrow1/ch0
/sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:0

mcX = Memory Controller (single, dual CPU)

chX = Channel 0, Channel 1, Channel 3 (single, dual, triple Channel RAM)

csrowX = see chart

root@system~: dmidecode -t memory | grep 'Locator'

Locator: DIMMA1 <- THIS ONE - DIMM_A0
Bank Locator: P0_Node0_Channel0_Dimm0
Locator: DIMMA2
Bank Locator: P0_Node0_Channel0_Dimm1
Locator: DIMMB1
Bank Locator: P0_Node0_Channel1_Dimm0
Locator: DIMMB2
Bank Locator: P0_Node0_Channel1_Dimm1

EDIT: Moved the stick of RAM and got an error in another slot. Ordering new stick of RAM.

/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:1 <- NEW ERROR

EDIT2:

/sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:2 <- NEW ERROR

Errors have slowed but at least I have a second error now. RAM incoming.

EDIT3:
replaced defective DIMM, 4 days no error. Turned on WDS. Watching Logs and CPU useage. WDS script still running.

EDIT4:

No more memory errors and WDS is running without eating a full CPU. Calling this solved.

Edited March 18, 2021 by Holmesware

Kernel Panic and Out Of Memory Errors [SOLVED]

Recommended Posts

Holmesware

Link to comment

Squid

Link to comment

Holmesware

Link to comment

Holmesware

Link to comment

Holmesware

Link to comment

Join the conversation