Jump to content

Kernel Panic and Out Of Memory Errors [SOLVED]


Recommended Posts

I'm not good with decoding syslog kernel panics and need some help please.

 

System Specs:

Supermicro X11SSH-LN4F - c states disable in bios

Xeon E3-1240 v6

64 GB DDR4 ECC

 

2 ZFS pools, one Nytro SSD pool and one HDD pool

UNRAID disk is on an nvme, using a Samsung Fit flash drive.

No longer running VMs or Dockers but ran stable with them.

 

Got some remote replication to another machine.

Enabled saving syslog to the flash drive and waited for another failure.

 

Found a Kernel panic Feb 23rd 1412 and repeats going forward. The system worked but cranky.

System gets worse around Feb 24th 1400.

Out of memory errors at Feb 24th 1408 and repeats. kernel: Out of memory: Kill process 12302 (monitor) score 0 or sacrifice child 

System required hard reboot Feb 24th 1443.

 

Been digging through forums and google and come up with possible bad RAM issues but it's ECC RAM.

Possible left over routes from VMs and Dockers. Easy to remote but can't see that doing it.

Takes about 6 days to blow up. Currently rebooting on day 4.

 

There is a record of a boot where the two pools tried mounting to the same mount point. Ignore that, it got fixed.

Another Kernel panic at Feb 24th 17:24 but it doesn't repeat. System stable from this point on.

 

Thanks in advance.

 

syslog

Edited by Holmesware
Link to comment

This is the issue I'm having, 1 cpu pegged at 100%, wdss is the process, this goes on until the kernel start panicing or running out of memory.

Trying the restart script at the end of the thread and will report back.

 

https://forums.unraid.net/topic/85073-wsdd-100-using-1-core/page/2/

 

EDIT: Script did not reset the 100% cpu usage. Disabled WDS. Kept script running for now.

Edited by Holmesware
Link to comment

Finally found this, looks like I got a bad stick of RAM.

I'm running ECC RAM and reseated the ram during the first server crash.

memtest didn't show anything after a quick run, didn't have time do a full test.

Heat is not an issue with my setup. I have a good quality 750W PSU.

Going to swap the DIMM on channel 0 with the one in channel 3 and see if this shows up again.

 

Mar  4 07:03:17 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 07:04:56 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 07:15:59 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 07:19:42 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 07:22:31 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 07:23:28 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 07:25:47 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 07:26:18 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 07:26:38 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 07:55:16 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 07:58:00 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 07:59:02 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 08:01:05 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 08:09:12 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 08:09:58 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 08:10:03 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 08:18:00 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 08:21:35 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)
Mar  4 08:21:56 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52)

 

Edit: To help find what DIMM is having the error:

root@system~: grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count

/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:34 <- ERROR COUNT /mc0/csrow1/ch0
/sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:0

 

mcX = Memory Controller (single, dual CPU)

chX = Channel 0, Channel 1, Channel 3 (single, dual, triple Channel RAM)

csrowX = see chart

 

                Channel 0     Channel 1     Channel 3
============================================
csrow0  |  DIMM_A0   |   DIMM_B0   |   DIMM_C0   |
csrow1   |  DIMM_A0   |   DIMM_B0   |   DIMM_C0   |
============================================
============================================
csrow2   |  DIMM_A1   |   DIMM_B1   |   DIMM_C0   |
csrow3   |  DIMM_A1   |   DIMM_B1   |   DIMM_C0   |
============================================

============================================
csrow4   |  DIMM_A1   |   DIMM_B1   |   DIMM_C0   |
csrow5   |  DIMM_A1   |   DIMM_B1   |   DIMM_C0   |
============================================

============================================
csrow6   |  DIMM_A1   |   DIMM_B1   |   DIMM_C0   |
csrow7   |  DIMM_A1   |   DIMM_B1   |   DIMM_C0   |
============================================

 

root@system~: dmidecode -t memory | grep 'Locator'

Locator: DIMMA1 <- THIS ONE - DIMM_A0
Bank Locator: P0_Node0_Channel0_Dimm0
Locator: DIMMA2
Bank Locator: P0_Node0_Channel0_Dimm1
Locator: DIMMB1
Bank Locator: P0_Node0_Channel1_Dimm0
Locator: DIMMB2
Bank Locator: P0_Node0_Channel1_Dimm1

 

EDIT: Moved the stick of RAM and got an error in another slot. Ordering new stick of RAM.

/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:1 <- NEW ERROR

 

EDIT2: 

/sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:2 <- NEW ERROR

Errors have slowed but at least I have a second error now. RAM incoming.

 

EDIT3:
replaced defective DIMM, 4 days no error. Turned on WDS. Watching Logs and CPU useage. WDS script still running.

 

EDIT4:

No more memory errors and WDS is running without eating a full CPU. Calling this solved.

 

 

Edited by Holmesware
Link to comment
  • Holmesware changed the title to Kernel Panic and Out Of Memory Errors [SOLVED]

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...