Holmesware Posted February 25, 2021 Share Posted February 25, 2021 (edited) I'm not good with decoding syslog kernel panics and need some help please. System Specs: Supermicro X11SSH-LN4F - c states disable in bios Xeon E3-1240 v6 64 GB DDR4 ECC 2 ZFS pools, one Nytro SSD pool and one HDD pool UNRAID disk is on an nvme, using a Samsung Fit flash drive. No longer running VMs or Dockers but ran stable with them. Got some remote replication to another machine. Enabled saving syslog to the flash drive and waited for another failure. Found a Kernel panic Feb 23rd 1412 and repeats going forward. The system worked but cranky. System gets worse around Feb 24th 1400. Out of memory errors at Feb 24th 1408 and repeats. kernel: Out of memory: Kill process 12302 (monitor) score 0 or sacrifice child System required hard reboot Feb 24th 1443. Been digging through forums and google and come up with possible bad RAM issues but it's ECC RAM. Possible left over routes from VMs and Dockers. Easy to remote but can't see that doing it. Takes about 6 days to blow up. Currently rebooting on day 4. There is a record of a boot where the two pools tried mounting to the same mount point. Ignore that, it got fixed. Another Kernel panic at Feb 24th 17:24 but it doesn't repeat. System stable from this point on. Thanks in advance. syslog Edited March 18, 2021 by Holmesware Quote Link to comment
Squid Posted February 26, 2021 Share Posted February 26, 2021 Can you also post your diagnostics so the task lists etc can be put into perspective. Quote Link to comment
Holmesware Posted February 26, 2021 Author Share Posted February 26, 2021 Sorry, here they are. Thanks for looking. brewmaster-diagnostics-20210226-0651.zip Quote Link to comment
Holmesware Posted March 3, 2021 Author Share Posted March 3, 2021 (edited) This is the issue I'm having, 1 cpu pegged at 100%, wdss is the process, this goes on until the kernel start panicing or running out of memory. Trying the restart script at the end of the thread and will report back. https://forums.unraid.net/topic/85073-wsdd-100-using-1-core/page/2/ EDIT: Script did not reset the 100% cpu usage. Disabled WDS. Kept script running for now. Edited March 3, 2021 by Holmesware Quote Link to comment
Holmesware Posted March 4, 2021 Author Share Posted March 4, 2021 (edited) Finally found this, looks like I got a bad stick of RAM. I'm running ECC RAM and reseated the ram during the first server crash. memtest didn't show anything after a quick run, didn't have time do a full test. Heat is not an issue with my setup. I have a good quality 750W PSU. Going to swap the DIMM on channel 0 with the one in channel 3 and see if this shows up again. Mar 4 07:03:17 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 07:04:56 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 07:15:59 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 07:19:42 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 07:22:31 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 07:23:28 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 07:25:47 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 07:26:18 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 07:26:38 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 07:55:16 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 07:58:00 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 07:59:02 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 08:01:05 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 08:09:12 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 08:09:58 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 08:10:03 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 08:18:00 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 08:21:35 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Mar 4 08:21:56 kernel: EDAC MC0: 1 CE ie31200 CE on mc#0csrow#1channel#0 (csrow:1 channel:0 page:0x0 offset:0x0 grain:8 syndrome:0x52) Edit: To help find what DIMM is having the error: root@system~: grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:34 <- ERROR COUNT /mc0/csrow1/ch0 /sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:0 mcX = Memory Controller (single, dual CPU) chX = Channel 0, Channel 1, Channel 3 (single, dual, triple Channel RAM) csrowX = see chart Channel 0 Channel 1 Channel 3 ============================================ csrow0 | DIMM_A0 | DIMM_B0 | DIMM_C0 | csrow1 | DIMM_A0 | DIMM_B0 | DIMM_C0 | ============================================ ============================================ csrow2 | DIMM_A1 | DIMM_B1 | DIMM_C0 | csrow3 | DIMM_A1 | DIMM_B1 | DIMM_C0 | ============================================ ============================================ csrow4 | DIMM_A1 | DIMM_B1 | DIMM_C0 | csrow5 | DIMM_A1 | DIMM_B1 | DIMM_C0 | ============================================ ============================================ csrow6 | DIMM_A1 | DIMM_B1 | DIMM_C0 | csrow7 | DIMM_A1 | DIMM_B1 | DIMM_C0 | ============================================ root@system~: dmidecode -t memory | grep 'Locator' Locator: DIMMA1 <- THIS ONE - DIMM_A0 Bank Locator: P0_Node0_Channel0_Dimm0 Locator: DIMMA2 Bank Locator: P0_Node0_Channel0_Dimm1 Locator: DIMMB1 Bank Locator: P0_Node0_Channel1_Dimm0 Locator: DIMMB2 Bank Locator: P0_Node0_Channel1_Dimm1 EDIT: Moved the stick of RAM and got an error in another slot. Ordering new stick of RAM. /sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count:0 /sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:1 <- NEW ERROR EDIT2: /sys/devices/system/edac/mc/mc0/csrow3/ch1_ce_count:2 <- NEW ERROR Errors have slowed but at least I have a second error now. RAM incoming. EDIT3: replaced defective DIMM, 4 days no error. Turned on WDS. Watching Logs and CPU useage. WDS script still running. EDIT4: No more memory errors and WDS is running without eating a full CPU. Calling this solved. Edited March 18, 2021 by Holmesware Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.