super high sysload, trace CPU error


drawde

Recommended Posts

i have a small script that alerts me when my CPU load is higher than normal. it alerted me to my system load at 39. after logging in to take a look, it definitely was showing it spike up to around 39 and staying there. output from top wasn't showing much going on so i don't think i had any rogue apps or anything. dockers appear to be running.. not sure about other parts of the GUI, but after trying to stop the array it's basically stuck.

 

saw the following error in my syslog:

 

Dec 22 02:08:03 Tower kernel: general protection fault: 0000 [#1] PREEMPT SMP
Dec 22 02:08:03 Tower kernel: Modules linked in: md_mod xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat kvm_amd kvm r8169 ahci mii libahci mvsas libsas scsi_transport_sas sata_sil24 wmi k10temp asus_atk0110 sata_sil i2c_piix4 pata_atiixp i2c_core acpi_cpufreq [last unloaded: md_mod]
Dec 22 02:08:03 Tower kernel: CPU: 0 PID: 556 Comm: kswapd0 Not tainted 4.4.30-unRAID #2
Dec 22 02:08:03 Tower kernel: Hardware name: System manufacturer System Product Name/M4A88T-M, BIOS 2403    12/23/2010
Dec 22 02:08:03 Tower kernel: task: ffff88040da8e0c0 ti: ffff8800ca998000 task.ti: ffff8800ca998000
Dec 22 02:08:03 Tower kernel: RIP: 0010:[<ffffffff8111dfaf>]  [<ffffffff8111dfaf>] __destroy_inode+0xcc/0x11b
Dec 22 02:08:03 Tower kernel: RSP: 0018:ffff8800ca99bbf0  EFLAGS: 00010206
Dec 22 02:08:03 Tower kernel: RAX: 0000ffffffffffff RBX: ffff88001405d8f0 RCX: 0000000000000000
Dec 22 02:08:03 Tower kernel: RDX: 0000000000000001 RSI: ffff88001405d970 RDI: 0001000000000000
Dec 22 02:08:03 Tower kernel: RBP: ffff8800ca99bbf8 R08: ffff88041ffcc2a0 R09: 0000000000000003
Dec 22 02:08:03 Tower kernel: R10: ffffea0002e165c0 R11: 0000000000000000 R12: ffff88001405d970
Dec 22 02:08:03 Tower kernel: R13: ffffffff81667280 R14: ffff8800ca99bd20 R15: 000000000000002b
Dec 22 02:08:03 Tower kernel: FS:  00002b3df7176700(0000) GS:ffff88041fc00000(0000) knlGS:0000000000000000
Dec 22 02:08:03 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Dec 22 02:08:03 Tower kernel: CR2: 00002b5bccef5000 CR3: 000000025de7f000 CR4: 00000000000006f0
Dec 22 02:08:03 Tower kernel: Stack:
Dec 22 02:08:03 Tower kernel: ffff88001405d8f0 ffff8800ca99bc10 ffffffff8111e5aa ffff88001405d8f0
Dec 22 02:08:03 Tower kernel: ffff8800ca99bc38 ffffffff8111e735 ffff8800ca99bc68 ffff8800a46ddcb0
Dec 22 02:08:03 Tower kernel: 0000000000000000 ffff8800ca99bc50 ffffffff8111e76d ffff8800ca99bc68
Dec 22 02:08:03 Tower kernel: Call Trace:
Dec 22 02:08:03 Tower kernel: [<ffffffff8111e5aa>] destroy_inode+0x1f/0x4d
Dec 22 02:08:03 Tower kernel: [<ffffffff8111e735>] evict+0x15d/0x164
Dec 22 02:08:03 Tower kernel: [<ffffffff8111e76d>] dispose_list+0x31/0x3b
Dec 22 02:08:03 Tower kernel: [<ffffffff8111f8bd>] prune_icache_sb+0x45/0x50
Dec 22 02:08:03 Tower kernel: [<ffffffff8110cf3c>] super_cache_scan+0x12a/0x174
Dec 22 02:08:03 Tower kernel: [<ffffffff810c6d52>] shrink_slab.part.6+0x190/0x20b
Dec 22 02:08:03 Tower kernel: [<ffffffff810c91d8>] shrink_zone+0x17c/0x265
Dec 22 02:08:03 Tower kernel: [<ffffffff810c9f5f>] kswapd+0x5bc/0x75d
Dec 22 02:08:03 Tower kernel: [<ffffffff81062f01>] ? finish_task_switch+0xee/0x1b5
Dec 22 02:08:03 Tower kernel: [<ffffffff810c99a3>] ? mem_cgroup_shrink_node_zone+0xae/0xae
Dec 22 02:08:03 Tower kernel: [<ffffffff8105fb24>] kthread+0xcd/0xd5
Dec 22 02:08:03 Tower kernel: [<ffffffff8105fa57>] ? kthread_worker_fn+0x137/0x137
Dec 22 02:08:03 Tower kernel: [<ffffffff81629f7f>] ret_from_fork+0x3f/0x70
Dec 22 02:08:03 Tower kernel: [<ffffffff8105fa57>] ? kthread_worker_fn+0x137/0x137
Dec 22 02:08:03 Tower kernel: Code: 48 c7 c7 9f 09 79 81 e8 c3 c5 f2 ff 48 8b 43 28 f0 48 ff 88 f0 04 00 00 48 8b 7b 10 48 8d 47 ff 48 83 f8 fd 77 0a 48 85 ff 74 05 <f0> ff 0f 74 38 48 8b 7b 18 48 8d 47 ff 48 83 f8 fd 77 0a 48 85
Dec 22 02:08:03 Tower kernel: RIP  [<ffffffff8111dfaf>] __destroy_inode+0xcc/0x11b
Dec 22 02:08:03 Tower kernel: RSP <ffff8800ca99bbf0>
Dec 22 02:08:03 Tower kernel: ---[ end trace 224c26f716313710 ]---

 

top - 12:00:58 up 5 days, 11:12,  4 users,  load average: 41.07, 40.68, 37.77
Tasks: 970 total,   1 running, 442 sleeping,   5 stopped, 522 zombie
%Cpu(s):  3.1 us,  1.2 sy,  0.0 ni, 94.9 id,  0.8 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 16178504 total,   414596 free,  1951004 used, 13812904 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 13126372 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
17454 nobody    20   0 1484124 621224  23768 S   6.9  3.8  38:04.69 mono
12839 root      20   0   25676   3880   2344 R   1.6  0.0   0:00.10 top
3723 sshd      20   0  323820  22988  13892 S   0.7  0.1   0:09.24 apache2
14883 root      20   0 1759292  40980  21516 S   0.7  0.3   8:20.57 docker
17097 nobody    20   0  163116  50112   3796 S   0.7  0.3  10:19.11 python
17155 nobody    20   0 1185268  54312   8752 S   0.7  0.3  11:49.91 kodi.bin
18700 sshd      20   0  323808  23028  13820 S   0.7  0.1   0:09.33 apache2
1586 root      20   0    9680   2512   2060 S   0.3  0.0  35:03.41 cpuload
8097 root      20   0       0      0      0 S   0.3  0.0   1:34.62 kworker/2:0
18530 nobody    20   0  848256 188260   3472 S   0.3  1.2   4:41.06 mysqld
18666 nobody    20   0  184664  79180   2828 S   0.3  0.5  11:18.32 python
18697 sshd      20   0  324644  22820  13276 S   0.3  0.1   0:09.22 apache2
20012 nobody    35  15 1774328  68288  11236 S   0.3  0.4   3:01.34 Plex Script Hos
20080 nobody    20   0  250804  52960  17440 S   0.3  0.3   3:32.61 Plex DLNA Serve
31704 root      20   0       0      0      0 S   0.3  0.0   0:02.01 kworker/u12:1
    1 root      20   0    4372   1556   1456 S   0.0  0.0   0:14.09 init
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.24 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   0:23.78 ksoftirqd/0
    7 root      20   0       0      0      0 S   0.0  0.0  22:51.31 rcu_preempt
    8 root      20   0       0      0      0 S   0.0  0.0   0:00.01 rcu_sched
    9 root      20   0       0      0      0 S   0.0  0.0   0:00.04 rcu_bh
   10 root      rt   0       0      0      0 S   0.0  0.0   0:18.79 migration/0
   11 root      rt   0       0      0      0 S   0.0  0.0   0:18.55 migration/1
   12 root      20   0       0      0      0 S   0.0  0.0   0:24.82 ksoftirqd/1
   15 root      rt   0       0      0      0 S   0.0  0.0   0:18.61 migration/2
   16 root      20   0       0      0      0 S   0.0  0.0   0:57.49 ksoftirqd/2
   19 root      rt   0       0      0      0 S   0.0  0.0   0:17.74 migration/3
   20 root      20   0       0      0      0 S   0.0  0.0  16:59.53 ksoftirqd/3
   22 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 kworker/3:0H
   23 root      20   0       0      0      0 S   0.0  0.0   0:00.01 kdevtmpfs
   24 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 netns
   27 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 perf
  273 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 writeback
  275 root      25   5       0      0      0 S   0.0  0.0   0:00.00 ksmd
  276 root      39  19       0      0      0 S   0.0  0.0   1:06.93 khugepaged
  277 root       0 -20       0      0      0 S   0.0  0.0   0:00.00 crypto

 

i'm also monitoring temps via snmp and for a cpu load that high, my temps are perfect, actually maybe even on the lower side than normally idle. tried to gracefully shut down and it's not doing anything. i'm remote but i'm waiting for someone to get there to forcefully reboot.

MobaXterm_tower_20161222_114225.zip

Link to comment

That's a GPF, something I haven't seen in awhile.  Unfortunately, there isn't a good clue evident as to the cause.  Check for a motherboard BIOS update, yours is from 2010.

 

I have seen another user with issues, also involving mono.  Using PhAzE's plugins?  I believe he has compiled a newer version of mono recently, but I don't think he believes there's anything wrong with mono.

 

You also have "PCIe ACS overrides enabled", which may allow you to do things you couldn't otherwise, but also carries some risk.  I don't know enough to say whether it could allow a GPF.

 

And it could be just a random event, a memory fault, over heat event, or unknown and unknowable cosmic event.

Link to comment

That's a GPF, something I haven't seen in awhile.  Unfortunately, there isn't a good clue evident as to the cause.  Check for a motherboard BIOS update, yours is from 2010.

 

I have seen another user with issues, also involving mono.  Using PhAzE's plugins?  I believe he has compiled a newer version of mono recently, but I don't think he believes there's anything wrong with mono.

 

is there something in the syslog error that leads you to believe it's mono? or because of the top output? i think sonarr uses mono but i'm not sure if that's what caused the issue, i think it just happened to be doing something when i copied that top.

 

for now i'll keep an eye on it. i don't think it was an overheat event as snmp was still reporting and my temps were OK. if it happens again i'll run memtest. i did recently add 2 new drives (within 1 week), i precleared both drives 3x without much issue, so hopefully that's not it.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.