insomnus Posted May 20, 2019 Share Posted May 20, 2019 (edited) keywords: CPU stall, starved, jiffies!, rcu_sched, kthread, runc, Hi All, SOS! My baby is slowly drifting into a coma! I am hoping for some help on this because I'm totally lost as to what is happening. A few days ago, my server started stalling and eventually, becoming completely unresponsive. Unkillable processes (dockerd/runc) choke up the machine until it can't even accept a reboot or halt command. CPU load averages shoot past even my machines theoretical maximum. Quote top - 20:24:16 up 1:11, 2 users, load average: 29.05, 15.03, 6.99 Tasks: 480 total, 14 running, 465 sleeping, 0 stopped, 1 zombie %Cpu(s): 0.0 us, 21.0 sy, 0.0 ni, 45.7 id, 33.3 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 64481.7 total, 42197.2 free, 2504.1 used, 19780.4 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 59583.6 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 5546 nobody 20 0 0 0 0 Z 199.7 0.0 8:00.90 Plex Tran+ 18567 root 20 0 1539332 248236 126828 R 99.7 0.4 4:17.36 firefox 20585 root 20 0 281556 2788 2100 S 99.7 0.0 3:23.93 emhttpd 6142 root 20 0 8876 3340 2332 R 0.7 0.0 0:00.31 top 11 root 20 0 0 0 0 I 0.3 0.0 0:03.12 rcu_sched 195 root 20 0 0 0 0 I 0.3 0.0 0:01.38 kworker/5+ 803 root 20 0 0 0 0 I 0.3 0.0 0:00.85 kworker/1+ Curiously htop has stopped working -- it runs but generates no output. This was briefly improved by upgrading to unRAID v6.7 but returned. Currently, the server will run for about twenty minutes before entering a series of stalls until it freezes entirely. I can provoke stalls by having plex transcode something heavy. However, the server will also eventually seize without this provocation. I've installed mcelog and have collected a syslog with (i believe) mcelog's reports on CPU stalls and kernel traces. This log corresponds to successive stalls over a two hour period until it seized entirely. The server remained unusable throughout, would fail to soft reset or shutdown, and ultimately had to be powered off manually. CPU stalls for two hours and wont shut down doesn't shut down syslog.txt Stalls from another shorter run.txt Stalls triggered by Plex Transcoder Playback.txt I run on a collection of used server gear: dual hexacore intel Xeon E5645s in a Supermicro X8DTL-I mobo with 64Gb or EEC Ram and an LSI 9201-8I HBA flashed to IT mode. The array is 8 second-hand NAS drives- 2x4gb dual parity 6x3gb storage and a pair of 650G flash drives for cache and VMs. It currently runs unRAID 6.70 stable The stalls appear in syslog are a series that look like this: Quote May 17 22:23:41 hive kernel: rcu: INFO: rcu_sched self-detected stall on CPU May 17 22:23:41 hive kernel: rcu: 8-....: (59998 ticks this GP) idle=9be/1/0x4000000000000002 softirq=448376/448376 fqs=14953 May 17 22:23:41 hive kernel: rcu: (t=60001 jiffies g=1081213 q=12137) May 17 22:23:41 hive kernel: NMI backtrace for cpu 8 May 17 22:23:41 hive kernel: CPU: 8 PID: 28377 Comm: runc Not tainted 4.19.41-Unraid #1 May 17 22:23:41 hive kernel: Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1b 11/16/2012 May 17 22:23:41 hive kernel: Call Trace: May 17 22:23:41 hive kernel: <IRQ> May 17 22:23:41 hive kernel: dump_stack+0x5d/0x79 May 17 22:23:41 hive kernel: nmi_cpu_backtrace+0x71/0x83 May 17 22:23:41 hive kernel: ? lapic_can_unplug_cpu+0x8e/0x8e May 17 22:23:41 hive kernel: nmi_trigger_cpumask_backtrace+0x57/0xd7 May 17 22:23:41 hive kernel: rcu_dump_cpu_stacks+0x91/0xbb May 17 22:23:41 hive kernel: rcu_check_callbacks+0x28f/0x58e May 17 22:23:41 hive kernel: ? tick_sched_handle.isra.5+0x2f/0x2f May 17 22:23:41 hive kernel: update_process_times+0x23/0x45 May 17 22:23:41 hive kernel: tick_sched_timer+0x36/0x64 May 17 22:23:41 hive kernel: __hrtimer_run_queues+0xb1/0x105 May 17 22:23:41 hive kernel: hrtimer_interrupt+0xf4/0x20d May 17 22:23:41 hive kernel: smp_apic_timer_interrupt+0x79/0x91 May 17 22:23:41 hive kernel: apic_timer_interrupt+0xf/0x20 May 17 22:23:41 hive kernel: </IRQ> May 17 22:23:41 hive kernel: RIP: 0010:smp_call_function_single+0x7f/0xcc May 17 22:23:41 hive kernel: Code: 72 0b 83 3d 7d 46 15 01 00 75 02 0f 0b 45 85 e4 48 89 e3 75 22 48 c7 c0 80 1c 02 00 65 48 03 05 1a 06 f6 7e 48 89 c3 8b 48 18 <80> e1 01 74 04 f3 90 eb f4 83 48 18 01 48 89 d1 44 89 c7 48 89 f2 May 17 22:23:41 hive kernel: RSP: 0018:ffffc90008adbc00 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13 May 17 22:23:41 hive kernel: RAX: ffff88901f8a1c80 RBX: ffff88901f8a1c80 RCX: 0000000000000001 May 17 22:23:41 hive kernel: RDX: 0000000000000000 RSI: ffffffff810282dc RDI: ffffc90008adbc20 May 17 22:23:41 hive kernel: RBP: ffffc90008adbc58 R08: 0000000000000015 R09: ffff88881a8a0020 May 17 22:23:41 hive kernel: R10: ffff88901c2c6fc0 R11: ffffffffff8af02e R12: 0000000000000000 May 17 22:23:41 hive kernel: R13: ffffffff81160401 R14: ffff888819e2f200 R15: ffff8888046566b8 May 17 22:23:41 hive kernel: ? seq_hlist_next_rcu+0x8/0x11 May 17 22:23:41 hive kernel: ? cpu_show_l1tf+0xca/0xca May 17 22:23:41 hive kernel: aperfmperf_snapshot_cpu+0x36/0x42 May 17 22:23:41 hive kernel: arch_freq_prepare_all+0x48/0x6c May 17 22:23:41 hive kernel: ? seq_buf_alloc+0xd/0xd May 17 22:23:41 hive kernel: cpuinfo_open+0x9/0x19 May 17 22:23:41 hive kernel: proc_reg_open+0x74/0xf5 May 17 22:23:41 hive kernel: ? proc_i_callback+0x13/0x13 May 17 22:23:41 hive kernel: do_dentry_open+0x197/0x2d2 May 17 22:23:41 hive kernel: path_openat+0xab2/0xc16 May 17 22:23:41 hive kernel: do_filp_open+0x4c/0xa9 May 17 22:23:41 hive kernel: do_sys_open+0x132/0x1ce May 17 22:23:41 hive kernel: do_syscall_64+0x57/0xe6 May 17 22:23:41 hive kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 May 17 22:23:41 hive kernel: RIP: 0033:0x47aa1a May 17 22:23:41 hive kernel: Code: e8 eb 81 fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 4c 8b 54 24 28 4c 8b 44 24 30 4c 8b 4c 24 38 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 40 ff ff ff ff 48 c7 44 24 48 May 17 22:23:41 hive kernel: RSP: 002b:000000c420053b80 EFLAGS: 00000206 ORIG_RAX: 0000000000000101 May 17 22:23:41 hive kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000047aa1a May 17 22:23:41 hive kernel: RDX: 0000000000080000 RSI: 000000c4200883f0 RDI: ffffffffffffff9c May 17 22:23:41 hive kernel: RBP: 000000c420053c00 R08: 0000000000000000 R09: 0000000000000000 May 17 22:23:41 hive kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000040 May 17 22:23:41 hive kernel: R13: 000000000000003f R14: 0000000000000200 R15: 0000000000000002 Please help! Edited May 20, 2019 by insomnus edited for clarity: formatting and attachments were janky Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.