Server slipping into a coma (multiple CPU stalls lead to server freeze)


Recommended Posts

keywords: CPU stall, starved, jiffies!, rcu_sched, kthread, runc,

Hi All, 

 

SOS! My baby is slowly drifting into a coma!

 

I am hoping for some help on this because I'm totally lost as to what is happening.


A few days ago, my server started stalling and eventually, becoming completely unresponsive. Unkillable processes (dockerd/runc) choke up the machine until it can't even accept a reboot or halt command. CPU load averages shoot past even my machines theoretical maximum.

 

 
 
 
Quote

top - 20:24:16 up  1:11,  2 users,  load average: 29.05, 15.03, 6.99

Tasks: 480 total,  14 running, 465 sleeping,   0 stopped,   1 zombie

%Cpu(s):  0.0 us, 21.0 sy,  0.0 ni, 45.7 id, 33.3 wa,  0.0 hi,  0.0 si,  0.0 st

MiB Mem :  64481.7 total,  42197.2 free,   2504.1 used,  19780.4 buff/cache

MiB Swap:      0.0 total,      0.0 free,      0.0 used.  59583.6 avail Mem 

 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND   

 5546 nobody    20   0       0      0      0 Z 199.7   0.0   8:00.90 Plex Tran+

18567 root      20   0 1539332 248236 126828 R  99.7   0.4   4:17.36 firefox   

20585 root      20   0  281556   2788   2100 S  99.7   0.0   3:23.93 emhttpd   

 6142 root      20   0    8876   3340   2332 R   0.7   0.0   0:00.31 top       

   11 root      20   0       0      0      0 I   0.3   0.0   0:03.12 rcu_sched 

  195 root      20   0       0      0      0 I   0.3   0.0   0:01.38 kworker/5+

  803 root      20   0       0      0      0 I   0.3   0.0   0:00.85 kworker/1+

 

 


Curiously htop has stopped working -- it runs but generates no output.

This was briefly improved by upgrading to unRAID v6.7 but returned.

 

Currently, the server will run for about twenty minutes before entering a series of stalls until it freezes entirely.

I can provoke stalls by having plex transcode something heavy. However, the server will also eventually seize without this provocation.

I've installed mcelog and have collected a syslog with (i believe) mcelog's reports on CPU stalls and kernel traces. This log corresponds to successive stalls over a two hour period until it seized entirely. The server remained unusable throughout, would fail to soft reset or shutdown, and ultimately had to be powered off manually.

 

CPU stalls for two hours and wont shut down doesn't shut down syslog.txt

Stalls from another shorter run.txt

Stalls triggered by Plex Transcoder Playback.txt

 

I run on a collection of used server gear: dual hexacore intel Xeon E5645s in a Supermicro X8DTL-I mobo with 64Gb or EEC Ram and an LSI 9201-8I HBA flashed to IT mode. The array is 8 second-hand NAS drives- 2x4gb dual parity 6x3gb storage and a pair of 650G flash drives for cache and VMs. It currently runs unRAID 6.70 stable

The stalls appear in syslog are a series that look like this:

Quote

May 17 22:23:41 hive kernel: rcu: INFO: rcu_sched self-detected stall on CPU
May 17 22:23:41 hive kernel: rcu:     8-....: (59998 ticks this GP) idle=9be/1/0x4000000000000002 softirq=448376/448376 fqs=14953 
May 17 22:23:41 hive kernel: rcu:      (t=60001 jiffies g=1081213 q=12137)
May 17 22:23:41 hive kernel: NMI backtrace for cpu 8
May 17 22:23:41 hive kernel: CPU: 8 PID: 28377 Comm: runc Not tainted 4.19.41-Unraid #1
May 17 22:23:41 hive kernel: Hardware name: Supermicro X8DTL/X8DTL, BIOS 2.1b       11/16/2012
May 17 22:23:41 hive kernel: Call Trace:
May 17 22:23:41 hive kernel: <IRQ>
May 17 22:23:41 hive kernel: dump_stack+0x5d/0x79
May 17 22:23:41 hive kernel: nmi_cpu_backtrace+0x71/0x83
May 17 22:23:41 hive kernel: ? lapic_can_unplug_cpu+0x8e/0x8e
May 17 22:23:41 hive kernel: nmi_trigger_cpumask_backtrace+0x57/0xd7
May 17 22:23:41 hive kernel: rcu_dump_cpu_stacks+0x91/0xbb
May 17 22:23:41 hive kernel: rcu_check_callbacks+0x28f/0x58e
May 17 22:23:41 hive kernel: ? tick_sched_handle.isra.5+0x2f/0x2f
May 17 22:23:41 hive kernel: update_process_times+0x23/0x45
May 17 22:23:41 hive kernel: tick_sched_timer+0x36/0x64
May 17 22:23:41 hive kernel: __hrtimer_run_queues+0xb1/0x105
May 17 22:23:41 hive kernel: hrtimer_interrupt+0xf4/0x20d
May 17 22:23:41 hive kernel: smp_apic_timer_interrupt+0x79/0x91
May 17 22:23:41 hive kernel: apic_timer_interrupt+0xf/0x20
May 17 22:23:41 hive kernel: </IRQ>
May 17 22:23:41 hive kernel: RIP: 0010:smp_call_function_single+0x7f/0xcc
May 17 22:23:41 hive kernel: Code: 72 0b 83 3d 7d 46 15 01 00 75 02 0f 0b 45 85 e4 48 89 e3 75 22 48 c7 c0 80 1c 02 00 65 48 03 05 1a 06 f6 7e 48 89 c3 8b 48 18 <80> e1 01 74 04 f3 90 eb f4 83 48 18 01 48 89 d1 44 89 c7 48 89 f2
May 17 22:23:41 hive kernel: RSP: 0018:ffffc90008adbc00 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
May 17 22:23:41 hive kernel: RAX: ffff88901f8a1c80 RBX: ffff88901f8a1c80 RCX: 0000000000000001
May 17 22:23:41 hive kernel: RDX: 0000000000000000 RSI: ffffffff810282dc RDI: ffffc90008adbc20
May 17 22:23:41 hive kernel: RBP: ffffc90008adbc58 R08: 0000000000000015 R09: ffff88881a8a0020
May 17 22:23:41 hive kernel: R10: ffff88901c2c6fc0 R11: ffffffffff8af02e R12: 0000000000000000
May 17 22:23:41 hive kernel: R13: ffffffff81160401 R14: ffff888819e2f200 R15: ffff8888046566b8
May 17 22:23:41 hive kernel: ? seq_hlist_next_rcu+0x8/0x11
May 17 22:23:41 hive kernel: ? cpu_show_l1tf+0xca/0xca
May 17 22:23:41 hive kernel: aperfmperf_snapshot_cpu+0x36/0x42
May 17 22:23:41 hive kernel: arch_freq_prepare_all+0x48/0x6c
May 17 22:23:41 hive kernel: ? seq_buf_alloc+0xd/0xd
May 17 22:23:41 hive kernel: cpuinfo_open+0x9/0x19
May 17 22:23:41 hive kernel: proc_reg_open+0x74/0xf5
May 17 22:23:41 hive kernel: ? proc_i_callback+0x13/0x13
May 17 22:23:41 hive kernel: do_dentry_open+0x197/0x2d2
May 17 22:23:41 hive kernel: path_openat+0xab2/0xc16
May 17 22:23:41 hive kernel: do_filp_open+0x4c/0xa9
May 17 22:23:41 hive kernel: do_sys_open+0x132/0x1ce
May 17 22:23:41 hive kernel: do_syscall_64+0x57/0xe6
May 17 22:23:41 hive kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
May 17 22:23:41 hive kernel: RIP: 0033:0x47aa1a
May 17 22:23:41 hive kernel: Code: e8 eb 81 fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 4c 8b 54 24 28 4c 8b 44 24 30 4c 8b 4c 24 38 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 40 ff ff ff ff 48 c7 44 24 48
May 17 22:23:41 hive kernel: RSP: 002b:000000c420053b80 EFLAGS: 00000206 ORIG_RAX: 0000000000000101
May 17 22:23:41 hive kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000047aa1a
May 17 22:23:41 hive kernel: RDX: 0000000000080000 RSI: 000000c4200883f0 RDI: ffffffffffffff9c
May 17 22:23:41 hive kernel: RBP: 000000c420053c00 R08: 0000000000000000 R09: 0000000000000000
May 17 22:23:41 hive kernel: R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000040
May 17 22:23:41 hive kernel: R13: 000000000000003f R14: 0000000000000200 R15: 0000000000000002

 

 

Please help!

 

 

Edited by insomnus
edited for clarity: formatting and attachments were janky
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.