(SOLVED) CPU stalls since linux kernal 5 was introduced


Scripter
Go to solution Solved by Hoopster,

Recommended Posts

Hi

 

I've had CPU stalls and similar issue since linux kernal 5 was introduced, I believe. Sometimes my system can be online for weeks and sometimes 24h before it hangs. I believe it started happening when I went from 6.8.3 to 6.9.2 and I've been upgrading hoping it would solve it self somehow.

I'm currently on 6.10 rc2.

My server is running 2 x  Intel® Xeon® CPU E5-2630 v4 @ 2.20GHz on a Z10PA-D8 MB with 64 GB DDR4 ECC memory.

I have ~130 TB storage in one array, 2 x 14TB parity disks.

Two 1TB SSD cache disk.

Running the normal docker stuff, plex, radarr, sonarr and so on. Plex always seems to be the affected process, not sure that has any bearing on the situation.

 

I've attached the diagnostic file to this post and below is traces from the syslog went the computer froze today, which is stored in the syslog on the flash drive (Syslog Server is enabled) and that file is now 44mb and contains a lot of non anonymized data so that's why I'm not posting it here. I can post more info from that syslog if needed.

 

Any help is appreciate that helps narrow down the actually issue or solve it completely.

 

Nov 21 18:36:16 mainframe kernel: ------------[ cut here ]------------
Nov 21 18:36:16 mainframe kernel: WARNING: CPU: 27 PID: 25636 at net/netfilter/nf_conntrack_core.c:1134 __nf_conntrack_confirm+0xa5/0x1f9 [nf_conntrack]
Nov 21 18:36:16 mainframe kernel: Modules linked in: xt_nat xt_tcpudp macvlan xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod nct6775 hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables igb input_leds cdc_acm led_class x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif kvm_intel kvm ast drm_vram_helper drm_ttm_helper ttm crct10dif_pclmul drm_kms_helper crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd drm mpt3sas rapl intel_cstate backlight i2c_i801 agpgart i2c_algo_bit i2c_smbus ahci syscopyarea raid_class sysfillrect sysimgblt intel_uncore megaraid_sas i2c_core libahci fb_sys_fops scsi_transport_sas wmi acpi_ipmi ipmi_si acpi_power_meter acpi_pad button [last unloaded: igb]
Nov 21 18:36:16 mainframe kernel: CPU: 27 PID: 25636 Comm: kworker/27:1 Not tainted 5.14.15-Unraid #1
Nov 21 18:36:16 mainframe kernel: Hardware name: ASUSTeK COMPUTER INC. Z10PA-D8 Series/Z10PA-D8 Series, BIOS 3801 08/23/2019
Nov 21 18:36:16 mainframe kernel: Workqueue: events macvlan_process_broadcast [macvlan]
Nov 21 18:36:16 mainframe kernel: RIP: 0010:__nf_conntrack_confirm+0xa5/0x1f9 [nf_conntrack]
Nov 21 18:36:16 mainframe kernel: Code: 49 89 c5 41 89 c6 e8 6f f6 ff ff 44 89 fa 44 89 ef 89 c6 41 89 c4 e8 87 f4 ff ff 84 c0 75 b7 48 8b 85 80 00 00 00 a8 08 74 1a <0f> 0b 44 89 ef 44 89 e6 45 31 ed e8 bb ed ff ff e8 a1 f0 ff ff e9
Nov 21 18:36:16 mainframe kernel: RSP: 0018:ffffc900068f0dc0 EFLAGS: 00010202
Nov 21 18:36:16 mainframe kernel: RAX: 0000000000000188 RBX: ffffffff8216b040 RCX: 0000000000000000
Nov 21 18:36:16 mainframe kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa0118064
Nov 21 18:36:16 mainframe kernel: RBP: ffff888190b7cf00 R08: 00000000b7959cd0 R09: ffff8881d73a24a0
Nov 21 18:36:16 mainframe kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000008f79
Nov 21 18:36:16 mainframe kernel: R13: 000000000000aa0b R14: 000000000000aa0b R15: 0000000000000000
Nov 21 18:36:16 mainframe kernel: FS:  0000000000000000(0000) GS:ffff88885fc40000(0000) knlGS:0000000000000000
Nov 21 18:36:16 mainframe kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 21 18:36:16 mainframe kernel: CR2: 00001498a6182000 CR3: 00000003c4a08006 CR4: 00000000003706e0
Nov 21 18:36:16 mainframe kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 21 18:36:16 mainframe kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 21 18:36:16 mainframe kernel: Call Trace:
Nov 21 18:36:16 mainframe kernel: <IRQ>
Nov 21 18:36:16 mainframe kernel: nf_conntrack_confirm+0x2f/0x36 [nf_conntrack]
Nov 21 18:36:16 mainframe kernel: nf_hook_slow+0x3e/0x93
Nov 21 18:36:16 mainframe kernel: ? ip_protocol_deliver_rcu+0x112/0x112
Nov 21 18:36:16 mainframe kernel: NF_HOOK.constprop.0+0x72/0xcd
Nov 21 18:36:16 mainframe kernel: ? ip_protocol_deliver_rcu+0x112/0x112
Nov 21 18:36:16 mainframe kernel: __netif_receive_skb_one_core+0x79/0x9a
Nov 21 18:36:16 mainframe kernel: process_backlog+0xab/0x143
Nov 21 18:36:16 mainframe kernel: __napi_poll.constprop.0+0x2a/0x114
Nov 21 18:36:16 mainframe kernel: net_rx_action+0xe8/0x1f2
Nov 21 18:36:16 mainframe kernel: __do_softirq+0xef/0x218
Nov 21 18:36:16 mainframe kernel: do_softirq+0x50/0x68
Nov 21 18:36:16 mainframe kernel: </IRQ>
Nov 21 18:36:16 mainframe kernel: netif_rx_ni+0x53/0x85
Nov 21 18:36:16 mainframe kernel: macvlan_broadcast+0x116/0x144 [macvlan]
Nov 21 18:36:16 mainframe kernel: macvlan_process_broadcast+0xc7/0x10b [macvlan]
Nov 21 18:36:16 mainframe kernel: process_one_work+0x193/0x26e
Nov 21 18:36:16 mainframe kernel: worker_thread+0x17c/0x247
Nov 21 18:36:16 mainframe kernel: ? rescuer_thread+0x285/0x285
Nov 21 18:36:16 mainframe kernel: kthread+0xde/0xe3
Nov 21 18:36:16 mainframe kernel: ? set_kthread_struct+0x32/0x32
Nov 21 18:36:16 mainframe kernel: ret_from_fork+0x22/0x30
Nov 21 18:36:16 mainframe kernel: ---[ end trace 65243d7d43630139 ]---
...
Nov 22 19:42:18 mainframe kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Nov 22 19:42:18 mainframe kernel: rcu: 	21-....: (59999 ticks this GP) idle=f86/1/0x4000000000000000 softirq=11354151/11354151 fqs=14994 
Nov 22 19:42:18 mainframe kernel: 	(t=60000 jiffies g=67977705 q=3175387)
Nov 22 19:42:18 mainframe kernel: NMI backtrace for cpu 21
Nov 22 19:42:18 mainframe kernel: CPU: 21 PID: 40424 Comm: Plex Media Serv Tainted: G        W         5.14.15-Unraid #1
Nov 22 19:42:18 mainframe kernel: Hardware name: ASUSTeK COMPUTER INC. Z10PA-D8 Series/Z10PA-D8 Series, BIOS 3801 08/23/2019
Nov 22 19:42:18 mainframe kernel: Call Trace:
Nov 22 19:42:18 mainframe kernel: <IRQ>
Nov 22 19:42:18 mainframe kernel: dump_stack_lvl+0x46/0x5a
Nov 22 19:42:18 mainframe kernel: ? lapic_can_unplug_cpu+0x93/0x93
Nov 22 19:42:18 mainframe kernel: nmi_cpu_backtrace+0x7d/0x8f
Nov 22 19:42:18 mainframe kernel: nmi_trigger_cpumask_backtrace+0x56/0xd3
Nov 22 19:42:18 mainframe kernel: rcu_dump_cpu_stacks+0xc3/0xea
Nov 22 19:42:18 mainframe kernel: rcu_sched_clock_irq+0x22e/0x608
Nov 22 19:42:18 mainframe kernel: ? _raw_spin_unlock_irqrestore+0xe/0x1b
Nov 22 19:42:18 mainframe kernel: ? tick_sched_do_timer+0x3e/0x3e
Nov 22 19:42:18 mainframe kernel: update_process_times+0x8c/0xab
Nov 22 19:42:18 mainframe kernel: tick_sched_timer+0x38/0x65
Nov 22 19:42:18 mainframe kernel: __hrtimer_run_queues+0xfa/0x18a
Nov 22 19:42:18 mainframe kernel: hrtimer_interrupt+0x92/0x160
Nov 22 19:42:18 mainframe kernel: __sysvec_apic_timer_interrupt+0x99/0xdb
Nov 22 19:42:18 mainframe kernel: sysvec_apic_timer_interrupt+0x61/0x7d
Nov 22 19:42:18 mainframe kernel: </IRQ>
Nov 22 19:42:18 mainframe kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Nov 22 19:42:18 mainframe kernel: RIP: 0010:xa_is_sibling+0x0/0x1a
Nov 22 19:42:18 mainframe kernel: Code: 4a 3c 00 48 83 c4 10 c3 c6 07 00 0f 1f 40 00 c3 48 89 f8 83 e0 03 48 83 f8 02 0f 94 c0 48 81 ff 00 10 00 00 0f 97 c2 21 d0 c3 <48> 89 f8 83 e0 03 48 83 f8 02 0f 94 c0 48 81 ff fd 00 00 00 0f 96
Nov 22 19:42:18 mainframe kernel: RSP: 0018:ffffc9000950fa60 EFLAGS: 00000202
Nov 22 19:42:18 mainframe kernel: RAX: 0000000000000038 RBX: ffffffffffffffff RCX: 0000000000000034
Nov 22 19:42:18 mainframe kernel: RDX: 0000000000000001 RSI: ffff888350fd0920 RDI: ffff88836d070db2
Nov 22 19:42:18 mainframe kernel: RBP: ffffffffffffffff R08: ffff88836d070db2 R09: ffffc9000950fab8
Nov 22 19:42:18 mainframe kernel: R10: ffffc9000950fab8 R11: ffffc9000950fab8 R12: ffff8888931196c8
Nov 22 19:42:18 mainframe kernel: R13: 0000000000000008 R14: ffffc9000950fb40 R15: 000000000000000f
Nov 22 19:42:18 mainframe kernel: xas_descend+0x2a/0x49
Nov 22 19:42:18 mainframe kernel: xas_load+0x2d/0x39
Nov 22 19:42:18 mainframe kernel: xas_find+0x58/0x11d
Nov 22 19:42:18 mainframe kernel: find_get_entry+0x20/0x81
Nov 22 19:42:18 mainframe kernel: find_get_entries+0x77/0xfe
Nov 22 19:42:18 mainframe kernel: invalidate_inode_pages2_range+0x71/0x29b
Nov 22 19:42:18 mainframe kernel: fuse_finish_open+0xbb/0xe1
Nov 22 19:42:18 mainframe kernel: fuse_open_common+0xa6/0xcd
Nov 22 19:42:18 mainframe kernel: ? fuse_open_common+0xcd/0xcd
Nov 22 19:42:18 mainframe kernel: do_dentry_open+0x157/0x288
Nov 22 19:42:18 mainframe kernel: path_openat+0x8a0/0x985
Nov 22 19:42:18 mainframe kernel: do_filp_open+0x53/0xb0
Nov 22 19:42:18 mainframe kernel: ? getname_flags+0x29/0x150
Nov 22 19:42:18 mainframe kernel: ? kmem_cache_alloc+0x100/0x176
Nov 22 19:42:18 mainframe kernel: do_sys_openat2+0x72/0xde
Nov 22 19:42:18 mainframe kernel: do_sys_open+0x3b/0x58
Nov 22 19:42:18 mainframe kernel: do_syscall_64+0x83/0xa5
Nov 22 19:42:18 mainframe kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
Nov 22 19:42:18 mainframe kernel: RIP: 0033:0x1482ecfc7739
Nov 22 19:42:18 mainframe kernel: Code: c0 0f 85 24 00 00 00 49 89 fb 48 89 f0 48 89 d7 48 89 ce 4c 89 c2 4d 89 ca 4c 8b 44 24 08 4c 8b 4c 24 10 4c 89 5c 24 08 0f 05 <c3> e9 d5 cb ff ff 41 57 41 56 53 48 81 ec 90 00 00 00 49 89 f6 b8
Nov 22 19:42:18 mainframe kernel: RSP: 002b:00001482e8a05628 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
Nov 22 19:42:18 mainframe kernel: RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00001482ecfc7739
Nov 22 19:42:18 mainframe kernel: RDX: 0000000000000000 RSI: 0000000000088000 RDI: 00001482de8cd0a0
Nov 22 19:42:18 mainframe kernel: RBP: 00001482e8a05a00 R08: 0000000000000000 R09: 0000000000000000
Nov 22 19:42:18 mainframe kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00001482e8a07b38
Nov 22 19:42:18 mainframe kernel: R13: 00001482ed202276 R14: 0000000000000000 R15: 00001482e8a07b74

 

mainframe-diagnostics-20211122-2040.zip

Edited by Scripter
Link to comment
10 minutes ago, Scripter said:

I've attached the diagnostic file to this post and below is traces from the syslog went the computer froze today

Your first call trace is the macvlan broadcast call trace when docker containers are assigned custom IP address on br0.  This is well documented in this thread.  That particular call trace will not always cause immediate lockups, but eventually you will get a server lockup from these call traces.

Edited by Hoopster
  • Like 1
Link to comment
  • Solution
14 minutes ago, Scripter said:

I'm currently on 6.10 rc2.

Version 6.10 allows the docker custom network type to be set to ipvlan instead of macvlan.  This setting was introduced to try to work around the macvlan call trace issue.  For some it has helped.

 

image.png.dda939842b2ee920268926893c1bba7a.png

 

Since I implemented a docker VLAN on my router and switch, the problem has disappeared for me.

  • Like 1
  • Thanks 1
Link to comment
  • 5 weeks later...
  • Scripter changed the title to (SOLVED) CPU stalls since linux kernal 5 was introduced

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.