November 30, 20214 yr So I've been dealing with this for about a month. Trying what I could research to fix the problem but I'm not left with many options. The issue: After a manual physical reset, the server boots as normal and everything comes online. Everything works, all UI's, all dockers, all shares. Then after a varied amount of time (sometimes 12 hours sometimes an hour) everything is unreachable. Even under Safe mode. The server itself shows something similar to this everytime and is unresponsive at the console itself.https://cdn.discordapp.com/attachments/585542638854471702/914953789029638194/PXL_20211127_022300451.jpg Diagnostics while the server is running if that helps. I kept the log open and was able to catch this before it cleared. Nov 29 15:42:42 Tower kernel: rcu: INFO: rcu_sched self-detected stall on CPU Nov 29 15:42:42 Tower kernel: rcu: 21-....: (240004 ticks this GP) idle=c46/1/0x4000000000000000 softirq=3856062/3856062 fqs=52036 Nov 29 15:42:42 Tower kernel: (t=240005 jiffies g=7800977 q=1908501) Nov 29 15:42:42 Tower kernel: NMI backtrace for cpu 21 Nov 29 15:42:42 Tower kernel: CPU: 21 PID: 29147 Comm: kworker/u256:17 Tainted: G W 5.10.28-Unraid #1 Nov 29 15:42:42 Tower kernel: Hardware name: ASUS System Product Name/ROG ZENITH II EXTREME ALPHA, BIOS 1502 07/13/2021 Nov 29 15:42:42 Tower kernel: Workqueue: events_power_efficient gc_worker [nf_conntrack] Nov 29 15:42:42 Tower kernel: Call Trace: Nov 29 15:42:42 Tower kernel: <IRQ> Nov 29 15:42:42 Tower kernel: dump_stack+0x6b/0x83 Nov 29 15:42:42 Tower kernel: ? lapic_can_unplug_cpu+0x8e/0x8e Nov 29 15:42:42 Tower kernel: nmi_cpu_backtrace+0x7d/0x8f Nov 29 15:42:42 Tower kernel: nmi_trigger_cpumask_backtrace+0x56/0xd3 Nov 29 15:42:42 Tower kernel: rcu_dump_cpu_stacks+0x9f/0xc6 Nov 29 15:42:42 Tower kernel: rcu_sched_clock_irq+0x1ec/0x543 Nov 29 15:42:42 Tower kernel: ? trigger_load_balance+0x5a/0x1ca Nov 29 15:42:42 Tower kernel: update_process_times+0x50/0x6e Nov 29 15:42:42 Tower kernel: tick_sched_timer+0x36/0x64 Nov 29 15:42:42 Tower kernel: __hrtimer_run_queues+0xb7/0x10b Nov 29 15:42:42 Tower kernel: ? tick_sched_do_timer+0x39/0x39 Nov 29 15:42:42 Tower kernel: hrtimer_interrupt+0x8d/0x15b Nov 29 15:42:42 Tower kernel: __sysvec_apic_timer_interrupt+0x5d/0x68 Nov 29 15:42:42 Tower kernel: asm_call_irq_on_stack+0x12/0x20 Nov 29 15:42:42 Tower kernel: </IRQ> Nov 29 15:42:42 Tower kernel: sysvec_apic_timer_interrupt+0x71/0x95 Nov 29 15:42:42 Tower kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20 Nov 29 15:42:42 Tower kernel: RIP: 0010:gc_worker+0x9a/0x240 [nf_conntrack] Any assistance would be appreciated. Edited November 30, 20214 yr by madpuma13
November 30, 20214 yr Author I never came across this before. I'll give it a shot. Is this a common thing to start at random times? My server was up for about a year without any crashes until now.
November 30, 20214 yr Community Expert also Nov 29 11:19:03 Tower kernel: Modules linked in: macvlan xt_mark xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle nf_tables vhost_net tun vhost vhost_iotlb tap xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding atlantic igb i2c_algo_bit edac_mce_amd amd_energy kvm_amd kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper wmi_bmof mxm_wmi btusb btrtl btbcm btintel bluetooth ecdh_generic ecc nvme nvme_core ahci rapl input_leds i2c_piix4 ccp i2c_core wmi led_class libahci k10temp button acpi_cpufreq [last unloaded: atlantic]
November 30, 20214 yr Author Thank you soo much. This puts me down the path of figuring everything out. You are awesome!
November 30, 20214 yr Author At the risk of being too optimistic, I'm marking this as solved. I implemented both of the above suggestions, disabling global C-states and turning off a docker that was on custom:br0 network type to avoid the call trace issue. The server has been up and running for over 24 hours which is the longest in a long time. Marking this solved, ill update if the issue comes up again. Thanks again!!
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.