Doridian Posted April 21, 2021 Share Posted April 21, 2021 (edited) My unRAID install is fairly new (on 6.9.2). Today, I suddenly could not reach my server anymore at all via network. I managed to pull some logs (diagnostics would just hang), like dmesg and syslog. The only way to reboot it was a hard reset, reboot would just hang as well. dmesg excerpt which seems most relevant pre-reboot: Spoiler [259885.572557] ------------[ cut here ]------------ [259885.572564] NETDEV WATCHDOG: eth3 (mlx4_core): transmit queue 23 timed out [259885.572603] WARNING: CPU: 9 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0xcf/0x12b [259885.572605] Modules linked in: vhost_net vhost vhost_iotlb tap kvm_intel kvm xt_connmark xt_comment iptable_raw wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libblake2s blake2s_x86_64 libblake2s_generic libchacha tun nfsd lockd grace sunrpc xt_mark xt_nat xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle nf_tables macvlan xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs dm_crypt dm_mod dax md_mod ipmi_devintf ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding mlx4_en mlx4_core igb i2c_algo_bit st sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper isci rapl mpt3sas ipmi_ssif libsas ahci nvme intel_cstate raid_class scsi_transport_sas i2c_i801 [259885.572737] input_leds intel_uncore nvme_core libahci acpi_ipmi i2c_smbus wmi i2c_core led_class ipmi_si button [last unloaded: kvm] [259885.572756] CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W 5.10.28-Unraid #1 [259885.572757] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.2a 06/30/2015 [259885.572762] RIP: 0010:dev_watchdog+0xcf/0x12b [259885.572767] Code: 79 b7 00 00 75 38 48 89 ef c6 05 63 79 b7 00 01 e8 79 dd fc ff 44 89 e1 48 89 ee 48 c7 c7 ef 7f de 81 48 89 c2 e8 50 16 10 00 <0f> 0b eb 10 41 ff c4 48 05 40 01 00 00 41 39 f4 75 9d eb 16 48 8b [259885.572769] RSP: 0018:ffffc90006614ed8 EFLAGS: 00010286 [259885.572772] RAX: 0000000000000000 RBX: ffff88812f120438 RCX: 0000000000000027 [259885.572774] RDX: 00000000ffffdfff RSI: 0000000000000001 RDI: ffff88a03fa58920 [259885.572776] RBP: ffff88812f120000 R08: 0000000000000000 R09: 00000000ffffdfff [259885.572778] R10: ffffc90006614d08 R11: ffffc90006614d00 R12: 0000000000000017 [259885.572779] R13: ffffc90006614f10 R14: ffffc90006614f10 R15: ffffffff820060c8 [259885.572782] FS: 0000000000000000(0000) GS:ffff88a03fa40000(0000) knlGS:0000000000000000 [259885.572784] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [259885.572786] CR2: 000014d2df44ebd8 CR3: 000000000600a002 CR4: 00000000001706e0 [259885.572788] Call Trace: [259885.572793] <IRQ> [259885.572801] call_timer_fn.isra.0+0x12/0x6f [259885.572806] ? netif_tx_lock+0x7a/0x7a [259885.572808] __run_timers.part.0+0x144/0x185 [259885.572812] ? update_process_times+0x68/0x6e [259885.572814] ? hrtimer_forward+0x73/0x7b [259885.572818] ? tick_sched_timer+0x5a/0x64 [259885.572823] ? timerqueue_add+0x62/0x68 [259885.572831] ? recalibrate_cpu_khz+0x1/0x1 [259885.572834] run_timer_softirq+0x21/0x43 [259885.572840] __do_softirq+0xc4/0x1c2 [259885.572844] asm_call_irq_on_stack+0x12/0x20 [259885.572847] </IRQ> [259885.572850] do_softirq_own_stack+0x2c/0x39 [259885.572860] __irq_exit_rcu+0x45/0x80 [259885.572866] sysvec_apic_timer_interrupt+0x87/0x95 [259885.572871] asm_sysvec_apic_timer_interrupt+0x12/0x20 [259885.572877] RIP: 0010:arch_local_irq_enable+0x7/0x8 [259885.572880] Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5 [259885.572882] RSP: 0018:ffffc900000f7ea0 EFLAGS: 00000246 [259885.572884] RAX: ffff88a03fa62380 RBX: 0000000000000004 RCX: 000000000000001f [259885.572886] RDX: 0000000000000000 RSI: 0000000025a5a719 RDI: 0000000000000000 [259885.572888] RBP: ffffe8ffff27f500 R08: 0000ec5d533ae507 R09: 0000000000000389 [259885.572890] R10: 000000007fffffff R11: 071c71c71c71c71c R12: 0000ec5d533ae507 [259885.572892] R13: ffffffff820c5dc0 R14: 0000000000000004 R15: 0000000000000000 [259885.572897] cpuidle_enter_state+0x101/0x1c4 [259885.572901] cpuidle_enter+0x25/0x31 [259885.572906] do_idle+0x1a6/0x214 [259885.572910] cpu_startup_entry+0x18/0x1a [259885.572914] secondary_startup_64_no_verify+0xb0/0xbb [259885.572918] ---[ end trace 6e6cbe7c23f3f6b7 ]--- [259885.572924] mlx4_en: eth3: TX timeout on queue: 23, QP: 0x23f, CQ: 0xd7, Cons: 0x5048a9, Prod: 0x50492d [259914.564042] rcu: INFO: rcu_sched self-detected stall on CPU [259914.564054] rcu: 15-....: (59999 ticks this GP) idle=aba/1/0x4000000000000000 softirq=7718420/7718420 fqs=14995 [259914.564062] (t=60000 jiffies g=34306649 q=104068) [259914.564066] NMI backtrace for cpu 15 [259914.564073] CPU: 15 PID: 21642 Comm: kworker/u66:3 Tainted: G W 5.10.28-Unraid #1 [259914.564076] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.2a 06/30/2015 [259914.564105] Workqueue: events_power_efficient gc_worker [nf_conntrack] [259914.564111] Call Trace: [259914.564119] <IRQ> [259914.564129] dump_stack+0x6b/0x83 [259914.564139] ? lapic_can_unplug_cpu+0x8e/0x8e [259914.564149] nmi_cpu_backtrace+0x7d/0x8f [259914.564156] nmi_trigger_cpumask_backtrace+0x56/0xd3 [259914.564162] rcu_dump_cpu_stacks+0x9f/0xc6 [259914.564171] rcu_sched_clock_irq+0x1ec/0x543 [259914.564182] ? _raw_spin_unlock_irqrestore+0xd/0xe [259914.564189] update_process_times+0x50/0x6e [259914.564195] tick_sched_timer+0x36/0x64 [259914.564201] __hrtimer_run_queues+0xb7/0x10b [259914.564212] ? tick_sched_do_timer+0x39/0x39 [259914.564217] hrtimer_interrupt+0x8d/0x15b [259914.564223] __sysvec_apic_timer_interrupt+0x5d/0x68 [259914.564229] asm_call_irq_on_stack+0x12/0x20 [259914.564233] </IRQ> [259914.564239] sysvec_apic_timer_interrupt+0x71/0x95 [259914.564246] asm_sysvec_apic_timer_interrupt+0x12/0x20 [259914.564259] RIP: 0010:nf_ct_tuplehash_to_ctrack+0x4/0xe [nf_conntrack] [259914.564265] Code: 48 8b 57 08 a8 01 48 89 02 75 04 48 89 50 08 c3 48 8b 06 48 89 77 08 48 89 07 a8 01 48 89 3e 75 04 48 89 78 08 c3 0f b6 47 37 <48> 6b c0 c8 48 8d 44 07 f0 c3 48 8b 87 b8 00 00 00 48 85 c0 74 12 [259914.564269] RSP: 0018:ffffc90029bffe40 EFLAGS: 00000206 [259914.564273] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff8890df880000 [259914.564277] RDX: 000000010f79753f RSI: ffffc90029bffe5c RDI: ffff8881ed42c7c8 [259914.564280] RBP: 000000000000e51b R08: 0000000000000000 R09: ffffffffa031229a [259914.564283] R10: 8080808080808080 R11: ffff8881ee35c780 R12: ffffffffa03285a0 [259914.564287] R13: 0000000000040c1b R14: ffff8881ed42c7c8 R15: ffff8881ed42c780 [259914.564299] ? nf_conntrack_free+0x2b/0x35 [nf_conntrack] [259914.564313] gc_worker+0x9a/0x240 [nf_conntrack] [259914.564323] process_one_work+0x13c/0x1d5 [259914.564329] worker_thread+0x18b/0x22f [259914.564336] ? process_scheduled_works+0x27/0x27 [259914.564341] kthread+0xe5/0xea [259914.564346] ? __kthread_bind_mask+0x57/0x57 [259914.564354] ret_from_fork+0x22/0x30 [259945.987587] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 15-... 31-... } 60412 jiffies s: 16169 root: 0x3/. [259945.987598] rcu: blocking rcu_node structures: l=1:0-15:0x8000/. l=1:16-31:0x8000/. [259945.987606] Task dump for CPU 15: [259945.987616] task:kworker/u66:3 state:R running task stack: 0 pid:21642 ppid: 2 flags:0x00004008 [259945.987636] Workqueue: events_power_efficient gc_worker [nf_conntrack] [259945.987639] Call Trace: [259945.987651] ? process_one_work+0x13c/0x1d5 [259945.987654] ? worker_thread+0x18b/0x22f [259945.987657] ? process_scheduled_works+0x27/0x27 [259945.987661] ? kthread+0xe5/0xea [259945.987664] ? __kthread_bind_mask+0x57/0x57 [259945.987668] ? ret_from_fork+0x22/0x30 [259945.987672] Task dump for CPU 31: [259945.987673] task:swapper/31 state:R running task stack: 0 pid: 0 ppid: 1 flags:0x00004008 [259945.987677] Call Trace: [259945.987683] ? arch_local_irq_enable+0x7/0x8 [259945.987686] ? cpuidle_enter_state+0x101/0x1c4 [259945.987690] ? cpuidle_enter+0x25/0x31 [259945.987695] ? do_idle+0x1a6/0x214 [259945.987698] ? cpu_startup_entry+0x18/0x1a [259945.987700] ? secondary_startup_64_no_verify+0xb0/0xbb [260094.564270] rcu: INFO: rcu_sched self-detected stall on CPU [260094.564281] rcu: 15-....: (240002 ticks this GP) idle=aba/1/0x4000000000000000 softirq=7718420/7718420 fqs=59991 [260094.564288] (t=240003 jiffies g=34306649 q=416808) [260094.564291] NMI backtrace for cpu 15 After a reboot, I also see a weird notification in dmesg (that I never got before when I ran Proxmox VE): Spoiler [ 206.835537] ------------[ cut here ]------------ [ 206.835559] WARNING: CPU: 31 PID: 0 at net/netfilter/nf_conntrack_core.c:1120 __nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack] [ 206.835561] Modules linked in: xt_connmark xt_comment iptable_raw xt_mark xt_nat xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_nat nf_tables vhost_net vhost vhost_iotlb tap macvlan xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter xfs dm_crypt dm_mod dax nfsd lockd grace sunrpc md_mod ipmi_devintf wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libblake2s blake2s_x86_64 libblake2s_generic libchacha tun ip6table_mangle iptable_mangle xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding mlx4_en mlx4_core igb i2c_algo_bit st sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ipmi_ssif ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper i2c_i801 i2c_smbus isci mpt3sas rapl input_leds libsas nvme ahci intel_cstate acpi_ipmi [ 206.835672] i2c_core raid_class scsi_transport_sas nvme_core wmi libahci intel_uncore led_class ipmi_si button [last unloaded: mlx4_core] [ 206.835692] CPU: 31 PID: 0 Comm: swapper/31 Not tainted 5.10.28-Unraid #1 [ 206.835694] Hardware name: Supermicro X9DAX/X9DAX, BIOS 3.2a 06/30/2015 [ 206.835702] RIP: 0010:__nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack] [ 206.835707] Code: e8 dc f8 ff ff 44 89 fa 89 c6 41 89 c4 48 c1 eb 20 89 df 41 89 de e8 36 f6 ff ff 84 c0 75 bb 48 8b 85 80 00 00 00 a8 08 74 18 <0f> 0b 89 df 44 89 e6 31 db e8 6d f3 ff ff e8 35 f5 ff ff e9 22 01 [ 206.835710] RSP: 0018:ffffc900069dc938 EFLAGS: 00010202 [ 206.835713] RAX: 0000000000000188 RBX: 000000000000bfba RCX: 000000001aafc951 [ 206.835715] RDX: 0000000000000000 RSI: 0000000000000366 RDI: ffffffffa01690e8 [ 206.835718] RBP: ffff889108901a40 R08: 0000000083d9afb1 R09: 0000000000000000 [ 206.835720] R10: 0000000000000158 R11: ffffc900069dc930 R12: 0000000000003766 [ 206.835722] R13: ffffffff8210b440 R14: 000000000000bfba R15: 0000000000000000 [ 206.835726] FS: 0000000000000000(0000) GS:ffff88a03fdc0000(0000) knlGS:0000000000000000 [ 206.835728] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 206.835731] CR2: 00007ffeeac18f28 CR3: 000000000600a001 CR4: 00000000001706e0 [ 206.835733] Call Trace: [ 206.835740] <IRQ> [ 206.835749] nf_conntrack_confirm+0x2f/0x36 [nf_conntrack] [ 206.835759] nf_hook_slow+0x39/0x8e [ 206.835765] nf_hook.constprop.0+0xb1/0xd8 [ 206.835770] ? ip_protocol_deliver_rcu+0xfe/0xfe [ 206.835773] ip_local_deliver+0x49/0x75 [ 206.835780] ip_sabotage_in+0x43/0x4d [br_netfilter] [ 206.835785] nf_hook_slow+0x39/0x8e [ 206.835788] nf_hook.constprop.0+0xb1/0xd8 [ 206.835792] ? l3mdev_l3_rcv.constprop.0+0x50/0x50 [ 206.835795] ip_rcv+0x41/0x61 [ 206.835805] __netif_receive_skb_one_core+0x74/0x95 [ 206.835811] netif_receive_skb+0x79/0xa1 [ 206.835817] br_handle_frame_finish+0x30d/0x351 [ 206.835825] ? skb_copy_bits+0xe8/0x197 [ 206.835830] ? ipt_do_table+0x570/0x5c0 [ip_tables] [ 206.835833] ? br_pass_frame_up+0xda/0xda [ 206.835837] br_nf_hook_thresh+0xa3/0xc3 [br_netfilter] [ 206.835841] ? br_pass_frame_up+0xda/0xda [ 206.835845] br_nf_pre_routing_finish+0x23d/0x264 [br_netfilter] [ 206.835848] ? br_pass_frame_up+0xda/0xda [ 206.835851] ? br_handle_frame_finish+0x351/0x351 [ 206.835858] ? nf_nat_ipv4_pre_routing+0x1e/0x4a [nf_nat] [ 206.835862] ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter] [ 206.835865] ? br_handle_frame_finish+0x351/0x351 [ 206.835869] NF_HOOK+0xd7/0xf7 [br_netfilter] [ 206.835874] ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter] [ 206.835878] br_nf_pre_routing+0x229/0x239 [br_netfilter] [ 206.835883] ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter] [ 206.835886] br_handle_frame+0x25e/0x2a6 [ 206.835889] ? br_pass_frame_up+0xda/0xda [ 206.835893] __netif_receive_skb_core+0x335/0x4e7 [ 206.835898] ? dev_gro_receive+0x55d/0x578 [ 206.835903] __netif_receive_skb_list_core+0x78/0x104 [ 206.835909] netif_receive_skb_list_internal+0x1bf/0x1f2 [ 206.835914] gro_normal_list+0x1d/0x39 [ 206.835918] napi_complete_done+0x79/0x104 [ 206.835926] mlx4_en_poll_rx_cq+0xa8/0xc7 [mlx4_en] [ 206.835931] net_rx_action+0xf4/0x29d [ 206.835937] __do_softirq+0xc4/0x1c2 [ 206.835941] asm_call_irq_on_stack+0x12/0x20 [ 206.835944] </IRQ> [ 206.835950] do_softirq_own_stack+0x2c/0x39 [ 206.835959] __irq_exit_rcu+0x45/0x80 [ 206.835964] common_interrupt+0x119/0x12e [ 206.835970] asm_common_interrupt+0x1e/0x40 [ 206.835976] RIP: 0010:arch_local_irq_enable+0x7/0x8 [ 206.835980] Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5 [ 206.835982] RSP: 0018:ffffc9000647bea0 EFLAGS: 00000246 [ 206.835985] RAX: ffff88a03fde2380 RBX: 0000000000000004 RCX: 000000000000001f [ 206.835987] RDX: 0000000000000000 RSI: 0000000025a5d58a RDI: 0000000000000000 [ 206.835990] RBP: ffffe8ffff5ff500 R08: 00000030285a7775 R09: 000000000002b057 [ 206.835992] R10: 000000007fffffff R11: 071c71c71c71c71c R12: 00000030285a7775 [ 206.835994] R13: ffffffff820c5dc0 R14: 0000000000000004 R15: 0000000000000000 [ 206.836000] cpuidle_enter_state+0x101/0x1c4 [ 206.836004] cpuidle_enter+0x25/0x31 [ 206.836009] do_idle+0x1a6/0x214 [ 206.836013] cpu_startup_entry+0x18/0x1a [ 206.836018] secondary_startup_64_no_verify+0xb0/0xbb [ 206.836022] ---[ end trace 79c5313195419a23 ]--- Is this a known issue with Mellanox Connect-X 3 cards, since their driver seems to be within the trace? Is there anything I can do to fix this? I do have a couple docker containers on bridges (most in VLAN bridges, but one on the main non-VLAN bridge) //EDIT: In fact, researching this issue on past things (the second one with just the trace), if I don't run any containers on the "root bridge" (the one without VLAN), then I no longer get such traces. Is this known? Is there a fix, because I would really like to run some containers on that bridge without a VLAN attached //EDIT2: I should also add, I am operating the Mellanox NIC in 802.3ad bonded mode with 9000 MTU jumbo packets. Edited April 27, 2021 by Doridian Quote Link to comment
JorgeB Posted April 22, 2021 Share Posted April 22, 2021 10 hours ago, Doridian said: Is this a known issue with Mellanox Connect-X 3 cards Not that I'm aware of, and I'm using them without issues, but it does appear network related, try simplifying you network config as much as possible. Quote Link to comment
Doridian Posted April 22, 2021 Author Share Posted April 22, 2021 2 hours ago, JorgeB said: Not that I'm aware of, and I'm using them without issues, but it does appear network related, try simplifying you network config as much as possible. It does seem to be a known issue with these errors when you run docker containers on a bridge directly without VLAN (on 10G cards). (Well, the nf_conntrack_confirm calltrace seems to be, which I suspect is the cause of the eventual network stall/crash) So far I've been running stable without any calltrace/issues since removing the Docker on br0 (and keeping it on br0.X interfaces instead). See links such as: - https://forums.unraid.net/topic/101342-solved-69-rc2-kernal-panic-and-trace/ - https://forums.unraid.net/topic/97881-unraid-becomes-unresponsive-rendomlydocker-containers-crashing/ (There is a lot more, and people have seem to have various solutions to fix their issue, for some using only VLAN bridges doesn't seem to have fixed it and instead they needed to remove custom IPs, etc etc) However, I would like to be able to run docker containers on br0 which is why I am keeping this thread open to see if anyone has suggestions as for how do to that, or someone from the unRAID team could give me more instructions how to debug further / try new kernels / etc. Quote Link to comment
Doridian Posted April 27, 2021 Author Share Posted April 27, 2021 (edited) I think I solved the issue: Setting iommu=pt intel_iommu=pt (to only translate / IOMMU hardware that is used inside VMs) made it stop spewing these messages. It seems the ConnectX-3 just doesn't like being IOMMU'd when used with docker? Weird, but okay. It works! (This is for anyone else with the same issue as something to try) //EDIT: IGNORE THIS. It just made it take longer... Edited April 27, 2021 by Doridian Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.