subagon Posted May 17, 2021 Share Posted May 17, 2021 (edited) My server has been kernel panicing every 2-3 days since mid-April. It panicked 3 times today alone. I'm looking for any help I can get. History and steps so far... Problems started about 10 days after I upgraded to 6.9.2. I searched the forum and redit and found other "not syncing" type problems. One some were at boot and the solution seems to be replacing the USB flash drive. My server has never failed to boot, so I have discounted the USB flash drive as the problem. The second type I found was bad RAM. My server has 64G of RAM on 4 DIMMs. I removed 32G and ran the server until it panicked within 2 days. I then swapped the RAM and again within 2 days the server panicked. So I think I can assume the RAM is okay. I've also downgraded to 6.9.1 thinking maybe 6.9,2 was the problem. Again, the server panicked within a few days. BTW, I'm still on 6.9.1. Since the server panics, I don't have much in the way of logs, etc. I do have screenshots of the IPMI console and they all look very similar. I've not made any major, or minor really, changes to the server prior to the panics. The system has been running well for years. I'm starting to think the problem is hardware based since switching unRAID versions and few changes prior to the panics would rule out software. After the second (of three) crashes today I change the Tunable (md_sync_limit) from the default setting of "5" to "10". This was a wild guess at trying to solve this problem. It's the only parameter I could find that has "sync" in it. I couldn't find any documentation on this parameter either. Changing the value to "10" doesn't seem to have helped since the server has since crashed. Since the panics have started the server has been in almost constant parity check status. So much so I've had 2 drive die. I can't be sure the constant parity checks killed them, but I'm sure it didn't help either. It took more than a week, but I was able to replace the 2 dead drives between panics. I also upgraded my parity drives from 10G to 18G and used the old 10G parity drives to replace the dead drives. For now I'm canceling parity checks since the server has crashed 3 times in just one day. I also stopped all but a few dockers and no VM's. Attached are my diagnostic files and a screenshot of the console after a panic. Any help will be welcome, Mike asok-diagnostics-20210516-1950.zip Edited June 1, 2021 by subagon Additional info Quote Link to comment
JorgeB Posted May 17, 2021 Share Posted May 17, 2021 See if this applies to you: https://forums.unraid.net/topic/70529-650-call-traces-when-assigning-ip-address-to-docker-containers/ See also here: https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/ 1 Quote Link to comment
subagon Posted May 17, 2021 Author Share Posted May 17, 2021 JorgeB, Thanks for the info. I'll give it try and move my br0 connections over to a VLAN. I looked at my syslog and found this from this morning, but the server has not crashed since yesterday. May 17 07:59:41 asok kernel: ------------[ cut here ]------------ May 17 07:59:41 asok kernel: WARNING: CPU: 15 PID: 0 at net/netfilter/nf_conntrack_core.c:1120 __nf_conntrack_confirm+0x9b/0x1e6 May 17 07:59:41 asok kernel: Modules linked in: vhost_net tun vhost vhost_iotlb tap kvm_intel kvm iptable_raw wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libblake2s blake2s_x86_64 libblake2s_generic libchacha veth xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat iptable_mangle macvlan xt_nat xt_MASQUERADE iptable_nat nf_nat xfs md_mod nct7904 watchdog ip6table_filter ip6_tables iptable_filter ip_tables igb i2c_algo_bit ipmi_ssif sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd mpt3sas i2c_i801 i2c_smbus input_leds glue_helper rapl intel_cstate i2c_core ahci led_class raid_class intel_uncore acpi_ipmi libahci scsi_transport_sas wmi ipmi_si acpi_pad button [last unloaded: kvm] May 17 07:59:41 asok kernel: CPU: 15 PID: 0 Comm: swapper/15 Not tainted 5.10.21-Unraid #1 May 17 07:59:41 asok kernel: Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.3 08/23/2018 May 17 07:59:41 asok kernel: RIP: 0010:__nf_conntrack_confirm+0x9b/0x1e6 May 17 07:59:41 asok kernel: Code: e8 64 f9 ff ff 44 89 fa 89 c6 41 89 c4 48 c1 eb 20 89 df 41 89 de e8 d5 f6 ff ff 84 c0 75 bb 48 8b 85 80 00 00 00 a8 08 74 18 <0f> 0b 89 df 44 89 e6 31 db e8 5d f3 ff ff e8 30 f6 ff ff e9 22 01 May 17 07:59:41 asok kernel: RSP: 0018:ffffc900066e08a8 EFLAGS: 00010202 May 17 07:59:41 asok kernel: RAX: 0000000000000188 RBX: 0000000000003c05 RCX: 00000000b0e84f5c May 17 07:59:41 asok kernel: RDX: 0000000000000000 RSI: 0000000000000061 RDI: ffffffff820099c4 May 17 07:59:41 asok kernel: RBP: ffff888179d86f00 R08: 00000000fac24da6 R09: ffff8888a1211ae0 May 17 07:59:41 asok kernel: R10: 0000000000000158 R11: ffff8884a537f200 R12: 0000000000008461 May 17 07:59:41 asok kernel: R13: ffffffff8210db40 R14: 0000000000003c05 R15: 0000000000000000 May 17 07:59:41 asok kernel: FS: 0000000000000000(0000) GS:ffff88885fc40000(0000) knlGS:0000000000000000 May 17 07:59:41 asok kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 17 07:59:41 asok kernel: CR2: 000014faebee29f8 CR3: 000000000200c006 CR4: 00000000001706e0 May 17 07:59:41 asok kernel: Call Trace: May 17 07:59:41 asok kernel: <IRQ> May 17 07:59:41 asok kernel: nf_conntrack_confirm+0x2f/0x36 May 17 07:59:41 asok kernel: nf_hook_slow+0x39/0x8e May 17 07:59:41 asok kernel: nf_hook.constprop.0+0xb1/0xd8 May 17 07:59:41 asok kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe May 17 07:59:41 asok kernel: ip_local_deliver+0x49/0x75 May 17 07:59:41 asok kernel: ip_sabotage_in+0x43/0x4d May 17 07:59:41 asok kernel: nf_hook_slow+0x39/0x8e May 17 07:59:41 asok kernel: nf_hook.constprop.0+0xb1/0xd8 May 17 07:59:41 asok kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50 May 17 07:59:41 asok kernel: ip_rcv+0x41/0x61 May 17 07:59:41 asok kernel: __netif_receive_skb_one_core+0x74/0x95 May 17 07:59:41 asok kernel: netif_receive_skb+0x79/0xa1 May 17 07:59:41 asok kernel: br_handle_frame_finish+0x30d/0x351 May 17 07:59:41 asok kernel: ? ipt_do_table+0x570/0x5c0 [ip_tables] May 17 07:59:41 asok kernel: ? br_pass_frame_up+0xda/0xda May 17 07:59:41 asok kernel: br_nf_hook_thresh+0xa3/0xc3 May 17 07:59:41 asok kernel: ? br_pass_frame_up+0xda/0xda May 17 07:59:41 asok kernel: br_nf_pre_routing_finish+0x23d/0x264 May 17 07:59:41 asok kernel: ? br_pass_frame_up+0xda/0xda May 17 07:59:41 asok kernel: ? br_handle_frame_finish+0x351/0x351 May 17 07:59:41 asok kernel: ? nf_nat_ipv4_in+0x1e/0x4a [nf_nat] May 17 07:59:41 asok kernel: ? br_nf_forward_finish+0xd0/0xd0 May 17 07:59:41 asok kernel: ? br_handle_frame_finish+0x351/0x351 May 17 07:59:41 asok kernel: NF_HOOK+0xd7/0xf7 May 17 07:59:41 asok kernel: ? br_nf_forward_finish+0xd0/0xd0 May 17 07:59:41 asok kernel: br_nf_pre_routing+0x229/0x239 May 17 07:59:41 asok kernel: ? br_nf_forward_finish+0xd0/0xd0 May 17 07:59:41 asok kernel: br_handle_frame+0x25e/0x2a6 May 17 07:59:41 asok kernel: ? br_pass_frame_up+0xda/0xda May 17 07:59:41 asok kernel: __netif_receive_skb_core+0x335/0x4e7 May 17 07:59:41 asok kernel: ? __update_load_avg_cfs_rq+0xd6/0x18f May 17 07:59:41 asok kernel: __netif_receive_skb_list_core+0x78/0x104 May 17 07:59:41 asok kernel: netif_receive_skb_list_internal+0x1bf/0x1f2 May 17 07:59:41 asok kernel: ? dev_gro_receive+0x55d/0x578 May 17 07:59:41 asok kernel: gro_normal_list+0x1d/0x39 May 17 07:59:41 asok kernel: napi_complete_done+0x79/0x104 May 17 07:59:41 asok kernel: igb_poll+0xc9a/0xec8 [igb] May 17 07:59:41 asok kernel: ? do_send_sig_info+0x63/0x86 May 17 07:59:41 asok kernel: net_rx_action+0xf4/0x29d May 17 07:59:41 asok kernel: __do_softirq+0xc4/0x1c2 May 17 07:59:41 asok kernel: asm_call_irq_on_stack+0x12/0x20 May 17 07:59:41 asok kernel: </IRQ> May 17 07:59:41 asok kernel: do_softirq_own_stack+0x2c/0x39 May 17 07:59:41 asok kernel: __irq_exit_rcu+0x45/0x80 May 17 07:59:41 asok kernel: common_interrupt+0x119/0x12e May 17 07:59:41 asok kernel: asm_common_interrupt+0x1e/0x40 May 17 07:59:41 asok kernel: RIP: 0010:arch_local_irq_enable+0x7/0x8 May 17 07:59:41 asok kernel: Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5 May 17 07:59:41 asok kernel: RSP: 0018:ffffc900063fbea0 EFLAGS: 00000246 May 17 07:59:41 asok kernel: RAX: ffff88885fc62380 RBX: 0000000000000004 RCX: 000000000000001f May 17 07:59:41 asok kernel: RDX: 0000000000000000 RSI: 00000000313b14ef RDI: 0000000000000000 May 17 07:59:41 asok kernel: RBP: ffffe8f7ff47f700 R08: 00002a9d58969783 R09: 0000000000000000 May 17 07:59:41 asok kernel: R10: 000000000000f71a R11: 071c71c71c71c71c R12: 00002a9d58969783 May 17 07:59:41 asok kernel: R13: ffffffff820c7ec0 R14: 0000000000000004 R15: 0000000000000000 May 17 07:59:41 asok kernel: cpuidle_enter_state+0x101/0x1c4 May 17 07:59:41 asok kernel: cpuidle_enter+0x25/0x31 May 17 07:59:41 asok kernel: do_idle+0x1a6/0x214 May 17 07:59:41 asok kernel: cpu_startup_entry+0x18/0x1a May 17 07:59:41 asok kernel: secondary_startup_64_no_verify+0xb0/0xbb May 17 07:59:41 asok kernel: ---[ end trace eaf4c384b8419f52 ]--- I report back after I get the VLAN setup and dockers moved over to it. Mike Quote Link to comment
subagon Posted May 17, 2021 Author Share Posted May 17, 2021 Update: For troubleshooting I have either shutdown or migrated all dockers from the "br0" to "bridge". I also upgrade back to 6.9.2 (from 6.9.1) since this issue seems to effect both 6.9.1 & 6.9.2. if this works, I'll likely use a VLAN as described in the links above. I'll see how long the system stays up. If I don't make any further updates to this thread after 2 weeks, assume that the problem is solved. Quote Link to comment
subagon Posted June 1, 2021 Author Share Posted June 1, 2021 Two weeks and the system hasn't crashed. So I'm going to assume that removing all IPs from br0 and moving them to "bridge" has stabilized the server. Next I'll re-IP all my dockers back to br0 but adding a VLAN as described above. I hope Lime Tech addresses this issue in a future release so that this workaround isn't necessary. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.