[Workaround] 6.9.1 & 6.8.2: Kernel panic - not syncing


Recommended Posts

Posted (edited)

My server has been kernel panicing every 2-3 days since mid-April. It panicked 3 times today alone. I'm looking for any help I can get.

 

History and steps so far...

 

Problems started about 10 days after I upgraded to 6.9.2. I searched the forum and redit and found other "not syncing" type problems. One some were at boot and the solution seems to be replacing the USB flash drive. My server has never failed to boot, so I have discounted the USB flash drive as the problem. The second type I found was bad RAM. My server has 64G of RAM on 4 DIMMs. I removed 32G and ran the server until it panicked within 2 days. I then swapped the RAM and again within 2 days the server panicked. So I think I can assume the RAM is okay.

 

I've also downgraded to 6.9.1 thinking maybe 6.9,2 was the problem. Again, the server panicked within a few days. BTW, I'm still on 6.9.1.

 

Since the server panics, I don't have much in the way of logs, etc. I do have screenshots of the IPMI console and they all look very similar.

 

2115674186_panic5-16no2.jpg.8e1830eac40105811afbbee1814be011.jpg

 

I've not made any major, or minor really, changes to the server prior to the panics. The system has been running well for years. I'm starting to think the problem is hardware based since switching unRAID versions and few changes prior to the panics would rule out software.

 

After the second (of three) crashes today I change the Tunable (md_sync_limit) from the default setting of "5" to "10". This was a wild guess at trying to solve this problem. It's the only parameter I could find that has "sync" in it. I couldn't find any documentation on this parameter either. Changing the value to "10" doesn't seem to have helped since the server has since crashed.

 

Since the panics have started the server has been in almost constant parity check status. So much so I've had 2 drive die. I can't be sure the constant parity checks killed them, but I'm sure it didn't help either.

 

It took more than a week, but I was able to replace the 2 dead drives between panics. I also upgraded my parity drives from 10G to 18G and used the old 10G parity drives to replace the dead drives.

 

For now I'm canceling parity checks since the server has crashed 3 times in just one day. I also stopped all but a few dockers and no VM's.

 

Attached are my diagnostic files and a screenshot of the console after a panic.

 

Any help will be welcome,

 

Mike

asok-diagnostics-20210516-1950.zip

Edited by subagon
Additional info
Link to comment

JorgeB,

 

Thanks for the info. I'll give it try and move my br0 connections over to a VLAN.

 

I looked at my syslog and found this from this morning, but the server has not crashed since yesterday.

 

May 17 07:59:41 asok kernel: ------------[ cut here ]------------
May 17 07:59:41 asok kernel: WARNING: CPU: 15 PID: 0 at net/netfilter/nf_conntrack_core.c:1120 __nf_conntrack_confirm+0x9b/0x1e6
May 17 07:59:41 asok kernel: Modules linked in: vhost_net tun vhost vhost_iotlb tap kvm_intel kvm iptable_raw wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libblake2s blake2s_x86_64 libblake2s_generic libchacha veth xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat iptable_mangle macvlan xt_nat xt_MASQUERADE iptable_nat nf_nat xfs md_mod nct7904 watchdog ip6table_filter ip6_tables iptable_filter ip_tables igb i2c_algo_bit ipmi_ssif sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd mpt3sas i2c_i801 i2c_smbus input_leds glue_helper rapl intel_cstate i2c_core ahci led_class raid_class intel_uncore acpi_ipmi libahci scsi_transport_sas wmi ipmi_si acpi_pad button [last unloaded: kvm]
May 17 07:59:41 asok kernel: CPU: 15 PID: 0 Comm: swapper/15 Not tainted 5.10.21-Unraid #1
May 17 07:59:41 asok kernel: Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.3 08/23/2018
May 17 07:59:41 asok kernel: RIP: 0010:__nf_conntrack_confirm+0x9b/0x1e6
May 17 07:59:41 asok kernel: Code: e8 64 f9 ff ff 44 89 fa 89 c6 41 89 c4 48 c1 eb 20 89 df 41 89 de e8 d5 f6 ff ff 84 c0 75 bb 48 8b 85 80 00 00 00 a8 08 74 18 <0f> 0b 89 df 44 89 e6 31 db e8 5d f3 ff ff e8 30 f6 ff ff e9 22 01
May 17 07:59:41 asok kernel: RSP: 0018:ffffc900066e08a8 EFLAGS: 00010202
May 17 07:59:41 asok kernel: RAX: 0000000000000188 RBX: 0000000000003c05 RCX: 00000000b0e84f5c
May 17 07:59:41 asok kernel: RDX: 0000000000000000 RSI: 0000000000000061 RDI: ffffffff820099c4
May 17 07:59:41 asok kernel: RBP: ffff888179d86f00 R08: 00000000fac24da6 R09: ffff8888a1211ae0
May 17 07:59:41 asok kernel: R10: 0000000000000158 R11: ffff8884a537f200 R12: 0000000000008461
May 17 07:59:41 asok kernel: R13: ffffffff8210db40 R14: 0000000000003c05 R15: 0000000000000000
May 17 07:59:41 asok kernel: FS:  0000000000000000(0000) GS:ffff88885fc40000(0000) knlGS:0000000000000000
May 17 07:59:41 asok kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 17 07:59:41 asok kernel: CR2: 000014faebee29f8 CR3: 000000000200c006 CR4: 00000000001706e0
May 17 07:59:41 asok kernel: Call Trace:
May 17 07:59:41 asok kernel: <IRQ>
May 17 07:59:41 asok kernel: nf_conntrack_confirm+0x2f/0x36
May 17 07:59:41 asok kernel: nf_hook_slow+0x39/0x8e
May 17 07:59:41 asok kernel: nf_hook.constprop.0+0xb1/0xd8
May 17 07:59:41 asok kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
May 17 07:59:41 asok kernel: ip_local_deliver+0x49/0x75
May 17 07:59:41 asok kernel: ip_sabotage_in+0x43/0x4d
May 17 07:59:41 asok kernel: nf_hook_slow+0x39/0x8e
May 17 07:59:41 asok kernel: nf_hook.constprop.0+0xb1/0xd8
May 17 07:59:41 asok kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
May 17 07:59:41 asok kernel: ip_rcv+0x41/0x61
May 17 07:59:41 asok kernel: __netif_receive_skb_one_core+0x74/0x95
May 17 07:59:41 asok kernel: netif_receive_skb+0x79/0xa1
May 17 07:59:41 asok kernel: br_handle_frame_finish+0x30d/0x351
May 17 07:59:41 asok kernel: ? ipt_do_table+0x570/0x5c0 [ip_tables]
May 17 07:59:41 asok kernel: ? br_pass_frame_up+0xda/0xda
May 17 07:59:41 asok kernel: br_nf_hook_thresh+0xa3/0xc3
May 17 07:59:41 asok kernel: ? br_pass_frame_up+0xda/0xda
May 17 07:59:41 asok kernel: br_nf_pre_routing_finish+0x23d/0x264
May 17 07:59:41 asok kernel: ? br_pass_frame_up+0xda/0xda
May 17 07:59:41 asok kernel: ? br_handle_frame_finish+0x351/0x351
May 17 07:59:41 asok kernel: ? nf_nat_ipv4_in+0x1e/0x4a [nf_nat]
May 17 07:59:41 asok kernel: ? br_nf_forward_finish+0xd0/0xd0
May 17 07:59:41 asok kernel: ? br_handle_frame_finish+0x351/0x351
May 17 07:59:41 asok kernel: NF_HOOK+0xd7/0xf7
May 17 07:59:41 asok kernel: ? br_nf_forward_finish+0xd0/0xd0
May 17 07:59:41 asok kernel: br_nf_pre_routing+0x229/0x239
May 17 07:59:41 asok kernel: ? br_nf_forward_finish+0xd0/0xd0
May 17 07:59:41 asok kernel: br_handle_frame+0x25e/0x2a6
May 17 07:59:41 asok kernel: ? br_pass_frame_up+0xda/0xda
May 17 07:59:41 asok kernel: __netif_receive_skb_core+0x335/0x4e7
May 17 07:59:41 asok kernel: ? __update_load_avg_cfs_rq+0xd6/0x18f
May 17 07:59:41 asok kernel: __netif_receive_skb_list_core+0x78/0x104
May 17 07:59:41 asok kernel: netif_receive_skb_list_internal+0x1bf/0x1f2
May 17 07:59:41 asok kernel: ? dev_gro_receive+0x55d/0x578
May 17 07:59:41 asok kernel: gro_normal_list+0x1d/0x39
May 17 07:59:41 asok kernel: napi_complete_done+0x79/0x104
May 17 07:59:41 asok kernel: igb_poll+0xc9a/0xec8 [igb]
May 17 07:59:41 asok kernel: ? do_send_sig_info+0x63/0x86
May 17 07:59:41 asok kernel: net_rx_action+0xf4/0x29d
May 17 07:59:41 asok kernel: __do_softirq+0xc4/0x1c2
May 17 07:59:41 asok kernel: asm_call_irq_on_stack+0x12/0x20
May 17 07:59:41 asok kernel: </IRQ>
May 17 07:59:41 asok kernel: do_softirq_own_stack+0x2c/0x39
May 17 07:59:41 asok kernel: __irq_exit_rcu+0x45/0x80
May 17 07:59:41 asok kernel: common_interrupt+0x119/0x12e
May 17 07:59:41 asok kernel: asm_common_interrupt+0x1e/0x40
May 17 07:59:41 asok kernel: RIP: 0010:arch_local_irq_enable+0x7/0x8
May 17 07:59:41 asok kernel: Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5
May 17 07:59:41 asok kernel: RSP: 0018:ffffc900063fbea0 EFLAGS: 00000246
May 17 07:59:41 asok kernel: RAX: ffff88885fc62380 RBX: 0000000000000004 RCX: 000000000000001f
May 17 07:59:41 asok kernel: RDX: 0000000000000000 RSI: 00000000313b14ef RDI: 0000000000000000
May 17 07:59:41 asok kernel: RBP: ffffe8f7ff47f700 R08: 00002a9d58969783 R09: 0000000000000000
May 17 07:59:41 asok kernel: R10: 000000000000f71a R11: 071c71c71c71c71c R12: 00002a9d58969783
May 17 07:59:41 asok kernel: R13: ffffffff820c7ec0 R14: 0000000000000004 R15: 0000000000000000
May 17 07:59:41 asok kernel: cpuidle_enter_state+0x101/0x1c4
May 17 07:59:41 asok kernel: cpuidle_enter+0x25/0x31
May 17 07:59:41 asok kernel: do_idle+0x1a6/0x214
May 17 07:59:41 asok kernel: cpu_startup_entry+0x18/0x1a
May 17 07:59:41 asok kernel: secondary_startup_64_no_verify+0xb0/0xbb
May 17 07:59:41 asok kernel: ---[ end trace eaf4c384b8419f52 ]---

 

I report back after I get the VLAN setup and dockers moved over to it.

 

Mike

Link to comment

Update:

 

For troubleshooting I have either shutdown or migrated all dockers from the "br0" to "bridge". I also upgrade back to 6.9.2 (from 6.9.1) since this issue seems to effect both 6.9.1 & 6.9.2.

 

if this works, I'll likely use a VLAN as described in the links above.

 

I'll see how long the system stays up. If I don't make any further updates to this thread after 2 weeks, assume that the problem is solved.

Link to comment
  • 2 weeks later...

Two weeks and the system hasn't crashed. So I'm going to assume that removing all IPs from br0 and moving them to "bridge" has stabilized the server. Next I'll re-IP all my dockers back to br0 but adding a VLAN as described above.

 

I hope Lime Tech addresses this issue in a future release so that this workaround isn't necessary.

Link to comment
  • subagon changed the title to [Workaround] 6.9.1 & 6.8.2: Kernel panic - not syncing

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.