• 6.9.0/6.9.1 - Kernel Panic due to netfilter (nf_nat_setup_info) - Docker Static IP (macvlan)


    CorneliousJD
    • Urgent

    So I had posted another thread about after a kernel panic, docker host access to custom networks doesn't work until docker is stopped/restarted on 6.9.0

     

     

    After further investigation and setting up syslogging, it apperas that it may actually be that host access that's CAUSING the kernel panic? 

    EDIT: 3/16 - I guess I needed to create a VLAN for my dockers with static IPs, so far that's working, so it's probably not HOST access causing the issue, but rather br0 static IPs being set. See following posts below.

     

    Here's my last kernel panic that thankfully got logged to syslog. It references macvlan and netfilter. I don't know enough to be super useful here, but this is my docker setup.

     

    image.png.dac2782e9408016de37084cf21ad64a5.png

     

    Mar 12 03:57:07 Server kernel: ------------[ cut here ]------------
    Mar 12 03:57:07 Server kernel: WARNING: CPU: 17 PID: 626 at net/netfilter/nf_nat_core.c:614 nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Modules linked in: ccp macvlan xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap veth xt_nat xt_MASQUERADE iptable_nat nf_nat xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables bonding igb i2c_algo_bit cp210x usbserial sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd ipmi_ssif isci glue_helper mpt3sas i2c_i801 rapl libsas i2c_smbus input_leds i2c_core ahci intel_cstate raid_class led_class acpi_ipmi intel_uncore libahci scsi_transport_sas wmi ipmi_si button [last unloaded: ipmi_devintf]
    Mar 12 03:57:07 Server kernel: CPU: 17 PID: 626 Comm: kworker/17:2 Tainted: G        W         5.10.19-Unraid #1
    Mar 12 03:57:07 Server kernel: Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
    Mar 12 03:57:07 Server kernel: Workqueue: events macvlan_process_broadcast [macvlan]
    Mar 12 03:57:07 Server kernel: RIP: 0010:nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Code: 89 fb 49 89 f6 41 89 d4 76 02 0f 0b 48 8b 93 80 00 00 00 89 d0 25 00 01 00 00 45 85 e4 75 07 89 d0 25 80 00 00 00 85 c0 74 07 <0f> 0b e9 1f 05 00 00 48 8b 83 90 00 00 00 4c 8d 6c 24 20 48 8d 73
    Mar 12 03:57:07 Server kernel: RSP: 0018:ffffc90006778c38 EFLAGS: 00010202
    Mar 12 03:57:07 Server kernel: RAX: 0000000000000080 RBX: ffff88837c8303c0 RCX: ffff88811e834880
    Mar 12 03:57:07 Server kernel: RDX: 0000000000000180 RSI: ffffc90006778d14 RDI: ffff88837c8303c0
    Mar 12 03:57:07 Server kernel: RBP: ffffc90006778d00 R08: 0000000000000000 R09: ffff889083c68160
    Mar 12 03:57:07 Server kernel: R10: 0000000000000158 R11: ffff8881e79c1400 R12: 0000000000000000
    Mar 12 03:57:07 Server kernel: R13: 0000000000000000 R14: ffffc90006778d14 R15: 0000000000000001
    Mar 12 03:57:07 Server kernel: FS:  0000000000000000(0000) GS:ffff88903fc40000(0000) knlGS:0000000000000000
    Mar 12 03:57:07 Server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar 12 03:57:07 Server kernel: CR2: 000000c000b040b8 CR3: 000000000200c005 CR4: 00000000001706e0
    Mar 12 03:57:07 Server kernel: Call Trace:
    Mar 12 03:57:07 Server kernel: <IRQ>
    Mar 12 03:57:07 Server kernel: ? activate_task+0x9/0x12
    Mar 12 03:57:07 Server kernel: ? resched_curr+0x3f/0x4c
    Mar 12 03:57:07 Server kernel: ? ipt_do_table+0x49b/0x5c0 [ip_tables]
    Mar 12 03:57:07 Server kernel: ? try_to_wake_up+0x1b0/0x1e5
    Mar 12 03:57:07 Server kernel: nf_nat_alloc_null_binding+0x71/0x88 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_nat_inet_fn+0x91/0x182 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
    Mar 12 03:57:07 Server kernel: ip_local_deliver+0x49/0x75
    Mar 12 03:57:07 Server kernel: ip_sabotage_in+0x43/0x4d
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
    Mar 12 03:57:07 Server kernel: ip_rcv+0x41/0x61
    Mar 12 03:57:07 Server kernel: __netif_receive_skb_one_core+0x74/0x95
    Mar 12 03:57:07 Server kernel: process_backlog+0xa3/0x13b
    Mar 12 03:57:07 Server kernel: net_rx_action+0xf4/0x29d
    Mar 12 03:57:07 Server kernel: __do_softirq+0xc4/0x1c2
    Mar 12 03:57:07 Server kernel: asm_call_irq_on_stack+0x12/0x20
    Mar 12 03:57:07 Server kernel: </IRQ>
    Mar 12 03:57:07 Server kernel: do_softirq_own_stack+0x2c/0x39
    Mar 12 03:57:07 Server kernel: do_softirq+0x3a/0x44
    Mar 12 03:57:07 Server kernel: netif_rx_ni+0x1c/0x22
    Mar 12 03:57:07 Server kernel: macvlan_broadcast+0x10e/0x13c [macvlan]
    Mar 12 03:57:07 Server kernel: macvlan_process_broadcast+0xf8/0x143 [macvlan]
    Mar 12 03:57:07 Server kernel: process_one_work+0x13c/0x1d5
    Mar 12 03:57:07 Server kernel: worker_thread+0x18b/0x22f
    Mar 12 03:57:07 Server kernel: ? process_scheduled_works+0x27/0x27
    Mar 12 03:57:07 Server kernel: kthread+0xe5/0xea
    Mar 12 03:57:07 Server kernel: ? __kthread_bind_mask+0x57/0x57
    Mar 12 03:57:07 Server kernel: ret_from_fork+0x22/0x30
    Mar 12 03:57:07 Server kernel: ---[ end trace b3ca21ac5f2c2720 ]---

     




    User Feedback

    Recommended Comments



    5 hours ago, CorneliousJD said:

     

    Not a dev, but as noted in this thread a few times it's your br0 ones causing the issue. Not bridge vs host. 

     

    You probably need to put those on a separate vlan. 

     

    This is admittedly a workaround, but one that's worked for me. Stable with zero crashes since doing it over a month ago 

    It has to be the br0 ones, as I've turned the others off completely. I really don't want to mess around with vlans in my home network, and complicate things further. I'll probably spin up a VM in ESXi for docker for now, and if this isn't fixed in the next few months, I may just end up migrating to a new platform. 6.7 broke things for me, as did 6.8 and 6.8.3, so I came from 6.6.7. I promised myself prior to 6.9.0 if this was another failed upgrade, I'd look into alternatives to unRAID, which really sucks, as I have 2 unraid pro licenses, and have been using unRAID for several years.

    Edited by vagrantprodigy
    • Like 1
    Link to comment
    26 minutes ago, vagrantprodigy said:

    It has to be the br0 ones, as I've turned the others off completely. I really don't want to mess around with vlans in my home network, and complicate things further. I'll probably spin up a VM in ESXi for docker for now, and if this isn't fixed in the next few months, I may just end up migrating to a new platform. 6.7 broke things for me, as did 6.8 and 6.8.3, so I came from 6.6.7. I promised myself prior to 6.9.0 if this was another failed upgrade, I'd look into alternatives to unRAID, which really sucks, as I have 2 unraid pro licenses, and have been using unRAID for several years.

    To each their own, a single vlan for Dockers is easier than setting up ESXi IMO. 

     

    Well I agree, I didn't want to complicate my home network any further either, it was an extremely simple process that took me less than 10 minutes to complete. In my eyes 10 minutes to avoid crashes was definitely worth it.

     

    I've already linked to post somewhere in this thread that goes over the details of adding a docker VLAN, complete with photos!

    Link to comment

    Found this thread after suffering for several kernel panics that halt my server, lots of nf_**** in the trace.

     

    Just tried to disable host access to custom networks on docker configuration, let's see if it helps.

     

    It's strange cause if I boot into safe mode, no more kernel panics. And I haven't any special plugin installed, only CA Store and some Dynamix ones.

    Link to comment

    I have been having this issue as well. I do have one docker that is using br0, but unfortunately I cannot move it to a VLAN. Is there any other way to fix this? Or is there an update on when a permanent fix will be out?

    Link to comment
    1 minute ago, Eddie Seelke said:

    I have been having this issue as well. I do have one docker that is using br0, but unfortunately I cannot move it to a VLAN. Is there any other way to fix this? Or is there an update on when a permanent fix will be out?

     

    Is there a reason you can't run a VLAN? (hardware doesn't support it?)

    Alternatively, you could add a 2nd NIC into your server and run it on a separate IP address from that I believe? I don't know the specifics of this exactly, but I think that would work so it's not assigned to br0 anymore?

     

    Also, are you able to simply not run that container on its own IP?

     

     

    I don't know of any other way besides those two options personally that will fix this. 

    I had success with the VLAN method, and implemented it in less than 10 minutes total. 

    Link to comment
    1 hour ago, CorneliousJD said:

     

    Is there a reason you can't run a VLAN? (hardware doesn't support it?)

    No, I have to use a specific IP for this service and using a VLAN would change it. Well, I don't have to, but it would be too much work to change it. lol

    Link to comment
    6 minutes ago, Eddie Seelke said:

    No, I have to use a specific IP for this service and using a VLAN would change it. Well, I don't have to, but it would be too much work to change it. lol

     

    What service is it? Out of curiosity at this point really. 

     

    What about adding another NIC (if your motherboard doesn't already have 2 or more LAN ports?) 

    That could be a very cheap $15 card for a gigabit NIC.

    https://amzn.to/3zfpmzF (referral link)

    I beleive that way you could keep the same IP address, but give it its own network interface instead?

    • Like 1
    Link to comment

    Hey everyone, just a quick update on this issue. The main problem we've faced is the inability to recreate this issue in our labs. We are still actively working on it, but if anyone here knows the full solution, we are open to providing a bounty for it. Just PM me and so long as the fix isn't a hack or workaround, we will gladly compensate you for your time and work. 

    Link to comment

    I am in the rather unique situation to be able to build a almost exact clone of my set up (ram will be different).

    I could set up a VPN and provide Limetech with access for testing.

    Link to comment
    11 hours ago, jonp said:

    Hey everyone, just a quick update on this issue. The main problem we've faced is the inability to recreate this issue in our labs. We are still actively working on it, but if anyone here knows the full solution, we are open to providing a bounty for it. Just PM me and so long as the fix isn't a hack or workaround, we will gladly compensate you for your time and work. 

    Have you tried docker with custom IP(macvlan) and host access binded with br0?

    After binded with br1 (another NIC), I’ve never met this issue.

    Link to comment
    15 hours ago, jonp said:

    Hey everyone, just a quick update on this issue. The main problem we've faced is the inability to recreate this issue in our labs. We are still actively working on it, but if anyone here knows the full solution, we are open to providing a bounty for it. Just PM me and so long as the fix isn't a hack or workaround, we will gladly compensate you for your time and work. 

     

    Thank you for continuing to update everyone on this, I appreciate it even though I've been able to work around the issue with a VLAN.

     

    My only "fixes" are workarounds for now - I'm only sharing them here to help people who are frustrated by the crasahes to do something free or at least very cheap (add a $15 NIC for another interface) to get around the issue.

     

    hoping someone here has real answers at some point though, that would be great!

     

    If there's anything else I can provide from my system please let me know, happy to keep providing logs/diags/etc. 

    Link to comment

    Going the vlan route did not work for me. I ended up having a call trace and subsequent kernel panic after a number of days. So far, creating a second interface (br2 in my case) with the unused onboard NIC seems to be the "fix" for me at this point.

     

    I don't want to muddy the waters anymore than they have to be, but since it could be something external of Unraid on the network (multicast, who knows...) that Unraid is choking on and causing the issue, perhaps it is beneficial to also mention a brief summary of the network gear used? Maybe a common denominator will surface that can be a hypothesis generation device. Perhaps wireshark could be utilized to help troubleshoot as well? Just throwing up ideas at this point.

     

    For reference, my offending Unraid system:

    Ryzen system, 3600. B450 chipset.

    br0 is Intel X520-T2 10Gbe adapter via DAC to Mikrotik CRS305-1G-4S+IN 10Gbe, uplink to "core" Unifi switch.

     

    br2 is Intel I211-AT copper to "core" 1Gbe Unifi switch.

    Link to comment
    9 minutes ago, rodan5150 said:

    Going the vlan route did not work for me. I ended up having a call trace and subsequent kernel panic after a number of days. So far, creating a second interface (br2 in my case) with the unused onboard NIC seems to be the "fix" for me at this point.

     

    I went this route and even created user defined bridge networks on a completely different NIC and still had the same issues. I suspect you'll see the issue resurface in a week or so, as this is exactly what happened to me. I have not tested using VLANs, but moved any containers needing their own IPs off of Unraid and haven't had issues for months.

    • Like 1
    Link to comment

    I'm not sure how true this is, but I was under the impression that it was broadcast traffic, in conjunction with static IPs that was causing macvlan to s*** the bed and cause kernel panics.  Now I'm not sure if this is true with every vendor, but Ubiquiti switches/routers do not route broadcast packets between networks (so I was told/read), which was the idea behind creating vlans for docker containers with static IPs.  

     

    I also think it was thrown out there that certain network adapters may be more prone to this than others.

     

    5 hours ago, rodan5150 said:

    I don't want to muddy the waters anymore than they have to be, but since it could be something external of Unraid on the network (multicast, who knows...) that Unraid is choking on and causing the issue, perhaps it is beneficial to also mention a brief summary of the network gear used? Maybe a common denominator will surface that can be a hypothesis generation device. Perhaps wireshark could be utilized to help troubleshoot as well? Just throwing up ideas at this point.

    This is a very good point since it could be an external device on the network that is hammering docker and therefore the macvlan interface, which isn't present in a "test lab".

     

    Just my $0.02

    • Like 1
    Link to comment
    6 hours ago, ryanhaver said:

     

    I went this route and even created user defined bridge networks on a completely different NIC and still had the same issues. I suspect you'll see the issue resurface in a week or so, as this is exactly what happened to me. I have not tested using VLANs, but moved any containers needing their own IPs off of Unraid and haven't had issues for months.

     

    I'm worried about this, but I'm trying to stay hopeful the second NIC will be the band-aid for me for now. Uptime is almost 14 days, and no issues to speak of thus far, knock on wood...

     

     

    Link to comment

    I purchased a second NIC to hopefully bypass this issue. Are there any instructions on what settings I need to change to utilize the second NIC for my system?

    Link to comment

    Manged to catch this today. With the lightest of google searches it looks like it may be a bug/regression in the kernel. I very well could be wrong but maybe?

     

    https://askubuntu.com/questions/1293945/20-10-complete-system-freeze-with-general-protection-fault-smp-nopti

    https://www.spinics.net/lists/linux-nfs/msg78091.html

    https://www.spinics.net/lists/amd-gfx/msg48596.html

     

    Jun 12 14:00:01 thelibrary kernel:
    Jun 12 14:00:44 thelibrary kernel: general protection fault, probably for non-canonical address 0x1090000ffffff76: 0000 [#1] SMP NOPTI
    Jun 12 14:00:44 thelibrary kernel: CPU: 6 PID: 10937 Comm: qbittorrent-nox Tainted: P S W O 5.10.28-Unraid #1
    Jun 12 14:00:44 thelibrary kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U, BIOS L4.21 04/15/2021
    Jun 12 14:00:44 thelibrary kernel: RIP: 0010:nf_nat_setup_info+0x129/0x6aa [nf_nat]
    Jun 12 14:00:44 thelibrary kernel: Code: ff 48 8b 15 ef 6a 00 00 89 c0 48 8d 04 c2 48 8b 10 48 85 d2 74 80 48 81 ea 98 00 00 00 48 85 d2 0f 84 70 ff ff ff 8a 44 24 46 <38> 42 46 74 09 48 8b 92 98 00 00 00 eb d9 48 8b 4a 20 48 8b 42 28
    Jun 12 14:00:44 thelibrary kernel: RSP: 0018:ffffc90000338700 EFLAGS: 00010202
    Jun 12 14:00:44 thelibrary kernel: RAX: ffff88818b422f06 RBX: ffff888108b21a40 RCX: 0000000000000000
    Jun 12 14:00:44 thelibrary kernel: RDX: 01090000ffffff76 RSI: 000000003f50ed19 RDI: ffffc90000338720
    Jun 12 14:00:44 thelibrary kernel: RBP: ffffc900003387c8 R08: 0000000098f45bae R09: ffff88813dd40620
    Jun 12 14:00:44 thelibrary kernel: R10: 0000000000000348 R11: ffffffff815cbe4b R12: 0000000000000000
    Jun 12 14:00:44 thelibrary kernel: R13: ffffc90000338720 R14: ffffc900003387dc R15: ffffffff8210b440
    Jun 12 14:00:44 thelibrary kernel: FS: 0000146c98419700(0000) GS:ffff88881e980000(0000) knlGS:0000000000000000
    Jun 12 14:00:44 thelibrary kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jun 12 14:00:44 thelibrary kernel: CR2: 0000150d3d688320 CR3: 000000020149a000 CR4: 0000000000350ee0
    Jun 12 14:00:44 thelibrary kernel: Call Trace:
    Jun 12 14:00:44 thelibrary kernel: <IRQ>
    Jun 12 14:00:44 thelibrary kernel: ? fq_enqueue+0x25b/0x4e8
    Jun 12 14:00:44 thelibrary kernel: ? igb_xmit_frame_ring+0x7c5/0x8fd [igb]
    Jun 12 14:00:44 thelibrary kernel: ? __ksize+0x15/0x64
    Jun 12 14:00:44 thelibrary kernel: ? krealloc+0x26/0x7a
    Jun 12 14:00:44 thelibrary kernel: nf_nat_masquerade_ipv4+0x10b/0x131 [nf_nat]
    Jun 12 14:00:44 thelibrary kernel: masquerade_tg+0x44/0x5e [xt_MASQUERADE]
    Jun 12 14:00:44 thelibrary kernel: ? __qdisc_run+0x21d/0x3c9
    Jun 12 14:00:44 thelibrary kernel: ipt_do_table+0x51a/0x5c0 [ip_tables]
    Jun 12 14:00:44 thelibrary kernel: ? __dev_queue_xmit+0x4d9/0x501
    Jun 12 14:00:44 thelibrary kernel: ? fib_validate_source+0xb0/0xda
    Jun 12 14:00:44 thelibrary kernel: nf_nat_inet_fn+0xe9/0x183 [nf_nat]
    Jun 12 14:00:44 thelibrary kernel: nf_nat_ipv4_out+0xf/0x88 [nf_nat]
    Jun 12 14:00:44 thelibrary kernel: nf_hook_slow+0x39/0x8e
    Jun 12 14:00:44 thelibrary kernel: nf_hook+0xab/0xd3
    Jun 12 14:00:44 thelibrary kernel: ? __ip_finish_output+0x146/0x146
    Jun 12 14:00:44 thelibrary kernel: ip_output+0x7d/0x8a
    Jun 12 14:00:44 thelibrary kernel: ? __ip_finish_output+0x146/0x146
    Jun 12 14:00:44 thelibrary kernel: ip_forward+0x3f1/0x420
    Jun 12 14:00:44 thelibrary kernel: ? ip_check_defrag+0x18f/0x18f
    Jun 12 14:00:44 thelibrary kernel: ip_sabotage_in+0x43/0x4d [br_netfilter]
    Jun 12 14:00:44 thelibrary kernel: nf_hook_slow+0x39/0x8e
    Jun 12 14:00:44 thelibrary kernel: nf_hook.constprop.0+0xb1/0xd8
    Jun 12 14:00:44 thelibrary kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
    Jun 12 14:00:44 thelibrary kernel: ip_rcv+0x41/0x61
    Jun 12 14:00:44 thelibrary kernel: __netif_receive_skb_one_core+0x74/0x95
    Jun 12 14:00:44 thelibrary kernel: netif_receive_skb+0x79/0xa1
    Jun 12 14:00:44 thelibrary kernel: br_handle_frame_finish+0x30d/0x351
    Jun 12 14:00:44 thelibrary kernel: ? br_pass_frame_up+0xda/0xda
    Jun 12 14:00:44 thelibrary kernel: br_nf_hook_thresh+0xa3/0xc3 [br_netfilter]
    Jun 12 14:00:44 thelibrary kernel: ? br_pass_frame_up+0xda/0xda
    Jun 12 14:00:44 thelibrary kernel: br_nf_pre_routing_finish+0x23d/0x264 [br_netfilter]
    Jun 12 14:00:44 thelibrary kernel: ? br_pass_frame_up+0xda/0xda
    Jun 12 14:00:44 thelibrary kernel: ? br_handle_frame_finish+0x351/0x351
    Jun 12 14:00:44 thelibrary kernel: ? nf_nat_ipv4_pre_routing+0x1e/0x4a [nf_nat]
    Jun 12 14:00:44 thelibrary kernel: ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter]
    Jun 12 14:00:44 thelibrary kernel: ? br_handle_frame_finish+0x351/0x351
    Jun 12 14:00:44 thelibrary kernel: NF_HOOK+0xd7/0xf7 [br_netfilter]
    Jun 12 14:00:44 thelibrary kernel: ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter]
    Jun 12 14:00:44 thelibrary kernel: br_nf_pre_routing+0x229/0x239 [br_netfilter]
    Jun 12 14:00:44 thelibrary kernel: ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter]
    Jun 12 14:00:44 thelibrary kernel: br_handle_frame+0x25e/0x2a6
    Jun 12 14:00:44 thelibrary kernel: ? br_pass_frame_up+0xda/0xda
    Jun 12 14:00:44 thelibrary kernel: __netif_receive_skb_core+0x335/0x4e7
    Jun 12 14:00:44 thelibrary kernel: __netif_receive_skb_one_core+0x3d/0x95
    Jun 12 14:00:44 thelibrary kernel: process_backlog+0xa3/0x13b
    Jun 12 14:00:44 thelibrary kernel: net_rx_action+0xf4/0x29d
    Jun 12 14:00:44 thelibrary kernel: __do_softirq+0xc4/0x1c2
    Jun 12 14:00:44 thelibrary kernel: asm_call_irq_on_stack+0x12/0x20
    Jun 12 14:00:44 thelibrary kernel: </IRQ>
    Jun 12 14:00:44 thelibrary kernel: do_softirq_own_stack+0x2c/0x39
    Jun 12 14:00:44 thelibrary kernel: do_softirq+0x3a/0x44
    Jun 12 14:00:44 thelibrary kernel: __local_bh_enable_ip+0x3b/0x43
    Jun 12 14:00:44 thelibrary kernel: ip_finish_output2+0x2ec/0x31f
    Jun 12 14:00:44 thelibrary kernel: ? ipv4_mtu+0x3d/0x64
    Jun 12 14:00:44 thelibrary kernel: __ip_queue_xmit+0x2a3/0x2df
    Jun 12 14:00:44 thelibrary kernel: __tcp_transmit_skb+0x845/0x8ba
    Jun 12 14:00:44 thelibrary kernel: tcp_connect+0x76d/0x7f4
    Jun 12 14:00:44 thelibrary kernel: tcp_v4_connect+0x3fc/0x455
    Jun 12 14:00:44 thelibrary kernel: __inet_stream_connect+0xd3/0x2b6
    Jun 12 14:00:44 thelibrary kernel: inet_stream_connect+0x34/0x49
    Jun 12 14:00:44 thelibrary kernel: __sys_connect+0x62/0x9d
    Jun 12 14:00:44 thelibrary kernel: ? __sys_bind+0x78/0x9f
    Jun 12 14:00:44 thelibrary kernel: __x64_sys_connect+0x11/0x14
    Jun 12 14:00:44 thelibrary kernel: do_syscall_64+0x5d/0x6a
    Jun 12 14:00:44 thelibrary kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
    Jun 12 14:00:44 thelibrary kernel: RIP: 0033:0x146c9bdec53b
    Jun 12 14:00:44 thelibrary kernel: Code: 83 ec 18 89 54 24 0c 48 89 34 24 89 7c 24 08 e8 bb fa ff ff 8b 54 24 0c 48 8b 34 24 41 89 c0 8b 7c 24 08 b8 2a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2f 44 89 c7 89 44 24 08 e8 f1 fa ff ff 8b 44
    Jun 12 14:00:44 thelibrary kernel: RSP: 002b:0000146c98416f20 EFLAGS: 00000293 ORIG_RAX: 000000000000002a
    Jun 12 14:00:44 thelibrary kernel: RAX: ffffffffffffffda RBX: 0000146c90d18c60 RCX: 0000146c9bdec53b
    Jun 12 14:00:44 thelibrary kernel: RDX: 0000000000000010 RSI: 0000146c90e4ac94 RDI: 000000000000004b
    Jun 12 14:00:44 thelibrary kernel: RBP: 0000146c98417160 R08: 0000000000000000 R09: 0000146c98418258
    Jun 12 14:00:44 thelibrary kernel: R10: 0000146c9841710c R11: 0000000000000293 R12: 0000146c90e4ac94
    Jun 12 14:00:44 thelibrary kernel: R13: 0000000000000000 R14: 0000146c98418258 R15: 0000146c90003310
    Jun 12 14:00:44 thelibrary kernel: Modules linked in: nvidia_uvm(PO) xt_nat xt_tcpudp veth macvlan xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs md_mod nvidia_drm(PO) nvidia_modeset(PO) drm_kms_helper drm backlight agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops nvidia(PO) ip6table_filter ip6_tables iptable_filter ip_tables x_tables igb i2c_algo_bit ipmi_ssif amd64_edac_mod edac_mce_amd kvm_amd wmi_bmof kvm crct10dif_pclmul crc32_pclmul crc32c_intel mpt3sas ghash_clmulni_intel aesni_intel i2c_piix4 crypto_simd cryptd raid_class i2c_core nvme ahci scsi_transport_sas nvme_core wmi glue_helper acpi_ipmi ccp k10temp rapl libahci button ipmi_si acpi_cpufreq [last unloaded: i2c_algo_bit]
    Jun 12 14:00:44 thelibrary kernel: ---[ end trace 98e92523c69e7e44 ]---
    Jun 12 14:00:44 thelibrary kernel: RIP: 0010:nf_nat_setup_info+0x129/0x6aa [nf_nat]
    Jun 12 14:00:44 thelibrary kernel: Code: ff 48 8b 15 ef 6a 00 00 89 c0 48 8d 04 c2 48 8b 10 48 85 d2 74 80 48 81 ea 98 00 00 00 48 85 d2 0f 84 70 ff ff ff 8a 44 24 46 <38> 42 46 74 09 48 8b 92 98 00 00 00 eb d9 48 8b 4a 20 48 8b 42 28
    Jun 12 14:00:44 thelibrary kernel: RSP: 0018:ffffc90000338700 EFLAGS: 00010202
    Jun 12 14:00:44 thelibrary kernel: RAX: ffff88818b422f06 RBX: ffff888108b21a40 RCX: 0000000000000000
    Jun 12 14:00:44 thelibrary kernel: RDX: 01090000ffffff76 RSI: 000000003f50ed19 RDI: ffffc90000338720
    Jun 12 14:00:44 thelibrary kernel: RBP: ffffc900003387c8 R08: 0000000098f45bae R09: ffff88813dd40620
    Jun 12 14:00:44 thelibrary kernel: R10: 0000000000000348 R11: ffffffff815cbe4b R12: 0000000000000000
    Jun 12 14:00:44 thelibrary kernel: R13: ffffc90000338720 R14: ffffc900003387dc R15: ffffffff8210b440
    Jun 12 14:00:44 thelibrary kernel: FS: 0000146c98419700(0000) GS:ffff88881e980000(0000) knlGS:0000000000000000
    Jun 12 14:00:44 thelibrary kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jun 12 14:00:44 thelibrary kernel: CR2: 0000150d3d688320 CR3: 000000020149a000 CR4: 0000000000350ee0
    Jun 12 14:00:44 thelibrary kernel: Kernel panic - not syncing: Fatal exception in interrupt

     

    • Thanks 3
    Link to comment
    • Featured Comment

    For Unraid version 6.10 I have replaced the Docker macvlan driver for the Docker ipvlan driver.

     

    IPvlan is a new twist on the tried and true network virtualization technique. The Linux implementations are extremely lightweight because rather than using the traditional Linux bridge for isolation, they are associated to a Linux Ethernet interface or sub-interface to enforce separation between networks and connectivity to the physical network.

     

    The end-user doesn't have to do anything special. At startup legacy networks are automatically removed and replaced by the new network approach. Please test once 6.10 becomes available. Internal testing looks very good so far.

     

    • Like 1
    • Thanks 10
    Link to comment

    I FINALLY have something I'm fully confident posting. It's been over two weeks since my last crash. I haven't seen a single panic in over a week and a half, but I DID reboot once during that time for an unrelated issue. Host Access to networks was disabled throughout; this did not improve the situation.

     

    My previous configuration used eth0 and eth1 in an Active 802.3AD bond, with Bridging enabled as well. This suffered the kernel panics described in this post. Docker containers were on dedicated IPs on interface br0.

     

    My CURRENT configuration uses eth0, Bonding off, Bridging off, with Docker containers configured "Network Type" as "Custom : eth0"  -- Host Access is still disabled; I have not tested enabling it as I do not require it.

     

    My kernel logs have been clean* since. I don't know if this will help anyone, but it has finally and conclusively fixed the problem for me, using Unraid 6.8.3  (Yes I know it's out of date, I'm trying to chase down a problem so I avoid changing two things at once.)

     

     

     

    *I have a recurring GPU-related message about copying the VBIOS which comes up regularly, no clue why, but it's unrelated.

    Link to comment
    On 6/16/2021 at 7:02 PM, codefaux said:

    I FINALLY have something I'm fully confident posting. It's been over two weeks since my last crash. I haven't seen a single panic in over a week and a half, but I DID reboot once during that time for an unrelated issue. Host Access to networks was disabled throughout; this did not improve the situation.

     

    My previous configuration used eth0 and eth1 in an Active 802.3AD bond, with Bridging enabled as well. This suffered the kernel panics described in this post. Docker containers were on dedicated IPs on interface br0.

     

    My CURRENT configuration uses eth0, Bonding off, Bridging off, with Docker containers configured "Network Type" as "Custom : eth0"  -- Host Access is still disabled; I have not tested enabling it as I do not require it.

     

    My kernel logs have been clean* since. I don't know if this will help anyone, but it has finally and conclusively fixed the problem for me, using Unraid 6.8.3  (Yes I know it's out of date, I'm trying to chase down a problem so I avoid changing two things at once.)

     

     

     

    *I have a recurring GPU-related message about copying the VBIOS which comes up regularly, no clue why, but it's unrelated.

    Thanks for posting this, I reverted to 6.8.3 yesterday and had a crash overnight. I just adjusted my settings to see if this works for me.

     

    Edit: unless I have some setting wrong, this did not work for me. My server crashed overnight.

    Edited by Capt_Rekt
    Tested settings overnight, did not work for me.
    Link to comment
    On 6/9/2021 at 6:55 PM, ryanhaver said:

     

    I went this route and even created user defined bridge networks on a completely different NIC and still had the same issues. I suspect you'll see the issue resurface in a week or so, as this is exactly what happened to me. I have not tested using VLANs, but moved any containers needing their own IPs off of Unraid and haven't had issues for months.

     

    When you use a different NIC, make sure that both bonding and bridging is OFF for this interface.

    The docker custom network will be directly attached to the interface (e.g. eth1) and not the linux bridge function (br1).

     

    Link to comment
    On 6/17/2021 at 6:37 AM, Capt_Rekt said:

    Thanks for posting this, I reverted to 6.8.3 yesterday and had a crash overnight. I just adjusted my settings to see if this works for me.

     

    Edit: unless I have some setting wrong, this did not work for me. My server crashed overnight.

    That's unfortunate, I'm still stable and I really would love to help figure out how.

     

    Normally, within twenty four hours (I check my logs like ten times a day during unstable periods) I'd have one non-fatal panic attributed to the nf_nat or contrack or similar, later (several hours) followed by a metadata checksum error on a random Docker-active volume (assuming a container touched a file, kernel panic caused the thread to drop before metadata was updated, etc?) which would worsen (checksum logspam) until I got another nf_nat or similar subsystem panic which would actually be fatal and lock the system entirely. I actually got very good at the reboot/maintenence mount/fsck thing. This explicitly has not happened since the day I switched off bridging and bonding.

     

    Another thing I noticed is in my kernel logs I'm seeing a few new lines that I haven't ....excplicitly noticed before? I don't know if it's due to the no-bonding/no-bridging configuration or not. It likely is due to the fact that I started to enable VLAN on eth1 before disabling eth1, but my logs include VLAN 0 references.

     

    So, to be clear, the day I fixed this crash, my configuration changed FROM;

    eth0 + eth1 in 802.3ad active aggregation (bonding) with bridging enabled, vlans off

     

    TO;

    eth0; bonding/bridging/vlans off

    eth1; bonding/bridging off, vlans ON (unconfigured), interface disabled

     

     

    My suspicion is that despite eth1 being disabled, a script is detecting that "an interface has vlans enabled" and is triggering a default VLAN 0 handling on all interfaces...?

     

    Anyway, relevant kernel logs:

    [597058.181899] docker0: port 1(vethcd65992) entered blocking state
    [597058.181902] docker0: port 1(vethcd65992) entered disabled state
    [597058.181961] device vethcd65992 entered promiscuous mode
    [597058.182040] IPv6: ADDRCONF(NETDEV_UP): vethcd65992: link is not ready
    [597058.751447] eth0: renamed from veth8c778f7
    [597058.762734] IPv6: ADDRCONF(NETDEV_CHANGE): vethcd65992: link becomes ready
    [597058.762825] docker0: port 1(vethcd65992) entered blocking state
    [597058.762828] docker0: port 1(vethcd65992) entered forwarding state
    [597058.762934] IPv6: ADDRCONF(NETDEV_CHANGE): docker0: link becomes ready
    [597091.715908] vethbcba7b5: renamed from eth0
    [597091.745044] igb 0000:01:00.0 eth0: mixed HW and IP checksum settings.
    [597093.591117] igb 0000:01:00.0 eth0: mixed HW and IP checksum settings.
    [597093.591546] eth0: renamed from vethc557863
    [597093.602749] 8021q: adding VLAN 0 to HW filter on device eth0
    [597094.179544] veth8c778f7: renamed from eth0
    [597094.215339] docker0: port 1(vethcd65992) entered disabled state
    [597094.292125] docker0: port 1(vethcd65992) entered disabled state
    [597094.294522] device vethcd65992 left promiscuous mode
    [597094.294523] docker0: port 1(vethcd65992) entered disabled state

     

    This happens any time I cycle a container, note the VLAN 0 reference. I'm almost positive that was not present before.

     

     

    Relevant network configuration:

    image.thumb.png.858f7d1fdbbdac8d50e0afe2cc645e72.png

     

    image.thumb.png.716c18758db842d1f2fa2d6f9f4e52bd.png

     

     

    Perhaps repeat this by enabling eth1, enabling VLAN, disabling it, then restarting your array? I'm not sure if it's worth the effort, but I suppose during your next crash cycle you could probably do that without a lot of fanfare.

     

     

    Unrelated;

    Anyone interested in the one-liner I wrote to unconditionally scan-and-repair every drive in your (Maintenance Mode mounted) array, assuming A) You're okay losing files which are corrupt and would otherwise require extensive filesystem-level repair to recover, and B) you're using xfs, no encryption or can modify the script to accomodate either of the above?

     

    I just paste it in an ssh terminal after every non-graceful power cycle, typically lose a logfile if anything at all, BUT it's definitely not for mission-critical arrays. Frankly unless you're going to do filesystem repair or pay someone else to, this is what you're gonna wind up doing to get your filesystem to either mount or stop complaining about checksum errors anyway. It also, so far, has resulted in no sync errors caused by kernel panics when I re-scan when starting the array normally (removing the Maintenance mode) afterward. I may lose a file or two (you can find out --scrolling back to read logs or capturing them to file is an option) but this is non-critical data for me. I prefer the parity safety and ease of use to the ability to pay someone to save a file or two, lol..

    Edited by codefaux
    Link to comment

    This all started when I upgraded to 6.9. Can't go back to 6.8 as my network adapter is not supported. Guess no more PiHole on UnRaid. Pihole, the only docker for me with static IP disabled, still call trace and short thereafter hard crash.

     

    Jul 5 14:10:55 Fatsally kernel: WARNING: CPU: 3 PID: 26 at net/netfilter/nf_conntrack_core.c:1120 __nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack]
    Jul 5 14:10:55 Fatsally kernel: Modules linked in: xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle nf_tables vhost_net tun vhost vhost_iotlb tap xt_nat xt_tcpudp veth macvlan xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs md_mod i915 iosf_mbi i2c_algo_bit drm_kms_helper drm intel_gtt agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding wmi_bmof intel_wmi_thunderbolt mxm_wmi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate intel_uncore mpt3sas i2c_i801 i2c_smbus i2c_core nvme nvme_core igc ahci raid_class input_leds led_class libahci scsi_transport_sas thermal fan video wmi backlight acpi_pad button
    Jul 5 14:10:55 Fatsally kernel: CPU: 3 PID: 26 Comm: kworker/3:0 Not tainted 5.10.28-Unraid #1
    Jul 5 14:10:55 Fatsally kernel: Hardware name: ASUS System Product Name/PRIME Z490-A, BIOS 2103 04/15/2021
    Jul 5 14:10:55 Fatsally kernel: Workqueue: events macvlan_process_broadcast [macvlan]
    Jul 5 14:10:55 Fatsally kernel: RIP: 0010:__nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack]
    Jul 5 14:10:55 Fatsally kernel: Code: e8 dc f8 ff ff 44 89 fa 89 c6 41 89 c4 48 c1 eb 20 89 df 41 89 de e8 36 f6 ff ff 84 c0 75 bb 48 8b 85 80 00 00 00 a8 08 74 18 <0f> 0b 89 df 44 89 e6 31 db e8 6d f3 ff ff e8 35 f5 ff ff e9 22 01
    Jul 5 14:10:55 Fatsally kernel: RSP: 0018:ffffc900001d4d38 EFLAGS: 00010202
    Jul 5 14:10:55 Fatsally kernel: RAX: 0000000000000188 RBX: 0000000000007014 RCX: 000000006b49dfa4
    Jul 5 14:10:55 Fatsally kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa0644e54
    Jul 5 14:10:55 Fatsally kernel: RBP: ffff88814c3acf00 R08: 000000002aebd323 R09: ffff88813f9b9560
    Jul 5 14:10:55 Fatsally kernel: R10: 0000000000000158 R11: ffff888105fdea00 R12: 0000000000001b15
    Jul 5 14:10:55 Fatsally kernel: R13: ffffffff8210b440 R14: 0000000000007014 R15: 0000000000000000
    Jul 5 14:10:55 Fatsally kernel: FS: 0000000000000000(0000) GS:ffff88883dac0000(0000) knlGS:0000000000000000
    Jul 5 14:10:55 Fatsally kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jul 5 14:10:55 Fatsally kernel: CR2: 000055e88501fd80 CR3: 000000000200a002 CR4: 00000000003706e0
    Jul 5 14:10:55 Fatsally kernel: Call Trace:
    Jul 5 14:10:55 Fatsally kernel: <IRQ>
    Jul 5 14:10:55 Fatsally kernel: nf_conntrack_confirm+0x2f/0x36 [nf_conntrack]
    Jul 5 14:10:55 Fatsally kernel: nf_hook_slow+0x39/0x8e
    Jul 5 14:10:55 Fatsally kernel: nf_hook.constprop.0+0xb1/0xd8
    Jul 5 14:10:55 Fatsally kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
    Jul 5 14:10:55 Fatsally kernel: ip_local_deliver+0x49/0x75
    Jul 5 14:10:55 Fatsally kernel: ip_sabotage_in+0x43/0x4d [br_netfilter]
    Jul 5 14:10:55 Fatsally kernel: nf_hook_slow+0x39/0x8e
    Jul 5 14:10:55 Fatsally kernel: nf_hook.constprop.0+0xb1/0xd8
    Jul 5 14:10:55 Fatsally kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
    Jul 5 14:10:55 Fatsally kernel: ip_rcv+0x41/0x61
    Jul 5 14:10:55 Fatsally kernel: __netif_receive_skb_one_core+0x74/0x95
    Jul 5 14:10:55 Fatsally kernel: process_backlog+0xa3/0x13b
    Jul 5 14:10:55 Fatsally kernel: net_rx_action+0xf4/0x29d
    Jul 5 14:10:55 Fatsally kernel: __do_softirq+0xc4/0x1c2
    Jul 5 14:10:55 Fatsally kernel: asm_call_irq_on_stack+0xf/0x20
    Jul 5 14:10:55 Fatsally kernel: </IRQ>
    Jul 5 14:10:55 Fatsally kernel: do_softirq_own_stack+0x2c/0x39
    Jul 5 14:10:55 Fatsally kernel: do_softirq+0x3a/0x44
    Jul 5 14:10:55 Fatsally kernel: netif_rx_ni+0x1c/0x22
    Jul 5 14:10:55 Fatsally kernel: macvlan_broadcast+0x10e/0x13c [macvlan]
    Jul 5 14:10:55 Fatsally kernel: macvlan_process_broadcast+0xf8/0x143 [macvlan]
    Jul 5 14:10:55 Fatsally kernel: process_one_work+0x13c/0x1d5
    Jul 5 14:10:55 Fatsally kernel: worker_thread+0x18b/0x22f
    Jul 5 14:10:55 Fatsally kernel: ? process_scheduled_works+0x27/0x27
    Jul 5 14:10:55 Fatsally kernel: kthread+0xe5/0xea
    Jul 5 14:10:55 Fatsally kernel: ? __kthread_bind_mask+0x57/0x57
    Jul 5 14:10:55 Fatsally kernel: ret_from_fork+0x1f/0x30
    Jul 5 14:10:55 Fatsally kernel: ---[ end trace 5e28eea505cdd363 ]---

    Edited by Gabriel_B
    Link to comment
    On 6/23/2021 at 12:23 AM, codefaux said:

    but my logs include VLAN 0 references.

    Perfectly normal when you enable VLANs for an interface.

    VLAN 0 refers to the untagged communication.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.