• 6.9.0/6.9.1 - Kernel Panic due to netfilter (nf_nat_setup_info) - Docker Static IP (macvlan)


    CorneliousJD
    • Urgent

    So I had posted another thread about after a kernel panic, docker host access to custom networks doesn't work until docker is stopped/restarted on 6.9.0

     

     

    After further investigation and setting up syslogging, it apperas that it may actually be that host access that's CAUSING the kernel panic? 

    EDIT: 3/16 - I guess I needed to create a VLAN for my dockers with static IPs, so far that's working, so it's probably not HOST access causing the issue, but rather br0 static IPs being set. See following posts below.

     

    Here's my last kernel panic that thankfully got logged to syslog. It references macvlan and netfilter. I don't know enough to be super useful here, but this is my docker setup.

     

    image.png.dac2782e9408016de37084cf21ad64a5.png

     

    Mar 12 03:57:07 Server kernel: ------------[ cut here ]------------
    Mar 12 03:57:07 Server kernel: WARNING: CPU: 17 PID: 626 at net/netfilter/nf_nat_core.c:614 nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Modules linked in: ccp macvlan xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap veth xt_nat xt_MASQUERADE iptable_nat nf_nat xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables bonding igb i2c_algo_bit cp210x usbserial sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd ipmi_ssif isci glue_helper mpt3sas i2c_i801 rapl libsas i2c_smbus input_leds i2c_core ahci intel_cstate raid_class led_class acpi_ipmi intel_uncore libahci scsi_transport_sas wmi ipmi_si button [last unloaded: ipmi_devintf]
    Mar 12 03:57:07 Server kernel: CPU: 17 PID: 626 Comm: kworker/17:2 Tainted: G        W         5.10.19-Unraid #1
    Mar 12 03:57:07 Server kernel: Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
    Mar 12 03:57:07 Server kernel: Workqueue: events macvlan_process_broadcast [macvlan]
    Mar 12 03:57:07 Server kernel: RIP: 0010:nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Code: 89 fb 49 89 f6 41 89 d4 76 02 0f 0b 48 8b 93 80 00 00 00 89 d0 25 00 01 00 00 45 85 e4 75 07 89 d0 25 80 00 00 00 85 c0 74 07 <0f> 0b e9 1f 05 00 00 48 8b 83 90 00 00 00 4c 8d 6c 24 20 48 8d 73
    Mar 12 03:57:07 Server kernel: RSP: 0018:ffffc90006778c38 EFLAGS: 00010202
    Mar 12 03:57:07 Server kernel: RAX: 0000000000000080 RBX: ffff88837c8303c0 RCX: ffff88811e834880
    Mar 12 03:57:07 Server kernel: RDX: 0000000000000180 RSI: ffffc90006778d14 RDI: ffff88837c8303c0
    Mar 12 03:57:07 Server kernel: RBP: ffffc90006778d00 R08: 0000000000000000 R09: ffff889083c68160
    Mar 12 03:57:07 Server kernel: R10: 0000000000000158 R11: ffff8881e79c1400 R12: 0000000000000000
    Mar 12 03:57:07 Server kernel: R13: 0000000000000000 R14: ffffc90006778d14 R15: 0000000000000001
    Mar 12 03:57:07 Server kernel: FS:  0000000000000000(0000) GS:ffff88903fc40000(0000) knlGS:0000000000000000
    Mar 12 03:57:07 Server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar 12 03:57:07 Server kernel: CR2: 000000c000b040b8 CR3: 000000000200c005 CR4: 00000000001706e0
    Mar 12 03:57:07 Server kernel: Call Trace:
    Mar 12 03:57:07 Server kernel: <IRQ>
    Mar 12 03:57:07 Server kernel: ? activate_task+0x9/0x12
    Mar 12 03:57:07 Server kernel: ? resched_curr+0x3f/0x4c
    Mar 12 03:57:07 Server kernel: ? ipt_do_table+0x49b/0x5c0 [ip_tables]
    Mar 12 03:57:07 Server kernel: ? try_to_wake_up+0x1b0/0x1e5
    Mar 12 03:57:07 Server kernel: nf_nat_alloc_null_binding+0x71/0x88 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_nat_inet_fn+0x91/0x182 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
    Mar 12 03:57:07 Server kernel: ip_local_deliver+0x49/0x75
    Mar 12 03:57:07 Server kernel: ip_sabotage_in+0x43/0x4d
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
    Mar 12 03:57:07 Server kernel: ip_rcv+0x41/0x61
    Mar 12 03:57:07 Server kernel: __netif_receive_skb_one_core+0x74/0x95
    Mar 12 03:57:07 Server kernel: process_backlog+0xa3/0x13b
    Mar 12 03:57:07 Server kernel: net_rx_action+0xf4/0x29d
    Mar 12 03:57:07 Server kernel: __do_softirq+0xc4/0x1c2
    Mar 12 03:57:07 Server kernel: asm_call_irq_on_stack+0x12/0x20
    Mar 12 03:57:07 Server kernel: </IRQ>
    Mar 12 03:57:07 Server kernel: do_softirq_own_stack+0x2c/0x39
    Mar 12 03:57:07 Server kernel: do_softirq+0x3a/0x44
    Mar 12 03:57:07 Server kernel: netif_rx_ni+0x1c/0x22
    Mar 12 03:57:07 Server kernel: macvlan_broadcast+0x10e/0x13c [macvlan]
    Mar 12 03:57:07 Server kernel: macvlan_process_broadcast+0xf8/0x143 [macvlan]
    Mar 12 03:57:07 Server kernel: process_one_work+0x13c/0x1d5
    Mar 12 03:57:07 Server kernel: worker_thread+0x18b/0x22f
    Mar 12 03:57:07 Server kernel: ? process_scheduled_works+0x27/0x27
    Mar 12 03:57:07 Server kernel: kthread+0xe5/0xea
    Mar 12 03:57:07 Server kernel: ? __kthread_bind_mask+0x57/0x57
    Mar 12 03:57:07 Server kernel: ret_from_fork+0x22/0x30
    Mar 12 03:57:07 Server kernel: ---[ end trace b3ca21ac5f2c2720 ]---

     




    User Feedback

    Recommended Comments



    13 minutes ago, bonienl said:

    Your case is very different, you don't have any custom (macvlan) network defined and containers with their own (fixed) IP address configured. Instead you have a user defined bridge network (proxynet) and your containers reside in this network (172.18.0.X).

    Right. As I mentioned, I originally did have containers (unifi, adguard) with fixed IP addresses, but in the process of troubleshooting I changed the network settings. Through further troubleshooting I then realized that the host access to custom networks caused the trace even without those fixed IP addresses (and easily/quickly repeatable), so that's what I've been focusing on.

     

    13 minutes ago, bonienl said:

    Q: when host access is enabled, can you show the routing table again, I like to see which shim interfaces are defined in this case.

    Here you go! Thanks.

    network with host access on.png

    Edited by kaiguy
    additional info
    Link to comment

    Thanks for the quick reply.

     

    A side note: host access is only applicable to custom (macvlan) networks, not bridge networks. In your case enabling it has no use and it can stay disabled, it should however not cause the call traces (still investigating)!

     

    • Thanks 1
    Link to comment

    I had the crashes on 6.9.0 and 6.9.1, and updated to 6.9.2 shortly after it was released. Mine had a kernel panic overnight, so I can confirm this issue is NOT resolved. My original support thread with logs is here: 

     

    Edited by vagrantprodigy
    Link to comment

    Just installed 6.9.2 today, macvlan kernel panic within 4 hours. I used to get frequent macvlan call traces and hard lockups when I had an intel 10gbit card installed on 6.8.3 and up, until I replaced it with a  dual 1 gbit intel card. 6.9.1 had no macvlan traces or lockups.

    Diagnostics attached...

    tower-diagnostics-20210410-2213.zip

    Link to comment

    I waited until 6.9.2 to make the jump from 6.8.3. Unfortunately I ran into this problem as well. 6.8.3 was stable for months for me so trying to downgrade now to keep the wife and kids happy. Will follow along to see when this one gets sorted out. Cheers to those putting in the effort to get this fixed! Thanks

    Edited by sirkuz
    Link to comment
    4 minutes ago, danieland said:

    same problems here, 6.8.3 is fine.

    Yes, I’ve reverted to 6.8.3 and have been up and running with no issues for 4 days, 6.9.1 would have crashed by now. 
     

    Perhaps the UNRAID devs could compare what in the network stack changed and revert that change as it’s a regression. 
     

    I think I speak for most unraid users that reliability is far more important than feature set. Reliability is the reason I’ve stayed on unraid. 

    • Thanks 1
    Link to comment
    5 minutes ago, danieland said:

    same problems here, 6.8.3 is fine.

    Is there a way to go back all the way to 6.8.3? I had already upgraded to 6.9.x and 6.9.2 and do not have an option to downgrade lower than 6.9.x

    Link to comment
    Just now, Capt_Rekt said:

    Is there a way to go back all the way to 6.8.3? I had already upgraded to 6.9.x and 6.9.2 and do not have an option to downgrade lower than 6.9.x

    I just went into the tools >> upgrade os and it had the option to restore. It may only allow one version worth of regression. 

    Link to comment
    2 minutes ago, Capt_Rekt said:

    Is there a way to go back all the way to 6.8.3? I had already upgraded to 6.9.x and 6.9.2 and do not have an option to downgrade lower than 6.9.x

    sorry, i don't know how to downgrade.

    hope limetech found a way to resolve this bug,

    Link to comment
    1 minute ago, whoopn said:

    I just went into the tools >> upgrade os and it had the option to restore. It may only allow one version worth of regression. 

    Which is my situation, I downgraded one level but it isn't far enough back. So now I am stuck manually restarting my server every day if I want to use it.

    Link to comment
    Just now, Capt_Rekt said:

    Which is my situation, I downgraded one level but it isn't far enough back. So now I am stuck manually restarting my server every day if I want to use it.

    That stinks. Perhaps this works?

     

    1. make a CA backup, separately backup or protect VMs

    2. save those files somewhere other than the array, keep a copy on the array for convenience

    3. reinstall 6.8.3

    4. Restore backup

    5. cross fingers

    6. profit?!?

    Link to comment

    After upgrading from 6.9.1 to 6.9.2 I took a chance and added back br0 with a static IP on one of my Docker Containers.  Less than 24 hrs later I have a hard crash.  My syslog for the past 2 days is included with what I believe to be the  crash is highlighted in yellow.Apr 16 -17 Crash Log.docx

    Link to comment

    Looks like others are now having the issue that I have struggled with. I have had the issue with two different motherboards and on versions of unRAID since 6.8.3. I swapped to a motherboard with IPMI and now I have the macvlan issue return. I resolved the issue on my MSI Pro Carbon x370 by removing my 10 gig PCIE NIC and using the onboard NIC. All of my dockers are currently in host mode or br0.2 with a static IP. I had six days of uptime and then just this morning I had another macvlan call trace come up. Since I installed my new motherboard in I check my logs every morning before I go to work and usually after I get off work. It seems like this has been an issue for a long time but something changed and caused more users to be affected than before.

     

    Current Unraid Version: 6.9.2

    Original Motherboard: MSI Pro Carbon X370

    Current Motherboard: ASRockRack X470D4U2-2T

     

    tower-syslog-20210421-1728.zip

    • Like 1
    Link to comment

    I think the TL;DR is 6.9.X is borked...9 days uptime on 6.8.3 after only getting like 2-4 on 6.9.1 

    my best is around 140 days before I had to bounce it for something. 

    Link to comment

    Same issues as K1ng0011 also with an ASRockRack Board, Same X470 Base. Have had a great number of system lock ups since December. Started for me with RC2. Not in a position to be able to create new vlans. Have instead removed all my dockers that used BR0 as even stopped I was having issues. 

     

    1 hour ago, K1ng0011 said:

    Current Unraid Version: 6.9.2

    Original Motherboard: MSI Pro Carbon X370

    Current Motherboard: ASRockRack X470D4U2-2T

     

     

    thelibrary-diagnostics-20210418-2211 (1).zip

    Edited by DuzAwe
    Link to comment

    Today my unraid crash again, is there any way we can fix this without buying a switch?

    Edited by danieland
    Link to comment

    If you look at hoopsters post he reported macvlan call traces in unRAID version 6.5.0 which was released in 2018. This has been an issue for quite a long time. I have seen references that this issue might be related to docker itself and not unRAID but it seems like the issue has not been widespread enough to warrant unRAID reaching out to the people who maintain the docker software. I spent from December 2020 to February 2021 diagnosing this issue on my original motherboard. When I ended up removing my 10 Gig PCIE NIC ASUS XG-C100C from my motherboard the issue stopped. So somehow specific hardware was causing this issue. A the time I was on 6.8.3. Unless there is a wide number of users affected I am not hopeful that this issue will get resolved. Currently I have the "host access to custom networks" setting under my docker container settings page disabled to hopefully prevent any hard lockups to see if unRAID can come up with a solution. If I were you I would install the "CA Backup / Restore Appdata" plugin if you do not have it already. I have had these hard lockups corrupt my app data for my dockers and this allowed me to restore my appdata folder. 

    Link to comment

    I am having similar issues that seem to be related to docker/macvlan:

     

    My kernel panics exists since 6.8.3 (my initial Unraid version) and have not been solved by upgrading to the latest version as well.

     

    I am not using the network "br0" with static IP addresses, only the following:

    - host

    - bridge

    - proxynet (custom docker network)

     

    The error logs seem to be pretty identical though.

     

    Could this be related?

    Link to comment
    On 4/24/2021 at 4:20 AM, jnk22 said:

    I am having similar issues that seem to be related to docker/macvlan:

     

    My kernel panics exists since 6.8.3 (my initial Unraid version) and have not been solved by upgrading to the latest version as well.

     

    I am not using the network "br0" with static IP addresses, only the following:

    - host

    - bridge

    - proxynet (custom docker network)

     

    The error logs seem to be pretty identical though.

     

    Could this be related?

    Disabling Host access to custom networks on docker settings should suffice... it's not necessary for proxynet. I had mine enabled for some reason or another ( testing at some point ) and forgot to disable and it was the sole cause of my macvlan call traces and lockups. Disabling it has ceased all macvlan problems for me. This was addressed earlier in this thread by kaiguy and bonienl.

    Seems nic driver related since different cards have different stabilities when host access is enabled.

    Link to comment

    I think you can put me on the list 😤

    Since the Upgrade to 6.9 my former rock stable server is crashing 1-2 a month in all the recent versions.

    Apr 12 23:33:06 Server kernel: ------------[ cut here ]------------
    Apr 12 23:33:06 Server kernel: WARNING: CPU: 1 PID: 13942 at net/netfilter/nf_conntrack_core.c:1120 __nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack]
    Apr 12 23:33:06 Server kernel: Modules linked in: macvlan xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle nf_tables xt_nat xt_tcpudp vhost_net tun vhost vhost_iotlb tap veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs nfsd lockd grace sunrpc md_mod ipmi_devintf ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding igb i2c_algo_bit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm wmi_bmof crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper rapl intel_cstate intel_uncore nvme ipmi_ssif nvme_core i2c_i801 i2c_smbus input_leds i2c_core led_class ahci libahci ie31200_edac intel_pch_thermal wmi fan thermal acpi_ipmi video ipmi_si backlight button [last unloaded: i2c_algo_bit]
    Apr 12 23:33:06 Server kernel: CPU: 1 PID: 13942 Comm: kworker/1:1 Not tainted 5.10.28-Unraid #1
    Apr 12 23:33:06 Server kernel: Hardware name: Supermicro Super Server/X11SCL-F, BIOS 1.5 10/05/2020
    Apr 12 23:33:06 Server kernel: Workqueue: events macvlan_process_broadcast [macvlan]
    Apr 12 23:33:06 Server kernel: RIP: 0010:__nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack]
    Apr 12 23:33:06 Server kernel: Code: e8 dc f8 ff ff 44 89 fa 89 c6 41 89 c4 48 c1 eb 20 89 df 41 89 de e8 36 f6 ff ff 84 c0 75 bb 48 8b 85 80 00 00 00 a8 08 74 18 <0f> 0b 89 df 44 89 e6 31 db e8 6d f3 ff ff e8 35 f5 ff ff e9 22 01
    Apr 12 23:33:06 Server kernel: RSP: 0018:ffffc90000110dd8 EFLAGS: 00010202
    Apr 12 23:33:06 Server kernel: RAX: 0000000000000188 RBX: 00000000000082f4 RCX: 000000006bbd9a8a
    Apr 12 23:33:06 Server kernel: RDX: 0000000000000000 RSI: 000000000000019a RDI: ffffffffa03b1dd0
    Apr 12 23:33:06 Server kernel: RBP: ffff88835c40b540 R08: 00000000b4ddb4fe R09: ffff888163ec87c0
    Apr 12 23:33:06 Server kernel: R10: 0000000000000000 R11: ffff888172bda000 R12: 000000000000519a
    Apr 12 23:33:06 Server kernel: R13: ffffffff8210b440 R14: 00000000000082f4 R15: 0000000000000000
    Apr 12 23:33:06 Server kernel: FS:  0000000000000000(0000) GS:ffff88884ec80000(0000) knlGS:0000000000000000
    Apr 12 23:33:06 Server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Apr 12 23:33:06 Server kernel: CR2: 00002a91561fadf4 CR3: 000000000200a003 CR4: 00000000003706e0
    Apr 12 23:33:06 Server kernel: Call Trace:

     

    I don't want to  / can't use vlans (my router isn't supporting this properly).

    After I find out about the macvlan issue I tried to switch every docker away from br0. Hopefully this helps.
    Just one docker I wasn't able to move from br0 (diyHue) - maybe someone can tell me how?

    Host access to custom networks is disabled (but I don't know if I did change it after the last stall I had, as I moved the appdata to a different drive)

     

    Just one addition - I quite often read that people with this issue are running piHole or the Unifi-controller. I'm also using the Unifi-controller docker and had it on fixed ip.

     

    Did the Unraid team reply to the issue so far - only the community seems to care. Very disappointing.

     

    server-diagnostics-20210427-2125.zip

    Edited by gilladur
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.