• 6.9.0/6.9.1 - Kernel Panic due to netfilter (nf_nat_setup_info) - Docker Static IP (macvlan)


    CorneliousJD
    • Urgent

    So I had posted another thread about after a kernel panic, docker host access to custom networks doesn't work until docker is stopped/restarted on 6.9.0

     

     

    After further investigation and setting up syslogging, it apperas that it may actually be that host access that's CAUSING the kernel panic? 

    EDIT: 3/16 - I guess I needed to create a VLAN for my dockers with static IPs, so far that's working, so it's probably not HOST access causing the issue, but rather br0 static IPs being set. See following posts below.

     

    Here's my last kernel panic that thankfully got logged to syslog. It references macvlan and netfilter. I don't know enough to be super useful here, but this is my docker setup.

     

    image.png.dac2782e9408016de37084cf21ad64a5.png

     

    Mar 12 03:57:07 Server kernel: ------------[ cut here ]------------
    Mar 12 03:57:07 Server kernel: WARNING: CPU: 17 PID: 626 at net/netfilter/nf_nat_core.c:614 nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Modules linked in: ccp macvlan xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap veth xt_nat xt_MASQUERADE iptable_nat nf_nat xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables bonding igb i2c_algo_bit cp210x usbserial sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd ipmi_ssif isci glue_helper mpt3sas i2c_i801 rapl libsas i2c_smbus input_leds i2c_core ahci intel_cstate raid_class led_class acpi_ipmi intel_uncore libahci scsi_transport_sas wmi ipmi_si button [last unloaded: ipmi_devintf]
    Mar 12 03:57:07 Server kernel: CPU: 17 PID: 626 Comm: kworker/17:2 Tainted: G        W         5.10.19-Unraid #1
    Mar 12 03:57:07 Server kernel: Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
    Mar 12 03:57:07 Server kernel: Workqueue: events macvlan_process_broadcast [macvlan]
    Mar 12 03:57:07 Server kernel: RIP: 0010:nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Code: 89 fb 49 89 f6 41 89 d4 76 02 0f 0b 48 8b 93 80 00 00 00 89 d0 25 00 01 00 00 45 85 e4 75 07 89 d0 25 80 00 00 00 85 c0 74 07 <0f> 0b e9 1f 05 00 00 48 8b 83 90 00 00 00 4c 8d 6c 24 20 48 8d 73
    Mar 12 03:57:07 Server kernel: RSP: 0018:ffffc90006778c38 EFLAGS: 00010202
    Mar 12 03:57:07 Server kernel: RAX: 0000000000000080 RBX: ffff88837c8303c0 RCX: ffff88811e834880
    Mar 12 03:57:07 Server kernel: RDX: 0000000000000180 RSI: ffffc90006778d14 RDI: ffff88837c8303c0
    Mar 12 03:57:07 Server kernel: RBP: ffffc90006778d00 R08: 0000000000000000 R09: ffff889083c68160
    Mar 12 03:57:07 Server kernel: R10: 0000000000000158 R11: ffff8881e79c1400 R12: 0000000000000000
    Mar 12 03:57:07 Server kernel: R13: 0000000000000000 R14: ffffc90006778d14 R15: 0000000000000001
    Mar 12 03:57:07 Server kernel: FS:  0000000000000000(0000) GS:ffff88903fc40000(0000) knlGS:0000000000000000
    Mar 12 03:57:07 Server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar 12 03:57:07 Server kernel: CR2: 000000c000b040b8 CR3: 000000000200c005 CR4: 00000000001706e0
    Mar 12 03:57:07 Server kernel: Call Trace:
    Mar 12 03:57:07 Server kernel: <IRQ>
    Mar 12 03:57:07 Server kernel: ? activate_task+0x9/0x12
    Mar 12 03:57:07 Server kernel: ? resched_curr+0x3f/0x4c
    Mar 12 03:57:07 Server kernel: ? ipt_do_table+0x49b/0x5c0 [ip_tables]
    Mar 12 03:57:07 Server kernel: ? try_to_wake_up+0x1b0/0x1e5
    Mar 12 03:57:07 Server kernel: nf_nat_alloc_null_binding+0x71/0x88 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_nat_inet_fn+0x91/0x182 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
    Mar 12 03:57:07 Server kernel: ip_local_deliver+0x49/0x75
    Mar 12 03:57:07 Server kernel: ip_sabotage_in+0x43/0x4d
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
    Mar 12 03:57:07 Server kernel: ip_rcv+0x41/0x61
    Mar 12 03:57:07 Server kernel: __netif_receive_skb_one_core+0x74/0x95
    Mar 12 03:57:07 Server kernel: process_backlog+0xa3/0x13b
    Mar 12 03:57:07 Server kernel: net_rx_action+0xf4/0x29d
    Mar 12 03:57:07 Server kernel: __do_softirq+0xc4/0x1c2
    Mar 12 03:57:07 Server kernel: asm_call_irq_on_stack+0x12/0x20
    Mar 12 03:57:07 Server kernel: </IRQ>
    Mar 12 03:57:07 Server kernel: do_softirq_own_stack+0x2c/0x39
    Mar 12 03:57:07 Server kernel: do_softirq+0x3a/0x44
    Mar 12 03:57:07 Server kernel: netif_rx_ni+0x1c/0x22
    Mar 12 03:57:07 Server kernel: macvlan_broadcast+0x10e/0x13c [macvlan]
    Mar 12 03:57:07 Server kernel: macvlan_process_broadcast+0xf8/0x143 [macvlan]
    Mar 12 03:57:07 Server kernel: process_one_work+0x13c/0x1d5
    Mar 12 03:57:07 Server kernel: worker_thread+0x18b/0x22f
    Mar 12 03:57:07 Server kernel: ? process_scheduled_works+0x27/0x27
    Mar 12 03:57:07 Server kernel: kthread+0xe5/0xea
    Mar 12 03:57:07 Server kernel: ? __kthread_bind_mask+0x57/0x57
    Mar 12 03:57:07 Server kernel: ret_from_fork+0x22/0x30
    Mar 12 03:57:07 Server kernel: ---[ end trace b3ca21ac5f2c2720 ]---

     



    User Feedback

    Recommended Comments



    I believe I'm seeing the same issue when I upgraded to 6.9.0, I have not tried 6.9.1 yet.

     

    Following a thread on a facebook group, someone else has this issue seen 6.9.0 and 6.9.1. After pointing to macvlan issues, he found the errors in his syslog. Similar to attached here.

     

    He believes its related to anything that is assigned a static IP on BR0.

     

    Link to comment
    Share on other sites
    Just now, Tuftuf said:

    I believe I'm seeing the same issue when I upgraded to 6.9.0, I have not tried 6.9.1 yet.

     

    Following a thread on a facebook group, someone else has this issue seen 6.9.0 and 6.9.1. After pointing to macvlan issues, he found the errors in his syslog. Similar to attached here.

     

    He believes its related to anything that is assigned a static IP on BR0.

     

    I believe the same. I'm 3 days into changing br0 to br0.10 (VLAN 10) and no crashed now (it's too short to tell fully) but so far so good... 

     

    Hoping this is something that can actually be addressed though. Not everyone can run VLANs at home. 

    Link to comment
    Share on other sites
    6 minutes ago, CorneliousJD said:

    I believe the same. I'm 3 days into changing br0 to br0.10 (VLAN 10) and no crashed now (it's too short to tell fully) but so far so good... 

     

    Hoping this is something that can actually be addressed though. Not everyone can run VLANs at home. 

    Thanks for confirming you are able to run it another vlan. It doesn't fit with how I have things configured, but I can look at moving these to another vlan.

    Link to comment
    Share on other sites

    A common cause for these call traces is when VMs and Docker share the same interface.

     

    You should change the network model of the VMs to "virtio-net" to avoid VM/docker conflicts.

     

    Link to comment
    Share on other sites
    1 minute ago, bonienl said:

    A common cause for these call traces is when VMs and Docker share the same interface.

     

    You should change the network model of the VMs to "virtio-net" to avoid VM/docker conflicts.

     

    Not running any VMs at all.

    Link to comment
    Share on other sites
    14 hours ago, bonienl said:

    A common cause for these call traces is when VMs and Docker share the same interface.

     

    You should change the network model of the VMs to "virtio-net" to avoid VM/docker conflicts.

     

     

    Can you expand on what you mean by share the same interface and virto-net? 

     

    I'm trying to understand if this is or is not been viewed as a bug. I try to keep intervlan routing to a minimal as Unraid is in my shed and router is in the house. 

     

    What I'm doing seems a very simple configuration and these errors are happening when upgrading to 6.9.x. The solution shouldn't be to not run a VM or Docker instance on br0.

     

    I'm running 1 VM running on my system and it's in a separate vlan (br0.50). 

     

    *EDIT* Wow, I did not expect to see posts going back to 2018 talking about this and solutions. I know saying I've never had an issue doesn't meant I didn't miss one. But prior to 6.9, I had a stable system for months. How has this only just started effecting me. I've always had core internal services on br0 and external stuff on other vlans.

    Edited by Tuftuf
    Link to comment
    Share on other sites

    I have also been getting this pretty constantly since upgrading to 6.9.x, which will eventually result in a system lock for me. Been trying to troubleshoot over the last few days. Older threads suggested it had to do with docker assignments on br0, so I removed all references to any containers using br0 (in fact, I moved 3 apps from unraid docker to a raspberry pi to try to fix this). Does not seem help.

     

    @CorneliousJD When you enabled the VLAN for br0.10, did you keep the setting for allowing host access to custom networks? Any chance you can share how you configured it? I'd like to give it a try, but truth be told my understanding of VLANs are pretty basic. I do have the ability to run VLANs, however.

     

    Edited by kaiguy
    Updated info
    Link to comment
    Share on other sites

    IMG_0499.thumb.jpg.aa5dd1f8e87d4c319edad168c678bdcf.jpgI'm experiencing the exact same issue. On 6.8.3 I had ZERO issues, rock solid, I upgraded for NVIDIA GPU support so that is the only new thing.  Yes I have br0 docker IP assignments, its never been a problem, even have VMs and Dockers sharing br0, no issues still.

     

    Limetech, this is a pretty serious issue, what do you need from me to investigate further?  Are there logs I can upload?

     

    Thank you in advance!

    • Like 1
    Link to comment
    Share on other sites
    15 hours ago, Tuftuf said:

    *EDIT* Wow, I did not expect to see posts going back to 2018 talking about this and solutions. I know saying I've never had an issue doesn't meant I didn't miss one. But prior to 6.9, I had a stable system for months. How has this only just started effecting me. I've always had core internal services on br0 and external stuff on other vlans.

    Same here, ran multiple containers on br0 for years without any issues, but once 6.9 dropped everything went to hell for that unfortuantely.

     

    12 hours ago, kaiguy said:

     

    @CorneliousJD When you enabled the VLAN for br0.10, did you keep the setting for allowing host access to custom networks? Any chance you can share how you configured it? I'd like to give it a try, but truth be told my understanding of VLANs are pretty basic. I do have the ability to run VLANs, however.

     

    I did keep allowing host access, yes, and that's still working. 4.5 days of uptime right now (a new record on 6.9 for me!)

    As for VLANs, it really depends on your switch and router hardware. I have a Unifi switch and security gateway so I just made a VLAN in UniFi, the switch auto-accepts and passes the traffic without any issues. Probably the simplest setup for VLANs you could imagine. 

    Here's a good post that walks you through it, the screenshots are a little bit outdated but they get the job done still!

     

     

    2 hours ago, whoopn said:

    I'm experiencing the exact same issue. On 6.8.3 I had ZERO issues, rock solid, I upgraded for NVIDIA GPU support so that is the only new thing.  Yes I have br0 docker IP assignments, its never been a problem, even have VMs and Dockers sharing br0, no issues still.

     

    Limetech, this is a pretty serious issue, what do you need from me to investigate further?  Are there logs I can upload?

     

    Thank you in advance!

     

    Same situation here, ran for years w/out issues, but 6.9 introducted this bug for me. It apperas to be something very old going back years, but what's changed here that's causing issues for so many people?

     

    Also hoping I can contribute to help get to the bottom of this.

    @limetech

    • Thanks 1
    Link to comment
    Share on other sites

    Not sure if disabling host access to custom networks fixed it, or migrating the two containers that had static IPs assigned to br0 to a raspberry pi, but I’m no longer getting these syslog errors/locks. 
     

    I would prefer to keep everything on the server, so next project will be setting up a docker VLAN on my pfsense and tplink smart switch. Not once did I see them before 6.9.x, so I am hopeful that whatever is going on in this kernel is corrected. 

    Link to comment
    Share on other sites
    8 hours ago, kaiguy said:

    Not sure if disabling host access to custom networks fixed it, or migrating the two containers that had static IPs assigned to br0 to a raspberry pi, but I’m no longer getting these syslog errors/locks. 
     

    I would prefer to keep everything on the server, so next project will be setting up a docker VLAN on my pfsense and tplink smart switch. Not once did I see them before 6.9.x, so I am hopeful that whatever is going on in this kernel is corrected. 

     

    I still have host access enabled so that shouldn't be part of it (i edited the post to no longer state that host access is causing this) -- the fix was more than likely moving your static br0 containers to a Pi. 

     

    Also hoping we can get a chime-in here from someone at Limetech -- VLAN isn't a huge issue for me so I'm a happy camper now, but I can see where VLANs would cause more issues for certain apps, or where users just don't have equipment at home for that type of setup. 

    Link to comment
    Share on other sites
    4 hours ago, CorneliousJD said:

    I still have host access enabled so that shouldn't be part of it

    Possibly. I still did get one when I removed them from br0 and turned off those containers entirely, but I don't recall if I rebooted between events. I'll maybe try re-enabling host access and see if it happens.

    Link to comment
    Share on other sites

    I went through my UniFi setup and made all clients static assignments on IPs. I’ll report back if I stays up for more than a week

    Link to comment
    Share on other sites

    Made the change back to host access and my server locked up sometime after 2:30am this morning. Looks like (at least for me) that's the primary culprit in general.

    Link to comment
    Share on other sites
    bonienl

    Posted (edited)

    I created on my test server a Windows 10 VM connected to br0 and I created a Firefox browser container on br0 (macvlan) too. The container has a static IP address, while the VM obtains its IP address by DHCP.

     

    Next I open a browser in the VM and I open the container browser and let both run Internet speed tests (I have a 500 Mbps symmetrical Internet connection).

     

    Initially one browser at the time and full upload and download speeds are achieved, then running tests simultaneously in both browsers, sometimes first starting the container browser and sometimes first starting the VM browser. Again speeds are obtained as expected (reduced speeds due to simultaneous testing).

     

    During this ordeal everything stays stable and no call traces are happening.

     

    I have docker "host access" enabled, but I don't believe this has anything to do with the issue.

     

    In short reproducing this issue is a challenge ... (but I knew this)

     

    Currently I have a 3 hour relaxing youtube video running in both browser, see how it goes!

     

    Update

    Both 3 hour long music videos ended without problem, all still going strong. I let it sit there for another day.

     

    Edited by bonienl
    Link to comment
    Share on other sites

    For what it's worth, most of my panics weren't when the containers were seeing heavy usage, it happened overnight randomly for me. 

     

    I'm on 7 days 7 hours of uptime now. :D

    Link to comment
    Share on other sites

    like many i have this problem since upgrading to 6.9.1 from 6.8.3

     

    Im running a docker that runs on bridge. not the custom BR0 ip and that one was also not availble

     

    Friday my dockers were still running and could access them. But now that also stopped. I can reboot the server via SSH. But im not planning to do this every X days.

     

    So Question. What is the solution to this ? Because the way i see it this is a BUG that started at 6.9.X for me.

    Edited by KoNeko
    added info
    Link to comment
    Share on other sites

    Count me as another who is now experiencing hard crashes on my server and based on my reading here i suspect it is my  Dockers on br0 creating problems.  I've been running UnRaid for a long time (more than 15 years??) and can't recall ever having had a server crash.  However since upgrading to 6.9.1 I've had 3  hard crashes in matter of week that require forcibly powering-off.  Prior to upgrading to 6.9..1 I was running 6.8.2 using Dockers and br0 with static IPs without any issue but now this bug seems to have gotten me as well.  Today I've moved all of my dockers to Bridge and Host where appropriate and will see if this resolves my crashing issues.  I'll report back here either way.

    Edited by jsiemon
    Link to comment
    Share on other sites
    4 hours ago, jsiemon said:

    Count me as another who is now experiencing hard crashes on my server and based on my reading here i suspect it is my  Dockers on br0 creating problems.  I've been running UnRaid for a long time (more than 15 years??) and can't recall ever having had a server crash.  However since upgrading to 6.9.1 I've had 3  hard crashes in matter of week that require forcibly powering-off.  Prior to upgrading to 6.9..1 I was running 6.8.2 using Dockers and br0 with static IPs without any issue but now this bug seems to have gotten me as well.  Today I've moved all of my dockers to Bridge and Host where appropriate and will see if this resolves my crashing issues.  I'll report back here either way.

    Please keep us posted. I moved to VLANs instead of static IPs and so far I'm at Uptime 9 days 12 hours 11 minutes

     

    @limetech -- is there any way someone can chime on here on this? 

    Not expecting any immediate fixes, but is this somethign related to 6.9.X and something that can be addressed or?... We're all kind of in the dark here, I understand that it's happened to some users and VLANs have fixed it for them in the past, but this is a lot of users now reporting problems ONLY since 6.9.0/6.9.1... Myself included.

    • Like 1
    Link to comment
    Share on other sites

    I had a single call trace yesterday evening.

     

    It seems to me the common denominator is iptables, which crashes.

    Unraid 6.9.1. is running iptables version 1.8.5, I haven't checked the version on Unraid 6.8.3.

     

    Link to comment
    Share on other sites

    Hi everyone,

     

    Thank you for your patience with us on this and @bonienl for taking point on trying to recreate the issue.  We are discussing this internally and will continue to do so until we have something to share with you guys.  Issues like these can be tricky to pin down, so please bare with us while we attempt to do so.

    • Like 2
    Link to comment
    Share on other sites
    1 minute ago, jonp said:

    Hi everyone,

     

    Thank you for your patience with us on this and @bonienl for taking point on trying to recreate the issue.  We are discussing this internally and will continue to do so until we have something to share with you guys.  Issues like these can be tricky to pin down, so please bare with us while we attempt to do so.

     

    Thanks for the update, glad to see it's being talked about, that's all I can hope for, for now :)

     

    If you need any additional logs or info let us know in the thread, myself and others here seem to be eager to help if we can by providing data/logs/specs, etc.

    Link to comment
    Share on other sites
    19 hours ago, jonp said:

    Hi everyone,

     

    Thank you for your patience with us on this and @bonienl for taking point on trying to recreate the issue.  We are discussing this internally and will continue to do so until we have something to share with you guys.  Issues like these can be tricky to pin down, so please bare with us while we attempt to do so.

    Awesome I also still cannot run docker containers on br0 either. It will start causing macvlan errors in syslog and eventually kernel panic.

     

    The problem also followed me from my R510 to my R720 if that helps at all.

    Link to comment
    Share on other sites

    Chiming in - also experienced more frequent kernel panics after update to 6.9.1.
    I only had 1 docker assigned to a static IP on br0 (Unifi controller).  Followed the instructions Network isolation in unRAID 6.4 because luckily capable of configuring vlans due to unifi equipment.  Haven't had any panics since.

    Link to comment
    Share on other sites



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.