• 6.9.0/6.9.1 - Kernel Panic due to netfilter (nf_nat_setup_info) - Docker Static IP (macvlan)


    CorneliousJD
    • Urgent

    So I had posted another thread about after a kernel panic, docker host access to custom networks doesn't work until docker is stopped/restarted on 6.9.0

     

     

    After further investigation and setting up syslogging, it apperas that it may actually be that host access that's CAUSING the kernel panic? 

    EDIT: 3/16 - I guess I needed to create a VLAN for my dockers with static IPs, so far that's working, so it's probably not HOST access causing the issue, but rather br0 static IPs being set. See following posts below.

     

    Here's my last kernel panic that thankfully got logged to syslog. It references macvlan and netfilter. I don't know enough to be super useful here, but this is my docker setup.

     

    image.png.dac2782e9408016de37084cf21ad64a5.png

     

    Mar 12 03:57:07 Server kernel: ------------[ cut here ]------------
    Mar 12 03:57:07 Server kernel: WARNING: CPU: 17 PID: 626 at net/netfilter/nf_nat_core.c:614 nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Modules linked in: ccp macvlan xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap veth xt_nat xt_MASQUERADE iptable_nat nf_nat xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables bonding igb i2c_algo_bit cp210x usbserial sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd ipmi_ssif isci glue_helper mpt3sas i2c_i801 rapl libsas i2c_smbus input_leds i2c_core ahci intel_cstate raid_class led_class acpi_ipmi intel_uncore libahci scsi_transport_sas wmi ipmi_si button [last unloaded: ipmi_devintf]
    Mar 12 03:57:07 Server kernel: CPU: 17 PID: 626 Comm: kworker/17:2 Tainted: G        W         5.10.19-Unraid #1
    Mar 12 03:57:07 Server kernel: Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
    Mar 12 03:57:07 Server kernel: Workqueue: events macvlan_process_broadcast [macvlan]
    Mar 12 03:57:07 Server kernel: RIP: 0010:nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Code: 89 fb 49 89 f6 41 89 d4 76 02 0f 0b 48 8b 93 80 00 00 00 89 d0 25 00 01 00 00 45 85 e4 75 07 89 d0 25 80 00 00 00 85 c0 74 07 <0f> 0b e9 1f 05 00 00 48 8b 83 90 00 00 00 4c 8d 6c 24 20 48 8d 73
    Mar 12 03:57:07 Server kernel: RSP: 0018:ffffc90006778c38 EFLAGS: 00010202
    Mar 12 03:57:07 Server kernel: RAX: 0000000000000080 RBX: ffff88837c8303c0 RCX: ffff88811e834880
    Mar 12 03:57:07 Server kernel: RDX: 0000000000000180 RSI: ffffc90006778d14 RDI: ffff88837c8303c0
    Mar 12 03:57:07 Server kernel: RBP: ffffc90006778d00 R08: 0000000000000000 R09: ffff889083c68160
    Mar 12 03:57:07 Server kernel: R10: 0000000000000158 R11: ffff8881e79c1400 R12: 0000000000000000
    Mar 12 03:57:07 Server kernel: R13: 0000000000000000 R14: ffffc90006778d14 R15: 0000000000000001
    Mar 12 03:57:07 Server kernel: FS:  0000000000000000(0000) GS:ffff88903fc40000(0000) knlGS:0000000000000000
    Mar 12 03:57:07 Server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar 12 03:57:07 Server kernel: CR2: 000000c000b040b8 CR3: 000000000200c005 CR4: 00000000001706e0
    Mar 12 03:57:07 Server kernel: Call Trace:
    Mar 12 03:57:07 Server kernel: <IRQ>
    Mar 12 03:57:07 Server kernel: ? activate_task+0x9/0x12
    Mar 12 03:57:07 Server kernel: ? resched_curr+0x3f/0x4c
    Mar 12 03:57:07 Server kernel: ? ipt_do_table+0x49b/0x5c0 [ip_tables]
    Mar 12 03:57:07 Server kernel: ? try_to_wake_up+0x1b0/0x1e5
    Mar 12 03:57:07 Server kernel: nf_nat_alloc_null_binding+0x71/0x88 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_nat_inet_fn+0x91/0x182 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
    Mar 12 03:57:07 Server kernel: ip_local_deliver+0x49/0x75
    Mar 12 03:57:07 Server kernel: ip_sabotage_in+0x43/0x4d
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
    Mar 12 03:57:07 Server kernel: ip_rcv+0x41/0x61
    Mar 12 03:57:07 Server kernel: __netif_receive_skb_one_core+0x74/0x95
    Mar 12 03:57:07 Server kernel: process_backlog+0xa3/0x13b
    Mar 12 03:57:07 Server kernel: net_rx_action+0xf4/0x29d
    Mar 12 03:57:07 Server kernel: __do_softirq+0xc4/0x1c2
    Mar 12 03:57:07 Server kernel: asm_call_irq_on_stack+0x12/0x20
    Mar 12 03:57:07 Server kernel: </IRQ>
    Mar 12 03:57:07 Server kernel: do_softirq_own_stack+0x2c/0x39
    Mar 12 03:57:07 Server kernel: do_softirq+0x3a/0x44
    Mar 12 03:57:07 Server kernel: netif_rx_ni+0x1c/0x22
    Mar 12 03:57:07 Server kernel: macvlan_broadcast+0x10e/0x13c [macvlan]
    Mar 12 03:57:07 Server kernel: macvlan_process_broadcast+0xf8/0x143 [macvlan]
    Mar 12 03:57:07 Server kernel: process_one_work+0x13c/0x1d5
    Mar 12 03:57:07 Server kernel: worker_thread+0x18b/0x22f
    Mar 12 03:57:07 Server kernel: ? process_scheduled_works+0x27/0x27
    Mar 12 03:57:07 Server kernel: kthread+0xe5/0xea
    Mar 12 03:57:07 Server kernel: ? __kthread_bind_mask+0x57/0x57
    Mar 12 03:57:07 Server kernel: ret_from_fork+0x22/0x30
    Mar 12 03:57:07 Server kernel: ---[ end trace b3ca21ac5f2c2720 ]---

     




    User Feedback

    Recommended Comments



     

    User @kaiguy actually noted that it seems to be br0 + "host access" being enabled, and just disabling host access worked for him.

    Work exploring maybe?

     

    I originally had br0 + host access as well. Now I just have it on br0.10 (VLAN 10)

    Link to comment

    Please attach complete diagnostics, a screenshot doesn't show the relevant information.

     

    In case your server doesn't respond anymore, consider to activate the Mirror syslog to flash function, see Tools -> Syslog server. This would save a copy of your syslog to flash and can be retrieved afterwards.

    Edited by bonienl
    Link to comment
    6 hours ago, bonienl said:

    Please attach complete diagnostics, a screenshot doesn't show the relevant information.

     

    In case your server doesn't respond anymore, consider to activate the Mirror syslog to flash function, see Tools -> Syslog server. This would save a copy of your syslog to flash and can be retrieved afterwards.

     

    I setup an actual syslog server to capture it myself but yes, this is indeed the way.

    I should note my first panic after doing this did NOT get captured but the 2nd one did and that's what's posted in the first post here.

     

    9 hours ago, whoopn said:

    To UNRAID staff...can we get an official response?

     

    it happened again after only being up for one day.  

    0F1FCC4A-4318-4761-9E18-A00CC550CF9E.bmp 606.65 kB · 2 downloads

     

    They did mention it already saying that they're looking into this :) A few posts back Limetech popped in on this.

    Link to comment
    23 hours ago, CorneliousJD said:

     

    They did mention it already saying that they're looking into this :) A few posts back Limetech popped in on this.

     

    Ah thank you!  I'm going to setup syslog as well as disable any host attached network settings for docker...I'm seriously considering moving all of my docker containers to a VM...

    Link to comment

    this is the reason I asked if we can attached a VF to container in the thread, it can help to solve the issue and meet the requirement

     

    Edited by trott
    Link to comment
    27 minutes ago, whoopn said:

     

    Ah thank you!  I'm going to setup syslog as well as disable any host attached network settings for docker...I'm seriously considering moving all of my docker containers to a VM...

     

    If you have anything on br0 custom network, change that, or put it on a VLAN, and possibly disable host access as well. I think moving things to a VM is way overkill. I have nearly 60 containers running, and had just 2 on br0, I moved one to br0.10 (VLAN 10) and just changed the other to bridge mode instead.

     

    Been running solid now ever since.

    Link to comment
    3 hours ago, Jimmy said:

    My web interface crash randomly and I think I did not have a static IP for my dockers. Here my logs

     

    You have a different problem.

    Your system is running out of memory and I also see disk full statuses.

     

    Better create a report under General Support

    Start your system in safe mode and post your diagnostics under that report.

     

    Link to comment
    4 minutes ago, bonienl said:

     

    You have a different problem.

    Your system is running out of memory and I also see disk full statuses.

     

    Better create a report under General Support

    Start your system in safe mode and post your diagnostics under that report.

     

    But that’s not the case....sometimes in the logs I have a Kernel Panic that appears.

    image.png.89e8ea6e7fec4b72f2b10df29b040e82.pngimage.png.087d0505f2fc8cb795a591060c274cdc.png

    Link to comment
    16 minutes ago, Jimmy said:

    .sometimes in the logs I have a Kernel Panic that appears.

     

    I don't see the call trace specific to this issue in your log files.

     

    • Thanks 1
    Link to comment

    I run all my Dockers on br0. all have a custom IP that i assigned.

     

    i read vlan this and vlan that. Why would i want to do that? if it fix the locks that isnt a good reason in my Book because then something is wrong. and the setting this up without the Vlan should not be possible.

     

    i just trying to find out why this happens.  I dont like answers like So you have put it on Vlan than it works. That sounds to me like when you go with your car to the garage and say my car isnt driving anymore but backwards/reverse still works and the Garage says so then you just drive it in reverse.

     

    i had a Medusa docker which was the only one not on br0 but that one seems to be gone after i tried to safe settings. So i have to recreate that one now :(

    Its runnin for 5+ days now.

    Link to comment

    This should explain, why the panics doesn't occur after a crash/restart of the server, because setting the "Host Access" to yes does not persist a reboot for me. After reboot it shows still yes, but the host can't access the container.

    So i have to disable and than enable host access in the settings again.

    After that it works again but the kernel panic will occur at some time.

     

    Mybe i'm alone with this setting persitant bug, but if more ppl on the same boat and it' unknown untill now, than there could possibly much more ppl with kernel panics upcoming xD

     

    Could be easily verified, just ping the container from host after reboot.

     

    P.S. Sometimes i'm running for weeks until a panic occurs.

    Edited by DarkMan83
    Link to comment

    Just a quick follow-up to my previous post.  Its been about a week now since I changed my  2 Docker containers from br0 + static ip to Host and so far no crashes.  All of my other Docker containers were already on Host or Bridge.  I realize that a week isn't a long time, but prior to this change I had 3 KP in the span of a week so this change appears to have made a significant improvement and may point to the root cause of the issue.  I'll report back again in another week good or bad.

     

    UPDATE:

    url_redirect.cgi.bmp

    I spoke too soon.  Just had a KP.  Here is a screen shot.  I know it's not the same as a log, but it is all that I have.

    Edited by jsiemon
    Link to comment
    22 hours ago, jsiemon said:

    Just a quick follow-up to my previous post.  Its been about a week now since I changed my  2 Docker containers from br0 + static ip to Host and so far no crashes.  All of my other Docker containers were already on Host or Bridge.  I realize that a week isn't a long time, but prior to this change I had 3 KP in the span of a week so this change appears to have made a significant improvement and may point to the root cause of the issue.  I'll report back again in another week good or bad.

     

    UPDATE:

    url_redirect.cgi.bmp 606.65 kB · 12 downloads

    I spoke too soon.  Just had a KP.  Here is a screen shot.  I know it's not the same as a log, but it is all that I have.

    Same kernel panic as always xD

    Link to comment

    I'm getting this kernel panic too, and unRAID freezes and usually has a black empty screen when I plug my monitor in.

    Managed to snap a pic of the error on my phone on 10th March, pic attached. After this I set up a server syslog.

     

    Today, I had another crash/freeze, but no error message, just a blank screen. Attached are my diags and syslog (I hard reset my system at Mar 28 11:24:54)

     

    Overall it's frozen/crashed like this about 4 or 5 times in the past couple of months or so.

     

    kernelpanic.jpg

    syslog-192.168.0.21.log rocinante-diagnostics-20210328-1202.zip

    Link to comment

    Following up - Just got a kernel panic sometime early this morning despite reconfiguring my only static ip'd docker to a br0.4 vlan. 

     

    Link to comment

    I've gone to the effort of building an 11th gen NUC system to replace my main unraid server party due to needing to troubleshoot this bug! The 11th gen was a bit of a failure as it seems driver support for 11th gen cpu quick sync is shocking right now.

     

    So I found a cheap used 10400T mini pc! Everything is running from there as of last night.

     

    Now it's time to unplug some hard drives and start testing this bug again! Good (well bad) but good to see others are still seeing the issue.

    Link to comment
    8 minutes ago, bonienl said:

    People, please post diagnostics.

     

    Sorry about that, should have added I posted a couple of posts up with all that attached.

    Link to comment
    17 minutes ago, bonienl said:

    People, please post diagnostics.

     

    Disabling host access to custom networks has helped eliminate my kernel panics. Is this something you’d like me to re-enable for the cause? 

    Link to comment

    If you have a way to create these call traces on demand that would be helpful (of course we need diagnostics to further investigate). I have host access enabled but don't have any of these call traces, and as such it is hard to reproduce the issue for me.

     

    In the next Unraid version some more conntrack modules will be loaded, it would hopefully help to tackle the problem in more detail.

     

    • Like 2
    Link to comment
    20 minutes ago, bonienl said:

    If you have a way to create these call traces on demand that would be helpful (of course we need diagnostics to further investigate). I have host access enabled but don't have any of these call traces, and as such it is hard to reproduce the issue for me.

     

    In the next Unraid version some more conntrack modules will be loaded, it would hopefully help to tackle the problem in more detail.

     

     

    FWIW I was never able to create them on demand despite trying to, although for me it seemed to happen every ~60 hours or so (roughly) I would never hit 3 days of uptime before I made VLANs. 

     

    I also never had call traces "build up" - it was one and done, resulting in a kernel panic for me.

     

    I still have host access enabled with VLANs and I've eliminated my call traces and kernel panics so far. 

    Uptime 17 days 4 hours 8 minutes since making those changes. 

     

    If anyone finds a way to force this to happen I would consider purposely recreating the issue to help provide whatever logs are needed beyond what I've already supplied (diags and syslog of the call trace and panic).

    Link to comment
    9 minutes ago, CorneliousJD said:

    I also never had call traces "build up" - it was one and done

     

    In this case, is it the connection to the server not working anymore, or the complete server halted?

    In other words local console is still working in this case?

     

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.