• 6.9.0/6.9.1 - Kernel Panic due to netfilter (nf_nat_setup_info) - Docker Static IP (macvlan)


    CorneliousJD
    • Urgent

    So I had posted another thread about after a kernel panic, docker host access to custom networks doesn't work until docker is stopped/restarted on 6.9.0

     

     

    After further investigation and setting up syslogging, it apperas that it may actually be that host access that's CAUSING the kernel panic? 

    EDIT: 3/16 - I guess I needed to create a VLAN for my dockers with static IPs, so far that's working, so it's probably not HOST access causing the issue, but rather br0 static IPs being set. See following posts below.

     

    Here's my last kernel panic that thankfully got logged to syslog. It references macvlan and netfilter. I don't know enough to be super useful here, but this is my docker setup.

     

    image.png.dac2782e9408016de37084cf21ad64a5.png

     

    Mar 12 03:57:07 Server kernel: ------------[ cut here ]------------
    Mar 12 03:57:07 Server kernel: WARNING: CPU: 17 PID: 626 at net/netfilter/nf_nat_core.c:614 nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Modules linked in: ccp macvlan xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap veth xt_nat xt_MASQUERADE iptable_nat nf_nat xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables bonding igb i2c_algo_bit cp210x usbserial sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd ipmi_ssif isci glue_helper mpt3sas i2c_i801 rapl libsas i2c_smbus input_leds i2c_core ahci intel_cstate raid_class led_class acpi_ipmi intel_uncore libahci scsi_transport_sas wmi ipmi_si button [last unloaded: ipmi_devintf]
    Mar 12 03:57:07 Server kernel: CPU: 17 PID: 626 Comm: kworker/17:2 Tainted: G        W         5.10.19-Unraid #1
    Mar 12 03:57:07 Server kernel: Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
    Mar 12 03:57:07 Server kernel: Workqueue: events macvlan_process_broadcast [macvlan]
    Mar 12 03:57:07 Server kernel: RIP: 0010:nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Code: 89 fb 49 89 f6 41 89 d4 76 02 0f 0b 48 8b 93 80 00 00 00 89 d0 25 00 01 00 00 45 85 e4 75 07 89 d0 25 80 00 00 00 85 c0 74 07 <0f> 0b e9 1f 05 00 00 48 8b 83 90 00 00 00 4c 8d 6c 24 20 48 8d 73
    Mar 12 03:57:07 Server kernel: RSP: 0018:ffffc90006778c38 EFLAGS: 00010202
    Mar 12 03:57:07 Server kernel: RAX: 0000000000000080 RBX: ffff88837c8303c0 RCX: ffff88811e834880
    Mar 12 03:57:07 Server kernel: RDX: 0000000000000180 RSI: ffffc90006778d14 RDI: ffff88837c8303c0
    Mar 12 03:57:07 Server kernel: RBP: ffffc90006778d00 R08: 0000000000000000 R09: ffff889083c68160
    Mar 12 03:57:07 Server kernel: R10: 0000000000000158 R11: ffff8881e79c1400 R12: 0000000000000000
    Mar 12 03:57:07 Server kernel: R13: 0000000000000000 R14: ffffc90006778d14 R15: 0000000000000001
    Mar 12 03:57:07 Server kernel: FS:  0000000000000000(0000) GS:ffff88903fc40000(0000) knlGS:0000000000000000
    Mar 12 03:57:07 Server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar 12 03:57:07 Server kernel: CR2: 000000c000b040b8 CR3: 000000000200c005 CR4: 00000000001706e0
    Mar 12 03:57:07 Server kernel: Call Trace:
    Mar 12 03:57:07 Server kernel: <IRQ>
    Mar 12 03:57:07 Server kernel: ? activate_task+0x9/0x12
    Mar 12 03:57:07 Server kernel: ? resched_curr+0x3f/0x4c
    Mar 12 03:57:07 Server kernel: ? ipt_do_table+0x49b/0x5c0 [ip_tables]
    Mar 12 03:57:07 Server kernel: ? try_to_wake_up+0x1b0/0x1e5
    Mar 12 03:57:07 Server kernel: nf_nat_alloc_null_binding+0x71/0x88 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_nat_inet_fn+0x91/0x182 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
    Mar 12 03:57:07 Server kernel: ip_local_deliver+0x49/0x75
    Mar 12 03:57:07 Server kernel: ip_sabotage_in+0x43/0x4d
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
    Mar 12 03:57:07 Server kernel: ip_rcv+0x41/0x61
    Mar 12 03:57:07 Server kernel: __netif_receive_skb_one_core+0x74/0x95
    Mar 12 03:57:07 Server kernel: process_backlog+0xa3/0x13b
    Mar 12 03:57:07 Server kernel: net_rx_action+0xf4/0x29d
    Mar 12 03:57:07 Server kernel: __do_softirq+0xc4/0x1c2
    Mar 12 03:57:07 Server kernel: asm_call_irq_on_stack+0x12/0x20
    Mar 12 03:57:07 Server kernel: </IRQ>
    Mar 12 03:57:07 Server kernel: do_softirq_own_stack+0x2c/0x39
    Mar 12 03:57:07 Server kernel: do_softirq+0x3a/0x44
    Mar 12 03:57:07 Server kernel: netif_rx_ni+0x1c/0x22
    Mar 12 03:57:07 Server kernel: macvlan_broadcast+0x10e/0x13c [macvlan]
    Mar 12 03:57:07 Server kernel: macvlan_process_broadcast+0xf8/0x143 [macvlan]
    Mar 12 03:57:07 Server kernel: process_one_work+0x13c/0x1d5
    Mar 12 03:57:07 Server kernel: worker_thread+0x18b/0x22f
    Mar 12 03:57:07 Server kernel: ? process_scheduled_works+0x27/0x27
    Mar 12 03:57:07 Server kernel: kthread+0xe5/0xea
    Mar 12 03:57:07 Server kernel: ? __kthread_bind_mask+0x57/0x57
    Mar 12 03:57:07 Server kernel: ret_from_fork+0x22/0x30
    Mar 12 03:57:07 Server kernel: ---[ end trace b3ca21ac5f2c2720 ]---

     




    User Feedback

    Recommended Comments



    4 hours ago, bonienl said:

    Perfectly normal when you enable VLANs for an interface.

    VLAN 0 refers to the untagged communication.

    Yes, I am well aware of the significance of VLAN 0. I'm also well aware that it happens when VLANs are enabled for an interface.

     

    However, upon reading my message, the following things stand out as unusual -- which may be why I wrote them in detail, and provided screenshots --

     

    1 -- There are no currently enabled interfaces with VLANs enabled so..."perfectly normal when" goes right out the window, yeah?

    2 -- Before (when I was crashing) I did not have VLAN 0 messages in my log

    3 -- After (now that I'm NOT crashing) I DO have VLAN 0 messages in my log

     

    The first point I was trying to raise, in this case, was that I'm not using VLANs on any of my enabled network interfaces, and none of my disabled interfaces have VLANs configured but they are enabled -- the system seems to still be activating VLAN 0 as if preparing for VLAN operation, which seems to be abnormal thus commenting on it, if it were normal I wouldn't have taken the time.

     

    The second point I was trying to raise, specifically, was that I seem to have stumbled upon a potential solution for this crash problem while Docker/Kernel/Limetech find a solution.

     

    What I'd like to see is if anyone experiencing these crashes could do the following;

    A) Enable VLAN but do not configure it on an unused network interface

    B) DISABLE that interface (IPv4/IPv6 Address Assignment None, not a member of bridge or bond)

    C) Reboot, start your array, check if you see "VLAN 0" related log messages

    D) Report if your system becomes stable

     

    I've got uptime around a month now, perfectly stable, with no changes to my system beyond what's mentioned above and in greater detail in my previous message.

    Edited by codefaux
    Link to comment
    On 7/6/2021 at 1:14 PM, codefaux said:

    Yes, I am well aware of the significance of VLAN 0. I'm also well aware that it happens when VLANs are enabled for an interface.

     

    However, upon reading my message, the following things stand out as unusual -- which may be why I wrote them in detail, and provided screenshots --

     

    1 -- There are no currently enabled interfaces with VLANs enabled so..."perfectly normal when" goes right out the window, yeah?

    2 -- Before (when I was crashing) I did not have VLAN 0 messages in my log

    3 -- After (now that I'm NOT crashing) I DO have VLAN 0 messages in my log

     

    The first point I was trying to raise, in this case, was that I'm not using VLANs on any of my enabled network interfaces, and none of my disabled interfaces have VLANs configured but they are enabled -- the system seems to still be activating VLAN 0 as if preparing for VLAN operation, which seems to be abnormal thus commenting on it, if it were normal I wouldn't have taken the time.

     

    The second point I was trying to raise, specifically, was that I seem to have stumbled upon a potential solution for this crash problem while Docker/Kernel/Limetech find a solution.

     

    What I'd like to see is if anyone experiencing these crashes could do the following;

    A) Enable VLAN but do not configure it on an unused network interface

    B) DISABLE that interface (IPv4/IPv6 Address Assignment None, not a member of bridge or bond)

    C) Reboot, start your array, check if you see "VLAN 0" related log messages

    D) Report if your system becomes stable

     

    I've got uptime around a month now, perfectly stable, with no changes to my system beyond what's mentioned above and in greater detail in my previous message.

    Well, knock on wood, but my system seems to be stable after making these changes. I had trouble even keeping Unraid up for more than 2 days, but it has now been running without issues for about a week.

    • Like 1
    Link to comment

    For what it's worth, I was having this issue as well and simply disabling host access to custom networks seems to have resolved it for me.  No vlan changes or anything else.  Couldn't keep it running running for longer than a day or two, and now it's been running for about a week and a half without an issue.  

    • Like 1
    Link to comment
    On 8/3/2021 at 10:15 AM, VlarpNL said:

    Well, knock on wood, but my system seems to be stable after making these changes. I had trouble even keeping Unraid up for more than 2 days, but it has now been running without issues for about a week.

    Well, there you have it. I jinxed it. My server just froze up. 

    I'm disabling host access to custom networks to see if that improves anything. Fingers crossed.

    Link to comment
    On 6/12/2021 at 10:56 AM, bonienl said:

    For Unraid version 6.10 I have replaced the Docker macvlan driver for the Docker ipvlan driver.

     

    IPvlan is a new twist on the tried and true network virtualization technique. The Linux implementations are extremely lightweight because rather than using the traditional Linux bridge for isolation, they are associated to a Linux Ethernet interface or sub-interface to enforce separation between networks and connectivity to the physical network.

     

    The end-user doesn't have to do anything special. At startup legacy networks are automatically removed and replaced by the new network approach. Please test once 6.10 becomes available. Internal testing looks very good so far.

     

    17 hours on 6.10.0-rc1 with the switch over to ipvlan and no panics yet....knock on wood  :) If I make it to 48 hours I will be convinced.

    • Like 1
    Link to comment
    On 8/4/2021 at 7:33 AM, VlarpNL said:

    Well, there you have it. I jinxed it. My server just froze up. 

    I'm disabling host access to custom networks to see if that improves anything. Fingers crossed.

    Sorry to hear it -- I'm still running the configuration posted and still stable save the recent power bumps in our area, but I had nearly a month of uptime at one point and even that power cycle was scheduled. I've also moved additional Docker containers onto the system so I could shut down the other for power/heat savings during the heat wave. I haven't had the spoons to convince myself to change anything since it became stable, so I'm not on the RC yet.

     

    I might suggest that if there is a panic, it could be unrelated -- post its details just to be sure. Various hardware have errata that could cause panics, including both Intel and AMD C-state bugs on various generations, and even filesystem corruption from previous panics -- something Unraid doesn't forward to the UI from the kernel logs, by default.

     

    Good luck all.

    Link to comment
    1 hour ago, codefaux said:

    Sorry to hear it -- I'm still running the configuration posted and still stable save the recent power bumps in our area, but I had nearly a month of uptime at one point and even that power cycle was scheduled. I've also moved additional Docker containers onto the system so I could shut down the other for power/heat savings during the heat wave. I haven't had the spoons to convince myself to change anything since it became stable, so I'm not on the RC yet.

     

    I might suggest that if there is a panic, it could be unrelated -- post its details just to be sure. Various hardware have errata that could cause panics, including both Intel and AMD C-state bugs on various generations, and even filesystem corruption from previous panics -- something Unraid doesn't forward to the UI from the kernel logs, by default.

     

    Good luck all.

    I finally enabled syslog etc. and it seems it is not related to this issue at all anymore that my server crashes. I had the macvlan problem earlier, so I made the assumption (yes, I know! Assumtions are the mother of all f*ups) that it was still the same issue. 

     

    That probably means that the macvlan issue is gone on my machine. 

    Link to comment

    So bit of a curveball I ran macvlan br0 on the new rc for 24 hours without issue. 

    I hadn't realised that I didn't enable ipVlan, I have now enabled ipvlan and its still all good.

    But I had not been able to run for 24 hours on mcavlan with 6.9. 

    Edited by DuzAwe
    Link to comment

    ipvlan seems like an upgrade from macvlan anyways so I didn't even bother with testing my config on macvlan with 6.10.0-rc1. I went right to ipvlan as soon as the update was complete and now at 45 hours with no issue.

     

    Generally couldn't make it half that long in 6.9.x when I tried with my configuration before locking up.

    • Like 1
    Link to comment

    This morning I had another crash again related to br_netfilter/nf_nat. So I have upgraded to 6.10.0-rc1 and switched to ipvlan for Docker. Hoping this solves this issue.

    Link to comment

    Unraid 6.9.2 here. Experiencing the same problem. Happens every once in a long while. The kernel panic screen shows something related to IPv6. The only changes to my server recently are upgrading from 6.8.3 and reassigning a bunch of dockers to br0,br1.

    Link to comment

    I'm stable running on 6.10.0-rc1 with ipvlan for over a week now. An uptime of more than 7 days is a record in the past couple of months. 

    • Like 1
    Link to comment

    Been getting kernel panics about once a week. I figured out a temporary bypass to deal with it:

     

    echo 60 >/proc/sys/kernel/panic

     

    Put this in your go file and Unraid will reboot after 60s instead of stuck in infinite panic loop. With this you can at least manage it remotely without being there to press the reset button.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.