• 6.9.0/6.9.1 - Kernel Panic due to netfilter (nf_nat_setup_info) - Docker Static IP (macvlan)


    CorneliousJD
    • Urgent

    So I had posted another thread about after a kernel panic, docker host access to custom networks doesn't work until docker is stopped/restarted on 6.9.0

     

     

    After further investigation and setting up syslogging, it apperas that it may actually be that host access that's CAUSING the kernel panic? 

    EDIT: 3/16 - I guess I needed to create a VLAN for my dockers with static IPs, so far that's working, so it's probably not HOST access causing the issue, but rather br0 static IPs being set. See following posts below.

     

    Here's my last kernel panic that thankfully got logged to syslog. It references macvlan and netfilter. I don't know enough to be super useful here, but this is my docker setup.

     

    image.png.dac2782e9408016de37084cf21ad64a5.png

     

    Mar 12 03:57:07 Server kernel: ------------[ cut here ]------------
    Mar 12 03:57:07 Server kernel: WARNING: CPU: 17 PID: 626 at net/netfilter/nf_nat_core.c:614 nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Modules linked in: ccp macvlan xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap veth xt_nat xt_MASQUERADE iptable_nat nf_nat xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables bonding igb i2c_algo_bit cp210x usbserial sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd ipmi_ssif isci glue_helper mpt3sas i2c_i801 rapl libsas i2c_smbus input_leds i2c_core ahci intel_cstate raid_class led_class acpi_ipmi intel_uncore libahci scsi_transport_sas wmi ipmi_si button [last unloaded: ipmi_devintf]
    Mar 12 03:57:07 Server kernel: CPU: 17 PID: 626 Comm: kworker/17:2 Tainted: G        W         5.10.19-Unraid #1
    Mar 12 03:57:07 Server kernel: Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
    Mar 12 03:57:07 Server kernel: Workqueue: events macvlan_process_broadcast [macvlan]
    Mar 12 03:57:07 Server kernel: RIP: 0010:nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Code: 89 fb 49 89 f6 41 89 d4 76 02 0f 0b 48 8b 93 80 00 00 00 89 d0 25 00 01 00 00 45 85 e4 75 07 89 d0 25 80 00 00 00 85 c0 74 07 <0f> 0b e9 1f 05 00 00 48 8b 83 90 00 00 00 4c 8d 6c 24 20 48 8d 73
    Mar 12 03:57:07 Server kernel: RSP: 0018:ffffc90006778c38 EFLAGS: 00010202
    Mar 12 03:57:07 Server kernel: RAX: 0000000000000080 RBX: ffff88837c8303c0 RCX: ffff88811e834880
    Mar 12 03:57:07 Server kernel: RDX: 0000000000000180 RSI: ffffc90006778d14 RDI: ffff88837c8303c0
    Mar 12 03:57:07 Server kernel: RBP: ffffc90006778d00 R08: 0000000000000000 R09: ffff889083c68160
    Mar 12 03:57:07 Server kernel: R10: 0000000000000158 R11: ffff8881e79c1400 R12: 0000000000000000
    Mar 12 03:57:07 Server kernel: R13: 0000000000000000 R14: ffffc90006778d14 R15: 0000000000000001
    Mar 12 03:57:07 Server kernel: FS:  0000000000000000(0000) GS:ffff88903fc40000(0000) knlGS:0000000000000000
    Mar 12 03:57:07 Server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar 12 03:57:07 Server kernel: CR2: 000000c000b040b8 CR3: 000000000200c005 CR4: 00000000001706e0
    Mar 12 03:57:07 Server kernel: Call Trace:
    Mar 12 03:57:07 Server kernel: <IRQ>
    Mar 12 03:57:07 Server kernel: ? activate_task+0x9/0x12
    Mar 12 03:57:07 Server kernel: ? resched_curr+0x3f/0x4c
    Mar 12 03:57:07 Server kernel: ? ipt_do_table+0x49b/0x5c0 [ip_tables]
    Mar 12 03:57:07 Server kernel: ? try_to_wake_up+0x1b0/0x1e5
    Mar 12 03:57:07 Server kernel: nf_nat_alloc_null_binding+0x71/0x88 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_nat_inet_fn+0x91/0x182 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
    Mar 12 03:57:07 Server kernel: ip_local_deliver+0x49/0x75
    Mar 12 03:57:07 Server kernel: ip_sabotage_in+0x43/0x4d
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
    Mar 12 03:57:07 Server kernel: ip_rcv+0x41/0x61
    Mar 12 03:57:07 Server kernel: __netif_receive_skb_one_core+0x74/0x95
    Mar 12 03:57:07 Server kernel: process_backlog+0xa3/0x13b
    Mar 12 03:57:07 Server kernel: net_rx_action+0xf4/0x29d
    Mar 12 03:57:07 Server kernel: __do_softirq+0xc4/0x1c2
    Mar 12 03:57:07 Server kernel: asm_call_irq_on_stack+0x12/0x20
    Mar 12 03:57:07 Server kernel: </IRQ>
    Mar 12 03:57:07 Server kernel: do_softirq_own_stack+0x2c/0x39
    Mar 12 03:57:07 Server kernel: do_softirq+0x3a/0x44
    Mar 12 03:57:07 Server kernel: netif_rx_ni+0x1c/0x22
    Mar 12 03:57:07 Server kernel: macvlan_broadcast+0x10e/0x13c [macvlan]
    Mar 12 03:57:07 Server kernel: macvlan_process_broadcast+0xf8/0x143 [macvlan]
    Mar 12 03:57:07 Server kernel: process_one_work+0x13c/0x1d5
    Mar 12 03:57:07 Server kernel: worker_thread+0x18b/0x22f
    Mar 12 03:57:07 Server kernel: ? process_scheduled_works+0x27/0x27
    Mar 12 03:57:07 Server kernel: kthread+0xe5/0xea
    Mar 12 03:57:07 Server kernel: ? __kthread_bind_mask+0x57/0x57
    Mar 12 03:57:07 Server kernel: ret_from_fork+0x22/0x30
    Mar 12 03:57:07 Server kernel: ---[ end trace b3ca21ac5f2c2720 ]---

     




    User Feedback

    Recommended Comments



    4 hours ago, bonienl said:

    Perfectly normal when you enable VLANs for an interface.

    VLAN 0 refers to the untagged communication.

    Yes, I am well aware of the significance of VLAN 0. I'm also well aware that it happens when VLANs are enabled for an interface.

     

    However, upon reading my message, the following things stand out as unusual -- which may be why I wrote them in detail, and provided screenshots --

     

    1 -- There are no currently enabled interfaces with VLANs enabled so..."perfectly normal when" goes right out the window, yeah?

    2 -- Before (when I was crashing) I did not have VLAN 0 messages in my log

    3 -- After (now that I'm NOT crashing) I DO have VLAN 0 messages in my log

     

    The first point I was trying to raise, in this case, was that I'm not using VLANs on any of my enabled network interfaces, and none of my disabled interfaces have VLANs configured but they are enabled -- the system seems to still be activating VLAN 0 as if preparing for VLAN operation, which seems to be abnormal thus commenting on it, if it were normal I wouldn't have taken the time.

     

    The second point I was trying to raise, specifically, was that I seem to have stumbled upon a potential solution for this crash problem while Docker/Kernel/Limetech find a solution.

     

    What I'd like to see is if anyone experiencing these crashes could do the following;

    A) Enable VLAN but do not configure it on an unused network interface

    B) DISABLE that interface (IPv4/IPv6 Address Assignment None, not a member of bridge or bond)

    C) Reboot, start your array, check if you see "VLAN 0" related log messages

    D) Report if your system becomes stable

     

    I've got uptime around a month now, perfectly stable, with no changes to my system beyond what's mentioned above and in greater detail in my previous message.

    Edited by codefaux
    Link to comment
    On 7/6/2021 at 1:14 PM, codefaux said:

    Yes, I am well aware of the significance of VLAN 0. I'm also well aware that it happens when VLANs are enabled for an interface.

     

    However, upon reading my message, the following things stand out as unusual -- which may be why I wrote them in detail, and provided screenshots --

     

    1 -- There are no currently enabled interfaces with VLANs enabled so..."perfectly normal when" goes right out the window, yeah?

    2 -- Before (when I was crashing) I did not have VLAN 0 messages in my log

    3 -- After (now that I'm NOT crashing) I DO have VLAN 0 messages in my log

     

    The first point I was trying to raise, in this case, was that I'm not using VLANs on any of my enabled network interfaces, and none of my disabled interfaces have VLANs configured but they are enabled -- the system seems to still be activating VLAN 0 as if preparing for VLAN operation, which seems to be abnormal thus commenting on it, if it were normal I wouldn't have taken the time.

     

    The second point I was trying to raise, specifically, was that I seem to have stumbled upon a potential solution for this crash problem while Docker/Kernel/Limetech find a solution.

     

    What I'd like to see is if anyone experiencing these crashes could do the following;

    A) Enable VLAN but do not configure it on an unused network interface

    B) DISABLE that interface (IPv4/IPv6 Address Assignment None, not a member of bridge or bond)

    C) Reboot, start your array, check if you see "VLAN 0" related log messages

    D) Report if your system becomes stable

     

    I've got uptime around a month now, perfectly stable, with no changes to my system beyond what's mentioned above and in greater detail in my previous message.

    Well, knock on wood, but my system seems to be stable after making these changes. I had trouble even keeping Unraid up for more than 2 days, but it has now been running without issues for about a week.

    • Like 1
    Link to comment

    For what it's worth, I was having this issue as well and simply disabling host access to custom networks seems to have resolved it for me.  No vlan changes or anything else.  Couldn't keep it running running for longer than a day or two, and now it's been running for about a week and a half without an issue.  

    • Like 1
    Link to comment
    On 8/3/2021 at 10:15 AM, VlarpNL said:

    Well, knock on wood, but my system seems to be stable after making these changes. I had trouble even keeping Unraid up for more than 2 days, but it has now been running without issues for about a week.

    Well, there you have it. I jinxed it. My server just froze up. 

    I'm disabling host access to custom networks to see if that improves anything. Fingers crossed.

    Link to comment
    On 6/12/2021 at 10:56 AM, bonienl said:

    For Unraid version 6.10 I have replaced the Docker macvlan driver for the Docker ipvlan driver.

     

    IPvlan is a new twist on the tried and true network virtualization technique. The Linux implementations are extremely lightweight because rather than using the traditional Linux bridge for isolation, they are associated to a Linux Ethernet interface or sub-interface to enforce separation between networks and connectivity to the physical network.

     

    The end-user doesn't have to do anything special. At startup legacy networks are automatically removed and replaced by the new network approach. Please test once 6.10 becomes available. Internal testing looks very good so far.

     

    17 hours on 6.10.0-rc1 with the switch over to ipvlan and no panics yet....knock on wood  :) If I make it to 48 hours I will be convinced.

    • Like 1
    Link to comment
    On 8/4/2021 at 7:33 AM, VlarpNL said:

    Well, there you have it. I jinxed it. My server just froze up. 

    I'm disabling host access to custom networks to see if that improves anything. Fingers crossed.

    Sorry to hear it -- I'm still running the configuration posted and still stable save the recent power bumps in our area, but I had nearly a month of uptime at one point and even that power cycle was scheduled. I've also moved additional Docker containers onto the system so I could shut down the other for power/heat savings during the heat wave. I haven't had the spoons to convince myself to change anything since it became stable, so I'm not on the RC yet.

     

    I might suggest that if there is a panic, it could be unrelated -- post its details just to be sure. Various hardware have errata that could cause panics, including both Intel and AMD C-state bugs on various generations, and even filesystem corruption from previous panics -- something Unraid doesn't forward to the UI from the kernel logs, by default.

     

    Good luck all.

    Link to comment
    1 hour ago, codefaux said:

    Sorry to hear it -- I'm still running the configuration posted and still stable save the recent power bumps in our area, but I had nearly a month of uptime at one point and even that power cycle was scheduled. I've also moved additional Docker containers onto the system so I could shut down the other for power/heat savings during the heat wave. I haven't had the spoons to convince myself to change anything since it became stable, so I'm not on the RC yet.

     

    I might suggest that if there is a panic, it could be unrelated -- post its details just to be sure. Various hardware have errata that could cause panics, including both Intel and AMD C-state bugs on various generations, and even filesystem corruption from previous panics -- something Unraid doesn't forward to the UI from the kernel logs, by default.

     

    Good luck all.

    I finally enabled syslog etc. and it seems it is not related to this issue at all anymore that my server crashes. I had the macvlan problem earlier, so I made the assumption (yes, I know! Assumtions are the mother of all f*ups) that it was still the same issue. 

     

    That probably means that the macvlan issue is gone on my machine. 

    Link to comment

    So bit of a curveball I ran macvlan br0 on the new rc for 24 hours without issue. 

    I hadn't realised that I didn't enable ipVlan, I have now enabled ipvlan and its still all good.

    But I had not been able to run for 24 hours on mcavlan with 6.9. 

    Edited by DuzAwe
    Link to comment

    ipvlan seems like an upgrade from macvlan anyways so I didn't even bother with testing my config on macvlan with 6.10.0-rc1. I went right to ipvlan as soon as the update was complete and now at 45 hours with no issue.

     

    Generally couldn't make it half that long in 6.9.x when I tried with my configuration before locking up.

    • Like 2
    Link to comment

    This morning I had another crash again related to br_netfilter/nf_nat. So I have upgraded to 6.10.0-rc1 and switched to ipvlan for Docker. Hoping this solves this issue.

    Link to comment

    Unraid 6.9.2 here. Experiencing the same problem. Happens every once in a long while. The kernel panic screen shows something related to IPv6. The only changes to my server recently are upgrading from 6.8.3 and reassigning a bunch of dockers to br0,br1.

    Link to comment

    I'm stable running on 6.10.0-rc1 with ipvlan for over a week now. An uptime of more than 7 days is a record in the past couple of months. 

    • Like 2
    Link to comment

    Just found this thread - I'm on 6.9.2 and have been having the same issue for some time now. 

     

    What's been the result for those of you who have upgraded to 6.10rc?



    My unraid box:

     - Host access to custom networks: off

     - Using a combination of Bridge & Br0 w/ fixed IP addresses

     - Have multiple NICs in the server, also using bonded 

     - Motherboard model: Supermicro X9DRi-LN4+ (I noticed someone else having the same problem was using the same model)

     - Was running a Unifi controller container

     

    I just today removed the Unifi container to see if there's any change.

    Considering the 6.10 update as another option depending on how that's worked out for others with this problem.

     

     

    Excerpt of syslog:

     

    Oct 26 10:19:46 X1 kernel: ------------[ cut here ]------------
    Oct 26 10:19:46 X1 kernel: WARNING: CPU: 4 PID: 3970 at net/netfilter/nf_conntrack_core.c:1120 __nf_conntrack_confirm+0x9b>
    Oct 26 10:19:46 X1 kernel: Modules linked in: xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_m>
    Oct 26 10:19:46 X1 kernel: CPU: 4 PID: 3970 Comm: kworker/4:0 Not tainted 5.10.28-Unraid #1
    Oct 26 10:19:46 X1 kernel: Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.4 11/20/2019
    Oct 26 10:19:46 X1 kernel: Workqueue: events macvlan_process_broadcast [macvlan]
    Oct 26 10:19:46 X1 kernel: RIP: 0010:__nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack]
    Oct 26 10:19:46 X1 kernel: Code: e8 dc f8 ff ff 44 89 fa 89 c6 41 89 c4 48 c1 eb 20 89 df 41 89 de e8 36 f6 ff ff 84 c0 75>
    Oct 26 10:19:46 X1 kernel: RSP: 0018:ffffc9000c764dd8 EFLAGS: 00010202
    Oct 26 10:19:46 X1 kernel: RAX: 0000000000000188 RBX: 000000000000144d RCX: 00000000ba3cf88b
    Oct 26 10:19:46 X1 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa03085fc
    Oct 26 10:19:46 X1 kernel: RBP: ffff8898dabb1b80 R08: 000000006d5ec060 R09: 0000000000000000
    Oct 26 10:19:46 X1 kernel: R10: 0000000000000098 R11: ffff888107e9cd00 R12: 00000000000010ff
    Oct 26 10:19:46 X1 kernel: R13: ffffffff8210b440 R14: 000000000000144d R15: 0000000000000000
    Oct 26 10:19:46 X1 kernel: FS:  0000000000000000(0000) GS:ffff88981fb00000(0000) knlGS:0000000000000000
    Oct 26 10:19:46 X1 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Oct 26 10:19:46 X1 kernel: CR2: 00001528988c3698 CR3: 00000018e47b0006 CR4: 00000000001706e0

     

    Link to comment

    Just wanted to give an update for those of you having this issue who are still on 6.9 or earlier.

     

    I have about 25 docker containers all running on various VLANs. This is on a Dell PowerEdge R510 server. I was experiencing crashes daily because of this issue. I tried tracking down the problem for weeks.

     

    I updated to 6.10 rc1 3 weeks ago and haven't had a single crash since.

     

    Migrating from macvlan to ipvlan driver couldn't be easier; none of my docker containers were affected negatively.

     

    I would definitely recommend upgrading to 6.10 if you are experiencing kernel panics and run docker with custom networks.

     

    EDIT: Attempted to upgrade to RC2 but received (bzroot checksum failed) error on bootup. I made a flash backup before upgrading so I manually restored to a new flash drive. Will hold off on upgrading to RC2 for a bit in case it happens again.

    Edited by PixelDJ
    Added info about RC2 upgrade.
    • Like 2
    Link to comment

    Update from another who was having the same issue on 6.9.3

     

    After upgrading to 6.10-rc2 - Same result

    Updated docker settings to use ipvlan instead of macvlan - seems to have resolved the issue.  Have not noticed any issues with any of my docker containers as a result of this change.

     

    Uptime 3 days, 43 minutes ( Prior to this change, was seeing hard crashes prior to the 3 day mark)

     

    Will report back if there's any change.

     

     

    • Like 1
    Link to comment

    I too have been unable to run any docker containers for months due to this bug in Unraid and am deeply disappointed there is no apparent documented workable solution.

     

    I have dug through 100s of posts looking for a solution and found a lot of contradictions and mostly try this and then that.

     

    I have tried 6.10.0-rc2 and still no help.

     

    I keep looking at a ipvlan but cannot find how to set this up for my network.

     

    So for now, Unriad and all the hardware for me just burn up electricity.

    Link to comment
    30 minutes ago, -jim said:

    I have tried 6.10.0-rc2 and still no help.

     

    I keep looking at a ipvlan but cannot find how to set this up for my network.

    Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enable, top right)

    Link to comment

    Now that I run my docker containers that need their own IP on a dedicated "docker" NIC, no issues at all on 6.9. It has been great for months.

    Link to comment
    7 hours ago, -jim said:

    I too have been unable to run any docker containers for months due to this bug in Unraid and am deeply disappointed there is no apparent documented workable solution.

     

    I have dug through 100s of posts looking for a solution and found a lot of contradictions and mostly try this and then that.

     

    I have tried 6.10.0-rc2 and still no help.

     

    I keep looking at a ipvlan but cannot find how to set this up for my network.

     

    So for now, Unriad and all the hardware for me just burn up electricity.

    I still haven't updated, because the workaround I posted still works for me. I'm using Docker with 20+ containers, each with a dedicated IP, all from one NIC.

     

    I never even tried the 6.10-rc because after it was released, I read a few posts from folks using the 6.10-rc that the problem still existed even with ipvlan. Honestly, I'm stable, and I'm not going to upgrade until I stop hearing about this bug.

    Link to comment

    My Unraid has become a warm brick.

    On 11/25/2021 at 6:40 AM, JorgeB said:

    Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enable, top right)

    Sure that was easy. But then there is a lot of configuration that must be done in other places that is not defined for Unraid.
     

    Edited by -jim
    Link to comment
    45 minutes ago, -jim said:

    My Unraid has become a warm brick.

    Sure that was easy. But then there is a lot of configuration that must be done in other places that is not defined for Unraid.
     


    Not really, thats all that needs to be done.

    Link to comment

    Just to add to this list, I was also experiencing these issues with 6.9.2 and was getting a kernel panic about every 2 weeks. I've been running *-rc1 and *-rc2 since they came out with ipvlan enabled and have been stable ever since then (58 days currently). For reference this is the hardware I'm running as well:

     

    Mobo: asrock X470D4U

    CPU: 3700x

    Memory: 32GB Kingston KSM26ED8

    Link to comment

    Hello Guys

     

    I think my server has been suffering from this too. I had originally thought it was down to C-States and my dual ES Xeons. It's been progressively worse since 6.8, now on 6.10 RC2

    I changed practically everything new CPUS, removed the GPUs, Load of new RAM, went through every version of the  BIOS. The problem was always there going from crashing every 12hrs to 4 days.

    I noticed it got worse when using PiHole/LanCache & occasionally Tdarr nodes, all have there own IPs on br0.

    I changed over to IPVLAN, crashed within 12 hrs. 

    I've now disabled "Host access to custom networks" - Let's see how that goes.

     

    The next step will be VLANs, not an issue for me as I already have a few on my LAN. I can set the docker IPs over to my IoT VLAN.

    Screenshot 2022-02-08 160554.jpg

    Screenshot 2022-02-09 015058.jpg

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.