• 6.9.0/6.9.1 - Kernel Panic due to netfilter (nf_nat_setup_info) - Docker Static IP (macvlan)


    CorneliousJD
    • Urgent

    So I had posted another thread about after a kernel panic, docker host access to custom networks doesn't work until docker is stopped/restarted on 6.9.0

     

     

    After further investigation and setting up syslogging, it apperas that it may actually be that host access that's CAUSING the kernel panic? 

    EDIT: 3/16 - I guess I needed to create a VLAN for my dockers with static IPs, so far that's working, so it's probably not HOST access causing the issue, but rather br0 static IPs being set. See following posts below.

     

    Here's my last kernel panic that thankfully got logged to syslog. It references macvlan and netfilter. I don't know enough to be super useful here, but this is my docker setup.

     

    image.png.dac2782e9408016de37084cf21ad64a5.png

     

    Mar 12 03:57:07 Server kernel: ------------[ cut here ]------------
    Mar 12 03:57:07 Server kernel: WARNING: CPU: 17 PID: 626 at net/netfilter/nf_nat_core.c:614 nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Modules linked in: ccp macvlan xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap veth xt_nat xt_MASQUERADE iptable_nat nf_nat xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables bonding igb i2c_algo_bit cp210x usbserial sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd ipmi_ssif isci glue_helper mpt3sas i2c_i801 rapl libsas i2c_smbus input_leds i2c_core ahci intel_cstate raid_class led_class acpi_ipmi intel_uncore libahci scsi_transport_sas wmi ipmi_si button [last unloaded: ipmi_devintf]
    Mar 12 03:57:07 Server kernel: CPU: 17 PID: 626 Comm: kworker/17:2 Tainted: G        W         5.10.19-Unraid #1
    Mar 12 03:57:07 Server kernel: Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
    Mar 12 03:57:07 Server kernel: Workqueue: events macvlan_process_broadcast [macvlan]
    Mar 12 03:57:07 Server kernel: RIP: 0010:nf_nat_setup_info+0x6c/0x652 [nf_nat]
    Mar 12 03:57:07 Server kernel: Code: 89 fb 49 89 f6 41 89 d4 76 02 0f 0b 48 8b 93 80 00 00 00 89 d0 25 00 01 00 00 45 85 e4 75 07 89 d0 25 80 00 00 00 85 c0 74 07 <0f> 0b e9 1f 05 00 00 48 8b 83 90 00 00 00 4c 8d 6c 24 20 48 8d 73
    Mar 12 03:57:07 Server kernel: RSP: 0018:ffffc90006778c38 EFLAGS: 00010202
    Mar 12 03:57:07 Server kernel: RAX: 0000000000000080 RBX: ffff88837c8303c0 RCX: ffff88811e834880
    Mar 12 03:57:07 Server kernel: RDX: 0000000000000180 RSI: ffffc90006778d14 RDI: ffff88837c8303c0
    Mar 12 03:57:07 Server kernel: RBP: ffffc90006778d00 R08: 0000000000000000 R09: ffff889083c68160
    Mar 12 03:57:07 Server kernel: R10: 0000000000000158 R11: ffff8881e79c1400 R12: 0000000000000000
    Mar 12 03:57:07 Server kernel: R13: 0000000000000000 R14: ffffc90006778d14 R15: 0000000000000001
    Mar 12 03:57:07 Server kernel: FS:  0000000000000000(0000) GS:ffff88903fc40000(0000) knlGS:0000000000000000
    Mar 12 03:57:07 Server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar 12 03:57:07 Server kernel: CR2: 000000c000b040b8 CR3: 000000000200c005 CR4: 00000000001706e0
    Mar 12 03:57:07 Server kernel: Call Trace:
    Mar 12 03:57:07 Server kernel: <IRQ>
    Mar 12 03:57:07 Server kernel: ? activate_task+0x9/0x12
    Mar 12 03:57:07 Server kernel: ? resched_curr+0x3f/0x4c
    Mar 12 03:57:07 Server kernel: ? ipt_do_table+0x49b/0x5c0 [ip_tables]
    Mar 12 03:57:07 Server kernel: ? try_to_wake_up+0x1b0/0x1e5
    Mar 12 03:57:07 Server kernel: nf_nat_alloc_null_binding+0x71/0x88 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_nat_inet_fn+0x91/0x182 [nf_nat]
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
    Mar 12 03:57:07 Server kernel: ip_local_deliver+0x49/0x75
    Mar 12 03:57:07 Server kernel: ip_sabotage_in+0x43/0x4d
    Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
    Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
    Mar 12 03:57:07 Server kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
    Mar 12 03:57:07 Server kernel: ip_rcv+0x41/0x61
    Mar 12 03:57:07 Server kernel: __netif_receive_skb_one_core+0x74/0x95
    Mar 12 03:57:07 Server kernel: process_backlog+0xa3/0x13b
    Mar 12 03:57:07 Server kernel: net_rx_action+0xf4/0x29d
    Mar 12 03:57:07 Server kernel: __do_softirq+0xc4/0x1c2
    Mar 12 03:57:07 Server kernel: asm_call_irq_on_stack+0x12/0x20
    Mar 12 03:57:07 Server kernel: </IRQ>
    Mar 12 03:57:07 Server kernel: do_softirq_own_stack+0x2c/0x39
    Mar 12 03:57:07 Server kernel: do_softirq+0x3a/0x44
    Mar 12 03:57:07 Server kernel: netif_rx_ni+0x1c/0x22
    Mar 12 03:57:07 Server kernel: macvlan_broadcast+0x10e/0x13c [macvlan]
    Mar 12 03:57:07 Server kernel: macvlan_process_broadcast+0xf8/0x143 [macvlan]
    Mar 12 03:57:07 Server kernel: process_one_work+0x13c/0x1d5
    Mar 12 03:57:07 Server kernel: worker_thread+0x18b/0x22f
    Mar 12 03:57:07 Server kernel: ? process_scheduled_works+0x27/0x27
    Mar 12 03:57:07 Server kernel: kthread+0xe5/0xea
    Mar 12 03:57:07 Server kernel: ? __kthread_bind_mask+0x57/0x57
    Mar 12 03:57:07 Server kernel: ret_from_fork+0x22/0x30
    Mar 12 03:57:07 Server kernel: ---[ end trace b3ca21ac5f2c2720 ]---

     




    User Feedback

    Recommended Comments



    With a watchdog timing out, it looks like a hardware related issue.

     

    Have you tried to reseat the NIC or move it to a different slot?
     

    Link to comment
    7 minutes ago, bonienl said:

    With a watchdog timing out, it looks like a hardware related issue.

     

    Have you tried to reseat the NIC or move it to a different slot?
     

    I've tried three different lan cards in various slots over the last few months.

     

    This happens using onboard NICs and PCI NICs

    Edited by Mr_Jay84
    Link to comment
    11 hours ago, Mr_Jay84 said:

    This happens using onboard NICs and PCI NICs

    Could also be a motherboard problem ...

     

    Link to comment
    On 3/12/2022 at 10:09 AM, bonienl said:

    Could also be a motherboard problem ...

     

    I've had the onboard NICs turned off too.

     

    Assigning IPs to containers is a sure way of causing a crash in under 24hrs.

     

    I'm not buying anymore hardware. I've already spent a lot of money replacing pretty much everything and trying out various hardware combinations over at least a year.

     

    Upgraded to RC3 today.....so fingers crossed. I'll have to look into other solutions as unRaid isn't reliable enough if this doesn't work.

    Edited by Mr_Jay84
    Link to comment

    Another lockup today not sure if it's related though.

     

    Mar 16 15:06:55 Ultron kernel: veth437980e: renamed from eth0
    Mar 16 15:06:55 Ultron kernel: docker0: port 26(veth244df90) entered disabled state
    Mar 16 15:06:55 Ultron kernel: docker0: port 26(veth244df90) entered disabled state
    Mar 16 15:06:55 Ultron kernel: device veth244df90 left promiscuous mode
    Mar 16 15:06:55 Ultron kernel: docker0: port 26(veth244df90) entered disabled state
    Mar 16 15:07:55 Ultron kernel: docker0: port 26(veth90a6157) entered blocking state
    Mar 16 15:07:55 Ultron kernel: docker0: port 26(veth90a6157) entered disabled state
    Mar 16 15:07:55 Ultron kernel: device veth90a6157 entered promiscuous mode
    Mar 16 15:07:55 Ultron kernel: eth0: renamed from vethc67d2ac
    Mar 16 15:07:55 Ultron kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth90a6157: link becomes ready
    Mar 16 15:07:55 Ultron kernel: docker0: port 26(veth90a6157) entered blocking state
    Mar 16 15:07:55 Ultron kernel: docker0: port 26(veth90a6157) entered forwarding state
    Mar 16 15:07:57 Ultron kernel: docker0: port 26(veth90a6157) entered disabled state
    Mar 16 15:07:57 Ultron kernel: vethc67d2ac: renamed from eth0
    Mar 16 15:07:57 Ultron kernel: docker0: port 26(veth90a6157) entered disabled state
    Mar 16 15:07:57 Ultron kernel: device veth90a6157 left promiscuous mode
    Mar 16 15:07:57 Ultron kernel: docker0: port 26(veth90a6157) entered disabled state
    Mar 16 18:48:14 Ultron kernel: br-caf2c2672c89: port 1(vethdf85790) entered disabled state
    Mar 16 18:48:14 Ultron kernel: veth6c760c6: renamed from eth0
    Mar 16 18:48:15 Ultron kernel: br-caf2c2672c89: port 1(vethdf85790) entered disabled state
    Mar 16 18:48:15 Ultron kernel: device vethdf85790 left promiscuous mode
    Mar 16 18:48:15 Ultron kernel: br-caf2c2672c89: port 1(vethdf85790) entered disabled state
    Mar 16 18:48:15 Ultron kernel: br-caf2c2672c89: port 1(veth400e60b) entered blocking state
    Mar 16 18:48:15 Ultron kernel: br-caf2c2672c89: port 1(veth400e60b) entered disabled state
    Mar 16 18:48:15 Ultron kernel: device veth400e60b entered promiscuous mode
    Mar 16 18:48:15 Ultron kernel: br-caf2c2672c89: port 1(veth400e60b) entered blocking state
    Mar 16 18:48:15 Ultron kernel: br-caf2c2672c89: port 1(veth400e60b) entered forwarding state
    Mar 16 18:48:15 Ultron kernel: docker0: port 1(vethc04644b) entered blocking state
    Mar 16 18:48:15 Ultron kernel: docker0: port 1(vethc04644b) entered disabled state
    Mar 16 18:48:15 Ultron kernel: device vethc04644b entered promiscuous mode
    Mar 16 18:48:15 Ultron kernel: docker0: port 1(vethc04644b) entered blocking state
    Mar 16 18:48:15 Ultron kernel: docker0: port 1(vethc04644b) entered forwarding state
    Mar 16 18:48:15 Ultron kernel: eth0: renamed from veth3e19731
    Mar 16 18:48:15 Ultron kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth400e60b: link becomes ready

     

    ultron.log

    20220316_183623.jpg

    Edited by Mr_Jay84
    Link to comment

    Upgraded to RC3 a few weeks back, still the same crashing however as soon as RC4 came out I tried that.

    It's been up for almost 9 days with no issues so far. This is a record for me, tomorrow i'll turn on piHole again as that's a sure way of causing the previous crashes. Fingers crossed!

    Link to comment

    The issue has not returned since the RC4 upgrade even with PiHole and other known offending containers re-enabled.

    This is excellent!

    Link to comment

    There is a lot of info here.. so I'm just going to chime in so that I receive notifications as this thread progresses.  Not speaking on the cause of anyone else's kernel panics, but mine match a few of these screenshots verbatim. 

     

    I've been running Unraid for as long as I can remember and running dockers since they were originally introduced.  My server has been rock solid this entire time, even after upgrading MB/Proc/RAM, adding 10G NICs, replacing HDDs, etc.

     

    Even with 6.9.2, never had any issues.  Out of the blue, I decided to upgrade to 6.10.0 and everything went south.  Random kernel panics.  Obviously, blaming it on the upgrade, reverted to the previous version, 6.9.2 but the kernel panics continued (in hindsight, is it possible that 6.10.0 made changes to the docker configs causing 6.9.2 to continue having issues?).  Doing the research, assumed it was hardware related and replaced the MB/Proc/RAM and the kernel panics continued.  I noticed that it appeared the kernel panics seem to happen around the time the Appdata Backup/Restore was scheduled, so I started diving into that.  Found some errors accessing files, so thought it might be corruption issues.  Resolved all that, moved appdata to different drives, on to/off of cache, etc.  Turned off all containers and slowly turned them on, one at a time (over the course of a month), to see if I could pinpoint the cause, but nothing was repeatable.  Everything appeared to be random.  As of two days ago, my server had been running for 28 days with all dockers running without issues.  So, when it kernel panic'd the other day, I remove Appdata Backup/Restore, moved everything docker related onto the protected array and it kerenel panic'd that evening.

     

    Anyway, latest change I am testing (finger's crossed) is changing the docker setup to use ipvlan instead of macvlan.  if that doesn't work, I'm going to plug another NIC in and try using eth# instead of br0 for the dockers that need a static IP to operate properly.

     

    PS:  I learned of ctop, which is like top for docker containers.. loving it.  Wish it was built into Unraid or that Unraid had that information available within the web GUI.

    Edited by jzawacki
    Link to comment

    Ipvlan made more issues for me. 

    Then I went back to macvlan with vlans on docker and it was fine, until I got a new gateway (UDM-SE) and then it started crashing again. 

     

    I am now on a dedicated NIC for containers with their own IP. 

     

    Things keep changing... but I'm hopeful that this keeps things stable long term. ;)

    Link to comment

    Not sure exactly what fixed it for me but I can say that running the latest stable the issue has not appeared for quite some months now. I did make some minor changes to macvlan/ipvlan but seem to remember that didn't seem to help.

    Link to comment

    I'm afraid to post.. cause I don't want to jinx myself, but unraid has been up for 5 days after switching to ipvlan running all dockers except for one (that is known to cause issues).  I hope I don't have to post again any time soon.

    Link to comment

    Well, I had to switch back to macvlan due to some network weirdness.  unraid had been running for 20 days without issues, but I had some dockers that wouldn't stay connected and it wasn't till I realized that unraid couldn't check to see if there was a version update available, did I blame it on the docker ipvlan setting.  Switched it back to macvlan and all the network weirdness went away and the server hasn't kernel panic'd yet, so I'm keeping my fingers crossed.

    Link to comment
    8 minutes ago, jzawacki said:

    Well, I had to switch back to macvlan due to some network weirdness.  unraid had been running for 20 days without issues, but I had some dockers that wouldn't stay connected and it wasn't till I realized that unraid couldn't check to see if there was a version update available, did I blame it on the docker ipvlan setting.  Switched it back to macvlan and all the network weirdness went away and the server hasn't kernel panic'd yet, so I'm keeping my fingers crossed.

     

    I ran into same issues with ipvlan, 

     

    Ipvlan has been less drastic of issues but more sporadic and IMO, the worse of the two 

    Link to comment

    glad I found this thread, been dealing with kernel panics for the last month or so and can't for the life of me find the solution. It started after i installed a gpu and setup a vm to pass it to - never had any issues before, and disabling the vm doesn't stop the kernel panics. I upgraded from 6.11 to 6.12.0-rc2 and having the same problems. Going to dig through this thread a bit.

    Edited by sage2050
    Link to comment
    22 minutes ago, sage2050 said:

    glad I found this thread...

     

    I wish there was a solid answer/fix.  For me personally, It had to do with one of the dockers.  I installed docker and portainer on my backup server, moved a bunch of dockers to it, and although I still have 9 dockers running full time on unraid, my server uptime is currently 21 days and I can't remember the last time it kernel panic'd.  Now, since I'm posting this, it'll kernel panic by the end of the night.

    Link to comment

    Knocking on every piece of wood and tree in sight, but I crossed over 1 week of uptime last night after changing docker from macvlan to iplan. I also disabled some less than crucial dockers so I'll start enabling them one by one.

    Link to comment
    4 minutes ago, sage2050 said:

    Knocking on every piece of wood and tree in sight, but I crossed over 1 week of uptime last night after changing docker from macvlan to iplan. I also disabled some less than crucial dockers so I'll start enabling them one by one.

     

    I was running fine but would randomly see all sorts of network issues on my server after that. not able to update containers, not able to reach the internet from some, etc. seems the way it networks those together w/ ipvlan didn't play nice with my unifi router.

    Link to comment

    Same for me, I couldn't even get the host to be able to update after switching to ipvlan, and I'm not using UniFi.  So, I believe, if ipvlan fixes it for anyone, it's because it is stopping the actual cause (one of the dockers?) not to be able to talk to the network.  But that's just a guess.

    Link to comment

    *sigh* got a call trace this morning, but the server recovered at least. I hadn't even started activating dockers yet.

     

    I haven't seen any network issues in the log but i'll keep an eye out.

     

    edit: this trace was related to a scheduled ca backup/restore which is currently incompatible wtih 6.12 and i hadn't disabled it. I'll chalk it up to that.

    Edited by sage2050
    Link to comment

    hmm

     

    Apr  1 18:40:17 Servbot kernel: e1000e 0000:00:1f.6 eth0: NIC Link is Down
    Apr  1 18:40:17 Servbot kernel: bond0: (slave eth0): link status definitely down, disabling slave
    Apr  1 18:40:17 Servbot kernel: device eth0 left promiscuous mode
    Apr  1 18:40:17 Servbot kernel: bond0: now running without any active interface!
    Apr  1 18:40:17 Servbot kernel: br0: port 1(bond0) entered disabled state
    Apr  1 18:40:20 Servbot kernel: e1000e 0000:00:1f.6 eth0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
    Apr  1 18:40:20 Servbot kernel: bond0: (slave eth0): link status definitely up, 1000 Mbps full duplex
    Apr  1 18:40:20 Servbot kernel: bond0: (slave eth0): making interface the new active one
    Apr  1 18:40:20 Servbot kernel: device eth0 entered promiscuous mode
    Apr  1 18:40:20 Servbot kernel: bond0: active interface up!
    Apr  1 18:40:20 Servbot kernel: br0: port 1(bond0) entered blocking state
    Apr  1 18:40:20 Servbot kernel: br0: port 1(bond0) entered forwarding state

     

    Link to comment

    15 days of uptime. going on a trip for 4 days, I got a feeling it's going to go down as soon as I walk out the door.

    Link to comment

    I did some more messing around and after another short period of instability I feel fairly confident that my crashes were related to binding the two unused USB controllers that are included with my GPU. In my VM settings i set PCIe ACS override to "both" and was able to bind only the video and audio devices on the GPU for passthrough and I haven't seen any call traces in 10 days now.

     

    During my previous stable period I wasn't binding any IOMMU groups.

    Edited by sage2050
    Link to comment

    It do, it's a:

     

    08:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Rage 3 [Rage XL PCI] (rev 27)
     

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.