MACVLAN call traces


13 posts in this topic Last Reply

Recommended Posts

Quote

Oct 24 16:25:35 Tower kernel: WARNING: CPU: 33 PID: 1126 at net/netfilter/nf_conntrack_core.c:763 __nf_conntrack_confirm+0x96/0x4fc
Oct 24 16:25:35 Tower kernel: Modules linked in: macvlan xt_CHECKSUM iptable_mangle ipt_REJECT ebtable_filter ebtables ip6table_filter ip6_tables veth vhost_net tun vhost tap xt_nat ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat xfs dm_crypt algif_skcipher af_alg dm_mod dax md_mod nct7904 bonding mlx4_en mlx4_core igb i2c_algo_bit sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mpt3sas intel_cstate ipmi_ssif intel_uncore isci intel_rapl_perf libsas nvme ahci i2c_i801 raid_class i2c_core libahci scsi_transport_sas nvme_core wmi pcc_cpufreq ipmi_si button [last unloaded: mlx4_core]
Oct 24 16:25:35 Tower kernel: CPU: 33 PID: 1126 Comm: kworker/33:1 Not tainted 4.18.14-unRAID #1
Oct 24 16:25:35 Tower kernel: Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
Oct 24 16:25:35 Tower kernel: Workqueue: events macvlan_process_broadcast [macvlan]
Oct 24 16:25:35 Tower kernel: RIP: 0010:__nf_conntrack_confirm+0x96/0x4fc
Oct 24 16:25:35 Tower kernel: Code: c1 ed 20 89 2c 24 e8 26 f7 ff ff 8b 54 24 04 89 ef 89 c6 41 89 c5 e8 bc f8 ff ff 84 c0 75 b9 49 8b 86 80 00 00 00 a8 08 74 02 <0f> 0b 4c 89 f7 e8 04 ff ff ff 49 8b 86 80 00 00 00 0f ba e0 09 73 
Oct 24 16:25:35 Tower kernel: RSP: 0018:ffff88105f343d30 EFLAGS: 00010202
Oct 24 16:25:35 Tower kernel: RAX: 0000000000000188 RBX: ffff880811e55800 RCX: 0000000083f36e36
Oct 24 16:25:35 Tower kernel: RDX: 0000000000000001 RSI: 00000000000000e6 RDI: ffffffff81e093ec
Oct 24 16:25:35 Tower kernel: RBP: 000000000000db7b R08: 00000000069adc10 R09: ffff88080c7d2000
Oct 24 16:25:35 Tower kernel: R10: 0000000000000098 R11: ffff88080c1fa780 R12: ffffffff81e8cc80
Oct 24 16:25:35 Tower kernel: R13: 000000000000f0e6 R14: ffff8808be1a2440 R15: ffff8808be1a2498
Oct 24 16:25:35 Tower kernel: FS:  0000000000000000(0000) GS:ffff88105f340000(0000) knlGS:0000000000000000
Oct 24 16:25:35 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 24 16:25:35 Tower kernel: CR2: 000000c420cd5000 CR3: 0000000001e0a001 CR4: 00000000001606e0
Oct 24 16:25:35 Tower kernel: Call Trace:
Oct 24 16:25:35 Tower kernel: <IRQ>
Oct 24 16:25:35 Tower kernel: ipv4_confirm+0xaf/0xb7 [nf_conntrack_ipv4]
Oct 24 16:25:35 Tower kernel: nf_hook_slow+0x37/0x96
Oct 24 16:25:35 Tower kernel: ip_local_deliver+0xa7/0xd5
Oct 24 16:25:35 Tower kernel: ? inet_del_offload+0x3e/0x3e
Oct 24 16:25:35 Tower kernel: ip_rcv+0x2dc/0x317
Oct 24 16:25:35 Tower kernel: ? ip_local_deliver_finish+0x1aa/0x1aa
Oct 24 16:25:35 Tower kernel: __netif_receive_skb_core+0x6b2/0x740
Oct 24 16:25:35 Tower kernel: process_backlog+0x7e/0x116
Oct 24 16:25:35 Tower kernel: net_rx_action+0x10b/0x274
Oct 24 16:25:35 Tower kernel: __do_softirq+0xce/0x1c8
Oct 24 16:25:35 Tower kernel: do_softirq_own_stack+0x2a/0x40
Oct 24 16:25:35 Tower kernel: </IRQ>
Oct 24 16:25:35 Tower kernel: do_softirq+0x4d/0x59
Oct 24 16:25:35 Tower kernel: netif_rx_ni+0x1c/0x22
Oct 24 16:25:35 Tower kernel: macvlan_broadcast+0x10f/0x153 [macvlan]
Oct 24 16:25:35 Tower kernel: macvlan_process_broadcast+0xd5/0x131 [macvlan]
Oct 24 16:25:35 Tower kernel: process_one_work+0x16e/0x243
Oct 24 16:25:35 Tower kernel: ? cancel_delayed_work_sync+0xa/0xa
Oct 24 16:25:35 Tower kernel: worker_thread+0x1dc/0x2ac
Oct 24 16:25:35 Tower kernel: kthread+0x10b/0x113
Oct 24 16:25:35 Tower kernel: ? kthread_flush_work_fn+0x9/0x9
Oct 24 16:25:35 Tower kernel: ret_from_fork+0x35/0x40
Oct 24 16:25:35 Tower kernel: ---[ end trace 16e5fe6358e3d26d ]---

Was going good for about a week now. I was told this was fixed in a previous version of Unraid, but today it caused my server to lock up completely making me do a unclean shutdown (parity check as we speak). I can't have that happening so I'm shutting down any fixed IP docker containers hoping that it will fix the problem. I'd like to put this down as a bug but need someone's verification before I do so. Thank you. Here's the Diagnostics.


 

tower-diagnostics-20181024-1640.zip

Link to post
14 minutes ago, slimshizn said:

I'd like to put this down as a bug but need someone's verification before I do so.

It only happens on certain hardware/configurations.  Macvlan call traces have not been reproducible by the Docker networking author as it does not happen on any of his systems.

 

In my case, it only happened on br0.  As soon as I created a separate VLAN (br0.3) for dockers that needed their own IP address, all macvlan call traces disappeared and I have not seen them for months.

Link to post

I reported a similar issue a while ago which at the time I suspected might have been related to the 10G Mellanox drivers I was using not playing nicely with macvlan.

 

 

I don't see any reference to the Mellanox drivers in your stack trace though; it looks like you're using Intel drivers. So maybe this is a more common issue than I'd assumed. I haven't really been following it since I last saw it in August. I can confirm at least that it was present in version 6.5.3. Is it safe to assume that you're running 6.6.3? If so, that begins to put a range on the affected software versions...

 

Anyway, as I mentioned in my post linked above, I just avoided the issue by removing VLANs and using multiple hardware NICs. I'm just chiming in here to report that I've seen this issue as well.

 

Cheers,

 

-A

Link to post
42 minutes ago, Hoopster said:

It only happens on certain hardware/configurations.  Macvlan call traces have not been reproducible by the Docker networking author as it does not happen on any of his systems.

 

In my case, it only happened on br0.  As soon as I created a separate VLAN (br0.3) for dockers that needed their own IP address, all macvlan call traces disappeared and I have not seen them for months.

I can give this a shot and see what happens, I'm not opposed to testing if it fixes the issue.

 

 

3 minutes ago, Ambrotos said:

I reported a similar issue a while ago which at the time I suspected might have been related to the 10G Mellanox drivers I was using not playing nicely with macvlan.

 

 

I don't see any reference to the Mellanox drivers in your stack trace though; it looks like you're using Intel drivers. So maybe this is a more common issue than I'd assumed. I haven't really been following it since I last saw it in August. I can confirm at least that it was present in version 6.5.3. Is it safe to assume that you're running 6.6.3? If so, that begins to put a range on the affected software versions...

 

Anyway, as I mentioned in my post linked above, I just avoided the issue by removing VLANs and using multiple hardware NICs. I'm just chiming in here to report that I've seen this issue as well.

 

Cheers,

 

-A

I am on 6.6.2 at the moment but now that I have seen another call trace (thought it might have been fixed in 6.6.2) I'll go ahead and update after the parity check. Also the driver is Mellanox, I believe that is the mlx4 referenced above.

Link to post
20 minutes ago, Ambrotos said:

Ah yes, I do see the mlx4_core reference in the second trace message there. I must have made a typo when I searched the first time and didn't find anything.

 

So alright, is Mellanox a common element of this issue? @AcidReign, what drivers are you using? @Hoopster?

 

-A

 

No Mellanox in my case.  I am using only the built-in NICs on my ASRock motherboard; Intel i210 and Intel i219.  However, at least in my case, the physical hardware was not the issue as both br0 and br0.3 use the same physical NIC.  br0 resulted in call traces whereas br0.3 did not.

Link to post
14 hours ago, Ambrotos said:

Ah yes, I do see the mlx4_core reference in the second trace message there. I must have made a typo when I searched the first time and didn't find anything.

 

So alright, is Mellanox a common element of this issue? @AcidReign, what drivers are you using? @Hoopster?

 

-A

I have a Mellanox NIC on another server as well and have seen kernal panic once. This was without the use of a user defined static IP for dockers, which makes the argument of docker br0 being the issue a little more difficult. Being that it was a one time occurrence and didn't cause the system to lock up, I'm going to think of it as a fluke.

Edited by slimshizn
Link to post

Well I did try to increase rx and tx buffers and had some interesting results that didn't happen before. Using ethtool -G eth0 rx 8192 tx 8192 I had this happen. 

 

Quote

Oct 25 09:02:29 Tower kernel: CPU: 36 PID: 14400 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:02:29 Tower kernel: Call Trace:
Oct 25 09:02:35 Tower kernel: CPU: 13 PID: 15145 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:02:35 Tower kernel: Call Trace:
Oct 25 09:03:02 Tower kernel: CPU: 34 PID: 17935 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:03:02 Tower kernel: Call Trace:
Oct 25 09:12:51 Tower kernel: CPU: 33 PID: 1480 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:12:51 Tower kernel: Call Trace:
Oct 25 09:12:53 Tower kernel: CPU: 10 PID: 1672 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:12:53 Tower kernel: Call Trace:
Oct 25 09:12:53 Tower kernel: CPU: 38 PID: 1913 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:12:53 Tower kernel: Call Trace:
Oct 25 09:12:54 Tower kernel: CPU: 11 PID: 1947 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:12:54 Tower kernel: Call Trace:
Oct 25 09:12:58 Tower kernel: CPU: 34 PID: 2444 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:12:58 Tower kernel: Call Trace:
Oct 25 09:12:59 Tower kernel: CPU: 31 PID: 2583 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:12:59 Tower kernel: Call Trace:
Oct 25 09:13:00 Tower kernel: CPU: 18 PID: 2626 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:13:00 Tower kernel: Call Trace:
Oct 25 09:13:00 Tower kernel: CPU: 14 PID: 2629 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:13:00 Tower kernel: Call Trace:
Oct 25 09:13:00 Tower kernel: CPU: 39 PID: 2632 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:13:00 Tower kernel: Call Trace:
Oct 25 09:13:01 Tower kernel: CPU: 18 PID: 2696 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:13:01 Tower kernel: Call Trace:
Oct 25 09:13:01 Tower kernel: CPU: 36 PID: 2901 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:13:01 Tower kernel: Call Trace:
Oct 25 09:13:01 Tower kernel: CPU: 31 PID: 2916 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:13:01 Tower kernel: Call Trace:
Oct 25 09:13:01 Tower kernel: CPU: 33 PID: 2923 Comm: ethtool Tainted: G        W         4.18.14-unRAID #1
Oct 25 09:13:01 Tower kernel: Call Trace:

As you can see I got a little carried away at trying to use the command lol. Hope that this can help reproduce this issue for bug zapping.

 

Quote

Oct 25 09:13:01 Tower kernel: mlx4_en: 0000:83:00.0: Port 1: Failed allocating hwq resources
Oct 25 09:13:01 Tower kernel: mlx4_en: 0000:83:00.0: Port 1: Failed to allocate NIC resources
Oct 25 09:13:01 Tower kernel: mlx4_en: eth0: mlx4_en_try_alloc_resources: Resource allocation failed, using previous configuration



Edit: Interestingly enough I tried this on my other server and it passed it through.
image.thumb.png.6bfc4f191be8cdb89d0bbfcfc001220b.png


Edit 2: Updated to 6.6.3, reboot the server, ran ethtool -G eth0 rx 8192 tx 8192 and it went through without a kernel panic. Testing speeds again.

Edited by slimshizn
Link to post
6 hours ago, slimshizn said:

which makes the argument of docker br0 being the issue a little more difficult

Yeah, it is not necessarily a br0 issue for everyone (it appears it was for me).  Others have had macvlan call traces on br1.  However, some combination of hardware, docker LAN configuration and, who knows what else, produces macvlan call traces for some users whereas others have never seen them.

 

The only common factor I can point to, at least to my knowledge, is that they only happen when dockers are assigned custom IP addresses.

Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.