Unraid becomes unresponsive rendomly/docker containers crashing


ScottRTL

Recommended Posts

Hi everyone:

I am new(ish) to Unraid (been running for about 4 months now) I have managed to fix some issues over the past few months with a lot the the forum posts, etc. However, I'm not able to find anything about my current problem...

I'm not sure if I am dealing with several problems, or just one recurring problem that has different outcomes.

Basically, sometimes a docker container will randomly crash. Sometimes Unraid will become unresponsive, I am unable to get to the GUI, and/or CLI and sometimes I cannot even ping my server. (So I am forced to force power down/reset) and therefore I cannot get a log of one of those instances, only an instance of a log entry I see often.

From what I have seen, a docker container crashing always seems to have an entry in my log like this:

Oct 15 12:14:23 Tower kernel: WARNING: CPU: 3 PID: 25038 at net/netfilter/nf_conntrack_core.c:945 __nf_conntrack_confirm+0xa0/0x69e
Oct 15 12:14:23 Tower kernel: Modules linked in: nfsv3 nfs lockd grace sunrpc xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap xt_nat macvlan iptable_filter xfs md_mod i915 i2c_algo_bit iosf_mbi drm_kms_helper drm intel_gtt agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops hwmon_vid iptable_nat ipt_MASQUERADE nf_nat_ipv4 nf_nat ip_tables wireguard ip6_udp_tunnel udp_tunnel bonding e1000e igb(O) x86_pkg_temp_thermal intel_powerclamp coretemp wmi_bmof intel_wmi_thunderbolt mxm_wmi kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate intel_uncore intel_rapl_perf i2c_i801 i2c_core ahci libahci nvme nvme_core thermal fan wmi video acpi_pad button
Oct 15 12:14:23 Tower kernel: pcc_cpufreq backlight [last unloaded: e1000e]
Oct 15 12:14:23 Tower kernel: CPU: 3 PID: 25038 Comm: kworker/3:0 Tainted: G           O      4.19.107-Unraid #1
Oct 15 12:14:23 Tower kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B98/Z390-A PRO (MS-7B98), BIOS 1.A0 06/10/2020
Oct 15 12:14:23 Tower kernel: Workqueue: events macvlan_process_broadcast [macvlan]
Oct 15 12:14:23 Tower kernel: RIP: 0010:__nf_conntrack_confirm+0xa0/0x69e
Oct 15 12:14:23 Tower kernel: Code: 04 e8 56 fb ff ff 44 89 f2 44 89 ff 89 c6 41 89 c4 e8 7f f9 ff ff 48 8b 4c 24 08 84 c0 75 af 48 8b 85 80 00 00 00 a8 08 74 26 <0f> 0b 44 89 e6 44 89 ff 45 31 f6 e8 95 f1 ff ff be 00 02 00 00 48
Oct 15 12:14:23 Tower kernel: RSP: 0018:ffff88902dac3d90 EFLAGS: 00010202
Oct 15 12:14:23 Tower kernel: RAX: 0000000000000188 RBX: ffff888e5d0fc300 RCX: ffff888e5c60acd8
Oct 15 12:14:23 Tower kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff81e08d98
Oct 15 12:14:23 Tower kernel: RBP: ffff888e5c60ac80 R08: 00000000903d9035 R09: ffffffff81c8aa80
Oct 15 12:14:23 Tower kernel: R10: 0000000000000098 R11: ffff888fe139a800 R12: 00000000000045e6
Oct 15 12:14:23 Tower kernel: R13: ffffffff81e91080 R14: 0000000000000000 R15: 00000000000039ad
Oct 15 12:14:23 Tower kernel: FS:  0000000000000000(0000) GS:ffff88902dac0000(0000) knlGS:0000000000000000
Oct 15 12:14:23 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 15 12:14:23 Tower kernel: CR2: 000014a269d6b9b0 CR3: 0000000001e0a002 CR4: 00000000003606e0
Oct 15 12:14:23 Tower kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct 15 12:14:23 Tower kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Oct 15 12:14:23 Tower kernel: Call Trace:
Oct 15 12:14:23 Tower kernel: <IRQ>
Oct 15 12:14:23 Tower kernel: ipv4_confirm+0xaf/0xb9
Oct 15 12:14:23 Tower kernel: nf_hook_slow+0x3a/0x90
Oct 15 12:14:23 Tower kernel: ip_local_deliver+0xad/0xdc
Oct 15 12:14:23 Tower kernel: ? ip_sublist_rcv_finish+0x54/0x54
Oct 15 12:14:23 Tower kernel: ip_rcv+0xa0/0xbe
Oct 15 12:14:23 Tower kernel: ? ip_rcv_finish_core.isra.0+0x2e1/0x2e1
Oct 15 12:14:23 Tower kernel: __netif_receive_skb_one_core+0x53/0x6f
Oct 15 12:14:23 Tower kernel: process_backlog+0x77/0x10e
Oct 15 12:14:23 Tower kernel: net_rx_action+0x107/0x26c
Oct 15 12:14:23 Tower kernel: __do_softirq+0xc9/0x1d7
Oct 15 12:14:23 Tower kernel: do_softirq_own_stack+0x2a/0x40
Oct 15 12:14:23 Tower kernel: </IRQ>
Oct 15 12:14:23 Tower kernel: do_softirq+0x4d/0x5a
Oct 15 12:14:23 Tower kernel: netif_rx_ni+0x1c/0x22
Oct 15 12:14:23 Tower kernel: macvlan_broadcast+0x111/0x156 [macvlan]
Oct 15 12:14:23 Tower kernel: ? __switch_to_asm+0x41/0x70
Oct 15 12:14:23 Tower kernel: macvlan_process_broadcast+0xea/0x128 [macvlan]
Oct 15 12:14:23 Tower kernel: process_one_work+0x16e/0x24f
Oct 15 12:14:23 Tower kernel: worker_thread+0x1e2/0x2b8
Oct 15 12:14:23 Tower kernel: ? rescuer_thread+0x2a7/0x2a7
Oct 15 12:14:23 Tower kernel: kthread+0x10c/0x114
Oct 15 12:14:23 Tower kernel: ? kthread_park+0x89/0x89
Oct 15 12:14:23 Tower kernel: ret_from_fork+0x1f/0x40
Oct 15 12:14:23 Tower kernel: ---[ end trace e09e8b4142c8f63c ]---

 

I also attached the diagnostics ZIP if that helps anyone. I am running 6.8.3. Everything runs smooth for a day or two, then I get one of these "crashes". It's been happening for a few weeks now.

 

Thanks for any help that can be provided!

tower-diagnostics-20201015-1249.zip

Edited by ScottRTL
Link to comment
12 hours ago, JorgeB said:

Macvlan call traces are usually related to having dockers with a custom IP address, more info here:

 

Thanks for the reply.

 

This is my issue for sure.

 

It seems really weird to me that you cant do something as simple as setup a static IP for a docker container without causing this crashing... And it's been around since 6.5.0? Doesnt really give me a lot of confidence in the next verison fixing this issue.

Link to comment
36 minutes ago, ScottRTL said:

Doesnt really give me a lot of confidence in the next verison fixing this issue

It really has nothing to do with unRAID.  There is not reasonably a way Limetech can fix this.

 

It does not happen for everyone and it appears to be something perhaps hardware related that triggers network broadcast problems with macvlan in Docker.

 

At least in my case, VLANs solved the problem and I have several docker containers running happily with their own IP address.  Unfortunately, VLANs are not an option for everyone due to router/switch restrictions.

Edited by Hoopster
Link to comment
51 minutes ago, Hoopster said:

It really has nothing to do with unRAID.  There is not reasonably a way Limetech can fix this.

 

It does not happen for everyone and it appears to be something perhaps hardware related that triggers network broadcast problems with macvlan in Docker.

 

At least in my case, VLANs solved the problem and I have several docker containers running happily with their own IP address.  Unfortunately, VLANs are not an option for everyone due to router/switch restrictions.

Yeah I don't blame them (Unraid dev team) for it, I understand that it's an issue to do with some people's setups, I was just hoping that it would be fixed until I saw you have had the issue for 2 years...


I asked in the other post as well...(yours from 2 years ago, that was linked)

 

When you say creating a new VLAN, you mean go into my EdgeRouter, add a new pool, then assign the docker images IPs in that pool, right? how do I create a new BR that uses that pool? Is there a youtube video on that or anything? I couldn't find a SpaceinvaderOne video...

Edited by ScottRTL
Link to comment

Thanks again for the help. :)

Well... I created the VLANs (4 for VMs and 5 for dockers) added them to my EdgeRouter and My UniFi Controller, added them to my Unraid network config.

 

I can see the routes created in the network settings, I can select them in the Settings>Enable Docker>Advanced, but if I uncheck my br0 interface and check my VLAN5 interface, "custom:brx" is not longer an option when I edit the docker container... So it seems I have a whole other problem now...

(Going into the settings>docker and selecting the br0 interface there, 'fixes' the issue, as in I can then select br0 again, but I cannot select br0.4 or br0.5 reguardless of what I do).

Edited by ScottRTL
Link to comment
13 minutes ago, ScottRTL said:

I can see the routes created in the network settings, I can select them in the Settings>Enable Docker>Advanced, but if I uncheck my br0 interface and check my VLAN5 interface, "custom:brx" is not longer an option when I edit the docker container... So it seems I have a whole other problem now.

Here are my Docker settings (br0, br1 and br0.3 enabled):

 

image.thumb.png.d2a7d31efba31a9b0c58a884ff3abadf.png

 

Network Settings:

image.png.55350e27b0ac8329ee5cfc6030538e12.png

 

Network selection in docker containers:

image.thumb.png.be32003f80813ccd5dbda6c05b22f5b1.png

 

Not sure what you might have set differently.  I am running unRAID 6.8.3 on this server.

Link to comment
24 minutes ago, ScottRTL said:

Changing those two settings does NOT give me br0.4 or br0.5 in my containers :/

You might want to try not setting an IPv4 address in the VLAN network settings. 

 

As I recall that caused routing problems in some configurations.  This is not necessarily related to this issue but when I was setting up remote WireGaurd access to the LAN, an IP address there would cause docker containers to be unavailable over WireGuard.

 

It's worth a shot.

Link to comment

Welp...

 

Something odd happened...

 

I did as you suggested, and when I hit apply, the server went down (as you would expect with the network stack reset), but once Unraid was back, it had lost ALL its network settings... (Static address, MTU, gateway, DNS, VLANs...Everything) So I put them all back in, including the VLANs again, then applied.

 

Now that seems to have fixed my issue, I activated VMs and Docker again, and put in their ranges, and now I see br0.4 and br0.5

 

So...I guess that's fixed...? LOL

 

Now to see if the underlying issue of Unraid locking up can be resolved by moving all my containers to br0.5

 

Thanks again for the help, I'll make resolved as soon as I get it all done.

 

Edited by ScottRTL
Link to comment
42 minutes ago, ScottRTL said:

Now that seems to have fixed my issue, I activated VMs and Docker again, and put in their ranges, and now I see br0.4 and br0.5

 

So...I guess that's fixed...? LOL

Just to be clear, I never had an IPv4 address assigned to my VLAN since the day I set it up.  That's why it worked for me and I was not sure if that was the problem in your case.   I just recall that when I was troubleshooting the WireGuard issues, it was mentioned that there should not be an address there or it would cause routing issues.

 

My problem in that case turned out to be a typo in the gateway IP address.

 

Hopefully, your issues are resolved.

Edited by Hoopster
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.