gurulee Posted September 10, 2024 Posted September 10, 2024 (edited) I am experiencing a new issue, and as of recent, with dockers on my custom vlan's (br0, br0.4, br0.5, br0.6) become unresponsive. However, I can still ping IP's on any of the networks. But I just cannot get to the web services running on any of them when the issues occurs, including the unraid webUI on br0. No other network changes have occurred and my unraid has been up for 3.5 months so far. The issue occurs when there is heavier network load on them. For example, Plex mobile app downloading content locally to view offline while another docker is performing downloads, or if multiple people are watching Plex. Uptime-Kuma docker webUI also becomes inaccessible during the issue, but when it resolves, it shows docker monitor events stating: 'Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? Additionally, external PRTG monitors show HTTPS monitors for Plex and other dockers as timing-out. Additionally, I cannot get to unraid mgmt. webUI on br0 when the issue occurs either. ***When the issue is occurring, unraid CPU, RAM, and network utilization is low as well. The issue resolves itself after approx. 3-5min.... I'm on unraid version: 6.12.8 My network config. (not changed in over a year): Two physical eth interfaces (eth0, eth1) with bonding and bridging enabled. Bond0 (eth0, eth1) is connected to Cisco switch using LAG port config. All vLAN's use parent interface bond0 Docker vLAN br0, br0.5, br0.6 use upstream Opnsense firewall for DHCP pool. Docker custom network type: macvlan I do not see any kernel 'call trace' in my enhanced syslog plugin output. Can someone help me narrow this down and/or recommend if I should try switching to Docker network type: ipvlan ? Edited September 10, 2024 by gurulee Quote
JorgeB Posted September 10, 2024 Posted September 10, 2024 Fist thing I would recommend updating to latest stable. 1 Quote
gurulee Posted September 10, 2024 Author Posted September 10, 2024 4 minutes ago, JorgeB said: Fist thing I would recommend updating to latest stable. I have been holding off due to all the issues I'm reading about in the release notes and with users. But if that has a specific fix for this, then I will plan for it. Quote
gurulee Posted September 10, 2024 Author Posted September 10, 2024 1 hour ago, JorgeB said: Fist thing I would recommend updating to latest stable. Okay, I have completed the upgrade from 6.12.8 to 6.12.13 successfully with no known issues. I will monitor it to see if the issue returns. Quote
gurulee Posted September 10, 2024 Author Posted September 10, 2024 (edited) Returning to my question as it relates to my issue, should I switch from macvlan to ipvlan even though I am not aware of any macvlan errors and my vLAN's use bond0 ? Edited September 10, 2024 by gurulee Quote
JorgeB Posted September 10, 2024 Posted September 10, 2024 The Macvlan issue with bridging is no longer a problem with the latest release, so test first to see how it is now. 1 Quote
gurulee Posted September 10, 2024 Author Posted September 10, 2024 2 hours ago, JorgeB said: The Macvlan issue with bridging is no longer a problem with the latest release, so test first to see how it is now. Thank you! I will report back 48 hours. Quote
gurulee Posted September 10, 2024 Author Posted September 10, 2024 46 minutes ago, gurulee said: Thank you! I will report back 48 hours. The issue just reoccurred at around 3:30pm / 15:30. All webUI connectivity lost to unraid mgmt int (br0)and all my dockers. But I was still able to ping the interfaces and dockers with custom bridge vLAN's static IP's. The issue resolved itself after approx. 3min and the webUI mgmt int of unraid and dockers became accessible again. I looked at my enhanced syslog plugin and this is the only entry around the time of the issue: Sep 10 11:05:13 Tower root: Fix Common Problems: Error: Macvlan and Bridging found ** Ignored Sep 10 11:10:40 Tower kernel: eth0: renamed from vethe9b73b7 Sep 10 11:15:06 Tower webGUI: Successful login user root from 192.168.100.90 Sep 10 11:18:48 Tower kernel: vethb78d931: renamed from eth0 Sep 10 11:19:03 Tower kernel: eth0: renamed from vethd9921c1 Sep 10 12:17:41 Tower emhttpd: spinning down /dev/sdf Sep 10 12:45:53 Tower emhttpd: read SMART /dev/sdf Sep 10 13:06:06 Tower emhttpd: spinning down /dev/sdd Sep 10 13:16:28 Tower emhttpd: spinning down /dev/sdf Sep 10 13:19:08 Tower emhttpd: read SMART /dev/sdd Sep 10 13:54:51 Tower emhttpd: read SMART /dev/sdf Sep 10 13:55:06 Tower emhttpd: spinning down /dev/sdd Sep 10 15:30:46 Tower emhttpd: read SMART /dev/sdd Quote
JorgeB Posted September 11, 2024 Posted September 11, 2024 Unfortunately there's nothing relevant logged. Quote
gurulee Posted September 11, 2024 Author Posted September 11, 2024 1 hour ago, JorgeB said: Unfortunately there's nothing relevant logged. That is apparent and agreed. Is there a way to log more debug level? So can someone advise me next steps to narrow the cause of this issue down? Essentially all the HTTP / HTTPS / webui become inaccessible to unraid Mgmt int on br0 and all dockers on br0.4 and br0.5 intermittently for approx 3-5min. All the while I can still ping all of the interfaces during the issue. Issue seems intermittent and no pattern. Quote
JorgeB Posted September 11, 2024 Posted September 11, 2024 The only thing I can think off it to run the server with half of the containers, if the same try the other half, it that helps then keep drilling down to see if you can find the culprit. Quote
gurulee Posted September 13, 2024 Author Posted September 13, 2024 (edited) So far so good... 48 hours and still stable. I'm monitoring at the dockers level with Uptime Kuma and the HTTP services of unraid Mgmt int and dockers with PRTG. 🤞🙏🙌 On 9/11/2024 at 4:00 AM, JorgeB said: Edited September 13, 2024 by gurulee 1 Quote
gurulee Posted September 16, 2024 Author Posted September 16, 2024 Continued stability and reoccurrence of issue since upgrading to 6.12.13 5 days ago. Quote
gurulee Posted September 25, 2024 Author Posted September 25, 2024 I have not been able to catch this when it occurs, but Uptime Kuma intermittently reports this for about 2-3min every ~7 days with docker monitor: Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? Also at this time my PRTG HTTPS monitors for the same dockers show timeouts causing downtime as well. Any guidance would be appreciated. Quote
gurulee Posted October 22, 2024 Author Posted October 22, 2024 Anyone, may I have some guidance on next steps to get support on this outstanding and repeat issue? Quote
JorgeB Posted October 22, 2024 Posted October 22, 2024 Did you try what I mentioned above? You need to see if you can find the problem container Quote
gurulee Posted November 1, 2024 Author Posted November 1, 2024 On 10/22/2024 at 7:40 AM, JorgeB said: Did you try what I mentioned above? You need to see if you can find the problem container I appreciate the advice. With the amount of dockers and their needed availability, I hoping I can take a different troubleshooting path first. For example, generating necessary system logs for Unraid support to analyze. What is the official Unraid support process? Quote
JorgeB Posted November 1, 2024 Posted November 1, 2024 There was nothing relevant logged before, you can try using just half of your containers, if still issues try the other half, then keep drilling down. Quote
gurulee Posted November 3, 2024 Author Posted November 3, 2024 On 11/1/2024 at 7:22 AM, JorgeB said: There was nothing relevant logged before, you can try using just half of your containers, if still issues try the other half, then keep drilling down. What about the anonymized diag file under tools, would that reveal anything to help narrow the root cause down? Quote
JorgeB Posted November 3, 2024 Posted November 3, 2024 There won't be anything extra in the syslog. 1 Quote
gurulee Posted November 8, 2024 Author Posted November 8, 2024 On 11/3/2024 at 1:13 PM, JorgeB said: There won't be anything extra in the syslog. I'm tracking this intermittent issue with Uptime Kuma and PRTG. Looking into enabling more logging options to narrow down root cause.. M Quote
gurulee Posted November 12, 2024 Author Posted November 12, 2024 Strange, I continue to experience dockers becoming unaccessible randomly every few days for approx 5min. and then they come back up. Uptime Kuma reports this for the docker monitors, example: UptimeKuma Alert: Nextcloud (docker): [🔴 Down] Knex: Timeout acquiring a connection. The pool is probably full. Are you missing a .transacting(trx) call? Time (America/New_York): 2024-11-11 22:05:36 Quote
JorgeB Posted November 12, 2024 Posted November 12, 2024 And is the WebGUI still accessible when that happens? Quote
gurulee Posted November 17, 2024 Author Posted November 17, 2024 On 11/12/2024 at 7:06 AM, JorgeB said: And is the WebGUI still accessible when that happens? No it is not, I noted this previously in the description of this thread: " I just cannot get to the web services running on any of them when the issues occurs, including the unraid webUI on br0." Any ideas? Quote
Solution Vr2Io Posted November 17, 2024 Solution Posted November 17, 2024 On 9/10/2024 at 11:20 PM, gurulee said: my vLAN's use bond0 ? You should try disable bonding, those usually trouble and your problem also self fix in few min. 1 Quote
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.