Jump to content

Unraid 6.9.2 - Docker causing server to become unresponsive


Recommended Posts

Hello, I have an H ML350P with plenty of resources, been attempting to upgrade from 6.8.2 to 6.9.2 for quite some time now.  Each time I do, I am unable to start more than a few docker containers before the Web GUI starts acting erratically.  I am not sure what the cause is, and the only thing I see in the logs in common with each occurrence is the following "upstream time out" error messages:

 

Jan 29 09:49:19 ML350P nginx: 2022/01/29 09:49:19 [error] 9325#9325: *902 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.0.0.51, server: , request: "POST /plugins/community.applications/scripts/notices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "10.0.0.101", referrer: "http://10.0.0.101/Main"

 

After this error shows in the log, no commands to the server will work.  While I can still navigate the web interface, aside from the Docker page which just tries to load endlessly, I am unable to send a command to stop or restart the docker service.  The machine will not reboot either with the button on GUI or by command line.  The syslog indicates the system is going down for reboot, but the nothing happens after that. I have not been able to pin this down to a particular container and seems to be fine when no containers are running.  As soon as I fire up three or so, I can expect the issue to occur again.  Also, I should note that I am using ZFS and that is where my docker config and containers are located.

 

I have also tried deleting the docker image file as well, now I cannot even get them to run from the templates.  I am thinking this may be a ZFS issue, but is there anywhere else I can look for some clues?

 

Here is what happens when I tried to add my dockers back:

Jan 29 10:20:44 ML350P nginx: 2022/01/29 10:20:44 [error] 7556#7556: *2249 upstream timed out (110: Connection timed out) while reading upstream, client: 10.0.0.51, server: , request: "POST /Docker/AddContainer?xmlTemplate=user%3A%2Fboot%2Fconfig%2Fplugins%2FdockerMan%2Ftemplates-user%2Fmy-UniFi-Video.xml&rmTemplate= HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "10.0.0.101", referrer: "http://10.0.0.101/Docker/AddContainer?xmlTemplate=user%3A%2Fboot%2Fconfig%2Fplugins%2FdockerMan%2Ftemplates-user%2Fmy-UniFi-Video.xml&rmTemplate="

 

Thanks in advance!

ml350p-diagnostics-20220129-1040.zip

Edited by greg_gorrell
Link to comment

Thanks for the reply Squid, but I honestly am not sure what would be causing this issue but the "upstream timeout" seems to be in the log every time this happens.  To clarify though, I don't suspect CA has anything to do with it and just happens to be related to the job I was performing at the time this happened most recently.  Generally, the timeout occurs when I am accessing the Docker page and not the CA Apps page, like it timesout when querying the service. Since I am getting no other information from the logs, I have no clue where to start but it seems that the issue lies with the query itself from the Web GUI and the Dockers.  Maybe not, but when I start the array and run it with either Docker disabled completely or Docker enabled with no containers running, the issue does not manifest itself.  After a couple Docker containers are running, at some point this timeout will occur and any subsequent commands to control a service will not execute, whether sent over SSH or the web interface.  I am going to attempt to move all of the docker related stuff off the zpool and onto a cache drive managed by Unraid, but any assistance on how to better troubleshoot this would be greatly appreciated.

Link to comment

Perhaps I am not asking in the correct way.  Could somebody please explain to me how the web interface interacts with the underlying services?  When I click "reboot server" on the main page, what has to happen for the "shutdown -r" command to be executed by the system?  Is it possible that the web server component of Unraid sends a command to the system and will not move on until that command is completed?  Is it possible I have an issue with Docker or ZFS and that issue is why the timeout is occurring, rendering the timeout more of a symptom than a cause for the problem?  Thanks again.

Link to comment

Yes, that is the name of the zpool.  I did notice that everything works fine with a fresh docker.img file created on the cache or array via the settings and the appdata folders on the zpool, so it is definitely some weird little bug with using ZFS.  It works for now, I'll see what happens with the official ZFS implementation when it comes around.  Apologies for the delay in responding, I had a drive go on the other server that kinda took priority lately but thank you for taking the time to check out the diagnostics and offer input.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...