Rubene Posted April 6, 2020 Share Posted April 6, 2020 (edited) Hi, This is the fourth time this happened on my server: This has multiple impacts: - I cannot access certain Docker containers anymore (Nextcloud e.g.), but most of them are still running fine (all behind Traefik as reverse proxy). - Stopping (also forced), restarting, creating, deleting of Docker containers is no longer possible. Not via the GUI, terminal or Portainer. Commands are hanging. - Creating a diagnostics file is no longer possible. Not via GUI or terminal. - Stopping array is not possible anymore (hangs, think because docker is not responding) The only way to solve this is a (unclean) reboot. I think it is related to Nextcloud. All four times I was doing something with Nextcloud (altough Nextcloud is used often, so why only these four times?). Also since Nextcloud is not accessible anymore (gateway timeout). My guess is that it is network related, or could it be something else? And how to verify that? Like I said, its impossible to get a diagnostics report. I'm currently on 6.8.3, but happened also on 6.8.1. Some more graphs: Thanks! Edited April 6, 2020 by Rubene Quote Link to comment
JorgeB Posted April 6, 2020 Share Posted April 6, 2020 57 minutes ago, Rubene said: Like I said, its impossible to get a diagnostics report. Try this: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=781601 Quote Link to comment
Rubene Posted April 6, 2020 Author Share Posted April 6, 2020 35 minutes ago, johnnie.black said: Try this: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=781601 Thanks! Just did that. I was able to copy everything from Tools -> System log. But nothing around the time it started (10:27). At 11:31 I tried to stop the array. Apr 6 06:00:15 Tower root: /var/lib/docker: 18.9 GiB (20290076672 bytes) trimmed on /dev/loop2 Apr 6 06:00:15 Tower root: /mnt/cache: 217.2 GiB (233194831872 bytes) trimmed on /dev/mapper/sdd1 Apr 6 08:00:32 Tower kernel: mdcmd (62): spindown 0 Apr 6 08:00:33 Tower kernel: mdcmd (63): spindown 1 Apr 6 10:38:14 Tower webGUI: Successful login user root from 172.18.0.18 Apr 6 10:38:36 Tower login[24621]: ROOT LOGIN on '/dev/pts/0' Apr 6 10:39:38 Tower kernel: mdcmd (64): spindown 0 Apr 6 10:40:21 Tower nginx: 2020/04/06 10:40:21 [error] 10155#10155: *438771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 172.18.0.18, server: , request: "POST /webGui/include/Download.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "unraid.xxx.com", referrer: "https://unraid.xxx.com/Tools/Diagnostics" Apr 6 10:41:24 Tower kernel: mdcmd (65): spindown 1 Apr 6 11:31:29 Tower webGUI: Successful login user root from 192.168.2.55 Apr 6 11:31:39 Tower kernel: mdcmd (66): nocheck cancel Apr 6 11:31:40 Tower emhttpd: Spinning up all drives... Apr 6 11:31:40 Tower emhttpd: shcmd (8973): /usr/sbin/hdparm -S0 /dev/sdd Apr 6 11:31:40 Tower kernel: mdcmd (67): spinup 0 Apr 6 11:31:40 Tower kernel: mdcmd (68): spinup 1 Apr 6 11:31:40 Tower root: Apr 6 11:31:40 Tower root: /dev/sdd: Apr 6 11:31:40 Tower root: setting standby to 0 (off) Apr 6 11:31:45 Tower emhttpd: Stopping services... Apr 6 11:31:45 Tower root: Stopping docker_load Apr 6 11:31:46 Tower emhttpd: shcmd (8977): /etc/rc.d/rc.docker stop Apr 6 11:31:46 Tower kernel: br-b2a3f6552968: port 8(veth4088f17) entered disabled state Quote Link to comment
Rubene Posted May 3, 2020 Author Share Posted May 3, 2020 (edited) Today I had the issue again around 11.51. Syslog shows nothing special: May 3 10:38:37 Tower emhttpd: shcmd (77895): /etc/rc.d/rc.samba restart May 3 10:38:40 Tower root: Starting Samba: /usr/sbin/smbd -D May 3 10:38:40 Tower root: /usr/sbin/nmbd -D May 3 10:38:40 Tower root: /usr/sbin/wsdd May 3 10:38:40 Tower root: /usr/sbin/winbindd -D May 3 10:38:40 Tower emhttpd: shcmd (77904): smbcontrol smbd close-share 'x' May 3 10:55:13 Tower kernel: mdcmd (294): spindown 0 May 3 10:55:14 Tower kernel: mdcmd (295): spindown 1 May 3 12:02:59 Tower kernel: veth0d8fa17: renamed from eth0 May 3 12:02:59 Tower kernel: docker0: port 1(vethb96c1c0) entered disabled state --- Restarted a docker container May 3 12:02:59 Tower kernel: docker0: port 1(vethb96c1c0) entered disabled state May 3 12:02:59 Tower kernel: device vethb96c1c0 left promiscuous mode May 3 12:02:59 Tower kernel: docker0: port 1(vethb96c1c0) entered disabled state Again Nextcloud was not responding. Not able to stop / kill that particular container (other contains are able to stop/start). Managed to forcefully stop docker with /etc/rc.d/rc.docker force_stop but the load was still there. Looking at top and ps I noticed there were quite some php-fpm processes in the D state ("uninterruptible sleep (usually IO)"). No way of stopping these. Nextcloud uses php-fpm but I expect these processes are gone when the container is no longer running. Also netdata was running. The only correlation I see is an increased amount of TCP sockets, higher number of ipv4 UDP errors, ipv6 packets and errors. Issue is still very vague to me. Does anyone have any idea what this could be? Edited May 3, 2020 by Rubene Quote Link to comment
bonienl Posted May 3, 2020 Share Posted May 3, 2020 The load values which Linux reports includes IOWAIT time. If your drives have issues or are very busy, this is reflected in the load and makes your system sluggish. One cause can be that the folder "appdata" is located on the array instead of the cache and consequently all containers make heavy use of the array. Quote Link to comment
Rubene Posted May 3, 2020 Author Share Posted May 3, 2020 7 minutes ago, bonienl said: The load values which Linux reports includes IOWAIT time. If your drives have issues or are very busy, this is reflected in the load and makes your system sluggish. One cause can be that the folder "appdata" is located on the array instead of the cache and consequently all containers make heavy use of the array. Indeed, looks like it has something to do with IO. But I don't suspect the disks. The appdata folder is located on the cache drive. The 2 hard drives were both asleep. Also the cache drive had barely any ops (see also the screenshot in my first post). I suspect it has something to do with network. There I see the most correlation, but i'm not entirely sure yet. Quote Link to comment
Guest Posted July 12, 2022 Share Posted July 12, 2022 Did you ever find out the solution for this problem @Rubene? My nexcloud docker (LSIO version) also froze form time to time and I can´t find where is the problem. Quote Link to comment
Rubene Posted July 18, 2022 Author Share Posted July 18, 2022 (edited) @Emilio Unfortunately not. Nextcloud already didn't quite meet my needs and with these problems added, I switched to just shared network volumes. Problem did not occur again, so it was definitely related to the Nextcloud docker container. But absolutely no idea what the reason may have been. Edited July 18, 2022 by Rubene Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.