High load, but low CPU utilization: docker frozen. Unable to stop array, reboot needed

Rubene · April 6, 2020

Hi,

This is the fourth time this happened on my server:

This has multiple impacts:

- I cannot access certain Docker containers anymore (Nextcloud e.g.), but most of them are still running fine (all behind Traefik as reverse proxy).

- Stopping (also forced), restarting, creating, deleting of Docker containers is no longer possible. Not via the GUI, terminal or Portainer. Commands are hanging.

- Creating a diagnostics file is no longer possible. Not via GUI or terminal.

- Stopping array is not possible anymore (hangs, think because docker is not responding)

The only way to solve this is a (unclean) reboot.

I think it is related to Nextcloud. All four times I was doing something with Nextcloud (altough Nextcloud is used often, so why only these four times?). Also since Nextcloud is not accessible anymore (gateway timeout).

My guess is that it is network related, or could it be something else? And how to verify that? Like I said, its impossible to get a diagnostics report.

I'm currently on 6.8.3, but happened also on 6.8.1.

Some more graphs:

Thanks!

Edited April 6, 2020 by Rubene

JorgeB · April 6, 2020

57 minutes ago, Rubene said:

Like I said, its impossible to get a diagnostics report.

Try this:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=781601

Rubene · April 6, 2020

35 minutes ago, johnnie.black said:

Try this:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=781601

Thanks! Just did that.

I was able to copy everything from Tools -> System log.

But nothing around the time it started (10:27). At 11:31 I tried to stop the array.

Apr  6 06:00:15 Tower root: /var/lib/docker: 18.9 GiB (20290076672 bytes) trimmed on /dev/loop2
Apr  6 06:00:15 Tower root: /mnt/cache: 217.2 GiB (233194831872 bytes) trimmed on /dev/mapper/sdd1
Apr  6 08:00:32 Tower kernel: mdcmd (62): spindown 0
Apr  6 08:00:33 Tower kernel: mdcmd (63): spindown 1
Apr  6 10:38:14 Tower webGUI: Successful login user root from 172.18.0.18
Apr  6 10:38:36 Tower login[24621]: ROOT LOGIN  on '/dev/pts/0'
Apr  6 10:39:38 Tower kernel: mdcmd (64): spindown 0
Apr  6 10:40:21 Tower nginx: 2020/04/06 10:40:21 [error] 10155#10155: *438771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 172.18.0.18, server: , request: "POST /webGui/include/Download.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "unraid.xxx.com", referrer: "https://unraid.xxx.com/Tools/Diagnostics"
Apr  6 10:41:24 Tower kernel: mdcmd (65): spindown 1
Apr  6 11:31:29 Tower webGUI: Successful login user root from 192.168.2.55
Apr  6 11:31:39 Tower kernel: mdcmd (66): nocheck cancel
Apr  6 11:31:40 Tower emhttpd: Spinning up all drives...
Apr  6 11:31:40 Tower emhttpd: shcmd (8973): /usr/sbin/hdparm -S0 /dev/sdd
Apr  6 11:31:40 Tower kernel: mdcmd (67): spinup 0
Apr  6 11:31:40 Tower kernel: mdcmd (68): spinup 1
Apr  6 11:31:40 Tower root: 
Apr  6 11:31:40 Tower root: /dev/sdd:
Apr  6 11:31:40 Tower root:  setting standby to 0 (off)
Apr  6 11:31:45 Tower emhttpd: Stopping services...
Apr  6 11:31:45 Tower root: Stopping docker_load
Apr  6 11:31:46 Tower emhttpd: shcmd (8977): /etc/rc.d/rc.docker stop
Apr  6 11:31:46 Tower kernel: br-b2a3f6552968: port 8(veth4088f17) entered disabled state

Rubene · May 3, 2020

Today I had the issue again around 11.51.

Syslog shows nothing special:

May  3 10:38:37 Tower emhttpd: shcmd (77895): /etc/rc.d/rc.samba restart
May  3 10:38:40 Tower root: Starting Samba:  /usr/sbin/smbd -D
May  3 10:38:40 Tower root:                  /usr/sbin/nmbd -D
May  3 10:38:40 Tower root:                  /usr/sbin/wsdd 
May  3 10:38:40 Tower root:                  /usr/sbin/winbindd -D
May  3 10:38:40 Tower emhttpd: shcmd (77904): smbcontrol smbd close-share 'x'
May  3 10:55:13 Tower kernel: mdcmd (294): spindown 0
May  3 10:55:14 Tower kernel: mdcmd (295): spindown 1
May  3 12:02:59 Tower kernel: veth0d8fa17: renamed from eth0
May  3 12:02:59 Tower kernel: docker0: port 1(vethb96c1c0) entered disabled state    --- Restarted a docker container
May  3 12:02:59 Tower kernel: docker0: port 1(vethb96c1c0) entered disabled state
May  3 12:02:59 Tower kernel: device vethb96c1c0 left promiscuous mode
May  3 12:02:59 Tower kernel: docker0: port 1(vethb96c1c0) entered disabled state

Again Nextcloud was not responding. Not able to stop / kill that particular container (other contains are able to stop/start). Managed to forcefully stop docker with /etc/rc.d/rc.docker force_stop but the load was still there.

Looking at top and ps I noticed there were quite some php-fpm processes in the D state ("uninterruptible sleep (usually IO)"). No way of stopping these. Nextcloud uses php-fpm but I expect these processes are gone when the container is no longer running.

Also netdata was running. The only correlation I see is an increased amount of TCP sockets, higher number of ipv4 UDP errors, ipv6 packets and errors.

Issue is still very vague to me. Does anyone have any idea what this could be?

Edited May 3, 2020 by Rubene

bonienl · May 3, 2020

The load values which Linux reports includes IOWAIT time.

If your drives have issues or are very busy, this is reflected in the load and makes your system sluggish.

One cause can be that the folder "appdata" is located on the array instead of the cache and consequently all containers make heavy use of the array.

Rubene · May 3, 2020

7 minutes ago, bonienl said:

The load values which Linux reports includes IOWAIT time.

If your drives have issues or are very busy, this is reflected in the load and makes your system sluggish.

One cause can be that the folder "appdata" is located on the array instead of the cache and consequently all containers make heavy use of the array.

Indeed, looks like it has something to do with IO. But I don't suspect the disks.

The appdata folder is located on the cache drive. The 2 hard drives were both asleep. Also the cache drive had barely any ops (see also the screenshot in my first post).

I suspect it has something to do with network. There I see the most correlation, but i'm not entirely sure yet.

July 12, 2022

Did you ever find out the solution for this problem @Rubene? My nexcloud docker (LSIO version) also froze form time to time and I can´t find where is the problem.

Rubene · July 18, 2022

@Emilio Unfortunately not. Nextcloud already didn't quite meet my needs and with these problems added, I switched to just shared network volumes. Problem did not occur again, so it was definitely related to the Nextcloud docker container. But absolutely no idea what the reason may have been.

Edited July 18, 2022 by Rubene

High load, but low CPU utilization: docker frozen. Unable to stop array, reboot needed

Recommended Posts

Rubene

Link to comment

JorgeB

Link to comment

Rubene

Link to comment

Rubene

Link to comment

bonienl

Link to comment

Rubene

Link to comment

Guest

Link to comment

Rubene

Link to comment

Join the conversation