High load, but low CPU utilization: docker frozen. Unable to stop array, reboot needed


6 posts in this topic Last Reply

Recommended Posts

Hi,

 

This is the fourth time this happened on my server:

837779485_Screenshot2020-04-06at10_53_09.thumb.png.b75773e6aa8edb4f28bbd060f5ffe6d8.png

 

This has multiple impacts:

- I cannot access certain Docker containers anymore (Nextcloud e.g.), but most of them are still running fine (all behind Traefik as reverse proxy).

- Stopping (also forced), restarting, creating, deleting of Docker containers is no longer possible. Not via the GUI, terminal or Portainer. Commands are hanging.

- Creating a diagnostics file is no longer possible. Not via GUI or terminal.

- Stopping array is not possible anymore (hangs, think because docker is not responding)

 

The only way to solve this is a (unclean) reboot.

 

I think it is related to Nextcloud. All four times I was doing something with Nextcloud (altough Nextcloud is used often, so why only these four times?). Also since Nextcloud is not accessible anymore (gateway timeout).

My guess is that it is network related, or could it be something else? And how to verify that? Like I said, its impossible to get a diagnostics report.

I'm currently on 6.8.3, but happened also on 6.8.1.

 

Some more graphs:

739103703_Screenshot2020-04-06at10_53_15.thumb.png.68dd7f26c369bb18819ebd1722f98b85.png

 

Thanks!

 

 

Edited by Rubene
Link to post
35 minutes ago, johnnie.black said:

Thanks! Just did that.

 

I was able to copy everything from Tools -> System log.

But nothing around the time it started (10:27). At 11:31 I tried to stop the array.

 

Apr  6 06:00:15 Tower root: /var/lib/docker: 18.9 GiB (20290076672 bytes) trimmed on /dev/loop2
Apr  6 06:00:15 Tower root: /mnt/cache: 217.2 GiB (233194831872 bytes) trimmed on /dev/mapper/sdd1
Apr  6 08:00:32 Tower kernel: mdcmd (62): spindown 0
Apr  6 08:00:33 Tower kernel: mdcmd (63): spindown 1
Apr  6 10:38:14 Tower webGUI: Successful login user root from 172.18.0.18
Apr  6 10:38:36 Tower login[24621]: ROOT LOGIN  on '/dev/pts/0'
Apr  6 10:39:38 Tower kernel: mdcmd (64): spindown 0
Apr  6 10:40:21 Tower nginx: 2020/04/06 10:40:21 [error] 10155#10155: *438771 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 172.18.0.18, server: , request: "POST /webGui/include/Download.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "unraid.xxx.com", referrer: "https://unraid.xxx.com/Tools/Diagnostics"
Apr  6 10:41:24 Tower kernel: mdcmd (65): spindown 1
Apr  6 11:31:29 Tower webGUI: Successful login user root from 192.168.2.55
Apr  6 11:31:39 Tower kernel: mdcmd (66): nocheck cancel
Apr  6 11:31:40 Tower emhttpd: Spinning up all drives...
Apr  6 11:31:40 Tower emhttpd: shcmd (8973): /usr/sbin/hdparm -S0 /dev/sdd
Apr  6 11:31:40 Tower kernel: mdcmd (67): spinup 0
Apr  6 11:31:40 Tower kernel: mdcmd (68): spinup 1
Apr  6 11:31:40 Tower root: 
Apr  6 11:31:40 Tower root: /dev/sdd:
Apr  6 11:31:40 Tower root:  setting standby to 0 (off)
Apr  6 11:31:45 Tower emhttpd: Stopping services...
Apr  6 11:31:45 Tower root: Stopping docker_load
Apr  6 11:31:46 Tower emhttpd: shcmd (8977): /etc/rc.d/rc.docker stop
Apr  6 11:31:46 Tower kernel: br-b2a3f6552968: port 8(veth4088f17) entered disabled state

 

Link to post
  • 4 weeks later...

Today I had the issue again around 11.51.

 

Syslog shows nothing special:

 

May  3 10:38:37 Tower emhttpd: shcmd (77895): /etc/rc.d/rc.samba restart
May  3 10:38:40 Tower root: Starting Samba:  /usr/sbin/smbd -D
May  3 10:38:40 Tower root:                  /usr/sbin/nmbd -D
May  3 10:38:40 Tower root:                  /usr/sbin/wsdd 
May  3 10:38:40 Tower root:                  /usr/sbin/winbindd -D
May  3 10:38:40 Tower emhttpd: shcmd (77904): smbcontrol smbd close-share 'x'
May  3 10:55:13 Tower kernel: mdcmd (294): spindown 0
May  3 10:55:14 Tower kernel: mdcmd (295): spindown 1
May  3 12:02:59 Tower kernel: veth0d8fa17: renamed from eth0
May  3 12:02:59 Tower kernel: docker0: port 1(vethb96c1c0) entered disabled state    --- Restarted a docker container
May  3 12:02:59 Tower kernel: docker0: port 1(vethb96c1c0) entered disabled state
May  3 12:02:59 Tower kernel: device vethb96c1c0 left promiscuous mode
May  3 12:02:59 Tower kernel: docker0: port 1(vethb96c1c0) entered disabled state

 

Again Nextcloud was not responding. Not able to stop / kill that particular container (other contains are able to stop/start). Managed to forcefully stop docker with /etc/rc.d/rc.docker force_stop but the load was still there.

 

Looking at top and ps I noticed there were quite some php-fpm  processes in the D state ("uninterruptible sleep (usually IO)"). No way of stopping these. Nextcloud uses php-fpm but I expect these processes are gone when the container is no longer running.

Also netdata was running. The only correlation I see is an increased amount of TCP sockets, higher number of ipv4 UDP errors, ipv6 packets and errors.

 

Issue is still very vague to me. Does anyone have any idea what this could be?

 

 

Edited by Rubene
Link to post

The load values which Linux reports includes IOWAIT time.

If your drives have issues or are very busy, this is reflected in the load and makes your system sluggish.

 

One cause can be that the folder "appdata" is located on the array instead of the cache and consequently all containers make heavy use of the array.

 

Link to post
7 minutes ago, bonienl said:

The load values which Linux reports includes IOWAIT time.

If your drives have issues or are very busy, this is reflected in the load and makes your system sluggish.

 

One cause can be that the folder "appdata" is located on the array instead of the cache and consequently all containers make heavy use of the array.

 

 

Indeed, looks like it has something to do with IO. But I don't suspect the disks.

The appdata folder is located on the cache drive. The 2 hard drives were both asleep. Also the cache drive had barely any ops (see also the screenshot in my first post).

 

I suspect it has something to do with network. There I see the most correlation, but i'm not entirely sure yet.

Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.