Random Freeze / 6.12.10 / Docker Server Error + CLI unresponsive

gloeckle · May 20

Hello,

The following happened in order:
1. my nextcloud was not reachable anymore

2. restart nextcloud container = Server Error

3. clicked on unraid gui >> tools to view syslog page only loads the frame/menu rest is empty and keeps loading

4. server pings

5. web terminal opens, I type "cd /var/us" <TAB>

6. cursor keeps blinking no autotap completions happens; CTRL+C or typing anyhting = nothing happens

clock shows 19:31 at this moment (docker logs below per idrac I opened about 19:40)

7. opened iDrac where I am still logged in

8. top shows me nothing fancy. plex is doing its thing overall 1-8% CPU, list change like normal also iDrac Thermals 36°C

9. nano syslog shows me 2 mins after my freeze (timestamp: 19:33) that USB / HID device disconnects no clue tbh maybe my usb dongle from wifi keyboard timed out but it was 2 mins after freeze. Besides that about 10 mins before no entries

10. opened nano docker.log last entries 2h ago were i updated my dockers

11. still in idrac terminal I try to copy the logs typing "cp /var/user/log/*.* /mnt/user/" <TAB>

same freeze as before per webterminal. Cursor keeps blinking no further inputs get recognized.

12. had to hardreset per iDrac and that where the diagnostics come from.

I noticed another forum post which may be connected:

Server (DELL r730xd) is new / I just moved to. 15h RAM check via unraid memtest 3 passes 0 errors.

Also I just recreated docker.img and re-installed all dockers via templates. Like 3 boots before.

Had it running like 2 days without issues, just racked it today.

godzilla-diagnostics-20240520-1956.zip

gloeckle · May 20

seemingly exact same issue?!

load average inside nextcloud LSIO docker: 32 rising
load average outside in unraid: 42 rising

- suspended processes
- docker deamon killed with unraid load average of 66+
- nginx root process with 70+ sub process waiting for I/O

- php-fpm root process with 23 sub process waiting for I/O
- after reboot I paused parity check caused by my cold boot
- istat showed 2.5TB on sda sdb ... from parity check I guess.
- any command towards array freezes = opened new tty*
- reboot hangs in Forcing shutdown since 15 mins

- oh and no: diagnostics did not work leaving it for 1h+ running in one tty.

- writing a file to /boot/ was no issue

a way more linux experienced friend guided me through a lot of stuff - see screens. Sorry but can not copy IDRAC remote screen.

I restart now without any docker and have a look tomorrow what load average tells me when no dockers have been started.

Edited May 20 by gloeckle

JorgeB · May 21

Enable the syslog server and if it happens again post that after a crash.

gloeckle · May 21

3 hours ago, JorgeB said:

Enable the syslog server and if it happens again post that after a crash.

since the whole array is not responding when this happens I can not point the syslog server to a share (which is the only option beside writing to unraidUSB) I am currently trying to find a way to mirror /var/log/* to an unassigned SSD - I can pull out and connect to WinPC and post whatever is needed.

Unfortunately all tutorials and posts I can find are outdated. Still searching how to mirror /var/log/* to the UD-SSD.

When I figure that out I will start all dockers again to force the issue again and have updated logs.

currently unraid is running fine with docker enabled but no dockers running.

JorgeB · May 21

It may still write the last lines to the share, or use the flash drive.

gloeckle · May 22

to summarise many hours searching:

Main issue is that nextcloud and swag docker are opening nginx processes again and again when communicating ,which all are waiting and nothing happens = load average rises.

In this SYSLOG nextcloud stopped responding pretty exact at 21:45:00.

Is it normal that everytime I start a docker-container that eth0=vethXYZ gets renamed? Like:

May 22 21:16:00 godzilla kernel: eth0: renamed from veth2b5dd4d
May 22 21:16:00 godzilla kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethca560dc: link becomes ready
May 22 21:42:00 godzilla kernel: br-5d08daaed772: port 1(vethca560dc) entered disabled state
May 22 21:42:00 godzilla kernel: veth2b5dd4d: renamed from eth0
May 22 21:42:00 godzilla rsyslogd: cannot connect to 10.0.0.200:514: Connection refused [v8.2102.0 try https://www.rsyslog.com/e/2027 ]
May 22 21:42:00 godzilla kernel: br-5d08daaed772: port 1(vethca560dc) entered disabled state
May 22 21:42:00 godzilla kernel: device vethca560dc left promiscuous mode
May 22 21:42:00 godzilla kernel: br-5d08daaed772: port 1(vethca560dc) entered disabled state
May 22 21:42:01 godzilla kernel: br-5d08daaed772: port 1(vethdd22d09) entered blocking state
May 22 21:42:01 godzilla kernel: br-5d08daaed772: port 1(vethdd22d09) entered disabled state
May 22 21:42:01 godzilla kernel: device vethdd22d09 entered promiscuous mode
May 22 21:42:01 godzilla kernel: br-5d08daaed772: port 1(vethdd22d09) entered blocking state
May 22 21:42:01 godzilla kernel: br-5d08daaed772: port 1(vethdd22d09) entered forwarding state
May 22 21:42:01 godzilla kernel: eth0: renamed from veth777bb11
May 22 21:42:01 godzilla kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethdd22d09: link becomes ready
May 22 21:44:27 godzilla kernel: br-5d08daaed772: port 1(vethdd22d09) entered disabled state
May 22 21:44:27 godzilla rsyslogd: cannot connect to 10.0.0.200:514: Connection refused [v8.2102.0 try https://www.rsyslog.com/e/2027 ]
May 22 21:44:27 godzilla kernel: veth777bb11: renamed from eth0
May 22 21:44:27 godzilla kernel: br-5d08daaed772: port 1(vethdd22d09) entered disabled state
May 22 21:44:27 godzilla kernel: device vethdd22d09 left promiscuous mode
May 22 21:44:27 godzilla kernel: br-5d08daaed772: port 1(vethdd22d09) entered disabled state
May 22 21:45:39 godzilla kernel: br-5d08daaed772: port 1(veth47ec6af) entered blocking state
May 22 21:45:39 godzilla kernel: br-5d08daaed772: port 1(veth47ec6af) entered disabled state
May 22 21:45:39 godzilla kernel: device veth47ec6af entered promiscuous mode
May 22 21:45:39 godzilla rsyslogd: cannot connect to 10.0.0.200:514: Connection refused [v8.2102.0 try https://www.rsyslog.com/e/2027 ]
May 22 21:45:39 godzilla kernel: eth0: renamed from veth7c4d745
May 22 21:45:39 godzilla kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth47ec6af: link becomes ready
May 22 21:45:39 godzilla kernel: br-5d08daaed772: port 1(veth47ec6af) entered blocking state
May 22 21:45:39 godzilla kernel: br-5d08daaed772: port 1(veth47ec6af) entered forwarding state
May 22 21:45:53 godzilla kernel: br-5d08daaed772: port 2(veth62fab02) entered blocking state
May 22 21:45:53 godzilla kernel: br-5d08daaed772: port 2(veth62fab02) entered disabled state
May 22 21:45:53 godzilla kernel: device veth62fab02 entered promiscuous mode
May 22 21:45:53 godzilla kernel: eth0: renamed from veth018449d
May 22 21:45:53 godzilla kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth62fab02: link becomes ready
May 22 21:45:53 godzilla kernel: br-5d08daaed772: port 2(veth62fab02) entered blocking state
May 22 21:45:53 godzilla kernel: br-5d08daaed772: port 2(veth62fab02) entered forwarding state

What I found so far:

1. first issue I found was nginx worker process conf inside docker appdata was set to 56 - deleted is now 12 (6C/12T)
2. nginx processes are waiting indefinetly inside nexctloud & nginx after nextcloud gets started and used
3. nextcloud busybox crond is waiting indefinetly - still don't know why

4. nextcloud and swag have a huge amount of established connections

I have a huge issue between nc and swag, yes - But what makes my head hurt the most is:

I still don't see any reason for the array to stop responding, even over screen TTY

I am currently trying to configure nextcloud as default as possible and try again. But this all feels so bad tbh

JorgeB · May 23

9 hours ago, gloeckle said:

Is it normal that everytime I start a docker-container that eth0=vethXYZ gets renamed? Like:

Yes, just not normal if a container keeps doing that continuously, it suggests it's always restarting, you can check the uptime to confirm.

9 hours ago, gloeckle said:

I have a huge issue between nc and swag, yes - But what makes my head hurt the most is:

I still don't see any reason for the array to stop responding, even over screen TTY

While container issues in theory should not crash a server, it's not an uncommon issue, Plex comes to mind, it's known to have crashed a lot of servers.

Random Freeze / 6.12.10 / Docker Server Error + CLI unresponsive

Recommended Posts

gloeckle

Link to comment

gloeckle

Link to comment

JorgeB

Link to comment

gloeckle

Link to comment

JorgeB

Link to comment

gloeckle

Link to comment

JorgeB

Link to comment

Join the conversation