WebGUI and SSH broke, restarted when I created diagnostics

OddMagnet · March 20

I've had the WebGUI and SSH break one or two times before already and simply rebooted.

This time I figured might as well connect my keyboard and monitor and create a diagnostics file.

Weirdly enough, right when I pressed enter to start the diagnostics tool, the WebGUI and SSH started working again.

I checked the logs and this seems to be the timeframe it happened (unless I'm misinterpreting something):

20:11 → WebGUI and SSH stopped working

20:20 → connected Keyboard

20:23 → connected Monitor

20:31 → started diagnostics

Mar 20 17:05:06 OddNas sshd[8859]: Starting session: shell on pts/0 for root from 192.168.178.80 port 50226 id 0
Mar 20 20:11:15 OddNas kernel: e1000e 0000:00:1f.6 eth0: NIC Link is Down
Mar 20 20:11:15 OddNas kernel: br0: port 1(eth0) entered disabled state
Mar 20 20:11:18 OddNas ntpd[1127]: Deleting interface #1 br0, 192.168.178.200#123, interface stats: received=7185, sent=7186, dropped=0, active_time=1387370 secs
Mar 20 20:11:18 OddNas ntpd[1127]: 216.239.35.12 local addr 192.168.178.200 -> 
Mar 20 20:11:18 OddNas ntpd[1127]: 216.239.35.8 local addr 192.168.178.200 -> 
Mar 20 20:11:18 OddNas ntpd[1127]: 216.239.35.4 local addr 192.168.178.200 -> 
Mar 20 20:11:18 OddNas ntpd[1127]: 216.239.35.0 local addr 192.168.178.200 -> 
Mar 20 20:20:51 OddNas kernel: usb 1-1: new full-speed USB device number 4 using xhci_hcd
Mar 20 20:20:51 OddNas kernel: input: Corsair Corsair Gaming K55 RGB Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.0/0003:1B1C:1B3D.0002/input/input5
Mar 20 20:20:51 OddNas kernel: hid-generic 0003:1B1C:1B3D.0002: input,hidraw0: USB HID v1.11 Keyboard [Corsair Corsair Gaming K55 RGB Keyboard] on usb-0000:00:14.0-1/input0
Mar 20 20:20:51 OddNas kernel: input: Corsair Corsair Gaming K55 RGB Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.1/0003:1B1C:1B3D.0003/input/input6
Mar 20 20:20:51 OddNas kernel: input: Corsair Corsair Gaming K55 RGB Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.1/0003:1B1C:1B3D.0003/input/input7
Mar 20 20:20:51 OddNas kernel: input: Corsair Corsair Gaming K55 RGB Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.1/0003:1B1C:1B3D.0003/input/input8
Mar 20 20:20:51 OddNas kernel: hid-generic 0003:1B1C:1B3D.0003: input,hiddev96,hidraw1: USB HID v1.11 Keyboard [Corsair Corsair Gaming K55 RGB Keyboard] on usb-0000:00:14.0-1/input1
Mar 20 20:20:51 OddNas kernel: hid-generic 0003:1B1C:1B3D.0004: hiddev97,hidraw2: USB HID v1.11 Device [Corsair Corsair Gaming K55 RGB Keyboard] on usb-0000:00:14.0-1/input2
Mar 20 20:23:11 OddNas kernel: fbcon: i915drmfb (fb0) is primary device
Mar 20 20:23:11 OddNas kernel: Console: switching to colour frame buffer device 320x90
Mar 20 20:23:11 OddNas kernel: i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
Mar 20 20:23:33 OddNas login: pam_unix(login:auth): authentication failure; logname=LOGIN uid=0 euid=0 tty=/dev/tty1 ruser= rhost=  user=root
Mar 20 20:23:36 OddNas login: FAILED LOGIN 1 FROM tty1 FOR root, Authentication failure
Mar 20 20:24:04 OddNas login: pam_unix(login:session): session opened for user root(uid=0) by LOGIN(uid=0)
Mar 20 20:24:04 OddNas login: ROOT LOGIN ON tty1
Mar 20 20:26:46 OddNas sshd[8859]: Read error from remote host 192.168.178.80 port 50226: No route to host
Mar 20 20:26:46 OddNas sshd[8859]: pam_unix(sshd:session): session closed for user root
Mar 20 20:27:17 OddNas root: ACPI action up is not defined
Mar 20 20:27:18 OddNas root: ACPI action left is not defined
Mar 20 20:27:18 OddNas root: ACPI action left is not defined
Mar 20 20:27:18 OddNas root: ACPI action left is not defined
Mar 20 20:27:18 OddNas root: ACPI action left is not defined
Mar 20 20:27:18 OddNas root: ACPI action left is not defined
Mar 20 20:27:19 OddNas root: ACPI action left is not defined
Mar 20 20:27:22 OddNas root: ACPI action up is not defined
Mar 20 20:27:22 OddNas root: ACPI action down is not defined
Mar 20 20:31:39 OddNas kernel: e1000e 0000:00:1f.6 eth0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Mar 20 20:31:39 OddNas kernel: br0: port 1(eth0) entered blocking state
Mar 20 20:31:39 OddNas kernel: br0: port 1(eth0) entered forwarding state
Mar 20 20:31:41 OddNas ntpd[1127]: Listen normally on 2 br0 192.168.178.200:123
Mar 20 20:31:41 OddNas ntpd[1127]: new interface(s) found: waking up resolver
Mar 20 20:31:45 OddNas emhttpd: read SMART /dev/sdb
Mar 20 20:32:19 OddNas sshd[13939]: Connection from 192.168.178.80 port 54474 on 192.168.178.200 port 22 rdomain ""
Mar 20 20:32:19 OddNas sshd[13939]: Failed publickey for root from 192.168.178.80 port 54474 ssh2: ED25519 SHA256:GInj1AiL72FiRCs+JM71XjgEElxEiNHUoR508RkQS3g
Mar 20 20:32:19 OddNas sshd[13939]: Postponed keyboard-interactive for root from 192.168.178.80 port 54474 ssh2 [preauth]
Mar 20 20:32:22 OddNas sshd[13939]: Postponed keyboard-interactive/pam for root from 192.168.178.80 port 54474 ssh2 [preauth]
Mar 20 20:32:22 OddNas sshd[13939]: Accepted keyboard-interactive/pam for root from 192.168.178.80 port 54474 ssh2
Mar 20 20:32:22 OddNas sshd[13939]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Mar 20 20:32:22 OddNas sshd[13939]: Starting session: shell on pts/0 for root from 192.168.178.80 port 54474 id 0

Full diagnostics are attached

oddnas-diagnostics-20240320-2031.zip

OddMagnet · March 20

Looking back, there also has been a lot of this, though I'm not sure if it's related:

Mar 18 09:58:46 OddNas kernel: veth5e5dac8: renamed from eth0
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth3e692a8) entered disabled state
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth3e692a8) entered disabled state
Mar 18 09:58:46 OddNas kernel: device veth3e692a8 left promiscuous mode
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth3e692a8) entered disabled state
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered blocking state
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered disabled state
Mar 18 09:58:46 OddNas kernel: device veth7260a7f entered promiscuous mode
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered blocking state
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered forwarding state
Mar 18 09:58:47 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered disabled state
Mar 18 09:58:48 OddNas kernel: eth0: renamed from veth190d797
Mar 18 09:58:48 OddNas kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth7260a7f: link becomes ready
Mar 18 09:58:48 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered blocking state
Mar 18 09:58:48 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered forwarding state

Some more stuff that might be noteworthy:

Looking at the ifconfig.txt, there's a lot of vethsomething interfaces, no idea why
In the same file, my aside from the above mentioned interfaces, only some br-... interface was up

Edited March 21 by OddMagnet

JorgeB · March 21

Nothing obvious that I can see in the logs.

13 hours ago, OddMagnet said:

Looking back, there also has been a lot of this,

That could be a container constantly re-starting, check all the containers up times.

OddMagnet · March 21

Containers all had 100% uptime (other than intentional restarts - e.g. when I changed things in the compose file)

JorgeB · March 21

Then you can ignore.

OddMagnet · March 21

Still looks like a problem to me though.

To me it seems that WebGUI and SSH broke because of the interfaces being down (aside from the br-... and veth... interfaces)

It's hard to believe that the spam I mentioned in my second post is completely unrelated to the interfaces going down.

I don't understand why there are so many veth... interfaces in the first place, why the br-... interfaces ports are spamming my log so much or why there are entries about IPv6 when that's not even enabled in my settings

OddMagnet · March 28

Still looking for help with this problem.

I'd really like to solve the interface spam in my logs, which I believe is the root cause of my problem

JorgeB · March 29

If none of the containers is restarting, you can disable one at a time to see if you find the culprit.

OddMagnet · March 31

So it happened again, or at least the "soft version" where all my containers restart after the interface spam.

This seems to mostly happen on Sunday mornings,

I've looked over all my container logs, but not one of them had anything in it that would indicate being a problem.

Additionally I've checked all my container settings, but none of them are configured to do anything at around that time.

It doesn't look like it's a docker problem to me.

I'll try disabling some of my containers over the coming weeks and months, but it's gonna be a very tedious process of narrowing it down (if they're the problem)

In the meantime, is there anything else I could check? I'm still confused why there is any mention of IPv6 in my logs, when I don't even have it enabled.

OddMagnet · April 3

So it happened again, guess it's not a Sunday only thing then.

Again no container restarts, nothing in the container or docker logs that would indicate any problems

OddMagnet · April 3

Had a look at the syslog, this time it's different it seems. Not interface spam before eth0 goes down.

(I don't think it previously even had an entry about eth0 going down)

@JorgeB can you take another look at the new diagnostics?

oddnas-diagnostics-20240403-0729.zip

JorgeB · April 3

Apr  3 01:34:58 OddNas kernel: e1000e 0000:00:1f.6 eth0: NIC Link is Down

Yep, this would mean the NIC actually lost the physical link, could be an issue with the NIC, cable, switch, etc.

OddMagnet · April 3

that'd suck a lot.

I guess that'd explain the drops for eth0 as well? Though it's weird there are only drops for receive

Errors info Receive counters Transmit counters

eth0 Errors: 0 Errors: 0
Drops: 19739 Drops: 0
Overruns: 0 Overruns: 0

Any suggestions on how I'd be able to verify what parts of the chain are good and where the problem actually is?

JorgeB · April 3

Try a different cable and switch/router port, if the same it could be the NIC.

OddMagnet · April 3

Changed the cable and router port, hopefully it's not gonna happen again.

Everything (Router, Cable, Server) is brand-new, but I'd much rather have the cable be the culprit, lol.

What's weird though, is that eth0 always goes up again when I connect a KB+Monitor, login and start the diagnostics command.

(and obviously loosing the physical link when nothing is moving is extremely weird in the first place)

OddMagnet · April 21

So I've setup a user script to start logging docker events on Sundays, one hour before the interface spam happens.

Looking at the log, I can see a few things:

container exec_create → exec_start → exec_die happens a lot. After a quick google search I learned that those events also happen for Healthchecks, which is the case in my logs
a lot of container kill / die / stop happening after the interface spam started
also a lot of network disconnect for my docker bridge network

I've edited the logs to remove all the healthchecks and attached it here.

I don't think there are any pointers in there, but if anyone is bored enough to take a look at it I'd appreciate it.

Additionally I've checked docker compose logs (with --since and --until), but there's not much to see before the time of the problem.

(It doesn't help that not all log lines contain times and the containers don't all use the same formatting for times...)

For now I'll set up another script to catch the docker compose logs and hopefully that'll be more helpful next week..

docker-events-edited.log

OddMagnet · April 28

Again, the interface spam happened at 8am on Sunday.

I've tried getting logs from docker via this script:

#!/bin/bash

cd /mnt/user/appdata
timeout 2h docker compose logs -f -t --since=1s > /mnt/user/data/docker-compose.log

However, this didn't record properly for the full duration:

Script Starting Apr 28, 2024  07:00.01

Full logs for this script are available at /tmp/user.scripts/tmpScripts/Get Docker Compose Logs/log.txt

error from daemon in stream: Error grabbing logs: unexpected EOF

Script Finished Apr 28, 2024  07:01.21

Full logs for this script are available at /tmp/user.scripts/tmpScripts/Get Docker Compose Logs/log.txt

I don't think it's a problem with the script, since it recorded over a minute.

The last line from the logs doesn't seem to be the problem either:

autobrr                  | 2024-04-28T05:01:21.043895040Z {"level":"debug","module":"filter","method":"CheckFilter","time":"2024-04-28T07:01:21+02:00","message":"(Cross-Seed) external filter check not matching what filter wanted"}

OddMagnet · May 5

Same thing again today, Docker logs got an unexpected EOF again as well.

Guess I'll have to go with the slow and painful "disabling some containers and checking if it still happens"

OddMagnet · May 19

So, last weekend I stopped half of my docker containers, the problem still occurred

This weekend, I stopped the other half, it still occurred

It seems the docker services are not the problem, I still need help with this

JorgeB · May 19

Did you try a different NIC?

OddMagnet · May 20

Not yet. Since it always happens at exactly the same time (at 8:00am sharp) on Sundays I highly doubt that the hardware is at fault here

JorgeB · May 20

Yes, that's very suspect, suggesting more a container or plugin issue, or the router if this does something at that time.

OddMagnet · May 22

I don't think it's the containers, since I had disabled half of them last weekend and the other half the weekend before that and both times it still happened.

My router doesn't really do anything afaik.

I'll try disabling plugins for next sunday.

OddMagnet · Sunday at 09:12 AM

Took me a week longer, but disabling the Plugins helped. The interface spam didn't happen either.

(I used the Dynamix Safe Mode Pluging to disable other plugins)

I'm not sure if there's a way for me to disable only certain plugins, so I could narrow down the culprit.

JorgeB · Sunday at 09:31 AM

You can temporarily rename the *.plg files and reboot, those plugins won't then be loaded, try a few at a time.

WebGUI and SSH broke, restarted when I created diagnostics

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation