WebGUI and SSH broke, restarted when I created diagnostics


Recommended Posts

I've had the WebGUI and SSH break one or two times before already and simply rebooted.

This time I figured might as well connect my keyboard and monitor and create a diagnostics file.

Weirdly enough, right when I pressed enter to start the diagnostics tool, the WebGUI and SSH started working again.

 

I checked the logs and this seems to be the timeframe it happened (unless I'm misinterpreting something):

20:11 → WebGUI and SSH stopped working

20:20 → connected Keyboard

20:23 → connected Monitor

20:31 → started diagnostics

Mar 20 17:05:06 OddNas sshd[8859]: Starting session: shell on pts/0 for root from 192.168.178.80 port 50226 id 0
Mar 20 20:11:15 OddNas kernel: e1000e 0000:00:1f.6 eth0: NIC Link is Down
Mar 20 20:11:15 OddNas kernel: br0: port 1(eth0) entered disabled state
Mar 20 20:11:18 OddNas ntpd[1127]: Deleting interface #1 br0, 192.168.178.200#123, interface stats: received=7185, sent=7186, dropped=0, active_time=1387370 secs
Mar 20 20:11:18 OddNas ntpd[1127]: 216.239.35.12 local addr 192.168.178.200 -> 
Mar 20 20:11:18 OddNas ntpd[1127]: 216.239.35.8 local addr 192.168.178.200 -> 
Mar 20 20:11:18 OddNas ntpd[1127]: 216.239.35.4 local addr 192.168.178.200 -> 
Mar 20 20:11:18 OddNas ntpd[1127]: 216.239.35.0 local addr 192.168.178.200 -> 
Mar 20 20:20:51 OddNas kernel: usb 1-1: new full-speed USB device number 4 using xhci_hcd
Mar 20 20:20:51 OddNas kernel: input: Corsair Corsair Gaming K55 RGB Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.0/0003:1B1C:1B3D.0002/input/input5
Mar 20 20:20:51 OddNas kernel: hid-generic 0003:1B1C:1B3D.0002: input,hidraw0: USB HID v1.11 Keyboard [Corsair Corsair Gaming K55 RGB Keyboard] on usb-0000:00:14.0-1/input0
Mar 20 20:20:51 OddNas kernel: input: Corsair Corsair Gaming K55 RGB Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.1/0003:1B1C:1B3D.0003/input/input6
Mar 20 20:20:51 OddNas kernel: input: Corsair Corsair Gaming K55 RGB Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.1/0003:1B1C:1B3D.0003/input/input7
Mar 20 20:20:51 OddNas kernel: input: Corsair Corsair Gaming K55 RGB Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.1/0003:1B1C:1B3D.0003/input/input8
Mar 20 20:20:51 OddNas kernel: hid-generic 0003:1B1C:1B3D.0003: input,hiddev96,hidraw1: USB HID v1.11 Keyboard [Corsair Corsair Gaming K55 RGB Keyboard] on usb-0000:00:14.0-1/input1
Mar 20 20:20:51 OddNas kernel: hid-generic 0003:1B1C:1B3D.0004: hiddev97,hidraw2: USB HID v1.11 Device [Corsair Corsair Gaming K55 RGB Keyboard] on usb-0000:00:14.0-1/input2
Mar 20 20:23:11 OddNas kernel: fbcon: i915drmfb (fb0) is primary device
Mar 20 20:23:11 OddNas kernel: Console: switching to colour frame buffer device 320x90
Mar 20 20:23:11 OddNas kernel: i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device
Mar 20 20:23:33 OddNas login: pam_unix(login:auth): authentication failure; logname=LOGIN uid=0 euid=0 tty=/dev/tty1 ruser= rhost=  user=root
Mar 20 20:23:36 OddNas login: FAILED LOGIN 1 FROM tty1 FOR root, Authentication failure
Mar 20 20:24:04 OddNas login: pam_unix(login:session): session opened for user root(uid=0) by LOGIN(uid=0)
Mar 20 20:24:04 OddNas login: ROOT LOGIN ON tty1
Mar 20 20:26:46 OddNas sshd[8859]: Read error from remote host 192.168.178.80 port 50226: No route to host
Mar 20 20:26:46 OddNas sshd[8859]: pam_unix(sshd:session): session closed for user root
Mar 20 20:27:17 OddNas root: ACPI action up is not defined
Mar 20 20:27:18 OddNas root: ACPI action left is not defined
Mar 20 20:27:18 OddNas root: ACPI action left is not defined
Mar 20 20:27:18 OddNas root: ACPI action left is not defined
Mar 20 20:27:18 OddNas root: ACPI action left is not defined
Mar 20 20:27:18 OddNas root: ACPI action left is not defined
Mar 20 20:27:19 OddNas root: ACPI action left is not defined
Mar 20 20:27:22 OddNas root: ACPI action up is not defined
Mar 20 20:27:22 OddNas root: ACPI action down is not defined
Mar 20 20:31:39 OddNas kernel: e1000e 0000:00:1f.6 eth0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Mar 20 20:31:39 OddNas kernel: br0: port 1(eth0) entered blocking state
Mar 20 20:31:39 OddNas kernel: br0: port 1(eth0) entered forwarding state
Mar 20 20:31:41 OddNas ntpd[1127]: Listen normally on 2 br0 192.168.178.200:123
Mar 20 20:31:41 OddNas ntpd[1127]: new interface(s) found: waking up resolver
Mar 20 20:31:45 OddNas emhttpd: read SMART /dev/sdb
Mar 20 20:32:19 OddNas sshd[13939]: Connection from 192.168.178.80 port 54474 on 192.168.178.200 port 22 rdomain ""
Mar 20 20:32:19 OddNas sshd[13939]: Failed publickey for root from 192.168.178.80 port 54474 ssh2: ED25519 SHA256:GInj1AiL72FiRCs+JM71XjgEElxEiNHUoR508RkQS3g
Mar 20 20:32:19 OddNas sshd[13939]: Postponed keyboard-interactive for root from 192.168.178.80 port 54474 ssh2 [preauth]
Mar 20 20:32:22 OddNas sshd[13939]: Postponed keyboard-interactive/pam for root from 192.168.178.80 port 54474 ssh2 [preauth]
Mar 20 20:32:22 OddNas sshd[13939]: Accepted keyboard-interactive/pam for root from 192.168.178.80 port 54474 ssh2
Mar 20 20:32:22 OddNas sshd[13939]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Mar 20 20:32:22 OddNas sshd[13939]: Starting session: shell on pts/0 for root from 192.168.178.80 port 54474 id 0

 

Full diagnostics are attached

oddnas-diagnostics-20240320-2031.zip

Link to comment
Posted (edited)

Looking back, there also has been a lot of this, though I'm not sure if it's related:

 

Mar 18 09:58:46 OddNas kernel: veth5e5dac8: renamed from eth0
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth3e692a8) entered disabled state
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth3e692a8) entered disabled state
Mar 18 09:58:46 OddNas kernel: device veth3e692a8 left promiscuous mode
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth3e692a8) entered disabled state
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered blocking state
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered disabled state
Mar 18 09:58:46 OddNas kernel: device veth7260a7f entered promiscuous mode
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered blocking state
Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered forwarding state
Mar 18 09:58:47 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered disabled state
Mar 18 09:58:48 OddNas kernel: eth0: renamed from veth190d797
Mar 18 09:58:48 OddNas kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth7260a7f: link becomes ready
Mar 18 09:58:48 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered blocking state
Mar 18 09:58:48 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered forwarding state

 

 

Some more stuff that might be noteworthy:

  1. Looking at the ifconfig.txt, there's a lot of vethsomething interfaces, no idea why
  2. In the same file, my aside from the above mentioned interfaces, only some br-... interface was up
Edited by OddMagnet
Link to comment

Still looks like a problem to me though.

 

To me it seems that WebGUI and SSH broke because of the interfaces being down (aside from the br-... and veth... interfaces)

It's hard to believe that the spam I mentioned in my second post is completely unrelated to the interfaces going down.

I don't understand why there are so many veth... interfaces in the first place, why the br-... interfaces ports are spamming my log so much or why there are entries about IPv6 when that's not even enabled in my settings

Link to comment

So it happened again, or at least the "soft version" where all my containers restart after the interface spam.

This seems to mostly happen on Sunday mornings,

 

I've looked over all my container logs, but not one of them had anything in it that would indicate being a problem.

Additionally I've checked all my container settings, but none of them are configured to do anything at around that time.

 

It doesn't look like it's a docker problem to me.

 

I'll try disabling some of my containers over the coming weeks and months, but it's gonna be a very tedious process of narrowing it down (if they're the problem)

 

In the meantime, is there anything else I could check? I'm still confused why there is any mention of IPv6 in my logs, when I don't even have it enabled.

Link to comment

that'd suck a lot.

I guess that'd explain the drops for eth0 as well? Though it's weird there are only drops for receive

 

Errors info                   Receive counters             Transmit counters 

eth0                           Errors: 0                           Errors: 0
                                  Drops: 19739                    Drops: 0
                                  Overruns: 0                      Overruns: 0

 

Any suggestions on how I'd be able to verify what parts of the chain are good and where the problem actually is?

Link to comment

Changed the cable and router port, hopefully it's not gonna happen again.

 

Everything (Router, Cable, Server) is brand-new, but I'd much rather have the cable be the culprit, lol.

 

What's weird though, is that eth0 always goes up again when I connect a KB+Monitor, login and start the diagnostics command.

(and obviously loosing the physical link when nothing is moving is extremely weird in the first place)

Link to comment
  • 3 weeks later...

So I've setup a user script to start logging docker events on Sundays, one hour before the interface spam happens.

 

Looking at the log, I can see a few things:

  • container exec_create → exec_start → exec_die  happens a lot. After a quick google search I learned that those events also happen for Healthchecks, which is the case in my logs
  • a lot of container kill / die / stop happening after the interface spam started
  • also a lot of network disconnect for my docker bridge network

 

I've edited the logs to remove all the healthchecks and attached it here.

 

I don't think there are any pointers in there, but if anyone is bored enough to take a look at it I'd appreciate it.

 

Additionally I've checked docker compose logs (with --since and --until), but there's not much to see before the time of the problem.

(It doesn't help that not all log lines contain times and the containers don't all use the same formatting for times...)

 

For now I'll set up another script to catch the docker compose logs and hopefully that'll be more helpful next week..

docker-events-edited.log

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.