OddMagnet Posted March 20 Share Posted March 20 I've had the WebGUI and SSH break one or two times before already and simply rebooted. This time I figured might as well connect my keyboard and monitor and create a diagnostics file. Weirdly enough, right when I pressed enter to start the diagnostics tool, the WebGUI and SSH started working again. I checked the logs and this seems to be the timeframe it happened (unless I'm misinterpreting something): 20:11 → WebGUI and SSH stopped working 20:20 → connected Keyboard 20:23 → connected Monitor 20:31 → started diagnostics Mar 20 17:05:06 OddNas sshd[8859]: Starting session: shell on pts/0 for root from 192.168.178.80 port 50226 id 0 Mar 20 20:11:15 OddNas kernel: e1000e 0000:00:1f.6 eth0: NIC Link is Down Mar 20 20:11:15 OddNas kernel: br0: port 1(eth0) entered disabled state Mar 20 20:11:18 OddNas ntpd[1127]: Deleting interface #1 br0, 192.168.178.200#123, interface stats: received=7185, sent=7186, dropped=0, active_time=1387370 secs Mar 20 20:11:18 OddNas ntpd[1127]: 216.239.35.12 local addr 192.168.178.200 -> Mar 20 20:11:18 OddNas ntpd[1127]: 216.239.35.8 local addr 192.168.178.200 -> Mar 20 20:11:18 OddNas ntpd[1127]: 216.239.35.4 local addr 192.168.178.200 -> Mar 20 20:11:18 OddNas ntpd[1127]: 216.239.35.0 local addr 192.168.178.200 -> Mar 20 20:20:51 OddNas kernel: usb 1-1: new full-speed USB device number 4 using xhci_hcd Mar 20 20:20:51 OddNas kernel: input: Corsair Corsair Gaming K55 RGB Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.0/0003:1B1C:1B3D.0002/input/input5 Mar 20 20:20:51 OddNas kernel: hid-generic 0003:1B1C:1B3D.0002: input,hidraw0: USB HID v1.11 Keyboard [Corsair Corsair Gaming K55 RGB Keyboard] on usb-0000:00:14.0-1/input0 Mar 20 20:20:51 OddNas kernel: input: Corsair Corsair Gaming K55 RGB Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.1/0003:1B1C:1B3D.0003/input/input6 Mar 20 20:20:51 OddNas kernel: input: Corsair Corsair Gaming K55 RGB Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.1/0003:1B1C:1B3D.0003/input/input7 Mar 20 20:20:51 OddNas kernel: input: Corsair Corsair Gaming K55 RGB Keyboard as /devices/pci0000:00/0000:00:14.0/usb1/1-1/1-1:1.1/0003:1B1C:1B3D.0003/input/input8 Mar 20 20:20:51 OddNas kernel: hid-generic 0003:1B1C:1B3D.0003: input,hiddev96,hidraw1: USB HID v1.11 Keyboard [Corsair Corsair Gaming K55 RGB Keyboard] on usb-0000:00:14.0-1/input1 Mar 20 20:20:51 OddNas kernel: hid-generic 0003:1B1C:1B3D.0004: hiddev97,hidraw2: USB HID v1.11 Device [Corsair Corsair Gaming K55 RGB Keyboard] on usb-0000:00:14.0-1/input2 Mar 20 20:23:11 OddNas kernel: fbcon: i915drmfb (fb0) is primary device Mar 20 20:23:11 OddNas kernel: Console: switching to colour frame buffer device 320x90 Mar 20 20:23:11 OddNas kernel: i915 0000:00:02.0: [drm] fb0: i915drmfb frame buffer device Mar 20 20:23:33 OddNas login: pam_unix(login:auth): authentication failure; logname=LOGIN uid=0 euid=0 tty=/dev/tty1 ruser= rhost= user=root Mar 20 20:23:36 OddNas login: FAILED LOGIN 1 FROM tty1 FOR root, Authentication failure Mar 20 20:24:04 OddNas login: pam_unix(login:session): session opened for user root(uid=0) by LOGIN(uid=0) Mar 20 20:24:04 OddNas login: ROOT LOGIN ON tty1 Mar 20 20:26:46 OddNas sshd[8859]: Read error from remote host 192.168.178.80 port 50226: No route to host Mar 20 20:26:46 OddNas sshd[8859]: pam_unix(sshd:session): session closed for user root Mar 20 20:27:17 OddNas root: ACPI action up is not defined Mar 20 20:27:18 OddNas root: ACPI action left is not defined Mar 20 20:27:18 OddNas root: ACPI action left is not defined Mar 20 20:27:18 OddNas root: ACPI action left is not defined Mar 20 20:27:18 OddNas root: ACPI action left is not defined Mar 20 20:27:18 OddNas root: ACPI action left is not defined Mar 20 20:27:19 OddNas root: ACPI action left is not defined Mar 20 20:27:22 OddNas root: ACPI action up is not defined Mar 20 20:27:22 OddNas root: ACPI action down is not defined Mar 20 20:31:39 OddNas kernel: e1000e 0000:00:1f.6 eth0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx Mar 20 20:31:39 OddNas kernel: br0: port 1(eth0) entered blocking state Mar 20 20:31:39 OddNas kernel: br0: port 1(eth0) entered forwarding state Mar 20 20:31:41 OddNas ntpd[1127]: Listen normally on 2 br0 192.168.178.200:123 Mar 20 20:31:41 OddNas ntpd[1127]: new interface(s) found: waking up resolver Mar 20 20:31:45 OddNas emhttpd: read SMART /dev/sdb Mar 20 20:32:19 OddNas sshd[13939]: Connection from 192.168.178.80 port 54474 on 192.168.178.200 port 22 rdomain "" Mar 20 20:32:19 OddNas sshd[13939]: Failed publickey for root from 192.168.178.80 port 54474 ssh2: ED25519 SHA256:GInj1AiL72FiRCs+JM71XjgEElxEiNHUoR508RkQS3g Mar 20 20:32:19 OddNas sshd[13939]: Postponed keyboard-interactive for root from 192.168.178.80 port 54474 ssh2 [preauth] Mar 20 20:32:22 OddNas sshd[13939]: Postponed keyboard-interactive/pam for root from 192.168.178.80 port 54474 ssh2 [preauth] Mar 20 20:32:22 OddNas sshd[13939]: Accepted keyboard-interactive/pam for root from 192.168.178.80 port 54474 ssh2 Mar 20 20:32:22 OddNas sshd[13939]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Mar 20 20:32:22 OddNas sshd[13939]: Starting session: shell on pts/0 for root from 192.168.178.80 port 54474 id 0 Full diagnostics are attached oddnas-diagnostics-20240320-2031.zip Quote Link to comment
OddMagnet Posted March 20 Author Share Posted March 20 (edited) Looking back, there also has been a lot of this, though I'm not sure if it's related: Mar 18 09:58:46 OddNas kernel: veth5e5dac8: renamed from eth0 Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth3e692a8) entered disabled state Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth3e692a8) entered disabled state Mar 18 09:58:46 OddNas kernel: device veth3e692a8 left promiscuous mode Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth3e692a8) entered disabled state Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered blocking state Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered disabled state Mar 18 09:58:46 OddNas kernel: device veth7260a7f entered promiscuous mode Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered blocking state Mar 18 09:58:46 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered forwarding state Mar 18 09:58:47 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered disabled state Mar 18 09:58:48 OddNas kernel: eth0: renamed from veth190d797 Mar 18 09:58:48 OddNas kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth7260a7f: link becomes ready Mar 18 09:58:48 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered blocking state Mar 18 09:58:48 OddNas kernel: br-db82cd71027f: port 20(veth7260a7f) entered forwarding state Some more stuff that might be noteworthy: Looking at the ifconfig.txt, there's a lot of vethsomething interfaces, no idea why In the same file, my aside from the above mentioned interfaces, only some br-... interface was up Edited March 21 by OddMagnet Quote Link to comment
JorgeB Posted March 21 Share Posted March 21 Nothing obvious that I can see in the logs. 13 hours ago, OddMagnet said: Looking back, there also has been a lot of this, That could be a container constantly re-starting, check all the containers up times. Quote Link to comment
OddMagnet Posted March 21 Author Share Posted March 21 Containers all had 100% uptime (other than intentional restarts - e.g. when I changed things in the compose file) Quote Link to comment
OddMagnet Posted March 21 Author Share Posted March 21 Still looks like a problem to me though. To me it seems that WebGUI and SSH broke because of the interfaces being down (aside from the br-... and veth... interfaces) It's hard to believe that the spam I mentioned in my second post is completely unrelated to the interfaces going down. I don't understand why there are so many veth... interfaces in the first place, why the br-... interfaces ports are spamming my log so much or why there are entries about IPv6 when that's not even enabled in my settings Quote Link to comment
OddMagnet Posted March 28 Author Share Posted March 28 Still looking for help with this problem. I'd really like to solve the interface spam in my logs, which I believe is the root cause of my problem Quote Link to comment
JorgeB Posted March 29 Share Posted March 29 If none of the containers is restarting, you can disable one at a time to see if you find the culprit. Quote Link to comment
OddMagnet Posted March 31 Author Share Posted March 31 So it happened again, or at least the "soft version" where all my containers restart after the interface spam. This seems to mostly happen on Sunday mornings, I've looked over all my container logs, but not one of them had anything in it that would indicate being a problem. Additionally I've checked all my container settings, but none of them are configured to do anything at around that time. It doesn't look like it's a docker problem to me. I'll try disabling some of my containers over the coming weeks and months, but it's gonna be a very tedious process of narrowing it down (if they're the problem) In the meantime, is there anything else I could check? I'm still confused why there is any mention of IPv6 in my logs, when I don't even have it enabled. Quote Link to comment
OddMagnet Posted April 3 Author Share Posted April 3 So it happened again, guess it's not a Sunday only thing then. Again no container restarts, nothing in the container or docker logs that would indicate any problems Quote Link to comment
OddMagnet Posted April 3 Author Share Posted April 3 Had a look at the syslog, this time it's different it seems. Not interface spam before eth0 goes down. (I don't think it previously even had an entry about eth0 going down) @JorgeB can you take another look at the new diagnostics? oddnas-diagnostics-20240403-0729.zip Quote Link to comment
JorgeB Posted April 3 Share Posted April 3 Apr 3 01:34:58 OddNas kernel: e1000e 0000:00:1f.6 eth0: NIC Link is Down Yep, this would mean the NIC actually lost the physical link, could be an issue with the NIC, cable, switch, etc. Quote Link to comment
OddMagnet Posted April 3 Author Share Posted April 3 that'd suck a lot. I guess that'd explain the drops for eth0 as well? Though it's weird there are only drops for receive Errors info Receive counters Transmit counters eth0 Errors: 0 Errors: 0 Drops: 19739 Drops: 0 Overruns: 0 Overruns: 0 Any suggestions on how I'd be able to verify what parts of the chain are good and where the problem actually is? Quote Link to comment
JorgeB Posted April 3 Share Posted April 3 Try a different cable and switch/router port, if the same it could be the NIC. Quote Link to comment
OddMagnet Posted April 3 Author Share Posted April 3 Changed the cable and router port, hopefully it's not gonna happen again. Everything (Router, Cable, Server) is brand-new, but I'd much rather have the cable be the culprit, lol. What's weird though, is that eth0 always goes up again when I connect a KB+Monitor, login and start the diagnostics command. (and obviously loosing the physical link when nothing is moving is extremely weird in the first place) Quote Link to comment
OddMagnet Posted April 21 Author Share Posted April 21 So I've setup a user script to start logging docker events on Sundays, one hour before the interface spam happens. Looking at the log, I can see a few things: container exec_create → exec_start → exec_die happens a lot. After a quick google search I learned that those events also happen for Healthchecks, which is the case in my logs a lot of container kill / die / stop happening after the interface spam started also a lot of network disconnect for my docker bridge network I've edited the logs to remove all the healthchecks and attached it here. I don't think there are any pointers in there, but if anyone is bored enough to take a look at it I'd appreciate it. Additionally I've checked docker compose logs (with --since and --until), but there's not much to see before the time of the problem. (It doesn't help that not all log lines contain times and the containers don't all use the same formatting for times...) For now I'll set up another script to catch the docker compose logs and hopefully that'll be more helpful next week.. docker-events-edited.log Quote Link to comment
OddMagnet Posted April 28 Author Share Posted April 28 Again, the interface spam happened at 8am on Sunday. I've tried getting logs from docker via this script: #!/bin/bash cd /mnt/user/appdata timeout 2h docker compose logs -f -t --since=1s > /mnt/user/data/docker-compose.log However, this didn't record properly for the full duration: Script Starting Apr 28, 2024 07:00.01 Full logs for this script are available at /tmp/user.scripts/tmpScripts/Get Docker Compose Logs/log.txt error from daemon in stream: Error grabbing logs: unexpected EOF Script Finished Apr 28, 2024 07:01.21 Full logs for this script are available at /tmp/user.scripts/tmpScripts/Get Docker Compose Logs/log.txt I don't think it's a problem with the script, since it recorded over a minute. The last line from the logs doesn't seem to be the problem either: autobrr | 2024-04-28T05:01:21.043895040Z {"level":"debug","module":"filter","method":"CheckFilter","time":"2024-04-28T07:01:21+02:00","message":"(Cross-Seed) external filter check not matching what filter wanted"} Quote Link to comment
OddMagnet Posted May 5 Author Share Posted May 5 Same thing again today, Docker logs got an unexpected EOF again as well. Guess I'll have to go with the slow and painful "disabling some containers and checking if it still happens" Quote Link to comment
OddMagnet Posted May 19 Author Share Posted May 19 So, last weekend I stopped half of my docker containers, the problem still occurred This weekend, I stopped the other half, it still occurred It seems the docker services are not the problem, I still need help with this Quote Link to comment
OddMagnet Posted May 20 Author Share Posted May 20 Not yet. Since it always happens at exactly the same time (at 8:00am sharp) on Sundays I highly doubt that the hardware is at fault here Quote Link to comment
JorgeB Posted May 20 Share Posted May 20 Yes, that's very suspect, suggesting more a container or plugin issue, or the router if this does something at that time. Quote Link to comment
OddMagnet Posted May 22 Author Share Posted May 22 I don't think it's the containers, since I had disabled half of them last weekend and the other half the weekend before that and both times it still happened. My router doesn't really do anything afaik. I'll try disabling plugins for next sunday. Quote Link to comment
OddMagnet Posted Sunday at 09:12 AM Author Share Posted Sunday at 09:12 AM Took me a week longer, but disabling the Plugins helped. The interface spam didn't happen either. (I used the Dynamix Safe Mode Pluging to disable other plugins) I'm not sure if there's a way for me to disable only certain plugins, so I could narrow down the culprit. Quote Link to comment
JorgeB Posted Sunday at 09:31 AM Share Posted Sunday at 09:31 AM You can temporarily rename the *.plg files and reboot, those plugins won't then be loaded, try a few at a time. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.