Server unreachable constantly

Dextabrewa · December 11, 2023

Hi, I need help digging into an ongoing problem I've been having with server outages. These outages happen a couple times a week, the server is still on, fans running, but Unraid dashboard is unavailable locally and through Unraid connect. All of my reverse proxy docker containers are also inaccessible and the router shows that the server is offline, indicating some kind of network outage or outright crash of the OS.

Typically only a server restart can bring it back, unraid is able to gracefully shut down if I hit the power button once

brulu-diagnostics-20231210-1931.zip

Edited December 11, 2023 by Dextabrewa

MrGrey · December 11, 2023

4 hours ago, Dextabrewa said:

These outages happen a couple times a week,

Can you nail down a reason for the "couple" of times? Neighbors coming/going (ie: devices coming and leaving)?

4 hours ago, Dextabrewa said:

the router shows that the server is offline

What router?... Make/Model, Yours/ISPs?

4 hours ago, Dextabrewa said:

Typically only a server restart can bring it back

Well... Your in REALLY big trouble when the computer doesn't start at all.

We/You will figure it out.

MrGrey.

JorgeB · December 11, 2023

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

itimpi · December 11, 2023

You might want to try enabling the syslog server (probably using the Mirror to flash option) and post the file that produces after the issue occurs so we can see if anything was logged leading up to the problem.

Dextabrewa · December 11, 2023

8 hours ago, MrGrey said:

Can you nail down a reason for the "couple" of times? Neighbors coming/going (ie: devices coming and leaving)?

What router?... Make/Model, Yours/ISPs?

Well... Your in REALLY big trouble when the computer doesn't start at all.

We/You will figure it out.

MrGrey.

Router: Ubiquiti Networks UniFi Dream Machine
ISP: Google Fiber 1gig

There doesn't seem to be an obvious reason for the crashes. The computer always reboots no problem and the Unraid dashboard comes back up. I never have to hard restart either, its always graceful.

Dextabrewa · December 11, 2023

3 hours ago, itimpi said:

You might want to try enabling the syslog server (probably using the Mirror to flash option) and post the file that produces after the issue occurs so we can see if anything was logged leading up to the problem.

I have a syslog file that goes back a couple months that I can also share here:

syslog-192.168.1.164.log

Dextabrewa · December 11, 2023

5 hours ago, JorgeB said:

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

If it was hardware related I would expect the server to just shut down or the OS to crash, the OS seems to be running but the network goes down. I replaced the onboard motherboard networking with a PCI Network card and the outage still happens.

xoC · December 11, 2023

I have the same thing happening since an update somewhere in september/october IIRC. Usually at these times, the dashboard shows this :

image.png.7f51d49a571909dec0f1147759339efe.png

Dashboard becomes unresponsive. Everything else also. Can't even power off it doesn't respond (even from a one push on the HW button). And the terminal doesn't load so I can see what happens with HTOP...

Edit : Since the last few months, I managed to get some diagnostics during one of this hang, showing nothing also.

Edit2 : it was before September actually :

Edited December 11, 2023 by xoC

itimpi · December 11, 2023

1 hour ago, Dextabrewa said:

If it was hardware related I would expect the server to just shut down or the OS to crash, the OS seems to be running but the network goes down. I replaced the onboard motherboard networking with a PCI Network card and the outage still happens.

The syslog seems to show the eth1 connection going up and down frequently.

Have you tried changing the network settings so that eth0 is not bonded with eth1 to see if that helps?

JorgeB · December 11, 2023

1 hour ago, Dextabrewa said:

I have a syslog file that goes back a couple months that I can also share here:

Various call traces and btrfs is detecting a lot of data corruption, suggest running memtest.

Dextabrewa · December 11, 2023

4 minutes ago, JorgeB said:

Various call traces and btrfs is detecting a lot of data corruption, suggest running memtest.

I replaced the SSD with the data corruption yesterday, those errors should be gone going forward. I had suspected that was the issue with the system crashes/freezes, but it proceeded to go offline just hours after the drive replacement

Edited December 11, 2023 by Dextabrewa

JorgeB · December 11, 2023

8 minutes ago, Dextabrewa said:

I replaced the SSD with the data corruption yesterday

Almost certainly the SSD was not the problem.

Dextabrewa · December 11, 2023

22 minutes ago, JorgeB said:

Almost certainly the SSD was not the problem.

The faulty SSD had errors when I checked the SSD log, it was caused by a bad set of ram I was using months prior (my theory), I replaced the ram with a known good set. But I will run a memtest again to verify.

Edited December 11, 2023 by Dextabrewa

Dextabrewa · December 11, 2023

2 hours ago, JorgeB said:

Almost certainly the SSD was not the problem.

This seems to be the common error whenever my server is unreachable. My eth0 is set correctly in the interface rules to the new PCI network card,

Network card: Gigabit Dual NIC with Intel 82576 Chip, 1Gb
Router: Unifi Dream Machine

Any suggestions?

I've included updated diagnostics that includes a crash.

Dec 11 13:32:10 Brulu kernel: igb 0000:29:00.0 eth0: igb: eth0 NIC Link is Down
Dec 11 13:32:10 Brulu kernel: br0: port 1(eth0) entered disabled state
Dec 11 13:32:14 Brulu ntpd[1812]: Deleting interface #1 br0, 192.168.1.164#123, interface stats: received=12, sent=12, dropped=0, active_time=264 secs

brulu-diagnostics-20231211-1338.zip

Dextabrewa · December 11, 2023

3 hours ago, JorgeB said:

Various call traces and btrfs is detecting a lot of data corruption, suggest running memtest.

Ran memtest for 1 hours with no errors, if you feel like I should run it longer I can.

Dextabrewa · December 12, 2023

Swapping out the ethernet cable seems to have fixed the issue for now.. Will report back in a few days!

Server unreachable constantly

Recommended Posts

Dextabrewa

Link to comment

MrGrey

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

Dextabrewa

Link to comment

Dextabrewa

Link to comment

Dextabrewa

Link to comment

xoC

Link to comment

itimpi

Link to comment

JorgeB

Link to comment

Dextabrewa

Link to comment

JorgeB

Link to comment

Dextabrewa

Link to comment

Dextabrewa

Link to comment

Dextabrewa

Link to comment

Dextabrewa

Link to comment

Join the conversation