Server unreachable constantly


Go to solution Solved by Dextabrewa,

Recommended Posts

Hi, I need help digging into an ongoing problem I've been having with server outages. These outages happen a couple times a week, the server is still on, fans running, but Unraid dashboard is unavailable locally and through Unraid connect. All of my reverse proxy docker containers are also inaccessible and the router shows that the server is offline, indicating some kind of network outage or outright crash of the OS. 

Typically only a server restart can bring it back, unraid is able to gracefully shut down if I hit the power button once

brulu-diagnostics-20231210-1931.zip

Edited by Dextabrewa
Link to comment
4 hours ago, Dextabrewa said:

These outages happen a couple times a week,

Can you nail down a reason for the "couple" of times? Neighbors coming/going (ie: devices coming and leaving)?

 

4 hours ago, Dextabrewa said:

the router shows that the server is offline

What router?... Make/Model, Yours/ISPs?

 

4 hours ago, Dextabrewa said:

Typically only a server restart can bring it back

Well... Your in REALLY big trouble when the computer doesn't start at all.

 

We/You will figure it out.

 

MrGrey.

 

Link to comment

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment
8 hours ago, MrGrey said:

Can you nail down a reason for the "couple" of times? Neighbors coming/going (ie: devices coming and leaving)?

 

What router?... Make/Model, Yours/ISPs?

 

Well... Your in REALLY big trouble when the computer doesn't start at all.

 

We/You will figure it out.

 

MrGrey.

 


Router: Ubiquiti Networks UniFi Dream Machine
ISP: Google Fiber 1gig

 

There doesn't seem to be an obvious reason for the crashes. The computer always reboots no problem and the Unraid dashboard comes back up. I never have to hard restart either, its always graceful. 

Link to comment
5 hours ago, JorgeB said:

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

 

If it was hardware related I would expect the server to just shut down or the OS to crash, the OS seems to be running but the network goes down. I replaced the onboard motherboard networking with a PCI Network card and the outage still happens. 

Link to comment

I have the same thing happening since an update somewhere in september/october IIRC. Usually at these times, the dashboard shows this :

 

image.png.7f51d49a571909dec0f1147759339efe.png

Dashboard becomes unresponsive. Everything else also. Can't even power off it doesn't respond (even from a one push on the HW button). And the terminal doesn't load so I can see what happens with HTOP...

 

Edit : Since the last few months, I managed to get some diagnostics during one of this hang, showing nothing also.

Edit2 : it was before September actually :

 

Edited by xoC
Link to comment
1 hour ago, Dextabrewa said:

 

If it was hardware related I would expect the server to just shut down or the OS to crash, the OS seems to be running but the network goes down. I replaced the onboard motherboard networking with a PCI Network card and the outage still happens. 

 

The syslog seems to show the eth1 connection going up and down frequently.

 

Have you tried changing the network settings so that eth0 is not bonded with eth1 to see if that helps?

Link to comment
4 minutes ago, JorgeB said:

Various call traces and btrfs is detecting a lot of data corruption, suggest running memtest.

I replaced the SSD with the data corruption yesterday, those errors should be gone going forward. I had suspected that was the issue with the system crashes/freezes, but it proceeded to go offline just hours after the drive replacement

Edited by Dextabrewa
Link to comment
22 minutes ago, JorgeB said:

Almost certainly the SSD was not the problem.

 

The faulty SSD had errors when I checked the SSD log, it was caused by a bad set of ram I was using months prior (my theory), I replaced the ram with a known good set. But I will run a memtest again to verify.  

Edited by Dextabrewa
Link to comment
2 hours ago, JorgeB said:

Almost certainly the SSD was not the problem.

This seems to be the common error whenever my server is unreachable. My eth0 is set correctly in the interface rules to the new PCI network card, 

Network card: Gigabit Dual NIC with Intel 82576 Chip, 1Gb
Router: Unifi Dream Machine

 

Any suggestions?

I've included updated diagnostics that includes a crash. 

Dec 11 13:32:10 Brulu kernel: igb 0000:29:00.0 eth0: igb: eth0 NIC Link is Down
Dec 11 13:32:10 Brulu kernel: br0: port 1(eth0) entered disabled state
Dec 11 13:32:14 Brulu ntpd[1812]: Deleting interface #1 br0, 192.168.1.164#123, interface stats: received=12, sent=12, dropped=0, active_time=264 secs

 

brulu-diagnostics-20231211-1338.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.