Unraid becomes unresponsive over night

ThanatopsisJSH · May 11, 2020

Hi,

For the last two days (after about 3 weeks of uptime) my unraid server has started to become unresponsive over night for two days in a row. When I try to access it in the morning shares, dockers, the Webinterface are all non-responsive. When I access the server directly (via console redirection) the shell is still up. HTOP shows basically no activity. After I reboot the server everything is fine.

I have looked into the logs but except for some weird network issues starting around 1 am everything I cannot find anything that explains the unresponsiveness.

Here are the diagnostics: https://drive.google.com/open?id=1TDVWs-58npp8zMHNqGjW0mPwz_RPHCkJ

JorgeB · May 11, 2020

Syslog starts over after a reboot, try this.

ThanatopsisJSH · May 11, 2020

I have done that.

The reboot was this morning around 8:50 or so when I tried to access. There was very little activity in the log before that. Here is the extended log since yesterday:

https://drive.google.com/open?id=1K42PlJj-tq6h86uunJzIN_6ckNAPYlqt

polish_pat · May 11, 2020

+1 for the issue.

Running Poweredge T410 with 2 CPUs, 64gb, all 6 HDD on backplane with SAS raid controller, dont remember exactly but its in IT mode and shows up as SAS2000. Have 6 ethernet ports and a idrac 6 enterprise.

Let me throw something your way, I don't think the problem is the server. I have a Ubiquiti USG, but a dedicated Rasp3B+bian for PiHole.

I'm fairely new at both Unraid and Pihole, i'm well established on the networking side as I equip all my customers with the same network setup, usually a USG with some AP-AC-Pros/lite/LR.

I was playing around with the raspberry Pi and the USG for what optimal setup would resolve DHCP queries and filter traffic while keeping a complete log of all IPs and hostnames and their requests. I realized just like you that for no obvious reason the unraid would no longer respond and give a DNS NX domain error even when using IP. but once more, the DRAC always remained accessible and rebooting would restore unraid. I was not able to find any cause in any log.

To me, it seems like the DNS server drops the unraid and all it uses as IPs because i left a monitor connected to see what showed up when i couldnt log in and everything was normal...screen showed the normal console login username.

I "think" the issues started and stopped when i had made the pi my DHCP server. Until that point the USG had always been it, and my issues seem to have stopped (good for day 3 as of now) when previously id wake up and try to access a container but got nothing.

ThanatopsisJSH · May 11, 2020

Hmm, unrelated my Pihole was throwing issues today and I reinstalled everything. My USG is my DHCP Server but the PiHole is the DNS. Lets hope the PiHole update to Version 5 and the reinstall solves the issues.

And yes, my Pihole is also running on a raspi in the server rack. Seems that I basically run the same setup as you do

polish_pat · May 11, 2020

Well yes, needless to say Pi hole is the DNS. I don't see what else it could be...😆 before posting I just wanted to mention that I had not checked any of your attached logs or anything but just relied on your description and my intuition seems to be well established since you confirm a setup that is two similar to be a coincidence. but if I were you I would maybe look at what your current controller version and fw. ubiquity released a pretty big bunch of firmwares for APs and USG that fixed a hell of a lot of problems that have been pending a fix for a very long time because ever since I stopped playing games with the DNS and DHCP and the settings just to see if one way was more optimal than the other well all of my issues seem to have disappeared or at least now I'm on day 4 when previously it was a daily thing

Edited May 11, 2020 by polish_pat

ThanatopsisJSH · May 12, 2020

Hmm, the problem is still there and it is not a pihole problem...

When I cam back to my system this morning I had the same problem again. The server is unavailable via ip, not webinterface, not shares, etc.

This time I took a little time to investigate and the problem seems to be that my main network link eth 0 is reported as "link ifdown" in ethtool.

It says Link detected: yes but the adapter is down.

ThanatopsisJSH · May 12, 2020

Ok, this is getting weirder and weirder. Unplugging and replugging the network cable brings the server back up.

But: Something deep in unraid must be the issue here, the server was not only unresponsive but the parity rebuild that was running when the server became unresponsive actually stopped (otherwise it should have finished by morning) and only started back up when I did my unplug/replug of the network cable. This means the dropped network link actually stopped the whole system from working.

ThanatopsisJSH · May 12, 2020

Here is the extract from the logs from last night:

May 12 01:03:06 Cubyserve kernel: igb 0000:03:00.0 eth1: igb: eth1 NIC Link is Down
May 12 01:03:06 Cubyserve kernel: bond0: link status definitely down for interface eth1, disabling it
May 12 01:03:08 Cubyserve kernel: mlx4_en: eth0: Link Down
May 12 01:03:09 Cubyserve dhcpcd[6094]: bond0: carrier lost
May 12 01:03:09 Cubyserve kernel: bond0: link status definitely down for interface eth0, disabling it
May 12 01:03:09 Cubyserve kernel: bond0: now running without any active interface!
May 12 01:04:36 Cubyserve kernel: mlx4_en: eth0: Link Up
May 12 01:04:36 Cubyserve dhcpcd[6094]: bond0: carrier acquired
May 12 01:04:36 Cubyserve kernel: bond0: link status definitely up for interface eth0, 10000 Mbps full duplex
May 12 01:04:36 Cubyserve kernel: bond0: making interface eth0 the new active one
May 12 01:04:36 Cubyserve kernel: bond0: first active interface up!
May 12 01:04:36 Cubyserve dhcpcd[6094]: bond0: IAID c9:54:99:24
May 12 01:04:36 Cubyserve dhcpcd[6094]: bond0: soliciting an IPv6 router
May 12 01:04:37 Cubyserve kernel: mlx4_en: eth0: Link Down
May 12 01:04:37 Cubyserve dhcpcd[6094]: bond0: carrier lost
May 12 01:04:37 Cubyserve kernel: bond0: link status definitely down for interface eth0, disabling it
May 12 01:04:37 Cubyserve kernel: bond0: now running without any active interface!
May 12 01:04:38 Cubyserve kernel: mlx4_en: eth0: Link Up
May 12 01:04:38 Cubyserve dhcpcd[6094]: bond0: carrier acquired
May 12 01:04:38 Cubyserve kernel: bond0: link status definitely up for interface eth0, 10000 Mbps full duplex
May 12 01:04:38 Cubyserve kernel: bond0: making interface eth0 the new active one
May 12 01:04:38 Cubyserve kernel: bond0: first active interface up!
May 12 01:04:38 Cubyserve dhcpcd[6094]: bond0: IAID c9:54:99:24
May 12 01:04:38 Cubyserve kernel: mlx4_en: eth0: Link Down
May 12 01:04:38 Cubyserve dhcpcd[6094]: bond0: carrier lost
May 12 01:04:38 Cubyserve kernel: bond0: link status definitely down for interface eth0, disabling it
May 12 01:04:38 Cubyserve kernel: bond0: now running without any active interface!
May 12 01:04:38 Cubyserve kernel: mlx4_en: eth0: Link Up
May 12 01:04:38 Cubyserve dhcpcd[6094]: bond0: carrier acquired
May 12 01:04:38 Cubyserve kernel: bond0: link status definitely up for interface eth0, 10000 Mbps full duplex
May 12 01:04:38 Cubyserve kernel: bond0: making interface eth0 the new active one
May 12 01:04:38 Cubyserve kernel: bond0: first active interface up!
May 12 01:04:38 Cubyserve dhcpcd[6094]: bond0: IAID c9:54:99:24
May 12 01:04:38 Cubyserve dhcpcd[6094]: bond0: soliciting an IPv6 router
May 12 01:04:41 Cubyserve kernel: igb 0000:03:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
May 12 01:04:41 Cubyserve kernel: bond0: link status definitely up for interface eth1, 1000 Mbps full duplex
May 12 01:04:43 Cubyserve kernel: mlx4_en: eth0: Link Down
May 12 01:04:43 Cubyserve kernel: bond0: link status definitely down for interface eth0, disabling it
May 12 01:04:43 Cubyserve kernel: bond0: making interface eth1 the new active one
May 12 01:04:43 Cubyserve kernel: mlx4_en: eth0: Link Up
May 12 01:04:43 Cubyserve kernel: bond0: link status definitely up for interface eth0, 10000 Mbps full duplex
May 12 01:04:43 Cubyserve kernel: mlx4_en: eth0: Link Down
May 12 01:04:43 Cubyserve kernel: bond0: link status definitely down for interface eth0, disabling it
May 12 01:04:43 Cubyserve kernel: mlx4_en: eth0: Link Up
May 12 01:04:43 Cubyserve kernel: bond0: link status definitely up for interface eth0, 10000 Mbps full duplex
May 12 01:04:43 Cubyserve kernel: mlx4_en: eth0: Link Down
May 12 01:04:43 Cubyserve kernel: bond0: link status definitely down for interface eth0, disabling it
May 12 01:04:43 Cubyserve kernel: mlx4_en: eth0: Link Up
May 12 01:04:43 Cubyserve kernel: bond0: link status definitely up for interface eth0, 10000 Mbps full duplex
May 12 01:04:50 Cubyserve dhcpcd[6094]: bond0: no IPv6 Routers available
May 12 01:04:56 Cubyserve kernel: mlx4_en: eth0: Link Down
May 12 01:04:56 Cubyserve kernel: bond0: link status definitely down for interface eth0, disabling it
May 12 01:04:56 Cubyserve kernel: igb 0000:03:00.0 eth1: igb: eth1 NIC Link is Down
May 12 01:04:56 Cubyserve dhcpcd[6094]: bond0: carrier lost
May 12 01:04:56 Cubyserve kernel: bond0: link status definitely down for interface eth1, disabling it
May 12 01:04:56 Cubyserve kernel: bond0: now running without any active interface!
May 12 01:04:57 Cubyserve kernel: mlx4_en: eth0: Link Up
May 12 01:04:57 Cubyserve dhcpcd[6094]: bond0: carrier acquired
May 12 01:04:57 Cubyserve kernel: bond0: link status definitely up for interface eth0, 10000 Mbps full duplex
May 12 01:04:57 Cubyserve kernel: bond0: making interface eth0 the new active one
May 12 01:04:57 Cubyserve kernel: bond0: first active interface up!
May 12 01:04:57 Cubyserve dhcpcd[6094]: bond0: IAID c9:54:99:24
May 12 01:04:57 Cubyserve dhcpcd[6094]: bond0: soliciting an IPv6 router
May 12 01:05:00 Cubyserve kernel: igb 0000:03:00.0 eth1: igb: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
May 12 01:05:00 Cubyserve kernel: bond0: link status definitely up for interface eth1, 1000 Mbps full duplex
May 12 01:05:09 Cubyserve dhcpcd[6094]: bond0: no IPv6 Routers available

I have configured eth0 and eth1 as active backup in bond0 so even if eth0 drops eth1 should still be there...

And then the server effectively does nothing until 9 am when I come in and first try some stuff at the console, restart the switch (which did not fix it) and the unplug and replug the network cable which for some reason fixed it...

Edited May 12, 2020 by ThanatopsisJSH

polish_pat · May 12, 2020

I see, well now i can say what seemed like a perfect match was more of a happy coincidence.

Before i give more deets about what i'm having to deal with, i suggest you utilize the greatness of the tools at hand because i'm under the impression your problem might be a driver problem, or perhaps a firmware...specifically the NIC you have, I see too many Kernel related messages and all are normal until something with the Melanox infiniband driver.

1- If not already done, get a syslog server running and tell the Unifi controller to send all Syslog and netconsole data to it.

2- In your BIOS, disable any IOMMU, or any other type of functionality that could be enabled by default and/or by user.

3- Not so much relelvant to this but unless you intend to have IPV6, disable it where ever. It wont affect anything and might add some unnessary data to the mix.

I've come to realize in my short experience with Unraid that it doesnt play nice with hardware that doesn't give it direct root access to raw data...such as when having a raid card, unless its in IT mode, it doesnt even show HDDs to unraid. I think unraid likes to be in charge of everything.

Unlink your aggregate link, AKA bond0, use only a single port and disable it in bios if you can.

Now wait and see, if you still have issues, make sure you get a full firmware update on that 1 big ISO the OEM has for offline updates.

As for my problem...its quite different...i have good hopes to see your issue solved, or at least a good way there...me, hahahha

My management UI is dead, cant resolve it, both tabs on top show the local UI, and one through a DDNS. all other tabs show containers that run off the same IP but use a different port. They all work when accessed locally or remotely. All of the shares are online and as for the Pi, well its status is the same as last night, even if i had it upgraded in between to Pi-hole Version v5.0 Web Interface Version v5.0 FTL Version v5.0.

To remove the pi from the equasion, I stopped any DNS through it and set it all back to how it was. That only affected the fact that the IP of the main UI now shows up in my Unifi client list as before it didnt.

Right now, most of my logs point to a Bind problem, not a Bond like you...will update this with some data after a little siesta

polish_pat · May 18, 2020

OK. Investigation over, case closed.

BURN THE PI!!!

V5 V5 V5 and this FTL offline bs could not be dealt with, so i purged all DHCP and DNS data from my pi, then my USG and APs using the CLI, no more issues at all. I will maybe revisit the pi some day, for now, it seems to be too much of a work in progress to run a semi business/ semi host and data center for my customers to bother dealing with something a Ublock Origin extension seemed to do better if what you want is a mostly ad free surfing experience.

Good luck. if you decide to keep it

Unraid becomes unresponsive over night

Recommended Posts

ThanatopsisJSH

Link to comment

JorgeB

Link to comment

ThanatopsisJSH

Link to comment

polish_pat

Link to comment

ThanatopsisJSH

Link to comment

polish_pat

Link to comment

ThanatopsisJSH

Link to comment

ThanatopsisJSH

Link to comment

ThanatopsisJSH

Link to comment

polish_pat

Link to comment

polish_pat

Link to comment

Join the conversation