Unraid freezes - cannot ping or access GUI.


Recommended Posts

Hello,

 

I am having an issue with my unraid server where every so often it will freeze, I cannot access the GUI or ping the IP address:

 

Here is an error that I am getting, when it last froze:

 

Oct 29 09:54:43 GamingNAS nginx: 2020/10/29 09:54:43 [crit] 16840#16840: *11684957 connect() to unix:/var/run/ttyd.sock failed (2: No such file or directory) while connecting to upstream, client: 192.168.1.238, server: , request: "GET /webterminal/token HTTP/1.1", upstream: "http://unix:/var/run/ttyd.sock:/token", host: "192.168.1.185", referrer: "http://192.168.1.185/webterminal/"

 

Also these come up:

 

Oct 29 08:31:26 GamingNAS kernel: IPv6: ADDRCONF(NETDEV_UP): vethb8b4836: link is not ready

Oct 29 08:31:26 GamingNAS kernel: docker0: port 4(vethb8b4836) entered blocking state

Oct 29 08:31:26 GamingNAS kernel: docker0: port 4(vethb8b4836) entered forwarding state

Oct 29 08:31:26 GamingNAS kernel: docker0: port 4(vethb8b4836) entered disabled state

Oct 29 08:31:28 GamingNAS kernel: eth0: renamed from vethe362d11

Oct 29 08:31:28 GamingNAS kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethb8b4836: link becomes ready

Oct 29 08:31:28 GamingNAS kernel: docker0: port 4(vethb8b4836) entered blocking state

Oct 29 08:31:28 GamingNAS kernel: docker0: port 4(vethb8b4836) entered forwarding state

Oct 29 08:32:03 GamingNAS kernel: veth3f100e6: renamed from eth0

Oct 29 08:32:03 GamingNAS kernel: docker0: port 6(vethfbf7f5f) entered disabled state

Oct 29 08:32:03 GamingNAS kernel: docker0: port 6(vethfbf7f5f) entered disabled state

Oct 29 08:32:03 GamingNAS kernel: device vethfbf7f5f left promiscuous mode

Oct 29 08:32:03 GamingNAS kernel: docker0: port 6(vethfbf7f5f) entered disabled state

Oct 29 08:32:07 GamingNAS kernel: docker0: port 5(veth07afff9) entered blocking state

Oct 29 08:32:07 GamingNAS kernel: docker0: port 5(veth07afff9) entered disabled state

Oct 29 08:32:07 GamingNAS kernel: device veth07afff9 entered promiscuous mode

 

Any ideas?

 

Thanks

gamingnas-diagnostics-20201029-0959.zip

Link to comment

Hi all,

 

Restarted the PC this morning to run memtest and it would not run.  Went to go back into Unraid and it stuck at the attached.  I havent got a recent backup, my most recent is from June.  I am wondering if the old usb is corrupt, so I have tried to copy the files from the old USB to the new one and make bootable but I get the same.  Whats my best course of action? From my diagnostics is there a way to find the correct drive assignments?

 

I have another USB is there a way to create it and copy the config folder over?

 

Thanks

IMG20201030090429.jpg

Link to comment

Got it sorted.  Created another Unraid boot disk, copied the config over from the old USB stick and booted it. Everything is back up, the only issue is JellyFin throwing a wobbler.  Small price to pay.  Downloaded a flash backup straight away.

 

Ran a memtest.  Had to download a setup Ultimate Boot CD as memtest wouldnt boot.  Ran it for 8 hours, no errors.

Link to comment
  • 2 weeks later...

Hi, I brought another 8GB of Ram and ran a further Memtest and still no errors, but I am still getting the occasional freeze, when I cannot access the GUI or ping the server. Its strange as I have a windows PC connected through the same switch, when I am connected to that through RDP, that goes down as well and I lose connection.

 

I thought it was the PC, but I havent turned it on for a few days, and I am still getting freezes.

 

This is from the logs, I still don't understand what the blocking state is about.

 

Nov 12 19:51:07 GamingNAS kernel: eth0: renamed from vethc8ba37b
Nov 12 19:51:07 GamingNAS kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth6fd2a8f: link becomes ready
Nov 12 19:51:07 GamingNAS kernel: docker0: port 5(veth6fd2a8f) entered blocking state
Nov 12 19:51:07 GamingNAS kernel: docker0: port 5(veth6fd2a8f) entered forwarding state
Nov 12 20:19:37 GamingNAS kernel: br0: port 5(vnet3) entered blocking state
Nov 12 20:19:37 GamingNAS kernel: br0: port 5(vnet3) entered disabled state
Nov 12 20:19:37 GamingNAS kernel: device vnet3 entered promiscuous mode
Nov 12 20:19:37 GamingNAS kernel: br0: port 5(vnet3) entered blocking state
Nov 12 20:19:37 GamingNAS kernel: br0: port 5(vnet3) entered forwarding state
Nov 12 20:20:10 GamingNAS kernel: br0: port 3(vnet1) entered disabled state

 

Any help would be appreciated.

Theldron

Link to comment

I have found out it was a NPCAP Loopback Adapter on my laptop causing the crashing, not the unraid server or the PC, but my laptop. I only found this out, as my step-son has come back to live with us and he was watching a film on the server, when it 'crashed', but he was able to keep watching it.

 

Looked at the event viewer on my laptop and saw loads of entries for the loopback adapter during the crashes.

 

Sorry, can't believe I missed that. Nobody else uses the server but me normally.

Edited by Theldron
  • Like 2
Link to comment
  • 1 month later...

I have the same symptoms.

Asrock TRX40 Creator, AMD Threadripper 3960X, 128GB of unbuffered ECC Samsung M391A4G43MB1-CTD.

 

All network accessibility on the server seems to suddenly severely degrade and/or eventually fail completely: No SSH, no SMB, and no ping responses.

Console seems responsive, but last time this occurred it became non-responsive and I had to hard-power down. Link stays up at 1Gbit to my existing switch.

 

I have the same/similar log entries with the interfaces - which seem to correlate when VMs are powered on/off (and it's the bond0 interface, so likely unrelated, just like @JorgeB said)

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered blocking state

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered forwarding state

 

 

However: Physically bouncing the NIC by unplugging and re-plugging the ethernet cable into my switch seemed to immediately resolve the issue. Either the NIC driver is faulty (it's a 10GBE PHY from Aquantia), or the Netgear managed switch I have is faulty and causing me grief. I suspect it's my existing network switch, which is also not suitable for my application.

 

I may be chasing two issues here, but I believe the previous issue was due to overclocking that RAM to 3200 (as many have been known to do successfully, and I've burned in for about 2 weeks of heavy memory load during initial build testing) I backed that off and I haven't had a hard-lock since.

Edited by bitcore
Link to comment
11 hours ago, bitcore said:

I have the same symptoms.

Asrock TRX40 Creator, AMD Threadripper 3960X, 128GB of unbuffered ECC Samsung M391A4G43MB1-CTD.

 

All network accessibility on the server seems to suddenly severely degrade and/or eventually fail completely: No SSH, no SMB, and no ping responses.

Console seems responsive, but last time this occurred it became non-responsive and I had to hard-power down. Link stays up at 1Gbit to my existing switch.

 

I have the same/similar log entries with the interfaces - which seem to correlate when VMs are powered on/off (and it's the bond0 interface, so likely unrelated, just like @JorgeB said)

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered blocking state

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered forwarding state

 

 

However: Physically bouncing the NIC by unplugging and re-plugging the ethernet cable into my switch seemed to immediately resolve the issue. Either the NIC driver is faulty (it's a 10GBE PHY from Aquantia), or the Netgear managed switch I have is faulty and causing me grief. I suspect it's my existing network switch, which is also not suitable for my application.

 

I may be chasing two issues here, but I believe the previous issue was due to overclocking that RAM to 3200 (as many have been known to do successfully, and I've burned in for about 2 weeks of heavy memory load during initial build testing) I backed that off and I haven't had a hard-lock since.

Hi, I am still getting these issues. Randomly I will get locks, where the GUI is unresponsive, I cannot ping or access the shares. I have tried different switches, which seem to solve the issue for a short time, then it rears its head again. I have tried PCIE NICs as well, but no help.

 

What is very strange in my case, is that any PCs, I access remotely, go down and I cannot access them and I cannot access anything through my WAP. The Amazon fire cube stays up.

 

I am wondering if it could be my router which is an EE hub (I am in the UK) or something is swamping the network randomly. I don't know.

 

I might setup wireshark to monitor things.

 

 

Edited by Theldron
Link to comment
On 1/1/2021 at 10:12 PM, bitcore said:

I have the same symptoms.

Asrock TRX40 Creator, AMD Threadripper 3960X, 128GB of unbuffered ECC Samsung M391A4G43MB1-CTD.

 

All network accessibility on the server seems to suddenly severely degrade and/or eventually fail completely: No SSH, no SMB, and no ping responses.

Console seems responsive, but last time this occurred it became non-responsive and I had to hard-power down. Link stays up at 1Gbit to my existing switch.

 

I have the same/similar log entries with the interfaces - which seem to correlate when VMs are powered on/off (and it's the bond0 interface, so likely unrelated, just like @JorgeB said)

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered blocking state

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered forwarding state

 

 

However: Physically bouncing the NIC by unplugging and re-plugging the ethernet cable into my switch seemed to immediately resolve the issue. Either the NIC driver is faulty (it's a 10GBE PHY from Aquantia), or the Netgear managed switch I have is faulty and causing me grief. I suspect it's my existing network switch, which is also not suitable for my application.

 

I may be chasing two issues here, but I believe the previous issue was due to overclocking that RAM to 3200 (as many have been known to do successfully, and I've burned in for about 2 weeks of heavy memory load during initial build testing) I backed that off and I haven't had a hard-lock since.

Hi @bitcore

 

I think I have found the cause, my router. I have been using the ISP router and, when it disconnects I started checking the logs, I found that it has been randomly disconnecting devices. Not just the unraid server, but laptops, WAPs everything. I have ordered a new TP-Link AC2800 router, so hoping that will sort the issue.

Link to comment

This issue appeared again today. I have a PFSense VM handling internet+NAT+etc with a quad port nic passed through to the VM. This does not go down and internet stays stable.

However, the NIC that all other unraid services operate on (webGUI, shares, other VMs, etc), seemed to suddenly stop working with no other entries in dmesg that I can see. This time, bouncing the physical port (disconnect, reconnect) did not help. Neither did rebooting my Netgear switch (it's firmware is also fully up to date).

 

This is the 10Gig Aquantia AQC107 NIC. I hope I can get to the bottom of this so I don't have to waste a PCI-E slot on another NIC. Hopefully the upcoming 6.9 release will include better driver support and resolve this - this is a fairly new platform.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.