Unraid freezes - cannot ping or access GUI.

Theldron · October 29, 2020

Hello,

I am having an issue with my unraid server where every so often it will freeze, I cannot access the GUI or ping the IP address:

Here is an error that I am getting, when it last froze:

Oct 29 09:54:43 GamingNAS nginx: 2020/10/29 09:54:43 [crit] 16840#16840: *11684957 connect() to unix:/var/run/ttyd.sock failed (2: No such file or directory) while connecting to upstream, client: 192.168.1.238, server: , request: "GET /webterminal/token HTTP/1.1", upstream: "http://unix:/var/run/ttyd.sock:/token", host: "192.168.1.185", referrer: "http://192.168.1.185/webterminal/"

Also these come up:

Oct 29 08:31:26 GamingNAS kernel: IPv6: ADDRCONF(NETDEV_UP): vethb8b4836: link is not ready

Oct 29 08:31:26 GamingNAS kernel: docker0: port 4(vethb8b4836) entered blocking state

Oct 29 08:31:26 GamingNAS kernel: docker0: port 4(vethb8b4836) entered forwarding state

Oct 29 08:31:26 GamingNAS kernel: docker0: port 4(vethb8b4836) entered disabled state

Oct 29 08:31:28 GamingNAS kernel: eth0: renamed from vethe362d11

Oct 29 08:31:28 GamingNAS kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethb8b4836: link becomes ready

Oct 29 08:31:28 GamingNAS kernel: docker0: port 4(vethb8b4836) entered blocking state

Oct 29 08:31:28 GamingNAS kernel: docker0: port 4(vethb8b4836) entered forwarding state

Oct 29 08:32:03 GamingNAS kernel: veth3f100e6: renamed from eth0

Oct 29 08:32:03 GamingNAS kernel: docker0: port 6(vethfbf7f5f) entered disabled state

Oct 29 08:32:03 GamingNAS kernel: device vethfbf7f5f left promiscuous mode

Oct 29 08:32:03 GamingNAS kernel: docker0: port 6(vethfbf7f5f) entered disabled state

Oct 29 08:32:07 GamingNAS kernel: docker0: port 5(veth07afff9) entered blocking state

Oct 29 08:32:07 GamingNAS kernel: docker0: port 5(veth07afff9) entered disabled state

Oct 29 08:32:07 GamingNAS kernel: device veth07afff9 entered promiscuous mode

Any ideas?

Thanks

gamingnas-diagnostics-20201029-0959.zip

Squid · October 29, 2020

Are you overclocking?

Model name:                      Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Stepping:                        7
CPU MHz:                         4019.685

Overclocks are never recommended when stability is required.

Also would be a good idea to run a memtest (at minimum a single pass)

Theldron · October 29, 2020

No I've never overclocked. Is there a memtest app on unraid?

ChatNoir · October 29, 2020

1 hour ago, Theldron said:

No I've never overclocked. Is there a memtest app on unraid?

At boot, you can select memtest in the menu.

Theldron · October 29, 2020

Cheers. I have got 3 sticks of ram in there at the moment, I know ideally I need 2 pairs so have ordered another to go in.

Theldron · October 30, 2020

Hi all,

Restarted the PC this morning to run memtest and it would not run. Went to go back into Unraid and it stuck at the attached. I havent got a recent backup, my most recent is from June. I am wondering if the old usb is corrupt, so I have tried to copy the files from the old USB to the new one and make bootable but I get the same. Whats my best course of action? From my diagnostics is there a way to find the correct drive assignments?

I have another USB is there a way to create it and copy the config folder over?

Thanks

Theldron · October 30, 2020

Got it sorted. Created another Unraid boot disk, copied the config over from the old USB stick and booted it. Everything is back up, the only issue is JellyFin throwing a wobbler. Small price to pay. Downloaded a flash backup straight away.

Ran a memtest. Had to download a setup Ultimate Boot CD as memtest wouldnt boot. Ran it for 8 hours, no errors.

Theldron · November 14, 2020

Hi, I brought another 8GB of Ram and ran a further Memtest and still no errors, but I am still getting the occasional freeze, when I cannot access the GUI or ping the server. Its strange as I have a windows PC connected through the same switch, when I am connected to that through RDP, that goes down as well and I lose connection.

I thought it was the PC, but I havent turned it on for a few days, and I am still getting freezes.

This is from the logs, I still don't understand what the blocking state is about.

Nov 12 19:51:07 GamingNAS kernel: eth0: renamed from vethc8ba37b
Nov 12 19:51:07 GamingNAS kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth6fd2a8f: link becomes ready
Nov 12 19:51:07 GamingNAS kernel: docker0: port 5(veth6fd2a8f) entered blocking state
Nov 12 19:51:07 GamingNAS kernel: docker0: port 5(veth6fd2a8f) entered forwarding state
Nov 12 20:19:37 GamingNAS kernel: br0: port 5(vnet3) entered blocking state
Nov 12 20:19:37 GamingNAS kernel: br0: port 5(vnet3) entered disabled state
Nov 12 20:19:37 GamingNAS kernel: device vnet3 entered promiscuous mode
Nov 12 20:19:37 GamingNAS kernel: br0: port 5(vnet3) entered blocking state
Nov 12 20:19:37 GamingNAS kernel: br0: port 5(vnet3) entered forwarding state
Nov 12 20:20:10 GamingNAS kernel: br0: port 3(vnet1) entered disabled state

Any help would be appreciated.

Theldron

JorgeB · November 14, 2020

Those entries are normal.

Theldron · November 14, 2020

Hi all,

At 6.22pm UK time, I have just had another freeze. I have attached my diagnostics folder. I couldn't see anything in the logs.

Any help would be appreciated.

Thanks

gamingnas-diagnostics-20201114-1824.zip

Edited November 14, 2020 by Theldron

JorgeB · November 16, 2020

Nothing out of the ordinary on the logs, most likely a hardware issue, you can try running in safe more without any docker and VMs, if it still crash like that it pretty much confirms it's a hardware problem.

Theldron · November 17, 2020

I have found out it was a NPCAP Loopback Adapter on my laptop causing the crashing, not the unraid server or the PC, but my laptop. I only found this out, as my step-son has come back to live with us and he was watching a film on the server, when it 'crashed', but he was able to keep watching it.

Looked at the event viewer on my laptop and saw loads of entries for the loopback adapter during the crashes.

Sorry, can't believe I missed that. Nobody else uses the server but me normally.

Edited November 17, 2020 by Theldron

bitcore · January 1, 2021

I have the same symptoms.

Asrock TRX40 Creator, AMD Threadripper 3960X, 128GB of unbuffered ECC Samsung M391A4G43MB1-CTD.

All network accessibility on the server seems to suddenly severely degrade and/or eventually fail completely: No SSH, no SMB, and no ping responses.

Console seems responsive, but last time this occurred it became non-responsive and I had to hard-power down. Link stays up at 1Gbit to my existing switch.

I have the same/similar log entries with the interfaces - which seem to correlate when VMs are powered on/off (and it's the bond0 interface, so likely unrelated, just like @JorgeB said)

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered blocking state

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered forwarding state

However: Physically bouncing the NIC by unplugging and re-plugging the ethernet cable into my switch seemed to immediately resolve the issue. Either the NIC driver is faulty (it's a 10GBE PHY from Aquantia), or the Netgear managed switch I have is faulty and causing me grief. I suspect it's my existing network switch, which is also not suitable for my application.

I may be chasing two issues here, but I believe the previous issue was due to overclocking that RAM to 3200 (as many have been known to do successfully, and I've burned in for about 2 weeks of heavy memory load during initial build testing) I backed that off and I haven't had a hard-lock since.

Edited January 1, 2021 by bitcore

JorgeB · January 2, 2021

9 hours ago, bitcore said:

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered blocking state

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered forwarding state

These are normal.

Theldron · January 2, 2021

11 hours ago, bitcore said:

I have the same symptoms.

Asrock TRX40 Creator, AMD Threadripper 3960X, 128GB of unbuffered ECC Samsung M391A4G43MB1-CTD.

All network accessibility on the server seems to suddenly severely degrade and/or eventually fail completely: No SSH, no SMB, and no ping responses.

Console seems responsive, but last time this occurred it became non-responsive and I had to hard-power down. Link stays up at 1Gbit to my existing switch.

I have the same/similar log entries with the interfaces - which seem to correlate when VMs are powered on/off (and it's the bond0 interface, so likely unrelated, just like @JorgeB said)

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered blocking state

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered forwarding state

However: Physically bouncing the NIC by unplugging and re-plugging the ethernet cable into my switch seemed to immediately resolve the issue. Either the NIC driver is faulty (it's a 10GBE PHY from Aquantia), or the Netgear managed switch I have is faulty and causing me grief. I suspect it's my existing network switch, which is also not suitable for my application.

I may be chasing two issues here, but I believe the previous issue was due to overclocking that RAM to 3200 (as many have been known to do successfully, and I've burned in for about 2 weeks of heavy memory load during initial build testing) I backed that off and I haven't had a hard-lock since.

Hi, I am still getting these issues. Randomly I will get locks, where the GUI is unresponsive, I cannot ping or access the shares. I have tried different switches, which seem to solve the issue for a short time, then it rears its head again. I have tried PCIE NICs as well, but no help.

What is very strange in my case, is that any PCs, I access remotely, go down and I cannot access them and I cannot access anything through my WAP. The Amazon fire cube stays up.

I am wondering if it could be my router which is an EE hub (I am in the UK) or something is swamping the network randomly. I don't know.

I might setup wireshark to monitor things.

Edited January 2, 2021 by Theldron

Theldron · January 8, 2021

On 1/1/2021 at 10:12 PM, bitcore said:

I have the same symptoms.

Asrock TRX40 Creator, AMD Threadripper 3960X, 128GB of unbuffered ECC Samsung M391A4G43MB1-CTD.

All network accessibility on the server seems to suddenly severely degrade and/or eventually fail completely: No SSH, no SMB, and no ping responses.

Console seems responsive, but last time this occurred it became non-responsive and I had to hard-power down. Link stays up at 1Gbit to my existing switch.

I have the same/similar log entries with the interfaces - which seem to correlate when VMs are powered on/off (and it's the bond0 interface, so likely unrelated, just like @JorgeB said)

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered blocking state

Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered forwarding state

However: Physically bouncing the NIC by unplugging and re-plugging the ethernet cable into my switch seemed to immediately resolve the issue. Either the NIC driver is faulty (it's a 10GBE PHY from Aquantia), or the Netgear managed switch I have is faulty and causing me grief. I suspect it's my existing network switch, which is also not suitable for my application.

I may be chasing two issues here, but I believe the previous issue was due to overclocking that RAM to 3200 (as many have been known to do successfully, and I've burned in for about 2 weeks of heavy memory load during initial build testing) I backed that off and I haven't had a hard-lock since.

Hi @bitcore

I think I have found the cause, my router. I have been using the ISP router and, when it disconnects I started checking the logs, I found that it has been randomly disconnecting devices. Not just the unraid server, but laptops, WAPs everything. I have ordered a new TP-Link AC2800 router, so hoping that will sort the issue.

theruck · January 8, 2021

also check if you have installed and configured the S3 sleep. i was wondering why my unraid is not accessible and it was just sleeping so sending a WOL packet got it sorted suddenly

bitcore · January 10, 2021

This issue appeared again today. I have a PFSense VM handling internet+NAT+etc with a quad port nic passed through to the VM. This does not go down and internet stays stable.

However, the NIC that all other unraid services operate on (webGUI, shares, other VMs, etc), seemed to suddenly stop working with no other entries in dmesg that I can see. This time, bouncing the physical port (disconnect, reconnect) did not help. Neither did rebooting my Netgear switch (it's firmware is also fully up to date).

This is the 10Gig Aquantia AQC107 NIC. I hope I can get to the bottom of this so I don't have to waste a PCI-E slot on another NIC. Hopefully the upcoming 6.9 release will include better driver support and resolve this - this is a fairly new platform.

Unraid freezes - cannot ping or access GUI.

Recommended Posts

Theldron

Link to comment

Squid

Link to comment

Theldron

Link to comment

ChatNoir

Link to comment

Theldron

Link to comment

Theldron

Link to comment

Theldron

Link to comment

Theldron

Link to comment

JorgeB

Link to comment

Theldron

Link to comment

JorgeB

Link to comment

Theldron

Link to comment

bitcore

Link to comment

JorgeB

Link to comment

Theldron

Link to comment

Theldron

Link to comment

theruck

Link to comment

bitcore

Link to comment

Join the conversation