Theldron Posted October 29, 2020 Share Posted October 29, 2020 Hello, I am having an issue with my unraid server where every so often it will freeze, I cannot access the GUI or ping the IP address: Here is an error that I am getting, when it last froze: Oct 29 09:54:43 GamingNAS nginx: 2020/10/29 09:54:43 [crit] 16840#16840: *11684957 connect() to unix:/var/run/ttyd.sock failed (2: No such file or directory) while connecting to upstream, client: 192.168.1.238, server: , request: "GET /webterminal/token HTTP/1.1", upstream: "http://unix:/var/run/ttyd.sock:/token", host: "192.168.1.185", referrer: "http://192.168.1.185/webterminal/" Also these come up: Oct 29 08:31:26 GamingNAS kernel: IPv6: ADDRCONF(NETDEV_UP): vethb8b4836: link is not ready Oct 29 08:31:26 GamingNAS kernel: docker0: port 4(vethb8b4836) entered blocking state Oct 29 08:31:26 GamingNAS kernel: docker0: port 4(vethb8b4836) entered forwarding state Oct 29 08:31:26 GamingNAS kernel: docker0: port 4(vethb8b4836) entered disabled state Oct 29 08:31:28 GamingNAS kernel: eth0: renamed from vethe362d11 Oct 29 08:31:28 GamingNAS kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethb8b4836: link becomes ready Oct 29 08:31:28 GamingNAS kernel: docker0: port 4(vethb8b4836) entered blocking state Oct 29 08:31:28 GamingNAS kernel: docker0: port 4(vethb8b4836) entered forwarding state Oct 29 08:32:03 GamingNAS kernel: veth3f100e6: renamed from eth0 Oct 29 08:32:03 GamingNAS kernel: docker0: port 6(vethfbf7f5f) entered disabled state Oct 29 08:32:03 GamingNAS kernel: docker0: port 6(vethfbf7f5f) entered disabled state Oct 29 08:32:03 GamingNAS kernel: device vethfbf7f5f left promiscuous mode Oct 29 08:32:03 GamingNAS kernel: docker0: port 6(vethfbf7f5f) entered disabled state Oct 29 08:32:07 GamingNAS kernel: docker0: port 5(veth07afff9) entered blocking state Oct 29 08:32:07 GamingNAS kernel: docker0: port 5(veth07afff9) entered disabled state Oct 29 08:32:07 GamingNAS kernel: device veth07afff9 entered promiscuous mode Any ideas? Thanks gamingnas-diagnostics-20201029-0959.zip Quote Link to comment
Squid Posted October 29, 2020 Share Posted October 29, 2020 Are you overclocking? Model name: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Stepping: 7 CPU MHz: 4019.685 Overclocks are never recommended when stability is required. Also would be a good idea to run a memtest (at minimum a single pass) Quote Link to comment
Theldron Posted October 29, 2020 Author Share Posted October 29, 2020 No I've never overclocked. Is there a memtest app on unraid? Quote Link to comment
ChatNoir Posted October 29, 2020 Share Posted October 29, 2020 1 hour ago, Theldron said: No I've never overclocked. Is there a memtest app on unraid? At boot, you can select memtest in the menu. Quote Link to comment
Theldron Posted October 29, 2020 Author Share Posted October 29, 2020 Cheers. I have got 3 sticks of ram in there at the moment, I know ideally I need 2 pairs so have ordered another to go in. Quote Link to comment
Theldron Posted October 30, 2020 Author Share Posted October 30, 2020 Hi all, Restarted the PC this morning to run memtest and it would not run. Went to go back into Unraid and it stuck at the attached. I havent got a recent backup, my most recent is from June. I am wondering if the old usb is corrupt, so I have tried to copy the files from the old USB to the new one and make bootable but I get the same. Whats my best course of action? From my diagnostics is there a way to find the correct drive assignments? I have another USB is there a way to create it and copy the config folder over? Thanks Quote Link to comment
Theldron Posted October 30, 2020 Author Share Posted October 30, 2020 Got it sorted. Created another Unraid boot disk, copied the config over from the old USB stick and booted it. Everything is back up, the only issue is JellyFin throwing a wobbler. Small price to pay. Downloaded a flash backup straight away. Ran a memtest. Had to download a setup Ultimate Boot CD as memtest wouldnt boot. Ran it for 8 hours, no errors. Quote Link to comment
Theldron Posted November 14, 2020 Author Share Posted November 14, 2020 Hi, I brought another 8GB of Ram and ran a further Memtest and still no errors, but I am still getting the occasional freeze, when I cannot access the GUI or ping the server. Its strange as I have a windows PC connected through the same switch, when I am connected to that through RDP, that goes down as well and I lose connection. I thought it was the PC, but I havent turned it on for a few days, and I am still getting freezes. This is from the logs, I still don't understand what the blocking state is about. Nov 12 19:51:07 GamingNAS kernel: eth0: renamed from vethc8ba37b Nov 12 19:51:07 GamingNAS kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth6fd2a8f: link becomes ready Nov 12 19:51:07 GamingNAS kernel: docker0: port 5(veth6fd2a8f) entered blocking state Nov 12 19:51:07 GamingNAS kernel: docker0: port 5(veth6fd2a8f) entered forwarding state Nov 12 20:19:37 GamingNAS kernel: br0: port 5(vnet3) entered blocking state Nov 12 20:19:37 GamingNAS kernel: br0: port 5(vnet3) entered disabled state Nov 12 20:19:37 GamingNAS kernel: device vnet3 entered promiscuous mode Nov 12 20:19:37 GamingNAS kernel: br0: port 5(vnet3) entered blocking state Nov 12 20:19:37 GamingNAS kernel: br0: port 5(vnet3) entered forwarding state Nov 12 20:20:10 GamingNAS kernel: br0: port 3(vnet1) entered disabled state Any help would be appreciated. Theldron Quote Link to comment
JorgeB Posted November 14, 2020 Share Posted November 14, 2020 Those entries are normal. Quote Link to comment
Theldron Posted November 14, 2020 Author Share Posted November 14, 2020 (edited) Hi all, At 6.22pm UK time, I have just had another freeze. I have attached my diagnostics folder. I couldn't see anything in the logs. Any help would be appreciated. Thanks gamingnas-diagnostics-20201114-1824.zip Edited November 14, 2020 by Theldron Quote Link to comment
JorgeB Posted November 16, 2020 Share Posted November 16, 2020 Nothing out of the ordinary on the logs, most likely a hardware issue, you can try running in safe more without any docker and VMs, if it still crash like that it pretty much confirms it's a hardware problem. Quote Link to comment
Theldron Posted November 17, 2020 Author Share Posted November 17, 2020 (edited) I have found out it was a NPCAP Loopback Adapter on my laptop causing the crashing, not the unraid server or the PC, but my laptop. I only found this out, as my step-son has come back to live with us and he was watching a film on the server, when it 'crashed', but he was able to keep watching it. Looked at the event viewer on my laptop and saw loads of entries for the loopback adapter during the crashes. Sorry, can't believe I missed that. Nobody else uses the server but me normally. Edited November 17, 2020 by Theldron 2 Quote Link to comment
bitcore Posted January 1, 2021 Share Posted January 1, 2021 (edited) I have the same symptoms. Asrock TRX40 Creator, AMD Threadripper 3960X, 128GB of unbuffered ECC Samsung M391A4G43MB1-CTD. All network accessibility on the server seems to suddenly severely degrade and/or eventually fail completely: No SSH, no SMB, and no ping responses. Console seems responsive, but last time this occurred it became non-responsive and I had to hard-power down. Link stays up at 1Gbit to my existing switch. I have the same/similar log entries with the interfaces - which seem to correlate when VMs are powered on/off (and it's the bond0 interface, so likely unrelated, just like @JorgeB said) Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered blocking state Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered forwarding state However: Physically bouncing the NIC by unplugging and re-plugging the ethernet cable into my switch seemed to immediately resolve the issue. Either the NIC driver is faulty (it's a 10GBE PHY from Aquantia), or the Netgear managed switch I have is faulty and causing me grief. I suspect it's my existing network switch, which is also not suitable for my application. I may be chasing two issues here, but I believe the previous issue was due to overclocking that RAM to 3200 (as many have been known to do successfully, and I've burned in for about 2 weeks of heavy memory load during initial build testing) I backed that off and I haven't had a hard-lock since. Edited January 1, 2021 by bitcore Quote Link to comment
JorgeB Posted January 2, 2021 Share Posted January 2, 2021 9 hours ago, bitcore said: Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered blocking state Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered forwarding state These are normal. Quote Link to comment
Theldron Posted January 2, 2021 Author Share Posted January 2, 2021 (edited) 11 hours ago, bitcore said: I have the same symptoms. Asrock TRX40 Creator, AMD Threadripper 3960X, 128GB of unbuffered ECC Samsung M391A4G43MB1-CTD. All network accessibility on the server seems to suddenly severely degrade and/or eventually fail completely: No SSH, no SMB, and no ping responses. Console seems responsive, but last time this occurred it became non-responsive and I had to hard-power down. Link stays up at 1Gbit to my existing switch. I have the same/similar log entries with the interfaces - which seem to correlate when VMs are powered on/off (and it's the bond0 interface, so likely unrelated, just like @JorgeB said) Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered blocking state Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered forwarding state However: Physically bouncing the NIC by unplugging and re-plugging the ethernet cable into my switch seemed to immediately resolve the issue. Either the NIC driver is faulty (it's a 10GBE PHY from Aquantia), or the Netgear managed switch I have is faulty and causing me grief. I suspect it's my existing network switch, which is also not suitable for my application. I may be chasing two issues here, but I believe the previous issue was due to overclocking that RAM to 3200 (as many have been known to do successfully, and I've burned in for about 2 weeks of heavy memory load during initial build testing) I backed that off and I haven't had a hard-lock since. Hi, I am still getting these issues. Randomly I will get locks, where the GUI is unresponsive, I cannot ping or access the shares. I have tried different switches, which seem to solve the issue for a short time, then it rears its head again. I have tried PCIE NICs as well, but no help. What is very strange in my case, is that any PCs, I access remotely, go down and I cannot access them and I cannot access anything through my WAP. The Amazon fire cube stays up. I am wondering if it could be my router which is an EE hub (I am in the UK) or something is swamping the network randomly. I don't know. I might setup wireshark to monitor things. Edited January 2, 2021 by Theldron Quote Link to comment
Theldron Posted January 8, 2021 Author Share Posted January 8, 2021 On 1/1/2021 at 10:12 PM, bitcore said: I have the same symptoms. Asrock TRX40 Creator, AMD Threadripper 3960X, 128GB of unbuffered ECC Samsung M391A4G43MB1-CTD. All network accessibility on the server seems to suddenly severely degrade and/or eventually fail completely: No SSH, no SMB, and no ping responses. Console seems responsive, but last time this occurred it became non-responsive and I had to hard-power down. Link stays up at 1Gbit to my existing switch. I have the same/similar log entries with the interfaces - which seem to correlate when VMs are powered on/off (and it's the bond0 interface, so likely unrelated, just like @JorgeB said) Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered blocking state Jan 1 16:51:28 server kernel: br0: port 1(bond0) entered forwarding state However: Physically bouncing the NIC by unplugging and re-plugging the ethernet cable into my switch seemed to immediately resolve the issue. Either the NIC driver is faulty (it's a 10GBE PHY from Aquantia), or the Netgear managed switch I have is faulty and causing me grief. I suspect it's my existing network switch, which is also not suitable for my application. I may be chasing two issues here, but I believe the previous issue was due to overclocking that RAM to 3200 (as many have been known to do successfully, and I've burned in for about 2 weeks of heavy memory load during initial build testing) I backed that off and I haven't had a hard-lock since. Hi @bitcore I think I have found the cause, my router. I have been using the ISP router and, when it disconnects I started checking the logs, I found that it has been randomly disconnecting devices. Not just the unraid server, but laptops, WAPs everything. I have ordered a new TP-Link AC2800 router, so hoping that will sort the issue. Quote Link to comment
theruck Posted January 8, 2021 Share Posted January 8, 2021 also check if you have installed and configured the S3 sleep. i was wondering why my unraid is not accessible and it was just sleeping so sending a WOL packet got it sorted suddenly Quote Link to comment
bitcore Posted January 10, 2021 Share Posted January 10, 2021 This issue appeared again today. I have a PFSense VM handling internet+NAT+etc with a quad port nic passed through to the VM. This does not go down and internet stays stable. However, the NIC that all other unraid services operate on (webGUI, shares, other VMs, etc), seemed to suddenly stop working with no other entries in dmesg that I can see. This time, bouncing the physical port (disconnect, reconnect) did not help. Neither did rebooting my Netgear switch (it's firmware is also fully up to date). This is the 10Gig Aquantia AQC107 NIC. I hope I can get to the bottom of this so I don't have to waste a PCI-E slot on another NIC. Hopefully the upcoming 6.9 release will include better driver support and resolve this - this is a fairly new platform. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.