Nexal Posted April 3 Share Posted April 3 (edited) Hello everyone, I recently upgraded my Unraid server from a QNAP TS-873A to a custom built machine with significantly higher capabilities. However, ever since I have switched machines, I have had nothing but problems getting consistent access to Unraid. To clarify, the Web GUI, Samba, and docker web-based service are all available when the server first boots, but after what seems like a random amount of time (maybe 30-45 min on average) they all become unavailable simultaneously, often while I am actively using them. I am still able to access the server via the command line sitting at the server itself (not through the network) which shows me the server is still operational and has not frozen. From the CLI I can validate the network connection is still live and I can ping both my router and client computers successfully. The new specs are below (in case it helps diagnose the issue): - Intel Core i14700K CPU - Crucial Pro RAM 96GB Kit (2x48GB) DDR5 5600MT/s - GIGABYTE Z790 AERO G Motherboard - 10Gb SFP+ PCI-E Network Card NIC, with Broadcom BCM57810S Chip, Dual SFP+ Port - LSI SAS9300-16I 9300-16I 16-PORT 12GBPS PCIe HBA (for Unraid array) - 8x Seagate Exos X20 ST20000NM007D 20TB 7.2K RPM SATA 6Gb/s 3.5in Hard Drive - 3x TEAMGROUP MP34 4TB 3D NAND TLC NVMe (2x for cache drives, 1x for VMs) - 2x SAMSUNG 870 EVO SATA SSD 250GB 2.5” (for mirrored Proxmox install, unmapped in Unraid) - ZOTAC Gaming GeForce RTX 4080 Super (For Windows VMs) - EVGA Supernova 1000 P3, 80 Plus Platinum 1000W As both an experiment and also info gathering step, I have tried to run Unraid both on bare metal, and also as an Proxmox VM with PCIe passthrough for my HBA and NVMe drives. Unfortunately, I am experiencing the same issues both with and without Unraid being virtualized. What I have tried: - Currently running the memtest that comes with Unraid. After multiple hours the tests passed with 0 errors. - Turning off all VMs and Docker Containers - Using Tips and Tweaks plugin to disable NIC Flow Control and NIC Offload (per this post: https://forums.unraid.net/topic/133745-network-issue-server-sometimes-cant-be-reached-without-reseting-network-connection/?do=findComment&comment=1215477) I have gone through the logs and there was one line that stood out to me, but it's possible it could just be circumstantial: - Apr 1 13:58:46 NAS ntpd[1459]: no peer for too long, server running free now Diagnostics are attached below, some from when the Web GUI does briefly return, and others from running `diagnostics` from the command line before the Web GUI is available. Any support on what to do next would be greatly appreciated as I have now been unsuccessful in independently diagnosing the issue for nearly two weeks now. Thanks in advance. nas-diagnostics-20240401-1434.zip nas-diagnostics-20240401-1214.zip nas-diagnostics-20240402-0948.zip nas-diagnostics-20240403-0722.zip Edited April 3 by Nexal Cleaning up typos Quote Link to comment
JorgeB Posted April 3 Share Posted April 3 Enable the syslog server and post that after it happens again. Quote Link to comment
Nexal Posted April 3 Author Share Posted April 3 @JorgeB I will try this. To clarify the procedure: I turn on the Syslog server and set to save the files to USB. Once the disconnection occurs, I should use the command line to shut down the computer. After plugging in the USB stick into my client machine I should have a file somewhere that I need to upload? Will this look different from the diagnostic files? Thanks in advance for the guidance. Quote Link to comment
JorgeB Posted April 3 Share Posted April 3 If you choose the mirror to flash drive option it will be in the /logs folder. Quote Link to comment
Nexal Posted April 3 Author Share Posted April 3 I had just lost connectivity to the Unraid Web UI. Uncertain if this is the same issue as before because while I did also loose access to one of the two docker services I was running, my plex service was still able to load new video data into the buffer which seem to be different than previous times this has occurred. This is also the first time I have seen the following in the logs: Error: eth0 does not have a valid IP address Curious to get an outside perspective. Thanks! syslog Quote Link to comment
JorgeB Posted April 3 Share Posted April 3 31 minutes ago, Nexal said: Error: eth0 does not have a valid IP address That would explain the issue, and suggests a network problem. Quote Link to comment
Nexal Posted April 3 Author Share Posted April 3 Yeah, weird that plex continued to function despite this. Though as I said, this is the first time I have experienced this particular error in the logs. I am going to reboot and try again and I'll post the syslog once more when it occurs again. Thank you for the support and separate pair of eyes. Quote Link to comment
Nexal Posted April 3 Author Share Posted April 3 Ok. The web GUI and services stopped again, but this time with the more commonly seen message: no peer for too long, server running free now I tried to let Plex play media so I'd know when the connection dropped. However, as long as media was queued, the plex service continued to function. I had to pause the media for about 30 min to take a work call and when I returned I could no longer resume the media or access plex. It is acting as if some small pause in utilization causes the server to somehow disable it's ability to accept new messages. Though the timing doesn't match up with the logs. The "no peer for too long, server running free now" happened right before 12:00pm but the content continued to stream successfully until I paused the media at 1:30pm. I don't see anything in the syslogs after the `no peer for too long, server running free now` message besides me signing in to try and debug further after the disconnection took place. What's also weird is that if I am at the server's CLI, I can successfully ping my client's IP address and the router, but once disconnected, I am unable to ping from the client machine to the server's IP address. However, after I ping the client from the server, on many occasions this will restore the ability for the client to connect to the unraid server. If this were caused by a network issue, why would I be able to connect to it without issues for the first 30-45 min after the server has been started? Another thing I had tried to do previously was up the DHCP lease from 1 hour to 1 day thinking that maybe this was the cause, but that made no observable difference in the ability to connect to the server. My network topology is pretty simple: I have a pfsense netgate router connected to my IPS and to a 10gbps unmanaged switch. Both the client and server connected to the two 10gbps ports on the switch. Given the switch is unmanaged, I don't think there are any changes I am able to make to it. And the router uses the same configuration I had before upgrading which had previously worked fine. I'm really at a loss here as to me none of these really add up to an obvious cause of the issue. Any further thoughts from the community would be very much appreciated. Thanks! syslog Quote Link to comment
Nexal Posted April 3 Author Share Posted April 3 (edited) Regarding this being a network issue, when I use the same port / mac address / static IP with my proxmox installation, I don't have any connection issues with the proxomox hypervisor. However, I have the same connection issues as described above with Unraid when it's running in a proxmox VM, but using a different mac address and static IP than when it runs on bare metal. This leads me to believe that the issue is somehow with unraid and not my networking hardware or configuration as it is the common denominator. Edited April 3 by Nexal Quote Link to comment
JorgeB Posted April 4 Share Posted April 4 11 hours ago, Nexal said: no peer for too long, server running free now Not sure what this means, but a search turned out this thread, see if it helps: Quote Link to comment
Nexal Posted April 5 Author Share Posted April 5 (edited) While going through through @JorgeB last post, I had an epiphany. The issue started occurring in a similar timeframe as when I added the GluetunVPN docker container to my Unraid server. After turning off docker entirely, the Web GUI and Samba services have now been consistently accessible for 12+ hours. I have no idea the mechanism by which a single docker container can take an entire host machine offline, but so far this change seems to be doing the trick. I only have 3 containers running, two of which are non-vpn containers running for more than 2 years without issues. This makes me believe the newly added GluetunVPN is the culprit. Will be experimenting in the days to come to confirm this is the issue but so far things are heading in the right direction. Edited April 5 by Nexal fixing typos, rephrasing for clarity 1 Quote Link to comment
Nexal Posted April 5 Author Share Posted April 5 Still investigating, but I turned on the docker service and immediatley stoped the VPN container and set it to no longer auto-start when the array is brought online. During this process, the webGUI remained response. However, just like before, after a certain period of time, the web GUI did once again become unresponsive (connection timed out). I am unsure if this is because the VPN container was turned even for that brief amount of time, or maybe there is a more serious issue with the docker daemon itself... could the container still be causing the issue if it's stopped? Or could it have somehow corrupted my docker daemon? Restarting now to do more information gathering. Quote Link to comment
Nexal Posted April 5 Author Share Posted April 5 I appear to be incorrect about my previous conclusion. I have removed the GluetunVPN container and simply enabling docker and the two (previously stable) containers (Plex & Komga) causes the Unraid Web GUI to become inaccessible after some period of time. Quote Link to comment
Nexal Posted April 11 Author Share Posted April 11 I am extremely dismayed to report that the new motherboard and RAM did not resolve the kernel errors I have been seeing and reporting. I am at a complete loss at what to do next. This makes me think that the issue wasn't hardware related. If it's not hardware related, that means it has to be software related. Starting to wonder if I need to find another solution other than Unraid. Not my preference, but I am not seeing how else to get a stable system... Any thoughts would greatly appreciated. Quote Link to comment
Nexal Posted April 11 Author Share Posted April 11 (edited) It was only on the screen for less than a second, but I might have just seen a similar kernel error when trying to boot into proxmox. If so, then it's likely not a software error... Could it be my CPU? Never heard of a brand new CPU being bad. But what else could cause these issues if the mobo and ram have been replaced and the issue persists? Edited April 11 by Nexal Quote Link to comment
Vr2Io Posted April 12 Share Posted April 12 (edited) You could try to use onboard LAN and check any different. You should check log and provide them, we don't know what kind of error base on above description. If Unraid under Proxmor, there should have virtual console screen for check. Edited April 12 by Vr2Io Quote Link to comment
JorgeB Posted April 12 Share Posted April 12 10 hours ago, Nexal said: Could it be my CPU? Never heard of a brand new CPU being bad. It's not common, but I did see that happening before with other users. Quote Link to comment
Solution Nexal Posted April 14 Author Solution Share Posted April 14 Thanks for all the wonderful responses. I have replaced the CPU and the error seem to have gone away. Will continue monitoring but I haven't seen these error in the 12 hours since the CPU was replaced. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.