Dextabrewa Posted December 11, 2023 Share Posted December 11, 2023 (edited) Hi, I need help digging into an ongoing problem I've been having with server outages. These outages happen a couple times a week, the server is still on, fans running, but Unraid dashboard is unavailable locally and through Unraid connect. All of my reverse proxy docker containers are also inaccessible and the router shows that the server is offline, indicating some kind of network outage or outright crash of the OS. Typically only a server restart can bring it back, unraid is able to gracefully shut down if I hit the power button once brulu-diagnostics-20231210-1931.zip Edited December 11, 2023 by Dextabrewa Quote Link to comment
MrGrey Posted December 11, 2023 Share Posted December 11, 2023 4 hours ago, Dextabrewa said: These outages happen a couple times a week, Can you nail down a reason for the "couple" of times? Neighbors coming/going (ie: devices coming and leaving)? 4 hours ago, Dextabrewa said: the router shows that the server is offline What router?... Make/Model, Yours/ISPs? 4 hours ago, Dextabrewa said: Typically only a server restart can bring it back Well... Your in REALLY big trouble when the computer doesn't start at all. We/You will figure it out. MrGrey. Quote Link to comment
JorgeB Posted December 11, 2023 Share Posted December 11, 2023 Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
itimpi Posted December 11, 2023 Share Posted December 11, 2023 You might want to try enabling the syslog server (probably using the Mirror to flash option) and post the file that produces after the issue occurs so we can see if anything was logged leading up to the problem. Quote Link to comment
Dextabrewa Posted December 11, 2023 Author Share Posted December 11, 2023 8 hours ago, MrGrey said: Can you nail down a reason for the "couple" of times? Neighbors coming/going (ie: devices coming and leaving)? What router?... Make/Model, Yours/ISPs? Well... Your in REALLY big trouble when the computer doesn't start at all. We/You will figure it out. MrGrey. Router: Ubiquiti Networks UniFi Dream Machine ISP: Google Fiber 1gig There doesn't seem to be an obvious reason for the crashes. The computer always reboots no problem and the Unraid dashboard comes back up. I never have to hard restart either, its always graceful. Quote Link to comment
Dextabrewa Posted December 11, 2023 Author Share Posted December 11, 2023 3 hours ago, itimpi said: You might want to try enabling the syslog server (probably using the Mirror to flash option) and post the file that produces after the issue occurs so we can see if anything was logged leading up to the problem. I have a syslog file that goes back a couple months that I can also share here: syslog-192.168.1.164.log Quote Link to comment
Dextabrewa Posted December 11, 2023 Author Share Posted December 11, 2023 5 hours ago, JorgeB said: Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. If it was hardware related I would expect the server to just shut down or the OS to crash, the OS seems to be running but the network goes down. I replaced the onboard motherboard networking with a PCI Network card and the outage still happens. Quote Link to comment
xoC Posted December 11, 2023 Share Posted December 11, 2023 (edited) I have the same thing happening since an update somewhere in september/october IIRC. Usually at these times, the dashboard shows this : Dashboard becomes unresponsive. Everything else also. Can't even power off it doesn't respond (even from a one push on the HW button). And the terminal doesn't load so I can see what happens with HTOP... Edit : Since the last few months, I managed to get some diagnostics during one of this hang, showing nothing also. Edit2 : it was before September actually : Edited December 11, 2023 by xoC Quote Link to comment
itimpi Posted December 11, 2023 Share Posted December 11, 2023 1 hour ago, Dextabrewa said: If it was hardware related I would expect the server to just shut down or the OS to crash, the OS seems to be running but the network goes down. I replaced the onboard motherboard networking with a PCI Network card and the outage still happens. The syslog seems to show the eth1 connection going up and down frequently. Have you tried changing the network settings so that eth0 is not bonded with eth1 to see if that helps? Quote Link to comment
JorgeB Posted December 11, 2023 Share Posted December 11, 2023 1 hour ago, Dextabrewa said: I have a syslog file that goes back a couple months that I can also share here: Various call traces and btrfs is detecting a lot of data corruption, suggest running memtest. Quote Link to comment
Dextabrewa Posted December 11, 2023 Author Share Posted December 11, 2023 (edited) 4 minutes ago, JorgeB said: Various call traces and btrfs is detecting a lot of data corruption, suggest running memtest. I replaced the SSD with the data corruption yesterday, those errors should be gone going forward. I had suspected that was the issue with the system crashes/freezes, but it proceeded to go offline just hours after the drive replacement Edited December 11, 2023 by Dextabrewa Quote Link to comment
JorgeB Posted December 11, 2023 Share Posted December 11, 2023 8 minutes ago, Dextabrewa said: I replaced the SSD with the data corruption yesterday Almost certainly the SSD was not the problem. Quote Link to comment
Dextabrewa Posted December 11, 2023 Author Share Posted December 11, 2023 (edited) 22 minutes ago, JorgeB said: Almost certainly the SSD was not the problem. The faulty SSD had errors when I checked the SSD log, it was caused by a bad set of ram I was using months prior (my theory), I replaced the ram with a known good set. But I will run a memtest again to verify. Edited December 11, 2023 by Dextabrewa Quote Link to comment
Dextabrewa Posted December 11, 2023 Author Share Posted December 11, 2023 2 hours ago, JorgeB said: Almost certainly the SSD was not the problem. This seems to be the common error whenever my server is unreachable. My eth0 is set correctly in the interface rules to the new PCI network card, Network card: Gigabit Dual NIC with Intel 82576 Chip, 1Gb Router: Unifi Dream Machine Any suggestions? I've included updated diagnostics that includes a crash. Dec 11 13:32:10 Brulu kernel: igb 0000:29:00.0 eth0: igb: eth0 NIC Link is Down Dec 11 13:32:10 Brulu kernel: br0: port 1(eth0) entered disabled state Dec 11 13:32:14 Brulu ntpd[1812]: Deleting interface #1 br0, 192.168.1.164#123, interface stats: received=12, sent=12, dropped=0, active_time=264 secs brulu-diagnostics-20231211-1338.zip Quote Link to comment
Dextabrewa Posted December 11, 2023 Author Share Posted December 11, 2023 3 hours ago, JorgeB said: Various call traces and btrfs is detecting a lot of data corruption, suggest running memtest. Ran memtest for 1 hours with no errors, if you feel like I should run it longer I can. Quote Link to comment
Solution Dextabrewa Posted December 12, 2023 Author Solution Share Posted December 12, 2023 Swapping out the ethernet cable seems to have fixed the issue for now.. Will report back in a few days! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.