AcidReign Posted August 28, 2018 Share Posted August 28, 2018 (edited) Hi everybody, I need help... again. My new build Server crashes and I have no idea where it comes from. First it happened in the middle of the night. I guess it was idling back then because I had not even installed plex. Then it happened again during the parity check after the first crash. I switched on trouble shooting in the fix common problem Plugin but then I had to reboot so it went of again. The server crashed again this morning (7 Hours ago). I then again started trouble shooting mode. Now it crashed again during the parity check (and I was working on the server too). See the log files attached. Defining Crashing: The WebUI and all Dockers stop to respond. SSH won't let me do anything anymore. The first time that happened I could still login on the local machine and it crashed 10 Seconds later. The last crashes I could not log in on the local machine. And pushing the Powerbutton did not initiate graceful shutdown so that I had to hold that button pressed. The Hardware is completly new and I have a UPS installed so it should not be a power-issue. Please help - I paid a load of money for this server and now my happiness is really going downhill Acid FCPsyslog_tail.txt optimusprime-diagnostics-20180828-1631.zip Edited August 28, 2018 by AcidReign Quote Link to comment
John_M Posted August 28, 2018 Share Posted August 28, 2018 2 hours ago, AcidReign said: The Hardware is completly new Have you run memtest from the boot menu to give the RAM a workout? Quote Link to comment
John_M Posted August 28, 2018 Share Posted August 28, 2018 It looks as though you're being attacked. Do you recognise the IP address? Its whois record shows that it's currently held by China Unicom. Get your server behind a firewall and don't forward any ports to it. Aug 28 12:12:41 OptimusPrime sshd[15200]: Failed password for root from 112.85.42.102 port 21600 ssh2 Aug 28 12:12:42 OptimusPrime sshd[15200]: Failed password for root from 112.85.42.102 port 21600 ssh2 Aug 28 12:12:44 OptimusPrime sshd[15200]: Failed password for root from 112.85.42.102 port 21600 ssh2 Aug 28 12:12:44 OptimusPrime sshd[15200]: error: maximum authentication attempts exceeded for root from 112.85.42.102 port 21600 ssh2 [preauth] Quote Link to comment
Squid Posted August 28, 2018 Share Posted August 28, 2018 https://www.abuseipdb.com/check/112.85.42.102 Quote Link to comment
AcidReign Posted August 28, 2018 Author Share Posted August 28, 2018 I closed Ports 21 and 22. The rest of the open ports is nothing I could be attacked with (Transmission and Plex). So you think this could be the reason for the whole server to freeze? For how long should I run am memtest? Quote Link to comment
Squid Posted August 28, 2018 Share Posted August 28, 2018 4 minutes ago, AcidReign said: For how long should I run am memtest? Ideally 24 hours. Does anything appear on the locally attached monitor when it crashes? What is the powersupply? Quote Link to comment
AcidReign Posted August 28, 2018 Author Share Posted August 28, 2018 The locally Monitor just shows the standard non-gui Login screen for unRAID and I can't type anything. Powersource is APC SmartUPS 1500 which is fully charged and only has 10% load. So this should not be a problem. Ok I'll let memtest run tomorrow morning. Quote Link to comment
Squid Posted August 28, 2018 Share Posted August 28, 2018 Power Supply, not UPS Quote Link to comment
AcidReign Posted August 28, 2018 Author Share Posted August 28, 2018 Ah ok, sorry. I don't have much information about the power source. It's a 650W 80+ I can't open the server right now. Is that enough info? Quote Link to comment
Squid Posted August 28, 2018 Share Posted August 28, 2018 For the limited drives that you have, then yes that should be easily sufficient. Assuming of course that a google search of the manufacturer brings up a website in English. Any thing else I wouldn't trust. Quote Link to comment
AcidReign Posted August 29, 2018 Author Share Posted August 29, 2018 So memtest ran for 2 complete passes and did not show any error. Any further ideas? Quote Link to comment
John_M Posted August 29, 2018 Share Posted August 29, 2018 1 hour ago, AcidReign said: Any further ideas? Investigate the call traces in your syslog. It looks as though they are macvlan related, which is not something I use. I'll tell you what I'd do - and it's advice I give over and over - I would break the problem into smaller, more manageable pieces. Disable dockers and VMs and run the server as a plain old NAS for a while. Once you're happy with that start enabling things gradually. That way you stand a better chance of narrowing down the problem. Does your power supply have a single +12 volt rail? I'd be interested to know what make and model it is. An expensive server deserves a quality power supply but people often compromise there. Quote Link to comment
AcidReign Posted August 29, 2018 Author Share Posted August 29, 2018 (edited) Sorry for the stupid question, but what is macvlan? I did not install something like that. Seems to come with unRAID? Edit: Google points me to something like using different IPs for containers. So is this the "Bridge" or "br0" thing from dockers? THe Powersupply should be a good one, but I'll check that with the manufacturer of the server because I cannot open the case right now. Edited August 29, 2018 by AcidReign Quote Link to comment
John_M Posted August 29, 2018 Share Posted August 29, 2018 Yes, my docker containers, for example, all use the same IP address as unRAID itself, but provide their services on different TCP ports. I have no need to give them individual IP addresses. Quote Link to comment
AcidReign Posted August 29, 2018 Author Share Posted August 29, 2018 Hm, I need two dockers I have running on separate IPs. They did not work with just the Port because the hostname was checked back. Quote Link to comment
AcidReign Posted August 30, 2018 Author Share Posted August 30, 2018 (edited) The server crashed again today. This time I was not home so it had some time before I did the hard reboot. And I think that is why it behaved different this time. UptimeRobot told me that the Server (I was monitoring the Plex-Port, so maybe only the Plex docker) went offline some time before 16:30. The Logfile goes on until 18:02 - I did the reset at 18:50 so it sat there for almost an hour totally doing nothing. On the local screen this time there was the following Lines (also see picture attached). crond[2909]: exit status 137 from user root /etc/rc.d/rc.diskinfo --daemon &> /dev/null crond[2909]: exit status 137 from user root /usr/local/emhttp/plugins/dynamixs/scripts/monitor &> /dev/null Can anyone tell me what that means and what the logfiles say this time?! When I search for that error I find another thread here, that has the same issue but never was solved. The only similarity I see is that I also have the UnifiController-Docker running. Thank you all for your help. I really appreciate it. Acid Edit: Sorry had the wrong Logs attached. Now correct FCPsyslog_tail.txt optimusprime-diagnostics-20180830-1553.zip Edited August 30, 2018 by AcidReign Quote Link to comment
AcidReign Posted August 30, 2018 Author Share Posted August 30, 2018 And just 10 minutes ago I had another crash. This time I sat right next to the server, tried to shut it down on the local monitor but had no luck. The log files again show a lot of these (see quote) what are those and what can I do about them? Aug 30 22:12:12 OptimusPrime kernel: INFO: rcu_bh detected stalls on CPUs/tasks: Aug 30 22:12:12 OptimusPrime kernel: 6-...: (1 GPs behind) idle=98a/140000000000000/0 softirq=3443318/3678502 fqs=59996 Aug 30 22:12:12 OptimusPrime kernel: (detected by 26, t=240007 jiffies, g=-243, c=-244, q=6) Aug 30 22:12:12 OptimusPrime kernel: Sending NMI from CPU 26 to CPUs 6: Aug 30 22:12:12 OptimusPrime kernel: NMI backtrace for cpu 6 Aug 30 22:12:12 OptimusPrime kernel: CPU: 6 PID: 16400 Comm: nc2 Tainted: G W 4.14.49-unRAID #1 Aug 30 22:12:12 OptimusPrime kernel: Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 2.0b 02/28/2018 Aug 30 22:12:12 OptimusPrime kernel: task: ffff880838bb0000 task.stack: ffffc90005d28000 Aug 30 22:12:12 OptimusPrime kernel: RIP: 0010:nf_nat_setup_info+0x198/0x5d1 [nf_nat] Aug 30 22:12:12 OptimusPrime kernel: RSP: 0018:ffff88046ff83758 EFLAGS: 00000286 Aug 30 22:12:12 OptimusPrime kernel: RAX: ffff88044b8b9706 RBX: ffff880449f45b80 RCX: 00000000cb652c51 Aug 30 22:12:12 OptimusPrime kernel: RDX: ffff88044b880000 RSI: 00000000b603c035 RDI: 000000008b215da7 Aug 30 22:12:12 OptimusPrime kernel: RBP: ffff88046ff83828 R08: 0000000000000038 R09: ffff88040b75c700 Aug 30 22:12:12 OptimusPrime kernel: R10: 0000000000000348 R11: 0000000000000000 R12: ffff88044cca7960 Aug 30 22:12:12 OptimusPrime kernel: R13: ffff88046ff83838 R14: ffffffff81c88480 R15: 0000000000000000 Aug 30 22:12:12 OptimusPrime kernel: FS: 000014ca3e6aa700(0000) GS:ffff88046ff80000(0000) knlGS:0000000000000000 Aug 30 22:12:12 OptimusPrime kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 30 22:12:12 OptimusPrime kernel: CR2: 000014b6626ef000 CR3: 000000083dfb0004 CR4: 00000000007606e0 Aug 30 22:12:12 OptimusPrime kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 30 22:12:12 OptimusPrime kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Aug 30 22:12:12 OptimusPrime kernel: PKRU: 55555554 Quote Link to comment
John_M Posted August 30, 2018 Share Posted August 30, 2018 rc.diskinfo is used by the preclear plugin. Non-zero return codes usually mean an error condition. Did you fix the problem that was producing the call traces? Quote Link to comment
AcidReign Posted August 31, 2018 Author Share Posted August 31, 2018 How do I fix those? Quote Link to comment
John_M Posted August 31, 2018 Share Posted August 31, 2018 On 8/29/2018 at 6:24 PM, John_M said: Investigate the call traces in your syslog. It looks as though they are macvlan related, which is not something I use. I'll tell you what I'd do - and it's advice I give over and over - I would break the problem into smaller, more manageable pieces. Disable dockers and VMs and run the server as a plain old NAS for a while. Once you're happy with that start enabling things gradually. That way you stand a better chance of narrowing down the problem. Quote Link to comment
AcidReign Posted August 31, 2018 Author Share Posted August 31, 2018 Well I deactivated the two dockers running on separate IPs today. The Server has an uptime of 16hours now. Lets see if it lasts. But what then? The different IPs for some dockers were one of the major reasons I chose unRAID for. Quote Link to comment
John_M Posted August 31, 2018 Share Posted August 31, 2018 Perhaps there's a configuration error. Other people use macvlans successfully so while you're waiting to see if your uptime continues you could read up on them. As I said above, I don't use them myself. Quote Link to comment
Rudde Posted June 2, 2019 Share Posted June 2, 2019 I'm also experiencing this exact issues, also on new MB, RAM, CPU, and USB. I made this thread about it: Did you ever fix this? Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.