Help - New Build unRAID Server stalling/crashing

AcidReign · August 28, 2018

Hi everybody,

I need help... again.

My new build Server crashes and I have no idea where it comes from.

First it happened in the middle of the night. I guess it was idling back then because I had not even installed plex.

Then it happened again during the parity check after the first crash.

I switched on trouble shooting in the fix common problem Plugin but then I had to reboot so it went of again.

The server crashed again this morning (7 Hours ago). I then again started trouble shooting mode.

Now it crashed again during the parity check (and I was working on the server too).

See the log files attached.

Defining Crashing:

The WebUI and all Dockers stop to respond. SSH won't let me do anything anymore.

The first time that happened I could still login on the local machine and it crashed 10 Seconds later.

The last crashes I could not log in on the local machine. And pushing the Powerbutton did not initiate graceful shutdown so that I had to hold that button pressed.

The Hardware is completly new and I have a UPS installed so it should not be a power-issue.

Please help - I paid a load of money for this server and now my happiness is really going downhill

Acid

FCPsyslog_tail.txt

optimusprime-diagnostics-20180828-1631.zip

Edited August 28, 2018 by AcidReign

John_M · August 28, 2018

2 hours ago, AcidReign said:

The Hardware is completly new

Have you run memtest from the boot menu to give the RAM a workout?

John_M · August 28, 2018

It looks as though you're being attacked. Do you recognise the IP address? Its whois record shows that it's currently held by China Unicom. Get your server behind a firewall and don't forward any ports to it.

Aug 28 12:12:41 OptimusPrime sshd[15200]: Failed password for root from 112.85.42.102 port 21600 ssh2
Aug 28 12:12:42 OptimusPrime sshd[15200]: Failed password for root from 112.85.42.102 port 21600 ssh2
Aug 28 12:12:44 OptimusPrime sshd[15200]: Failed password for root from 112.85.42.102 port 21600 ssh2
Aug 28 12:12:44 OptimusPrime sshd[15200]: error: maximum authentication attempts exceeded for root from 112.85.42.102 port 21600 ssh2 [preauth]

Squid · August 28, 2018

https://www.abuseipdb.com/check/112.85.42.102

AcidReign · August 28, 2018

I closed Ports 21 and 22.

The rest of the open ports is nothing I could be attacked with (Transmission and Plex). So you think this could be the reason for the whole server to freeze?

For how long should I run am memtest?

Squid · August 28, 2018

4 minutes ago, AcidReign said:

For how long should I run am memtest?

Ideally 24 hours.

Does anything appear on the locally attached monitor when it crashes? What is the powersupply?

AcidReign · August 28, 2018

The locally Monitor just shows the standard non-gui Login screen for unRAID and I can't type anything.

Powersource is APC SmartUPS 1500 which is fully charged and only has 10% load. So this should not be a problem.

Ok I'll let memtest run tomorrow morning.

Squid · August 28, 2018

Power Supply, not UPS

AcidReign · August 28, 2018

Ah ok, sorry. I don't have much information about the power source. It's a 650W 80+

I can't open the server right now. Is that enough info?

Squid · August 28, 2018

For the limited drives that you have, then yes that should be easily sufficient. Assuming of course that a google search of the manufacturer brings up a website in English. Any thing else I wouldn't trust.

AcidReign · August 29, 2018

So memtest ran for 2 complete passes and did not show any error.

Any further ideas?

John_M · August 29, 2018

1 hour ago, AcidReign said:

Any further ideas?

Investigate the call traces in your syslog. It looks as though they are macvlan related, which is not something I use. I'll tell you what I'd do - and it's advice I give over and over - I would break the problem into smaller, more manageable pieces. Disable dockers and VMs and run the server as a plain old NAS for a while. Once you're happy with that start enabling things gradually. That way you stand a better chance of narrowing down the problem.

Does your power supply have a single +12 volt rail? I'd be interested to know what make and model it is. An expensive server deserves a quality power supply but people often compromise there.

AcidReign · August 29, 2018

Sorry for the stupid question, but what is macvlan? I did not install something like that. Seems to come with unRAID?

Edit: Google points me to something like using different IPs for containers. So is this the "Bridge" or "br0" thing from dockers?

THe Powersupply should be a good one, but I'll check that with the manufacturer of the server because I cannot open the case right now.

Edited August 29, 2018 by AcidReign

John_M · August 29, 2018

Yes, my docker containers, for example, all use the same IP address as unRAID itself, but provide their services on different TCP ports. I have no need to give them individual IP addresses.

AcidReign · August 29, 2018

Hm, I need two dockers I have running on separate IPs. They did not work with just the Port because the hostname was checked back.

AcidReign · August 30, 2018

The server crashed again today. This time I was not home so it had some time before I did the hard reboot.

And I think that is why it behaved different this time.

UptimeRobot told me that the Server (I was monitoring the Plex-Port, so maybe only the Plex docker) went offline some time before 16:30. The Logfile goes on until 18:02 - I did the reset at 18:50 so it sat there for almost an hour totally doing nothing.

On the local screen this time there was the following Lines (also see picture attached).

crond[2909]: exit status 137 from user root /etc/rc.d/rc.diskinfo --daemon &> /dev/null

crond[2909]: exit status 137 from user root /usr/local/emhttp/plugins/dynamixs/scripts/monitor &> /dev/null

Can anyone tell me what that means and what the logfiles say this time?!

When I search for that error I find another thread here, that has the same issue but never was solved. The only similarity I see is that I also have the UnifiController-Docker running.

Thank you all for your help. I really appreciate it.

Acid

Edit: Sorry had the wrong Logs attached. Now correct

FCPsyslog_tail.txt

optimusprime-diagnostics-20180830-1553.zip

Edited August 30, 2018 by AcidReign

AcidReign · August 30, 2018

And just 10 minutes ago I had another crash.

This time I sat right next to the server, tried to shut it down on the local monitor but had no luck.

The log files again show a lot of these (see quote) what are those and what can I do about them?

Aug 30 22:12:12 OptimusPrime kernel: INFO: rcu_bh detected stalls on CPUs/tasks:
Aug 30 22:12:12 OptimusPrime kernel: 	6-...: (1 GPs behind) idle=98a/140000000000000/0 softirq=3443318/3678502 fqs=59996 
Aug 30 22:12:12 OptimusPrime kernel: 	(detected by 26, t=240007 jiffies, g=-243, c=-244, q=6)
Aug 30 22:12:12 OptimusPrime kernel: Sending NMI from CPU 26 to CPUs 6:
Aug 30 22:12:12 OptimusPrime kernel: NMI backtrace for cpu 6
Aug 30 22:12:12 OptimusPrime kernel: CPU: 6 PID: 16400 Comm: nc2 Tainted: G        W       4.14.49-unRAID #1
Aug 30 22:12:12 OptimusPrime kernel: Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 2.0b 02/28/2018
Aug 30 22:12:12 OptimusPrime kernel: task: ffff880838bb0000 task.stack: ffffc90005d28000
Aug 30 22:12:12 OptimusPrime kernel: RIP: 0010:nf_nat_setup_info+0x198/0x5d1 [nf_nat]
Aug 30 22:12:12 OptimusPrime kernel: RSP: 0018:ffff88046ff83758 EFLAGS: 00000286
Aug 30 22:12:12 OptimusPrime kernel: RAX: ffff88044b8b9706 RBX: ffff880449f45b80 RCX: 00000000cb652c51
Aug 30 22:12:12 OptimusPrime kernel: RDX: ffff88044b880000 RSI: 00000000b603c035 RDI: 000000008b215da7
Aug 30 22:12:12 OptimusPrime kernel: RBP: ffff88046ff83828 R08: 0000000000000038 R09: ffff88040b75c700
Aug 30 22:12:12 OptimusPrime kernel: R10: 0000000000000348 R11: 0000000000000000 R12: ffff88044cca7960
Aug 30 22:12:12 OptimusPrime kernel: R13: ffff88046ff83838 R14: ffffffff81c88480 R15: 0000000000000000
Aug 30 22:12:12 OptimusPrime kernel: FS:  000014ca3e6aa700(0000) GS:ffff88046ff80000(0000) knlGS:0000000000000000
Aug 30 22:12:12 OptimusPrime kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 30 22:12:12 OptimusPrime kernel: CR2: 000014b6626ef000 CR3: 000000083dfb0004 CR4: 00000000007606e0
Aug 30 22:12:12 OptimusPrime kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 30 22:12:12 OptimusPrime kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 30 22:12:12 OptimusPrime kernel: PKRU: 55555554

John_M · August 30, 2018

rc.diskinfo is used by the preclear plugin. Non-zero return codes usually mean an error condition.

Did you fix the problem that was producing the call traces?

AcidReign · August 31, 2018

How do I fix those?

John_M · August 31, 2018

On 8/29/2018 at 6:24 PM, John_M said:

Investigate the call traces in your syslog. It looks as though they are macvlan related, which is not something I use. I'll tell you what I'd do - and it's advice I give over and over - I would break the problem into smaller, more manageable pieces. Disable dockers and VMs and run the server as a plain old NAS for a while. Once you're happy with that start enabling things gradually. That way you stand a better chance of narrowing down the problem.

AcidReign · August 31, 2018

Well I deactivated the two dockers running on separate IPs today. The Server has an uptime of 16hours now. Lets see if it lasts.

But what then? The different IPs for some dockers were one of the major reasons I chose unRAID for.

John_M · August 31, 2018

Perhaps there's a configuration error. Other people use macvlans successfully so while you're waiting to see if your uptime continues you could read up on them. As I said above, I don't use them myself.

Rudde · June 2, 2019

I'm also experiencing this exact issues, also on new MB, RAM, CPU, and USB.

I made this thread about it:

Did you ever fix this?

Help - New Build unRAID Server stalling/crashing

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation