Help - New Build unRAID Server stalling/crashing


Recommended Posts

Hi everybody,

 

I need help... again. 

My new build Server crashes and I have no idea where it comes from.

First it happened in the middle of the night. I guess it was idling back then because I had not even installed plex.

Then it happened again during the parity check after the first crash.

I switched on trouble shooting in the fix common problem Plugin but then I had to reboot so it went of again.

The server crashed again this morning (7 Hours ago). I then again started trouble shooting mode.

Now it crashed again during the parity check (and I was working on the server too).

See the log files attached. 

 

Defining Crashing:

The WebUI and all Dockers stop to respond. SSH won't let me do anything anymore. 

The first time that happened I could still login on the local machine and it crashed 10 Seconds later.

The last crashes I could not log in on the local machine. And pushing the Powerbutton did not initiate graceful shutdown so that I had to hold that button pressed.

 

The Hardware is completly new and I have a UPS installed so it should not be a power-issue.

 

Please help - I paid a load of money for this server and now my happiness is really going downhill :(

 

Acid

 

FCPsyslog_tail.txt

optimusprime-diagnostics-20180828-1631.zip

Edited by AcidReign
Link to comment

It looks as though you're being attacked. Do you recognise the IP address? Its whois record shows that it's currently held by China Unicom. Get your server behind a firewall and don't forward any ports to it.

Aug 28 12:12:41 OptimusPrime sshd[15200]: Failed password for root from 112.85.42.102 port 21600 ssh2
Aug 28 12:12:42 OptimusPrime sshd[15200]: Failed password for root from 112.85.42.102 port 21600 ssh2
Aug 28 12:12:44 OptimusPrime sshd[15200]: Failed password for root from 112.85.42.102 port 21600 ssh2
Aug 28 12:12:44 OptimusPrime sshd[15200]: error: maximum authentication attempts exceeded for root from 112.85.42.102 port 21600 ssh2 [preauth]

 

Link to comment
1 hour ago, AcidReign said:

Any further ideas?

 

Investigate the call traces in your syslog. It looks as though they are macvlan related, which is not something I use. I'll tell you what I'd do - and it's advice I give over and over - I would break the problem into smaller, more manageable pieces. Disable dockers and VMs and run the server as a plain old NAS for a while. Once you're happy with that start enabling things gradually. That way you stand a better chance of narrowing down the problem.

 

Does your power supply have a single +12 volt rail? I'd be interested to know what make and model it is. An expensive server deserves a quality power supply but people often compromise there.

Link to comment

Sorry for the stupid question, but what is macvlan? I did not install something like that. Seems to come with unRAID?

Edit: Google points me to something like using different IPs for containers. So is this the "Bridge" or "br0" thing from dockers?

 

THe Powersupply should be a good one, but I'll check that with the manufacturer of the server because I cannot open the case right now. 

Edited by AcidReign
Link to comment

The server crashed again today. This time I was not home so it had some time before I did the hard reboot.

And I think that is why it behaved different this time.

UptimeRobot told me that the Server (I was monitoring the Plex-Port, so maybe only the Plex docker) went offline some time before 16:30. The Logfile goes on until 18:02 - I did the reset at 18:50 so it sat there for almost an hour totally doing nothing.

 

On the local screen this time there was the following Lines (also see picture attached).

crond[2909]: exit status 137 from user root /etc/rc.d/rc.diskinfo --daemon &> /dev/null

crond[2909]: exit status 137 from user root /usr/local/emhttp/plugins/dynamixs/scripts/monitor &> /dev/null

 

Can anyone tell me what that means and what the logfiles say this time?!

 

When I search for that error I find another thread here, that has the same issue but never was solved. The only similarity I see is that I also have the UnifiController-Docker running.

 

 

Thank you all for your help. I really appreciate it.

Acid

 

Edit: Sorry had the wrong Logs attached. Now correct

 

 

IMG_1692.jpg

FCPsyslog_tail.txt

optimusprime-diagnostics-20180830-1553.zip

Edited by AcidReign
Link to comment

And just 10 minutes ago I had another crash. 

This time I sat right next to the server, tried to shut it down on the local monitor but had no luck.

The log files again show a lot of these (see quote) what are those and what can I do about them?

 

Aug 30 22:12:12 OptimusPrime kernel: INFO: rcu_bh detected stalls on CPUs/tasks:
Aug 30 22:12:12 OptimusPrime kernel: 	6-...: (1 GPs behind) idle=98a/140000000000000/0 softirq=3443318/3678502 fqs=59996 
Aug 30 22:12:12 OptimusPrime kernel: 	(detected by 26, t=240007 jiffies, g=-243, c=-244, q=6)
Aug 30 22:12:12 OptimusPrime kernel: Sending NMI from CPU 26 to CPUs 6:
Aug 30 22:12:12 OptimusPrime kernel: NMI backtrace for cpu 6
Aug 30 22:12:12 OptimusPrime kernel: CPU: 6 PID: 16400 Comm: nc2 Tainted: G        W       4.14.49-unRAID #1
Aug 30 22:12:12 OptimusPrime kernel: Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 2.0b 02/28/2018
Aug 30 22:12:12 OptimusPrime kernel: task: ffff880838bb0000 task.stack: ffffc90005d28000
Aug 30 22:12:12 OptimusPrime kernel: RIP: 0010:nf_nat_setup_info+0x198/0x5d1 [nf_nat]
Aug 30 22:12:12 OptimusPrime kernel: RSP: 0018:ffff88046ff83758 EFLAGS: 00000286
Aug 30 22:12:12 OptimusPrime kernel: RAX: ffff88044b8b9706 RBX: ffff880449f45b80 RCX: 00000000cb652c51
Aug 30 22:12:12 OptimusPrime kernel: RDX: ffff88044b880000 RSI: 00000000b603c035 RDI: 000000008b215da7
Aug 30 22:12:12 OptimusPrime kernel: RBP: ffff88046ff83828 R08: 0000000000000038 R09: ffff88040b75c700
Aug 30 22:12:12 OptimusPrime kernel: R10: 0000000000000348 R11: 0000000000000000 R12: ffff88044cca7960
Aug 30 22:12:12 OptimusPrime kernel: R13: ffff88046ff83838 R14: ffffffff81c88480 R15: 0000000000000000
Aug 30 22:12:12 OptimusPrime kernel: FS:  000014ca3e6aa700(0000) GS:ffff88046ff80000(0000) knlGS:0000000000000000
Aug 30 22:12:12 OptimusPrime kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 30 22:12:12 OptimusPrime kernel: CR2: 000014b6626ef000 CR3: 000000083dfb0004 CR4: 00000000007606e0
Aug 30 22:12:12 OptimusPrime kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 30 22:12:12 OptimusPrime kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 30 22:12:12 OptimusPrime kernel: PKRU: 55555554

 

Link to comment
On 8/29/2018 at 6:24 PM, John_M said:

Investigate the call traces in your syslog. It looks as though they are macvlan related, which is not something I use. I'll tell you what I'd do - and it's advice I give over and over - I would break the problem into smaller, more manageable pieces. Disable dockers and VMs and run the server as a plain old NAS for a while. Once you're happy with that start enabling things gradually. That way you stand a better chance of narrowing down the problem.

 

Link to comment
  • 9 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.