Random crashes involving NIC


Recommended Posts

Hi Guys,

 

Server stats:

Dell T610

Dual Xeon E5530 CPUs

44GB RAM

UNRAID 6.8.3

 

Plugins:

CA Auto Update Applications

CA Backup / Restore

CA Config Editor

Custom Tab

Local Master

SSD TRIM

System Statistics

Fix common problems

Nerd Tools

NUT

Server Layout

Speed Test

Theme engine

Tips and Tweaks

Unassigned Drives (both)

User scripts

 

I've been running UNRAID for about a year now and have no issues and it has been solid, until now!

 

About every two days now my server will crash. I've searched the web and the only things I can really find have to do with a kernel issue that affects the NIC.

 

I honestly have no idea because when this happens I can't access anything on the server. The monitor plugged into the server just keeps repeating the same error message over and over and the only thing I can do is hold down the power button and force a reboot.

 

I've only been able to find similar issues that are pretty dated involving other OS. It seems like those issues have something to do with the kernel. 

 

I'm attaching a picture of the error. Apologies because I can't access any other diagnostics when this happens.

 

Syslog and image of the error attached...

unraid-error.jpg

tower-syslog-20200413-0241.zip

Link to comment
5 hours ago, Mathervius said:

bumping this in the hopes that someone has some input or experience

Unfortunately I don't have experience with this and there really isn't much information provided but here are some things for you to consider until someone else chimes in...

On 4/12/2020 at 10:49 PM, Mathervius said:

Apologies because I can't access any other diagnostics

Enable syslog server, the attached log is only a few minutes long and doesn't really show anything other than:

Quote

Apr 12 18:47:30 Tower kernel: traps: ffdetect[12663] general protection ip:4042af sp:7ffd6b77a880 error:0 in ffdetect[403000+c000]
Apr 12 18:47:30 Tower kernel: traps: ffdetect[12664] general protection ip:4042af sp:7ffec7e115e0 error:0 in ffdetect[403000+c000]

All I could find on this error (↑) was that it may be related to emby? 

On 4/12/2020 at 10:49 PM, Mathervius said:

I've been running UNRAID for about a year now and have no issues and it has been solid, until now!

It is absolutely possible that hardware is failing but, I have to ask if something has changed recently, either hardware, software (new or updated), or you may have moved the rig and knocked something loose etc. If not, then, a piece of hardware is failing (possibly the NIC). This post suggests kernel crashing may be related to setting MTU greater than 1500 (Jumbo frames). I've always used the default 1500 and nothing more so I 'm not familiar with making changes to this setting. Again, if the setting has been that way for a year then I'd question if something else has changed recently.

 

Things to try

  • Start in safe mode, no dockers/vms/plugins etc
  • Enable syslog server 
  • Pull diagnostics frequently until crash and upload the most recent
  • If your RAM is not ECC then run memtest for 24 hours and add the results along with your diagnostics

You may want to consider tailing the log as well, this lets you get a screen shot of things that are not able to be written to the log before it crashes

  • Attach your monitor and keyboard, using the command line enter
Quote

tail -n 30 /var/log/syslog -f

This will show the last 30 lines, you can change -n or remove it (default is 10). To quit tail...

Quote

Ctrl + C

Hopefully this will get you going in the right direction

Edited by Dissones4U
Link to comment

Thank you for getting back to me! 

 

The only recent changes are adding syncthing and tdarr_aio. They have both been running without any issues until this all started about two weeks ago.

 

I believe you are correct about the ffdetect error being emby. That was my impression as well.

 

My MTU is default of 1500.

 

I'm going to setup the syslog server now. Thank you for that idea!

 

I have tdarr_aio off for now as it isn't essential and Im going to wait a few days and see if it happens again. It seems to be about every 48hrs give or take a little.

Link to comment
  • 1 month later...

Well, the crashes are back....

 

I was finally able to setup Graylog since my other syslog server (unraid) wasn't capturing the issue.

 

I've attached the logs. The other issue is that it seems Graylog didn't export everything in the exact order that it came in. Sorry about that...

 

This came in after rebooting the server and had it running for about an hour:

May 26 19:17:08 Tower kernel: RIP: 0010:__nf_conntrack_confirm+0xa0/0x69e
May 26 19:17:08 Tower kernel: Code: 04 e8 56 fb ff ff 44 89 f2 44 89 ff 89 c6 41 89 c4 e8 7f f9 ff ff 48 8b 4c 24 08 84 c0 75 af 48 8b 85 80 00 00 00 a8 08 74 26 <0f> 0b 44 89 e6 44 89 ff 45 31 f6 e8 95 f1 ff ff be 00 02 00 00 48
May 26 19:17:08 Tower kernel: RSP: 0018:ffff8885a99c3d58 EFLAGS: 00010202
May 26 19:17:08 Tower kernel: RAX: 0000000000000188 RBX: ffff888574348500 RCX: ffff888ad98fce18
May 26 19:17:08 Tower kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff81e091c4
May 26 19:17:08 Tower kernel: RBP: ffff888ad98fcdc0 R08: 00000000e48f2dcb R09: ffffffff81c8aa80
May 26 19:17:08 Tower kernel: R10: 0000000000000158 R11: ffffffff81e91080 R12: 000000000000baf1
May 26 19:17:08 Tower kernel: R13: ffffffff81e91080 R14: 0000000000000000 R15: 000000000000eaf0
May 26 19:17:08 Tower kernel: FS: 0000000000000000(0000) GS:ffff8885a99c0000(0000) knlGS:0000000000000000
May 26 19:17:08 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 26 19:17:08 Tower kernel: CR2: 0000146fb1114000 CR3: 0000000001e0a000 CR4: 00000000000006e0
May 26 19:17:08 Tower kernel: Call Trace:
May 26 19:17:08 Tower kernel: <IRQ>
May 26 19:17:08 Tower kernel: ipv4_confirm+0xaf/0xb9
May 26 19:17:08 Tower kernel: nf_hook_slow+0x3a/0x90
May 26 19:17:08 Tower kernel: ip_local_deliver+0xad/0xdc
May 26 19:17:08 Tower kernel: ? ip_sublist_rcv_finish+0x54/0x54
May 26 19:17:08 Tower kernel: ip_sabotage_in+0x38/0x3e
May 26 19:17:08 Tower kernel: nf_hook_slow+0x3a/0x90
May 26 19:17:08 Tower kernel: ip_rcv+0x8e/0xbe
May 26 19:17:08 Tower kernel: ? ip_rcv_finish_core.isra.0+0x2e1/0x2e1
May 26 19:17:08 Tower kernel: __netif_receive_skb_one_core+0x53/0x6f
May 26 19:17:08 Tower kernel: process_backlog+0x77/0x10e
May 26 19:17:08 Tower kernel: net_rx_action+0x107/0x26c
May 26 19:17:08 Tower kernel: __do_softirq+0xc9/0x1d7
May 26 19:17:08 Tower kernel: do_softirq_own_stack+0x2a/0x40
May 26 19:17:08 Tower kernel: </IRQ>
May 26 19:17:08 Tower kernel: do_softirq+0x4d/0x5a
May 26 19:17:08 Tower kernel: netif_rx_ni+0x1c/0x22
May 26 19:17:08 Tower kernel: macvlan_broadcast+0x111/0x156 [macvlan]
May 26 19:17:08 Tower kernel: ? __switch_to_asm+0x41/0x70
May 26 19:17:08 Tower kernel: macvlan_process_broadcast+0xea/0x128 [macvlan]
May 26 19:17:08 Tower kernel: process_one_work+0x16e/0x24f
May 26 19:17:08 Tower kernel: worker_thread+0x1e2/0x2b8
May 26 19:17:08 Tower kernel: ? rescuer_thread+0x2a7/0x2a7
May 26 19:17:08 Tower kernel: kthread+0x10c/0x114
May 26 19:17:08 Tower kernel: ? kthread_park+0x89/0x89
May 26 19:17:08 Tower kernel: ret_from_fork+0x35/0x40
May 26 19:17:08 Tower kernel: ---[ end trace b58796bea918bc16 ]---

It didn't crash the server this time though. Sorry I just don't have experience with this kind of issue...

 

UNRAID_CRASH05262020.jpg

graylog-search-result-relative-0.txt

Link to comment
6 hours ago, johnnie.black said:

Macvlan call traces are usually related to having dockers with a custom IP address:

 

 

OK, I read through that post but my dockers don't have an IP address assigned to them. Mine are Host, Bridge, Proxynet (letsencrypt), and a VPN container.

 

Could one of those networks cause the macvlan issue? Maybe it's because I have docker set to be able to communicate with the host network (Host access to custom networks)?

Link to comment

After reading through a bunch of forum posts I tried putting the docker network onto its own NIC. Anytime I change the network settings I am no longer able to reach the machine over LAN. 

 

I then deleted the network.cfg and rebooted into GUI mode. I made the suggested adjustments to put docker on its own NIC and once again I lost all network connectivity. 

 

I have now adjusted eth0 to: Bonding = no, Enable bridge = yes, Bridging members of br0 = eth0.

 

If I try and set eth0 to a static IP I lose network connectivity again. I have it set with a static IP from pfSense already but previously I had it set as static in UNRAID as well and it worked no problem.

 

eth1, eth2, and eth3 all show as not configured now. If I make any adjustments to them I lose network connectivity and have to delete the network.cfg file and reboot in order to get connected again.

 

The dashboard still shows that I'm using bond0, which is what it always showed before. It seems like it would match the network settings page though?

 

Sorry for the long post but I am genuinely stuck here. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.