Unraid 6.9.1 Trial - Kernel Panic


Recommended Posts

I'm having a persistent issue with kernel panics on Unraid, and I'm trying to troubleshoot my way out of this, but it's not working so I'm turning to you guys. 
At best, I get about 3 days of uptime before it hits. I've tried a ton of things suggested on this forum and reddit. 

I'm not ruling out a hardware problem, since the server (Supermicro X8DT3-LN4F, 2x Xeon X5650, 48GB RAM) was bought used on Ebay, but video card and SAS controller are new. Memtest comes up clean. Hard drives are a mix of new and ones I used in my Synology for a bit, but all test clean. 
I had some issues with UDMA CRC SMART errors when I first setup Unraid, but figured out it was the onboard SATA ports being flaky, which is why I switched to the SAS controller. 

I'm about halfway through my trial, and I really want to get this working because I love what Unraid offers, but it's getting to be sink-or-swim time to decide if I stick with the platform or cut my loses and move to something else. 

 

I can't pinpoint what's causing it, so any suggestions would be appreciated. 

 

640520941_2021-04-0109_05_33-IPMIView2_18.0(build201007)-SuperMicroComputerInc..jpg.2d66f3836d40e280c98dd431cdd3fc31.jpg

unraid-diagnostics-20210401-0911.zip

Link to comment
7 minutes ago, jonathanm said:

Does it crash if you don't enable the nvidia drivers?


I can try disabling them for troubleshooting, but my hope was to run Tdarr on this server to transcode my downloaded files to maintain consistency, and I got a Quadro P400 specifically for that task. 

Edit: Driver removed... let's see. 

Edited by Diesel
Link to comment
1 hour ago, jonathanm said:

Are you running any containers with their own IP address? I see macvlan mentioned, and that is a known issue with some setups.


I have 3 containers using br0 due to port overlap with other containers. Looking at the thread that @Hoopster mentioned, it looks like setting up a vlan for the containers might be needed to resolve this.

Curious though... it this functionality considered broken? I thought the whole point of using br0 was for this use case.

 

1 hour ago, jonathanm said:

 

Also, blake2 is shown, are you by any chance using the file integrity plugin?

 

Not to my knowledge. I'm not familiar with blake2.  

The only plugins I've installed (so far) were to try to resolve issues I've been having. 
Community Applications - self-explanatory.
Tips & Tweaks - to kill ssh before stopping the array, since I was having issues with the process hanging when I tried to stop the array.
Open Files - to see what files were holding up the array from stopping.
NerdPack - to use CLI locate to troubleshoot an issue where a container was putting config files, only to find out that container hadn't been marked as deprecated, but was so out of date, it should have been. Probably no longer needed, but none of the tools in there should cause problems because they're only being used on-demand.
CA.Backup - to backup my appdata apps to the array once I installed a cache drive.
Nvidia Driver - now removed for troubleshooting, but used to allow Tdarr_node to do hardware transcoding.

I haven't installed any other plugins beyond those. 

Link to comment
2 minutes ago, Diesel said:

Curious though... it this functionality considered broken? I thought the whole point of using br0 was for this use case.

 

IP addresses on br0 work great on many systems with no call traces.  The problem only arises with certain hardware setups.  Some have it, some don't and so far there has been no common denominator identified.

 

If you have the problem, VLANs seem to be the solution that works for most.

Link to comment

Follow-up: I setup a VLAN for the containers this morning, set the containers to use the new interface, and I'm currently hammering the box via Tdarr, which previously seemed to exacerbate whatever was causing the kernel panics. 
So, fingers crossed that resolved it. Wait and see...

Follow-up question: One of the containers I was using br0 for was just to assign my unifi-controller it's own IP on my LAN subnet. Obviously, putting it on the VLAN puts it in a different subnet, but my preference would be to have the controller on the same subnet as the devices it's controlling. I realize there's no necessity for this if the networking is setup properly, but I'm also experienced enough to know that should problems with my Unifi setup arise in the future, having fewer variables in the mix makes for easier troubleshooting.  
Given that I seem to be one of the ones randomly afflicted by this issue, am I correct in assuming at this point that there's no way for me to recover this functionality without risking reintroducing the call traces and kernel panics?

 

Link to comment
3 hours ago, Diesel said:

Follow-up question: One of the containers I was using br0 for was just to assign my unifi-controller it's own IP on my LAN subnet. Obviously, putting it on the VLAN puts it in a different subnet, but my preference would be to have the controller on the same subnet as the devices it's controlling

LOL - I was going to ask if you had Ubiquiti UniFi hardware.  A large percentage of us who have these macvlan call traces have UniFi.  However, I believe @bonienl also has some UniFi and he has not had the call traces so that cannot be the sole determining factor.

 

By default, the UniFI routers pass all LAN traffic freely among the "Corporate" LANs which includes any VLANs.  I currently have my unifi-controller in bridge mode which puts it on the same subnet as the router, switches and APs it controls as there are no port conflicts with other containers.  However, early on in my VLAN experimentation, I believe I had it on the VLAN and did not have any issues.  I am not 100% certain of this, but that is my recollection

 

In my case, Controller Hostname/IP setting in the controller is the unRAID server IP address since the controller is in bridge mode; however, if it is put on a VLAN it would need to be set to the IP address assigned in the controller container.

 

I eagerly await unRAID 6.9.2 "soon"

Edited by Hoopster
Link to comment
2 hours ago, Hoopster said:

LOL - I was going to ask if you had Ubiquiti UniFi hardware.  A large percentage of us who have these macvlan call traces have UniFi.  However, I believe @bonienl also has some UniFi and he has have had the call traces so that cannot be the sole determining factor.


To clarify, I have UniFi switches and APs, but my router is pFsense on it's own hardware.


I did setup the container VLAN to be open to my LAN, but I also set the UniFi controller container to bridge instead of br0 or on the vlan. Like I said, I wanted it to have it's own IP, but for now, I'm managing that with a host override in the pfSense DNS Resolver. However, this isn't my ideal setup. 
Assuming that I've gotten the kernel panics under control, I too will be eagerly awaiting 6.9.2. But I'm still in wait-and-see mode. My current uptime is barely 25 hours, and I've only got 8 days left on my trial. Yes, I'm aware I can extend the trial, but it honestly shouldn't take me 30 days and numerous fixes just to achieve basic functionality of network storage and Docker containers. If I can get >4 days with this fix, I'll consider it a success and finish migrating the rest of my drives into the array and purchase a license. If not, I fear it's back to the drawing board. 

Link to comment
16 minutes ago, Diesel said:

To clarify, I have UniFi switches and APs, but my router is pFsense on it's own hardware

Ah, OK.  A different situation than mine then.

 

The br0 call trace issue has been around for three years.  It has something to do with the way docker/macvlan networking works on the physical hardware.  The issue is not an unRAID specific issue and has been reported on Ubuntu and other Linux distros.

 

The potential fix (hopefully, it is a real fix) in unRAID 6.9.2 apparently involves a Linux kernel patch.  As such, it highlights that Limetech really did not have much control over this one.  I am just glad the VLAN thing worked for me three years ago or all my docker containers would have to be in bridge or host mode.  That can get messy when they start fighting over wanting the same ports.

Link to comment

Looks like it crashed again. Hard locked this time. I couldn't even get the local console or IPMI KVM to wake up to see what happened. 
Checked the syslog, and it looks like there's still some macvlan issues going on. I had it down to 2 containers on the vlan, and still...

Apr  5 15:39:11 unraid kernel: ------------[ cut here ]------------
Apr  5 15:39:11 unraid kernel: WARNING: CPU: 0 PID: 2716 at net/netfilter/nf_nat_core.c:614 nf_nat_setup_info+0x6c/0x652 [nf_nat]
Apr  5 15:39:11 unraid kernel: Modules linked in: nvidia_uvm(PO) nvidia_drm(PO) nvidia_modeset(PO) drm_kms_helper drm backlight agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops nvidia(PO) iptable_mangle iptable_raw wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libblake2s blake2s_x86_64 libblake2s_generic libchacha veth xt_nat macvlan xt_MASQUERADE iptable_nat nf_nat xfs nfsd lockd grace sunrpc md_mod ip6table_filter ip6_tables iptable_filter ip_tables bonding igb i2c_algo_bit intel_powerclamp coretemp kvm_intel kvm mpt3sas mptsas mptscsih mptbase raid_class i2c_i801 crc32c_intel ahci scsi_transport_sas i2c_smbus input_leds intel_cstate i2c_core led_class i5500_temp intel_uncore ipmi_si libahci i7core_edac button acpi_cpufreq [last unloaded: i2c_algo_bit]
Apr  5 15:39:11 unraid kernel: CPU: 0 PID: 2716 Comm: kworker/0:3 Tainted: P        W  O      5.10.21-Unraid #1
Apr  5 15:39:11 unraid kernel: Hardware name: Supermicro X8DT3/X8DT3, BIOS 2.2     07/09/2018
Apr  5 15:39:11 unraid kernel: Workqueue: events macvlan_process_broadcast [macvlan]
Apr  5 15:39:11 unraid kernel: RIP: 0010:nf_nat_setup_info+0x6c/0x652 [nf_nat]
Apr  5 15:39:11 unraid kernel: Code: 89 fb 49 89 f6 41 89 d4 76 02 0f 0b 48 8b 93 80 00 00 00 89 d0 25 00 01 00 00 45 85 e4 75 07 89 d0 25 80 00 00 00 85 c0 74 07 <0f> 0b e9 1f 05 00 00 48 8b 83 90 00 00 00 4c 8d 6c 24 20 48 8d 73
Apr  5 15:39:11 unraid kernel: RSP: 0000:ffffc90000003c38 EFLAGS: 00010202
Apr  5 15:39:11 unraid kernel: RAX: 0000000000000080 RBX: ffff888707aeba40 RCX: ffff888646fcc540
Apr  5 15:39:11 unraid kernel: RDX: 0000000000000180 RSI: ffffc90000003d14 RDI: ffff888707aeba40
Apr  5 15:39:11 unraid kernel: RBP: ffffc90000003d00 R08: 0000000000000000 R09: ffff88865e1802a0
Apr  5 15:39:11 unraid kernel: R10: 0000000000000158 R11: ffff88874ac92000 R12: 0000000000000000
Apr  5 15:39:11 unraid kernel: R13: 0000000000000000 R14: ffffc90000003d14 R15: 0000000000000001
Apr  5 15:39:11 unraid kernel: FS:  0000000000000000(0000) GS:ffff888627a00000(0000) knlGS:0000000000000000
Apr  5 15:39:11 unraid kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr  5 15:39:11 unraid kernel: CR2: 00001488a04d6000 CR3: 00000006fee4c002 CR4: 00000000000206f0
Apr  5 15:39:11 unraid kernel: Call Trace:
Apr  5 15:39:11 unraid kernel: <IRQ>
Apr  5 15:39:11 unraid kernel: ? ip_route_input_slow+0x5e9/0x754
Apr  5 15:39:11 unraid kernel: ? ipt_do_table+0x49b/0x5c0 [ip_tables]
Apr  5 15:39:11 unraid kernel: nf_nat_alloc_null_binding+0x71/0x88 [nf_nat]
Apr  5 15:39:11 unraid kernel: nf_nat_inet_fn+0x91/0x182 [nf_nat]
Apr  5 15:39:11 unraid kernel: nf_hook_slow+0x39/0x8e
Apr  5 15:39:11 unraid kernel: nf_hook.constprop.0+0xb1/0xd8
Apr  5 15:39:11 unraid kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
Apr  5 15:39:11 unraid kernel: ip_local_deliver+0x49/0x75
Apr  5 15:39:11 unraid kernel: ip_sabotage_in+0x43/0x4d
Apr  5 15:39:11 unraid kernel: nf_hook_slow+0x39/0x8e
Apr  5 15:39:11 unraid kernel: nf_hook.constprop.0+0xb1/0xd8
Apr  5 15:39:11 unraid kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
Apr  5 15:39:11 unraid kernel: ip_rcv+0x41/0x61
Apr  5 15:39:11 unraid kernel: __netif_receive_skb_one_core+0x74/0x95
Apr  5 15:39:11 unraid kernel: process_backlog+0xa3/0x13b
Apr  5 15:39:11 unraid kernel: net_rx_action+0xf4/0x29d
Apr  5 15:39:11 unraid kernel: __do_softirq+0xc4/0x1c2
Apr  5 15:39:11 unraid kernel: asm_call_irq_on_stack+0x12/0x20
Apr  5 15:39:11 unraid kernel: </IRQ>
Apr  5 15:39:11 unraid kernel: do_softirq_own_stack+0x2c/0x39
Apr  5 15:39:11 unraid kernel: do_softirq+0x3a/0x44
Apr  5 15:39:11 unraid kernel: netif_rx_ni+0x1c/0x22
Apr  5 15:39:11 unraid kernel: macvlan_broadcast+0x10e/0x13c [macvlan]
Apr  5 15:39:11 unraid kernel: macvlan_process_broadcast+0xf8/0x143 [macvlan]
Apr  5 15:39:11 unraid kernel: process_one_work+0x13c/0x1d5
Apr  5 15:39:11 unraid kernel: worker_thread+0x18b/0x22f
Apr  5 15:39:11 unraid kernel: ? process_scheduled_works+0x27/0x27
Apr  5 15:39:11 unraid kernel: kthread+0xe5/0xea
Apr  5 15:39:11 unraid kernel: ? __kthread_bind_mask+0x57/0x57
Apr  5 15:39:11 unraid kernel: ret_from_fork+0x22/0x30
Apr  5 15:39:11 unraid kernel: ---[ end trace 958c8b9071653523 ]---
Apr  5 16:08:03 unraid kernel: eth0: renamed from veth90b5415
Apr  5 16:10:31 unraid kernel: ------------[ cut here ]------------

Link to comment

Mainly posting for posterity now, in case this ends up helping someone with the same problem in the future. 

I've replaced some of the containers (ie. torrent client) on the VLAN with different apps that don't use the same ports, so no overlap with others and they can live on the bridge network. 
The remaining containers that were on the vlan have been turned off to see if that stops the crashes. 

I saw another post that mentioned, in addition to putting br0 containers on their own vlan, also turning off "host access to custom networks", which I needed to enable to allow my apps on the bridge network to talk to those with custom IPs. So, that's on the troubleshooting checklist as well. For now, I'll just leave the VLAN hosts turned off. 

If I can at least get this box to stay up successfully without crashes, I can live with a workaround, as frustrating as that may be. Then I can tweak things with additional suggested fixes if needed. But I'd at least like to make it to the end of the trial without another crash, if possible. 

Edited by Diesel
Link to comment
  • 3 weeks later...

Just as a further follow-up, everything has been working smoothly on the kernel panic front since taking the containers off of the br0 interface. 
I haven't had a chance to test it again running a container on another VLAN since turning off "host access to custom networks", but I may still get to that at some point. For now, I'm managing without that particular container. 
I've also upgraded to 6.9.2, as @bonienl mentioned. I'll retest that as well as soon as I'm able. 

Edited by Diesel
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.