2nd Kernel Panic on 6.9.0 - Help Please!

CorneliousJD · March 8, 2021

This is now the 2nd time I've gotten a kernel panic since upgrading to 6.9.0 (and chaning SSDs to 1MiB partition layout)

Here's attached diags and an image of what was on the screen.

I do see it says something about "btrfs release delayed node" - Is that related to the panic?

I'm really hoping to get some help on this, if at all possible.

First time it happened it was doing a btrfs balance (for the 1MiB partitions)

But this time (a few days later) I wasn't doing anything at all outside of running my normal containers.

Is there any way to figure out what is going on with this?

If not is the downgrade back to 6.8.X going to affect the SSDs now that I've changed them to 1MiB?

server-diagnostics-20210308-0837.zip

CorneliousJD · March 8, 2021

My recent changes for when this started.

Upgraded to 6.9.0

Re-aligned my SSD partitions to 1MiB using the Unassign/Re-Assign method below (in the link)

Added the "Soon" plugin and set that up.

JorgeB · March 8, 2021

After formatting the cache pool did you re-create the docker image or copied the old one?

CorneliousJD · March 8, 2021

1 hour ago, JorgeB said:

After formatting the cache pool did you re-create the docker image or copied the old one?

So I didn't format the cache, I did the unassign/reassign, so all the original data on the drives remainted in-tact through 4 different balance operations.

Quote

Unassign/Re-assign Method

Stop array and unassign one of the devices from your existing pool; leave device unassigned.

Start array. A balance will take place on your existing pool. Let the balance complete.

Stop array. Re-assign the device, adding it back to your existing pool.

Start array. The added device will get re-partitioned and a balance will start moving data to the new device. Let the balance complete.

Repeat steps 1-4 for the other device in your existing pool.

Whats happening here is this:

At the completion of step 2, btrfs will 'delete' the missing device from the volume and wipe the btrfs signature from it.

At the beginning of step 4, Unraid OS will re-partition the new device being added to an existing pool.

JorgeB · March 8, 2021

Try re-creating the docker image.

CorneliousJD · March 8, 2021

3 minutes ago, JorgeB said:

Try re-creating the docker image.

I can certianly do that without too much trouble.

Is btrfs the best format for the docker.img file or I recall seeing another option, when would that other option be used ever?

CorneliousJD · March 9, 2021

9 hours ago, JorgeB said:

Try re-creating the docker image.

Ok, I've rebuilt the docker.img file, I've stuck with btrfs docker.img for now still.

https://wiki.unraid.net/Unraid_OS_6.9.0#Docker

The new release talks about different options but I'm not sure I understand what the benefit to XFS would be for docker, or a directory?

I left btrfs incase I need to roll-back (if that's even an option still).

I also setup a syslog server on unraid itself and it's logging to itself so hopefully I can catch more of what happens the next time it kernel panics (if it does)

Fingers crossed!

CorneliousJD · March 10, 2021

Sadly, here we are again with a 3rd...

I'll have to try grabbing my syslog later.

6.9.1 was released but doesn't have any mention of fixing kernel panics.

I've never had an issue before 6.9.0

Is the best option here a downgrade? Or is there a way to track down what's causing this with the syslog?

The "rip code" listed here is different than the last screenshot, not sure if that's anything to go off of.

First panic rip code referenced nf_nat, second mentioned btrfs, now we are back to nf_nat again.

Help is greatly appreciated here.

Edited March 10, 2021 by CorneliousJD

JorgeB · March 10, 2021

You can try this to capture the complete crash.

CorneliousJD · March 10, 2021

28 minutes ago, JorgeB said:

You can try this to capture the complete crash.

I actually had that setup, I grabbed the syslog, here's where the crash happened at 4:40AM and then the reboot at 6:56AM

Successfully ran CA plugin and docker updates at midnight and then 4am respectively, those finished and things seemed stable until random crash.

Looks like it happened right after the "MyServers" plugin backed up my flash? That IS one of the recent changes that I noted doing was setting that up, should I remove/uninstall that? Not sure if there's been any other reports of someone with a kernel panic from it.

Maybe I'm drawing lines where there shouldn't be any?

Mar 10 00:00:08 Server Plugin Auto Update: Community Applications Plugin Auto Update finished
Mar 10 00:00:11 Server emhttpd: read SMART /dev/sdj
Mar 10 00:18:39 Server emhttpd: read SMART /dev/sdg
Mar 10 00:30:02 Server root: /var/lib/docker: 1.3 GiB (1385422848 bytes) trimmed on /dev/loop2
Mar 10 00:30:02 Server root: /etc/libvirt: 125.7 MiB (131821568 bytes) trimmed on /dev/loop3
Mar 10 03:00:14 Server emhttpd: spinning down /dev/sdj
Mar 10 03:00:14 Server emhttpd: spinning down /dev/sdl
Mar 10 03:00:16 Server crond[2515]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
Mar 10 03:14:38 Server emhttpd: read SMART /dev/sdm
Mar 10 04:00:01 Server Docker Auto Update: Community Applications Docker Autoupdate running
Mar 10 04:00:01 Server Docker Auto Update: Checking for available updates
Mar 10 04:00:47 Server Docker Auto Update: Stopping SnipeIT
Mar 10 04:00:47 Server kernel: docker0: port 35(veth5ceba18) entered disabled state
Mar 10 04:00:47 Server kernel: veth95a63a5: renamed from eth0
Mar 10 04:00:47 Server kernel: docker0: port 35(veth5ceba18) entered disabled state
Mar 10 04:00:47 Server kernel: device veth5ceba18 left promiscuous mode
Mar 10 04:00:47 Server kernel: docker0: port 35(veth5ceba18) entered disabled state
Mar 10 04:00:48 Server Docker Auto Update: Stopping Jackett
Mar 10 04:00:52 Server kernel: veth2f06713: renamed from eth0
Mar 10 04:00:52 Server kernel: docker0: port 6(veth5ba43bd) entered disabled state
Mar 10 04:00:52 Server kernel: docker0: port 6(veth5ba43bd) entered disabled state
Mar 10 04:00:52 Server kernel: device veth5ba43bd left promiscuous mode
Mar 10 04:00:52 Server kernel: docker0: port 6(veth5ba43bd) entered disabled state
Mar 10 04:00:52 Server Docker Auto Update: Installing Updates for SnipeIT Jackett
Mar 10 04:03:14 Server Docker Auto Update: Restarting SnipeIT
Mar 10 04:03:15 Server kernel: docker0: port 6(veth878096f) entered blocking state
Mar 10 04:03:15 Server kernel: docker0: port 6(veth878096f) entered disabled state
Mar 10 04:03:15 Server kernel: device veth878096f entered promiscuous mode
Mar 10 04:03:15 Server kernel: eth0: renamed from veth1eb125e
Mar 10 04:03:15 Server kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth878096f: link becomes ready
Mar 10 04:03:15 Server kernel: docker0: port 6(veth878096f) entered blocking state
Mar 10 04:03:15 Server kernel: docker0: port 6(veth878096f) entered forwarding state
Mar 10 04:03:16 Server Docker Auto Update: Restarting Jackett
Mar 10 04:03:16 Server kernel: docker0: port 32(veth229665f) entered blocking state
Mar 10 04:03:16 Server kernel: docker0: port 32(veth229665f) entered disabled state
Mar 10 04:03:16 Server kernel: device veth229665f entered promiscuous mode
Mar 10 04:03:17 Server kernel: eth0: renamed from veth691ff9e
Mar 10 04:03:17 Server kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth229665f: link becomes ready
Mar 10 04:03:17 Server kernel: docker0: port 32(veth229665f) entered blocking state
Mar 10 04:03:17 Server kernel: docker0: port 32(veth229665f) entered forwarding state
Mar 10 04:03:18 Server Docker Auto Update: Community Applications Docker Autoupdate finished
Mar 10 04:40:07 Server root: Fix Common Problems Version 2021.02.18
Mar 10 04:40:25 Server root: Fix Common Problems: Ignored errors / warnings / other comments found, but not logged per user settings
Mar 10 04:40:47 Server emhttpd: read SMART /dev/sdj
Mar 10 04:40:50 Server flash_backup: adding task: php /usr/local/emhttp/plugins/dynamix.unraid.net/include/UpdateFlashBackup.php update
Mar 10 04:40:52 Server emhttpd: read SMART /dev/sdl
Mar 10 06:56:25 Server root: Delaying execution of fix common problems scan for 10 minutes
Mar 10 06:56:25 Server unassigned.devices: Mounting 'Auto Mount' Devices...
Mar 10 06:56:25 Server emhttpd: /usr/local/emhttp/plugins/user.scripts/backgroundScript.sh "/tmp/user.scripts/tmpScripts/Boot - USB Device Symlink/script" >/dev/null 2>&1
Mar 10 06:56:25 Server emhttpd: Starting services...
Mar 10 06:56:25 Server emhttpd: shcmd (91): /etc/rc.d/rc.samba restart

JorgeB · March 10, 2021

There's nothing logged, so no idea what's causing the problem, doubt it's MyServers, but try uninstallling it.

CorneliousJD · March 10, 2021

5 minutes ago, JorgeB said:

There's nothing logged, so no idea what's causing the problem, doubt it's MyServers, but try uninstallling it.

I removed it. Is there anything else I can do to try and find the cause of the problem?

Is it safe to downgrade back to 6.8.X if it happens again? I never had a problem until 6.9.0.

now that MyServers is removed, if it happens again that leaves the only changes as 1MiB partition change and 6.9.0 installed.

I don't think, but was hoping to get confirmation, that the 1MiB partition of the cahce wouldn't cause any issues if I tried downgrading back to 6.8?

JorgeB · March 10, 2021

1 hour ago, CorneliousJD said:

Is it safe to downgrade back to 6.8.X if it happens again?

Yes.

1 hour ago, CorneliousJD said:

that the 1MiB partition of the cahce wouldn't cause any issues if I tried downgrading back to 6.8?

It's OK if you use a pool, and you can have a single device pool if caches slots are set to >1.

CorneliousJD · March 10, 2021

44 minutes ago, JorgeB said:

Yes.

It's OK if you use a pool, and you can have a single device pool if caches slots are set to >1.

Thank you for confirming - i'm using a 2 device btrfs pool so that's good to know.

I'll sit tight and see what happens first, if it happens agan i'll try downgrading. If it happens again on 6.8.X then that will open a whole new can of worms.

Thank you @JorgeB

CorneliousJD · March 11, 2021

I'm also tail -f /var/log/syslog on the local monitor - although I think the kernel panic itself will force that off the screen...

I am *ALSO* in addition keeping the system log up via the web GUI on my PC incase it shows something there that never had a chance to write to the syslog server...

worth a shot while I wait and see.

CorneliousJD · March 12, 2021

Welp, got another one...

tailing the log didn't show anything on screen because the panic section takes up the whole screen.

Live-monitoring syslog from web UI also didn't show anything, last action was HOURS ago with a disk spinning down.

I saw other users on Discord also having kernel panic issues in 6.9.0 too.

Is there anything else we can do to try to find the root cuase of this problem?

For now I suppose my only option here without any other suggestions is to just downgrade...

CorneliousJD · March 12, 2021

Actually it looks like my syslog DID log the error this time.

Mar 12 03:57:07 Server kernel: ------------[ cut here ]------------
Mar 12 03:57:07 Server kernel: WARNING: CPU: 17 PID: 626 at net/netfilter/nf_nat_core.c:614 nf_nat_setup_info+0x6c/0x652 [nf_nat]
Mar 12 03:57:07 Server kernel: Modules linked in: ccp macvlan xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap veth xt_nat xt_MASQUERADE iptable_nat nf_nat xfs md_mod ip6table_filter ip6_tables iptable_filter ip_tables bonding igb i2c_algo_bit cp210x usbserial sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd ipmi_ssif isci glue_helper mpt3sas i2c_i801 rapl libsas i2c_smbus input_leds i2c_core ahci intel_cstate raid_class led_class acpi_ipmi intel_uncore libahci scsi_transport_sas wmi ipmi_si button [last unloaded: ipmi_devintf]
Mar 12 03:57:07 Server kernel: CPU: 17 PID: 626 Comm: kworker/17:2 Tainted: G        W         5.10.19-Unraid #1
Mar 12 03:57:07 Server kernel: Hardware name: Supermicro PIO-617R-TLN4F+-ST031/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.2 03/04/2015
Mar 12 03:57:07 Server kernel: Workqueue: events macvlan_process_broadcast [macvlan]
Mar 12 03:57:07 Server kernel: RIP: 0010:nf_nat_setup_info+0x6c/0x652 [nf_nat]
Mar 12 03:57:07 Server kernel: Code: 89 fb 49 89 f6 41 89 d4 76 02 0f 0b 48 8b 93 80 00 00 00 89 d0 25 00 01 00 00 45 85 e4 75 07 89 d0 25 80 00 00 00 85 c0 74 07 <0f> 0b e9 1f 05 00 00 48 8b 83 90 00 00 00 4c 8d 6c 24 20 48 8d 73
Mar 12 03:57:07 Server kernel: RSP: 0018:ffffc90006778c38 EFLAGS: 00010202
Mar 12 03:57:07 Server kernel: RAX: 0000000000000080 RBX: ffff88837c8303c0 RCX: ffff88811e834880
Mar 12 03:57:07 Server kernel: RDX: 0000000000000180 RSI: ffffc90006778d14 RDI: ffff88837c8303c0
Mar 12 03:57:07 Server kernel: RBP: ffffc90006778d00 R08: 0000000000000000 R09: ffff889083c68160
Mar 12 03:57:07 Server kernel: R10: 0000000000000158 R11: ffff8881e79c1400 R12: 0000000000000000
Mar 12 03:57:07 Server kernel: R13: 0000000000000000 R14: ffffc90006778d14 R15: 0000000000000001
Mar 12 03:57:07 Server kernel: FS:  0000000000000000(0000) GS:ffff88903fc40000(0000) knlGS:0000000000000000
Mar 12 03:57:07 Server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 12 03:57:07 Server kernel: CR2: 000000c000b040b8 CR3: 000000000200c005 CR4: 00000000001706e0
Mar 12 03:57:07 Server kernel: Call Trace:
Mar 12 03:57:07 Server kernel: <IRQ>
Mar 12 03:57:07 Server kernel: ? activate_task+0x9/0x12
Mar 12 03:57:07 Server kernel: ? resched_curr+0x3f/0x4c
Mar 12 03:57:07 Server kernel: ? ipt_do_table+0x49b/0x5c0 [ip_tables]
Mar 12 03:57:07 Server kernel: ? try_to_wake_up+0x1b0/0x1e5
Mar 12 03:57:07 Server kernel: nf_nat_alloc_null_binding+0x71/0x88 [nf_nat]
Mar 12 03:57:07 Server kernel: nf_nat_inet_fn+0x91/0x182 [nf_nat]
Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
Mar 12 03:57:07 Server kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
Mar 12 03:57:07 Server kernel: ip_local_deliver+0x49/0x75
Mar 12 03:57:07 Server kernel: ip_sabotage_in+0x43/0x4d
Mar 12 03:57:07 Server kernel: nf_hook_slow+0x39/0x8e
Mar 12 03:57:07 Server kernel: nf_hook.constprop.0+0xb1/0xd8
Mar 12 03:57:07 Server kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
Mar 12 03:57:07 Server kernel: ip_rcv+0x41/0x61
Mar 12 03:57:07 Server kernel: __netif_receive_skb_one_core+0x74/0x95
Mar 12 03:57:07 Server kernel: process_backlog+0xa3/0x13b
Mar 12 03:57:07 Server kernel: net_rx_action+0xf4/0x29d
Mar 12 03:57:07 Server kernel: __do_softirq+0xc4/0x1c2
Mar 12 03:57:07 Server kernel: asm_call_irq_on_stack+0x12/0x20
Mar 12 03:57:07 Server kernel: </IRQ>
Mar 12 03:57:07 Server kernel: do_softirq_own_stack+0x2c/0x39
Mar 12 03:57:07 Server kernel: do_softirq+0x3a/0x44
Mar 12 03:57:07 Server kernel: netif_rx_ni+0x1c/0x22
Mar 12 03:57:07 Server kernel: macvlan_broadcast+0x10e/0x13c [macvlan]
Mar 12 03:57:07 Server kernel: macvlan_process_broadcast+0xf8/0x143 [macvlan]
Mar 12 03:57:07 Server kernel: process_one_work+0x13c/0x1d5
Mar 12 03:57:07 Server kernel: worker_thread+0x18b/0x22f
Mar 12 03:57:07 Server kernel: ? process_scheduled_works+0x27/0x27
Mar 12 03:57:07 Server kernel: kthread+0xe5/0xea
Mar 12 03:57:07 Server kernel: ? __kthread_bind_mask+0x57/0x57
Mar 12 03:57:07 Server kernel: ret_from_fork+0x22/0x30
Mar 12 03:57:07 Server kernel: ---[ end trace b3ca21ac5f2c2720 ]---

JorgeB · March 12, 2021

Macvlan call traces are usually the result of having dockers with a custom IP address, more info below.

https://forums.unraid.net/topic/70529-650-call-traces-when-assigning-ip-address-to-docker-containers/

CorneliousJD · March 12, 2021

1 hour ago, JorgeB said:

Macvlan call traces are usually the result of having dockers with a custom IP address, more info below.

https://forums.unraid.net/topic/70529-650-call-traces-when-assigning-ip-address-to-docker-containers/

Thanks, working on setting up VLAN now for the 2 containers on br0.

I also want to note, I tried downgrading to 6.8.3 but the cache pool just showed up as unassigned devices, so I'm back on 6.9.1 now otherwise my cache wasn't working.

JorgeB · March 12, 2021

10 minutes ago, CorneliousJD said:

I tried downgrading to 6.8.3 but the cache pool just showed up as unassigned devices

That's normal, it's in the release notes, you need to re-assign it.

CorneliousJD · March 12, 2021

45 minutes ago, JorgeB said:

That's normal, it's in the release notes, you need to re-assign it.

ah, sorry, I saw they didn't assign and just came back to 6.9.1

I've changed one docker to HOST network mode (Home Assistant Core) instead of its own IP, anda change Pi-Hole docker to be on VLAN, so now nothing is using br0 as an option anymore, fingers crossed again.

Thanks again for the assistance.

2nd Kernel Panic on 6.9.0 - Help Please!

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation