Regular Kernel Panics UNRAID v6.6.3


Recommended Posts

I've had a few, fairly consistent Kernel panics on most version of UNRAID v6.6.x, I can't say exactly when it started because I didn't always notice them, and I think I blamed a couple initially due to some power loss we had.

 

Hardware is a Supermicro MBD-A1SRi-2758F-O Mini ITX Server Motherboard. It's been running just fine previously with Proxmox and previous versions of UNRAID. I also have all drives connected to an LSISAS2008 : FWVersion(18.00.00.00), ChipRevision(0x03), BiosVersion(07.35.00.00) (SAS9211-8I 8PORT Int 6GB Sata+sas Pcie 2.0) which I believe is used by a few people here. Additionally I have 64GB of ECC ram. All of this has been running solid for a few months (since at least December of last year to maybe a month or so ago). I've got 4 1-GB NICs aggregated together to a Unifi switch which is configured for aggregation.

 

I have IPMI on this device, but I can only see the last bits of the console, what I've captured is attached, but i fear the useful information will be above those screens from what I can tell.

 

I was using AFP on this server, I saw an article saying this might cause some issues so I've disabled AFP altogether (was using for TimeMachine backups). Only a few docker containers binhex-jenkins, duckdns, letsencrypt (not running), Netdata (not runing), unifi. And all up to date.

 

I enabled the mcelog, or thought i did, with nerdtools but I don't see if running, I may need to something else. I also setup the CA Fix Common Problems and did everything it said to do. The usage on this system is pretty light. Right now its mostly just used for backups from a couple of Proxmox servers and their backup schedules don't seem to correlate with the panics...that I can tell. 

 

Is there a correct way to make sure mcelog is running and logging? What else can I do here to troubleshoot this? I just no updated to 6.6.5, would like to make any other settings changes now that I have a reboot and try and see if I can catch this thing and get some usable data.

 

Screenshot 2018-11-09 10.55.19.png

Screenshot 2018-11-05 09.26.19.png

Screenshot 2018-10-31 07.53.21.png

Screenshot 2018-10-29 10.46.23.png

Link to comment
  • 2 weeks later...

I just caught one, I think I have the full dump now. This wasn't a kernel panic at least but all network activity stopped. This coincided with me trying to update my Unifi docker container. In fact, Unifi was already unresponsive before I started, so I figured I'd just update it. Then docker got about half way through to where it was shutting the container down and I lost network connectivity. I was able to console in locally (IPMI) and read the syslog and get screen shots, I copied the syslog to the USB stick and I'm waiting for it to reboot now, hoping it sticks.

Link to comment

My syslog from today attached. Seems like this was the first occurrence this morning:

Nov 20 05:01:50 nas kernel: WARNING: CPU: 6 PID: 23287 at net/netfilter/nf_conntrack_core.c:763 __nf_conntrack_confirm+0x96/0x4fc
Nov 20 05:01:50 nas kernel: Modules linked in: xt_CHECKSUM iptable_mangle ipt_REJECT ebtable_filter ebtables ip6table_filter ip6_tables vhost_net tun vhost tap xt_nat macvlan ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 iptable_filter ip_tables nf_nat xfs nfsd lockd grace sunrpc md_mod ipmi_devintf bonding igb i2c_algo_bit intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mpt3sas ahci libahci intel_cstate raid_class scsi_transport_sas ipmi_ssif i2c_i801 i2c_core pcc_cpufreq button ipmi_si acpi_cpufreq [last unloaded: i2c_algo_bit]
Nov 20 05:01:50 nas kernel: CPU: 6 PID: 23287 Comm: kworker/6:1 Tainted: G    B D W         4.18.17-unRAID #1
Nov 20 05:01:50 nas kernel: Hardware name: Supermicro A1SAi/A1SRi, BIOS 1.1a 08/27/2015
Nov 20 05:01:50 nas kernel: Workqueue: events macvlan_process_broadcast [macvlan]
Nov 20 05:01:50 nas kernel: RIP: 0010:__nf_conntrack_confirm+0x96/0x4fc
Nov 20 05:01:50 nas kernel: Code: c1 ed 20 89 2c 24 e8 26 f7 ff ff 8b 54 24 04 89 ef 89 c6 41 89 c5 e8 bc f8 ff ff 84 c0 75 b9 49 8b 86 80 00 00 00 a8 08 74 02 <0f> 0b 4c 89 f7 e8 04 ff ff ff 49 8b 86 80 00 00 00 0f ba e0 09 73 
Nov 20 05:01:50 nas kernel: RSP: 0018:ffff880fefd83d30 EFLAGS: 00010202
Nov 20 05:01:50 nas kernel: RAX: 0000000000000188 RBX: ffff880d928d3200 RCX: 0000000000000101
Nov 20 05:01:50 nas kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffff81e092a0
Nov 20 05:01:50 nas kernel: RBP: 000000000000d9d7 R08: 0000000009d52fd0 R09: 0000000000000000
Nov 20 05:01:50 nas kernel: R10: 0000000000000000 R11: ffff880d50e60000 R12: ffffffff81e8ccc0
Nov 20 05:01:50 nas kernel: R13: 0000000000006b28 R14: ffff880d9c5d63c0 R15: ffff880d9c5d6418
Nov 20 05:01:50 nas kernel: FS:  0000000000000000(0000) GS:ffff880fefd80000(0000) knlGS:0000000000000000
Nov 20 05:01:50 nas kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 20 05:01:50 nas kernel: CR2: 0000146bfa86f000 CR3: 0000000001e0a000 CR4: 00000000001006e0
Nov 20 05:01:50 nas kernel: Call Trace:
Nov 20 05:01:50 nas kernel: <IRQ>
Nov 20 05:01:50 nas kernel: ipv4_confirm+0xaf/0xb7 [nf_conntrack_ipv4]
Nov 20 05:01:50 nas kernel: nf_hook_slow+0x37/0x96
Nov 20 05:01:50 nas kernel: ip_local_deliver+0xa7/0xd5
Nov 20 05:01:50 nas kernel: ? inet_del_offload+0x3e/0x3e
Nov 20 05:01:50 nas kernel: ip_rcv+0x2dc/0x317
Nov 20 05:01:50 nas kernel: ? ip_local_deliver_finish+0x1aa/0x1aa
Nov 20 05:01:50 nas kernel: __netif_receive_skb_core+0x6b2/0x740
Nov 20 05:01:50 nas kernel: process_backlog+0x7e/0x116
Nov 20 05:01:50 nas kernel: net_rx_action+0x10b/0x274
Nov 20 05:01:50 nas kernel: __do_softirq+0xce/0x1c8
Nov 20 05:01:50 nas kernel: do_softirq_own_stack+0x2a/0x40
Nov 20 05:01:50 nas kernel: </IRQ>
Nov 20 05:01:50 nas kernel: do_softirq+0x4d/0x59
Nov 20 05:01:50 nas kernel: netif_rx_ni+0x1c/0x22
Nov 20 05:01:50 nas kernel: macvlan_broadcast+0x10f/0x153 [macvlan]
Nov 20 05:01:50 nas kernel: ? __switch_to_asm+0x34/0x70
Nov 20 05:01:50 nas kernel: macvlan_process_broadcast+0xd5/0x131 [macvlan]
Nov 20 05:01:50 nas kernel: process_one_work+0x16e/0x243
Nov 20 05:01:50 nas kernel: ? cancel_delayed_work_sync+0xa/0xa
Nov 20 05:01:50 nas kernel: worker_thread+0x1dc/0x2ac
Nov 20 05:01:50 nas kernel: kthread+0x10b/0x113
Nov 20 05:01:50 nas kernel: ? kthread_flush_work_fn+0x9/0x9
Nov 20 05:01:50 nas kernel: ret_from_fork+0x35/0x40
Nov 20 05:01:50 nas kernel: ---[ end trace cf2d1fc891b38b47 ]---

 

oldsyslog.txt

Link to comment

Yeah, it seems like that's not the problem, farther down I'm seeing this which is the most concerning. Bad "page map"

 

Seems almost like it's pointing to memory (I'm running ECC and not seeing any ECC related errors) so I'm not sure. This thing, with the same hardware configuration, had been stable for months...

 

Nov 20 08:29:02 nas kernel: BUG: Bad page map in process php-fpm7  pte:ffff880f7acfd9b8 pmd:56ef73067
Nov 20 08:29:02 nas kernel: addr:000000000a2169c4 vm_flags:00000075 anon_vma:          (null) mapping:000000005445b8e9 index:85
Nov 20 08:29:02 nas kernel: file:xmlreader.so fault:filemap_fault mmap:btrfs_file_mmap readpage:btrfs_readpage
Nov 20 08:29:02 nas kernel: CPU: 6 PID: 6559 Comm: php-fpm7 Tainted: G    B D W         4.18.17-unRAID #1
Nov 20 08:29:02 nas kernel: Hardware name: Supermicro A1SAi/A1SRi, BIOS 1.1a 08/27/2015
Nov 20 08:29:02 nas kernel: Call Trace:
Nov 20 08:29:02 nas kernel: dump_stack+0x5d/0x79
Nov 20 08:29:02 nas kernel: print_bad_pte+0x212/0x22f
Nov 20 08:29:02 nas kernel: _vm_normal_page+0x50/0xa6
Nov 20 08:29:02 nas kernel: unmap_page_range+0x4b6/0x88a
Nov 20 08:29:02 nas kernel: unmap_vmas+0x4b/0x7f
Nov 20 08:29:02 nas kernel: exit_mmap+0xc8/0x16a
Nov 20 08:29:02 nas kernel: ? wake_bit_function+0x1/0x20
Nov 20 08:29:02 nas kernel: mmput+0x4d/0xe5
Nov 20 08:29:02 nas kernel: do_exit+0x3a4/0x8a4
Nov 20 08:29:02 nas kernel: ? dput.part.5+0xdf/0xea
Nov 20 08:29:02 nas kernel: do_group_exit+0x9a/0x9a
Nov 20 08:29:02 nas kernel: get_signal+0x417/0x44c
Nov 20 08:29:02 nas kernel: ? wait_woken+0x68/0x68
Nov 20 08:29:02 nas kernel: do_signal+0x31/0x59d
Nov 20 08:29:02 nas kernel: ? inet_accept+0x3e/0x127
Nov 20 08:29:02 nas kernel: ? put_unused_fd+0x31/0x40
Nov 20 08:29:02 nas kernel: ? __do_page_fault+0x379/0x40b
Nov 20 08:29:02 nas kernel: exit_to_usermode_loop+0x25/0x96
Nov 20 08:29:02 nas kernel: do_syscall_64+0xdf/0xe6
Nov 20 08:29:02 nas kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 20 08:29:02 nas kernel: RIP: 0033:0x14ed7bb025e4
Nov 20 08:29:02 nas kernel: Code: Bad RIP value.
Nov 20 08:29:02 nas kernel: RSP: 002b:00007fff4e407778 EFLAGS: 00000246 ORIG_RAX: 000000000000002b
Nov 20 08:29:02 nas kernel: RAX: fffffffffffffe00 RBX: 000014ed7bd3eb88 RCX: 000014ed7bb025e4
Nov 20 08:29:02 nas kernel: RDX: 00007fff4e4077f0 RSI: 00007fff4e4077f8 RDI: 0000000000000008
Nov 20 08:29:02 nas kernel: RBP: 000000000000002b R08: 0000000000000000 R09: 0000000000000000
Nov 20 08:29:02 nas kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000008
Nov 20 08:29:02 nas kernel: R13: 00007fff4e407f08 R14: 0000000000000000 R15: 0000000000000000
Nov 20 08:29:02 nas kernel: BUG: Bad page map in process php-fpm7  pte:ffff880f7acc5998 pmd:6b1c61067
Nov 20 08:29:02 nas kernel: addr:00000000c2da8294 vm_flags:00000075 anon_vma:          (null) mapping:000000002c2a5265 index:15e
Nov 20 08:29:02 nas kernel: file:libldap_r-2.4.so.2.10.9 fault:filemap_fault mmap:btrfs_file_mmap readpage:btrfs_readpage
Nov 20 08:29:02 nas kernel: CPU: 7 PID: 6557 Comm: php-fpm7 Tainted: G    B D W         4.18.17-unRAID #1
Nov 20 08:29:02 nas kernel: Hardware name: Supermicro A1SAi/A1SRi, BIOS 1.1a 08/27/2015
Nov 20 08:29:02 nas kernel: Call Trace:
Nov 20 08:29:02 nas kernel: dump_stack+0x5d/0x79
Nov 20 08:29:02 nas kernel: print_bad_pte+0x212/0x22f
Nov 20 08:29:02 nas kernel: _vm_normal_page+0x50/0xa6
Nov 20 08:29:02 nas kernel: unmap_page_range+0x4b6/0x88a
Nov 20 08:29:02 nas kernel: unmap_vmas+0x4b/0x7f
Nov 20 08:29:02 nas kernel: exit_mmap+0xc8/0x16a
Nov 20 08:29:02 nas kernel: mmput+0x4d/0xe5
Nov 20 08:29:02 nas kernel: do_exit+0x3a4/0x8a4
Nov 20 08:29:02 nas kernel: ? handle_mm_fault+0x159/0x1a8
Nov 20 08:29:02 nas kernel: do_group_exit+0x9a/0x9a
Nov 20 08:29:02 nas kernel: __x64_sys_exit_group+0xf/0xf
Nov 20 08:29:02 nas kernel: do_syscall_64+0x57/0xe6
Nov 20 08:29:02 nas kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Nov 20 08:29:02 nas kernel: RIP: 0033:0x14ed7bacbf9a
Nov 20 08:29:02 nas kernel: Code: Bad RIP value.
Nov 20 08:29:02 nas kernel: RSP: 002b:00007fff4e4075f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
Nov 20 08:29:02 nas kernel: RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 000014ed7bacbf9a
Nov 20 08:29:02 nas kernel: RDX: 00000000000004ca RSI: 0000000000000000 RDI: 0000000000000000
Nov 20 08:29:02 nas kernel: RBP: 00000000000004ca R08: 00000000000000ca R09: f270ebec9d813dc2
Nov 20 08:29:02 nas kernel: R10: 000055661af14178 R11: 0000000000000246 R12: 000055661afccf40
Nov 20 08:29:02 nas kernel: R13: 00007fff4e407728 R14: 0000000000000001 R15: 000055661afcc640

 

 

Link to comment
Just now, John_M said:

Same hardware but you've changed the software - specifically, the kernel. Have you looked for a BIOS update or is the board no longer supported?

oh for sure. most recent bios update, the board is supported until 2020 supposedly. 

 

Is it possible that the macvlan stuff is making things unstable? I can't remember when I actually enabled that. I'm going to disable Unifi (since that's the only thing that's really using that) and see if that helps.

 

Plan on running a full memtest tonight, I like it go through about 7% (5-6 test IIRC) but would like to see a full run.

Link to comment

Okay so I just decided to disable docker and kvm altogether. kvm had a virtual but it wasn't powered or being used. longest uptime I've had so far has been 8 days, I guess I'll see what I get now.

 

A little bit discouraged as this platform is made to be a NAS platform, I can't find others with this problem AFAIK so maybe it is bad hardware, but it's been a pretty painful thing to try and diagnose.

Link to comment
3 minutes ago, billchurch said:

most recent bios update, the board is supported until 2020 supposedly.

Why is it reporting a BIOS from 2015 then?

Nov 20 08:29:02 nas kernel: Hardware name: Supermicro A1SAi/A1SRi, BIOS 1.1a 08/27/2015

 

5 minutes ago, billchurch said:

Is it possible that the macvlan stuff is making things unstable?

Well, it's a known issue and it's easy to test. So yes, try disabling it. I like to go a stage further when I'm trying to troubleshoot. It's easy to shutdown entire sub-systems (dockers and VMs) and that makes it much easier to narrow down problems like this. Of course, you can't prove much unless you have good RAM and good power so, yes, a full memory test is a good place to start. I recommend the free download version - make a separate USB bootable flash and UEFI boot it to run MemTest86 version 7.5.

Link to comment
4 minutes ago, billchurch said:

Okay so I just decided to disable docker and kvm altogether. kvm had a virtual but it wasn't powered or being used. longest uptime I've had so far has been 8 days, I guess I'll see what I get now.

 

A little bit discouraged as this platform is made to be a NAS platform, I can't find others with this problem AFAIK so maybe it is bad hardware, but it's been a pretty painful thing to try and diagnose.

Ha! You beat me to it.

 

Unraid started off as a 32-bit NAS OS and it has since grown. Is that a bad thing? I don't think so. Strip it down to its basics and get it running as a stable NAS and then add the extras back in. You might have bad hardware. Your diagnostics (Tools -> Diagnostics) might reveal more.

 

There were big changes between the kernels used in version 6.5 and 6.6. Some people are sticking with 6.5.3 for a while. It's still available from the Downloads page if you'd like to try it. I still have it running on one server at the moment.

Link to comment

This is irritating and embarrassing. I know I checked that before and when @John_M mentioned that I did a double-take and sure enough there's an update. Now I just have to figure out how to make the update stick, lol. Thanks for calling that out guys, I'll give that a shot. 

 

Also... If this **is** the issue, sorry for wasting everyone's time. :/

Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.