• [unRAID 6.10.0-rc1] - Seemingly random crashes


    danioj
    • Urgent

    I previously mentioned (on the release thread) that I had experienced a few random crashes since upgrading to 6.10.0-rc1. Each crash required a hard reset of the server making capturing diagnostics problematic.

     

    I enabled syslog mirroring though and have been able to capture the following error:

     

    Aug 24 08:58:01 unraid kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000dc000-0x000dffff window]
    Aug 24 08:58:01 unraid kernel: caller _nv000722rm+0x1ad/0x200 [nvidia] mapping multiple BARs
    Aug 24 09:56:19 unraid kernel: ------------[ cut here ]------------
    Aug 24 09:56:19 unraid kernel: WARNING: CPU: 3 PID: 4821 at net/netfilter/nf_conntrack_core.c:1132 __nf_conntrack_confirm+0xa0/0x1eb [nf_conntrack]
    Aug 24 09:56:19 unraid kernel: Modules linked in: nvidia_modeset(PO) nvidia_uvm(PO) veth xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle xt_nat xt_tcpudp ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap macvlan xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs md_mod nvidia(PO) nct6775 hwmon_vid jc42 ip6table_filter ip6_tables iptable_filter ip_tables x_tables igb x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ast drm_vram_helper drm_ttm_helper ttm drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel drm crypto_simd cryptd rapl ahci ipmi_ssif agpgart intel_cstate mpt3sas syscopyarea intel_uncore sysfillrect sysimgblt i2c_i801 libahci fb_sys_fops input_leds intel_pch_thermal video i2c_algo_bit raid_class i2c_smbus scsi_transport_sas i2c_core led_class backlight thermal button acpi_ipmi fan ipmi_si [last unloaded: igb]
    Aug 24 09:56:19 unraid kernel: CPU: 3 PID: 4821 Comm: kworker/3:1 Tainted: P           O      5.13.8-Unraid #1
    Aug 24 09:56:19 unraid kernel: Hardware name: Supermicro X10SL7-F/X10SL7-F, BIOS 3.2 06/09/2018
    Aug 24 09:56:19 unraid kernel: Workqueue: events macvlan_process_broadcast [macvlan]
    Aug 24 09:56:19 unraid kernel: RIP: 0010:__nf_conntrack_confirm+0xa0/0x1eb [nf_conntrack]
    Aug 24 09:56:19 unraid kernel: Code: e8 7e f6 ff ff 44 89 fa 89 c6 41 89 c4 48 c1 eb 20 89 df 41 89 de e8 92 f4 ff ff 84 c0 75 bb 48 8b 85 80 00 00 00 a8 08 74 18 <0f> 0b 89 df 44 89 e6 31 db e8 c6 ed ff ff e8 09 f3 ff ff e9 22 01
    Aug 24 09:56:19 unraid kernel: RSP: 0018:ffffc9000015cd20 EFLAGS: 00010202
    Aug 24 09:56:19 unraid kernel: RAX: 0000000000000188 RBX: 0000000000008091 RCX: 00000000b4e7b974
    Aug 24 09:56:19 unraid kernel: RDX: 0000000000000000 RSI: 000000000000033c RDI: ffffffffa0264eb0
    Aug 24 09:56:19 unraid kernel: RBP: ffff888390b048c0 R08: 00000000f0e26b20 R09: ffff88818087a6a0
    Aug 24 09:56:19 unraid kernel: R10: ffff88822f778040 R11: 0000000000000000 R12: 000000000000233c
    Aug 24 09:56:19 unraid kernel: R13: ffffffff82168b00 R14: 0000000000008091 R15: 0000000000000000
    Aug 24 09:56:19 unraid kernel: FS:  0000000000000000(0000) GS:ffff8887ffcc0000(0000) knlGS:0000000000000000
    Aug 24 09:56:19 unraid kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Aug 24 09:56:19 unraid kernel: CR2: 00007f9528a8af73 CR3: 000000000200a005 CR4: 00000000001726e0
    Aug 24 09:56:19 unraid kernel: Call Trace:
    Aug 24 09:56:19 unraid kernel: <IRQ>
    Aug 24 09:56:19 unraid kernel: nf_conntrack_confirm+0x2f/0x36 [nf_conntrack]
    Aug 24 09:56:19 unraid kernel: nf_hook_slow+0x3e/0x93
    Aug 24 09:56:19 unraid kernel: ? ip_protocol_deliver_rcu+0x115/0x115
    Aug 24 09:56:19 unraid kernel: NF_HOOK.constprop.0+0x70/0xc8
    Aug 24 09:56:19 unraid kernel: ? ip_protocol_deliver_rcu+0x115/0x115
    Aug 24 09:56:19 unraid kernel: ip_sabotage_in+0x4c/0x59 [br_netfilter]
    Aug 24 09:56:19 unraid kernel: nf_hook_slow+0x3e/0x93
    Aug 24 09:56:19 unraid kernel: ? ip_rcv_finish_core.constprop.0+0x351/0x351
    Aug 24 09:56:19 unraid kernel: NF_HOOK.constprop.0+0x70/0xc8
    Aug 24 09:56:19 unraid kernel: ? ip_rcv_finish_core.constprop.0+0x351/0x351
    Aug 24 09:56:19 unraid kernel: __netif_receive_skb_one_core+0x77/0x98
    Aug 24 09:56:19 unraid kernel: process_backlog+0xab/0x143
    Aug 24 09:56:19 unraid kernel: __napi_poll+0x2a/0x114
    Aug 24 09:56:19 unraid kernel: net_rx_action+0xe8/0x1f2
    Aug 24 09:56:19 unraid kernel: __do_softirq+0xef/0x21b
    Aug 24 09:56:19 unraid kernel: do_softirq+0x50/0x68
    Aug 24 09:56:19 unraid kernel: </IRQ>
    Aug 24 09:56:19 unraid kernel: netif_rx_ni+0x56/0x8b
    Aug 24 09:56:19 unraid kernel: macvlan_broadcast+0x116/0x144 [macvlan]
    Aug 24 09:56:19 unraid kernel: macvlan_process_broadcast+0xc7/0x10b [macvlan]
    Aug 24 09:56:19 unraid kernel: process_one_work+0x196/0x274
    Aug 24 09:56:19 unraid kernel: worker_thread+0x19c/0x240
    Aug 24 09:56:19 unraid kernel: ? rescuer_thread+0x2a2/0x2a2
    Aug 24 09:56:19 unraid kernel: kthread+0xdf/0xe4
    Aug 24 09:56:19 unraid kernel: ? set_kthread_struct+0x32/0x32
    Aug 24 09:56:19 unraid kernel: ret_from_fork+0x22/0x30
    Aug 24 09:56:19 unraid kernel: ---[ end trace 44186f4b6dd2c3e1 ]---

     

    After a reset, all is well and there appears to be no obvious regularity of trigger for the above. 

     

    EDIT: set to urgent as per priority definition instructions given it was a server crash.

     

     




    User Feedback

    Recommended Comments



    On 9/3/2021 at 5:00 AM, danioj said:

    Just checking in.

     

    Uptime is now 3 days 13 hours 9 minutes since I last issued the netfilter fix command.

     

    Not one call trace in the log or hard lock up / crash since.

     

    Hi @danioj I wanted to check in and see if you still had zero traces

    Link to comment

    Update: 

     

    After roughly 10 days of uptime, I started to get crashes. It seemed to coincide with allot traffic coming to and from the server - which doesn't happen all the time.

     

    I have since, removed the need for Host Access to Custom Networks to be enabled. After disabling that option, I have stress tested the server with some serious traffic.

     

    Now I have an uptime of 3 weeks with no call traces.

     

    I am still doing the fix as per earlier in the thread. Not sure what is providing stability now. Im inclined to think its the disabling of HATCN but I really don't know. Still on macvlan.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.