• 6.12.8 Unstable on Intel E5 V4 Intel platforms


    BCinBC
    • Urgent

    I've had more than a few crashes with this hardware, ever since I updated to 6.12.8 from 6.12.6. I've swapped over to different hardware than the original crashes save
    1 Quad AIC NVME card with 4 * NVMe 
    The HDD's

    the USB Key. 

     

    The system will run for up to about a week then it crashes hard. It will respond to pings but everything else appears non-responsive, no SSH. All dockers and VM's gone.

     

    System changes from previous failing platform:

    Same model of motherboard (different one, older BIOS version)

    Different Intel CPU

    Different RAM (Still ECC but from 2 * 32 to 4 * 8)

    Different P4000.

    Different LSI controller, different SATA breakout cables

    Different Power supply

    Different Intel 10GBE NIC

     

    Also one bug on these Lenovo P410 Workstation motherboards, a Restart/Reboot always results in a shutdown. 

     

    I will revert to .6 in the interim.

    bellerophon-diagnostics-20240323-2051.zip




    User Feedback

    Recommended Comments

    The syslog in the diagnostics is the RAM version that starts afresh every time the system is booted.  You should enable the syslog server (probably with the option to Mirror to Flash set) to get a syslog that survives a reboot so we can see what leads up to a crash.  The mirror to flash option is the easiest to set up (and if used the file is then automatically included in any diagnostics), but if you are worried about excessive wear on the flash drive you can put your server's address into the remote server field.

    Link to comment

    I've tried this more than once and it just ends in tears. I have about 25 containers and a bunch of them don't seem to work in ipvlan.  I guess I'll try it again. 

    Link to comment

    If you really need macvlan you can still use if you disable bridging for eth0, see the 6.12.4 release notes.

    Link to comment

    I've actually done the switch way from MACVLAN and it doesn't seem to be as bad this time, for some reason. Bridging with a fixed IP address (about 3 or 4 dockers) seems to work fine... so far. 

     

    Although, my network monitoring is freaking out a bit, as a bunch of services with dedicated MAC addresses have "gone offline." They work, just it's not the same MAC address responding.

     

    Edited by BCinBC
    Explanation for monitoring errors.
    Link to comment

    Also seeing possibly a related issue. Hard freeze, no response to ping at all. Started saving syslog to USB and finally got a useful trace.

     

    Lenovo ThinkServer TS140, Xeon E3-1226 v3.

     

    Going to migrate from macvlan to see if it mitigates freezing.

     

    Mar 25 12:29:25 exa kernel: ------------[ cut here ]------------
    Mar 25 12:29:25 exa kernel: WARNING: CPU: 0 PID: 5152 at net/netfilter/nf_nat_core.c:594 nf_nat_setup_info+0x8c/0x7d1 [nf_nat]
    Mar 25 12:29:25 exa kernel: Modules linked in: xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb tap tls macvlan veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter xfs md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs bridge 8021q garp mrp stp llc intel_rapl_msr intel_rapl_common i915(+) x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 iosf_mbi drm_buddy i2c_algo_bit sha256_ssse3 sha1_ssse3 ttm aesni_intel drm_display_helper drm_kms_helper crypto_simd cryptd mei_hdcp mei_pxp wmi_bmof rapl drm intel_cstate intel_uncore firewire_ohci i2c_i801 ahci libahci i2c_smbus firewire_core intel_gtt mei_me agpgart
    Mar 25 12:29:25 exa kernel: input_leds mei led_class e1000e i2c_core tpm_tis syscopyarea video sysfillrect sysimgblt fb_sys_fops tpm_tis_core thermal fan backlight wmi tpm button unix
    Mar 25 12:29:25 exa kernel: CPU: 0 PID: 5152 Comm: kworker/u8:8 Tainted: P    B D W  O       6.1.74-Unraid #1
    Mar 25 12:29:25 exa kernel: Hardware name: LENOVO ThinkServer TS140/ThinkServer TS140, BIOS FBKTD9AUS 12/09/2019
    Mar 25 12:29:25 exa kernel: Workqueue: events_unbound macvlan_process_broadcast [macvlan]
    Mar 25 12:29:25 exa kernel: RIP: 0010:nf_nat_setup_info+0x8c/0x7d1 [nf_nat]
    Mar 25 12:29:25 exa kernel: Code: a8 80 75 26 48 8d 73 58 48 8d 7c 24 20 e8 18 fb 9a ff 48 8d 43 0c 4c 8b bb 88 00 00 00 48 89 44 24 18 eb 54 0f ba e0 08 73 07 <0f> 0b e9 75 06 00 00 48 8d 73 58 48 8d 7c 24 20 e8 eb fa 9a ff 48
    Mar 25 12:29:25 exa kernel: RSP: 0018:ffffc90000003c78 EFLAGS: 00010282
    Mar 25 12:29:25 exa kernel: RAX: 0000000000000180 RBX: ffff8881b2838600 RCX: ffff8881036cfa40
    Mar 25 12:29:25 exa kernel: RDX: 0000000000000000 RSI: ffffc90000003d5c RDI: ffff8881b2838600
    Mar 25 12:29:25 exa kernel: RBP: ffffc90000003d40 R08: 000000006a01000a R09: 0000000000000000
    Mar 25 12:29:25 exa kernel: R10: 0000000000000098 R11: 0000000000000000 R12: ffffc90000003d5c
    Mar 25 12:29:25 exa kernel: R13: 0000000000000000 R14: ffffc90000003e40 R15: 0000000000000001
    Mar 25 12:29:25 exa kernel: FS:  0000000000000000(0000) GS:ffff88880ea00000(0000) knlGS:0000000000000000
    Mar 25 12:29:25 exa kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar 25 12:29:25 exa kernel: CR2: 00007f2cfc0cd000 CR3: 000000000420a002 CR4: 00000000001726f0
    Mar 25 12:29:25 exa kernel: Call Trace:
    Mar 25 12:29:25 exa kernel: <IRQ>
    Mar 25 12:29:25 exa kernel: ? __warn+0xab/0x122
    Mar 25 12:29:25 exa kernel: ? report_bug+0x109/0x17e
    Mar 25 12:29:25 exa kernel: ? nf_nat_setup_info+0x8c/0x7d1 [nf_nat]
    Mar 25 12:29:25 exa kernel: ? handle_bug+0x41/0x6f
    Mar 25 12:29:25 exa kernel: ? exc_invalid_op+0x13/0x60
    Mar 25 12:29:25 exa kernel: ? asm_exc_invalid_op+0x16/0x20
    Mar 25 12:29:25 exa kernel: ? nf_nat_setup_info+0x8c/0x7d1 [nf_nat]
    Mar 25 12:29:25 exa kernel: ? nf_nat_setup_info+0x44/0x7d1 [nf_nat]
    Mar 25 12:29:25 exa kernel: ? xt_write_recseq_end+0xf/0x1c [ip_tables]
    Mar 25 12:29:25 exa kernel: ? __local_bh_enable_ip+0x56/0x6b
    Mar 25 12:29:25 exa kernel: ? ipt_do_table+0x575/0x5ba [ip_tables]
    Mar 25 12:29:25 exa kernel: __nf_nat_alloc_null_binding+0x66/0x81 [nf_nat]
    Mar 25 12:29:25 exa kernel: nf_nat_inet_fn+0xc0/0x1a8 [nf_nat]
    Mar 25 12:29:25 exa kernel: nf_nat_ipv4_local_in+0x2a/0xaa [nf_nat]
    Mar 25 12:29:25 exa kernel: nf_hook_slow+0x3d/0x96
    Mar 25 12:29:25 exa kernel: ? ip_protocol_deliver_rcu+0x164/0x164
    Mar 25 12:29:25 exa kernel: NF_HOOK.constprop.0+0x79/0xd9
    Mar 25 12:29:25 exa kernel: ? ip_protocol_deliver_rcu+0x164/0x164
    Mar 25 12:29:25 exa kernel: __netif_receive_skb_one_core+0x77/0x9c
    Mar 25 12:29:25 exa kernel: process_backlog+0x8c/0x116
    Mar 25 12:29:25 exa kernel: __napi_poll.constprop.0+0x2b/0x124
    Mar 25 12:29:25 exa kernel: net_rx_action+0x159/0x24f
    Mar 25 12:29:25 exa kernel: __do_softirq+0x129/0x288
    Mar 25 12:29:25 exa kernel: do_softirq+0x7f/0xab
    Mar 25 12:29:25 exa kernel: </IRQ>
    Mar 25 12:29:25 exa kernel: <TASK>
    Mar 25 12:29:25 exa kernel: __local_bh_enable_ip+0x4c/0x6b
    Mar 25 12:29:25 exa kernel: netif_rx+0x52/0x5a
    Mar 25 12:29:25 exa kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
    Mar 25 12:29:25 exa kernel: ? _raw_spin_unlock+0x14/0x29
    Mar 25 12:29:25 exa kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]
    Mar 25 12:29:25 exa kernel: process_one_work+0x1ab/0x295
    Mar 25 12:29:25 exa kernel: worker_thread+0x18b/0x244
    Mar 25 12:29:25 exa kernel: ? rescuer_thread+0x281/0x281
    Mar 25 12:29:25 exa kernel: kthread+0xe7/0xef
    Mar 25 12:29:25 exa kernel: ? kthread_complete_and_exit+0x1b/0x1b
    Mar 25 12:29:25 exa kernel: ret_from_fork+0x22/0x30
    Mar 25 12:29:25 exa kernel: </TASK>
    Mar 25 12:29:25 exa kernel: ---[ end trace 0000000000000000 ]---

     

    Link to comment
    11 hours ago, Justin F. said:

    Also seeing possibly a related issue.

     

    On 3/24/2024 at 5:54 PM, JorgeB said:

    There are macvlan call traces, change the docker network to ipvlan.

     

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.