• [6.7.0-rc1] System Hard Lock


    TechnoBabble28
    • Urgent

    I just upgraded from 5.6.3 to 6.7 rc1 yesterday afternoon. It initially locked after about 4 hours, no ssh or browser access, and required a hard reset. I then ran TS mode from the FCP plugin, left the dashboard window open on my screen, and went to bed. When i woke up this morning it had locked up again around 12.5 hours in. I am not running any VM's, this is purely a media server with minimal dockers. The logs are attached below.

    Unraid.thumb.PNG.89e543f8e0299621ee334448646ccc9e.PNG

    FCPsyslog_tail.txt

    mediaserver-diagnostics-20190122-0245.zip




    User Feedback

    Recommended Comments



    dont understand why syslog works for u, for me if it locks, there is just nothing in log at all.. reverted to latest stable, not useable.

    Edited by nuhll
    Link to comment
    10 hours ago, d2dyno said:

    Well, seems I made it happen quicker this time 😅

    This line:

    Jan 25 04:49:43 TheiaHD kernel: br0: received packet on bond0 with own address as source address (addr:00:02:c9:54:99:2e, vlan:0)

    can happen as a result of an "ARP storm", eg, plugging multiple NICs into same router without having bonding set up.  Not sure about the kernel exception.

    Link to comment
    3 minutes ago, nuhll said:

    @limetech

    do you know if the rc2 helps maybe agains the "losing connection" bug?

    No clue.  If you don't mind, would be better to post a separate report since you're not using Ryzen.

    Link to comment
    1 minute ago, limetech said:

    No clue.  If you don't mind, would be better to post a separate report since you're not using Ryzen.

    already did, but couldnt get any usefull logs, tried 3 times.

     

    Im back on 6.6 stable now uptime 24hours + 

     

     

     

     

    Link to comment
    17 minutes ago, nuhll said:

    already did, but couldnt get any usefull logs, tried 3 times.

    Ok I didn't see it over in General Support - moved to Bug Reports.

    Link to comment

    Happened again. I tried following a thread that said to move dockers to a different bridge/network group. Syslog is a bit longer:

    Jan 25 21:08:55 TheiaHD ntpd[3758]: Deleting interface #3 br0, fe80::907c:a6ff:fe97:c361%15#123, interface stats: received=0, sent=0, dropped=0, active_time=58141 secs
    Jan 25 21:09:05 TheiaHD kernel: : received packet on bond0 with own address as source address (addr:00:02:c9:54:99:2e, vlan:0)
    Jan 25 21:09:15 TheiaHD kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    Jan 25 21:09:15 TheiaHD kernel: rcu: 	8-....: (13 GPs behind) idle=706/0/0x1 softirq=9150793/9150793 fqs=10124 
    Jan 25 21:09:15 TheiaHD kernel: rcu: 	(detected by 35, t=60002 jiffies, g=21586645, q=566349)
    Jan 25 21:09:15 TheiaHD kernel: Sending NMI from CPU 35 to CPUs 8:
    Jan 25 21:09:15 TheiaHD kernel: NMI backtrace for cpu 8
    Jan 25 21:09:15 TheiaHD kernel: CPU: 8 PID: 0 Comm: swapper/8 Not tainted 4.19.17-Unraid #1
    Jan 25 21:09:15 TheiaHD kernel: Hardware name: System manufacturer System Product Name/PRIME X399-A, BIOS 0808 10/12/2018
    Jan 25 21:09:15 TheiaHD kernel: RIP: 0010:ip_defrag+0x871/0xaba
    Jan 25 21:09:15 TheiaHD kernel: Code: 0c 00 4c 8b 5c 24 28 48 89 e8 4c 09 d8 0f 84 c8 00 00 00 4d 85 db 0f 84 8e 00 00 00 31 c0 b9 06 00 00 00 4c 89 df 4d 89 1c 24 <f3> ab 49 c7 43 18 00 00 00 00 4d 89 dc 41 8b 83 80 00 00 00 41 01
    Jan 25 21:09:15 TheiaHD kernel: RSP: 0018:ffff88905e203b10 EFLAGS: 00000246
    Jan 25 21:09:15 TheiaHD kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
    Jan 25 21:09:15 TheiaHD kernel: RDX: 0000000000000001 RSI: ffff8887e1010070 RDI: ffff888fa2ef9208
    Jan 25 21:09:15 TheiaHD kernel: RBP: 0000000000000000 R08: 0000000000000001 R09: ffffffff81569100
    Jan 25 21:09:15 TheiaHD kernel: R10: ffffea003c511000 R11: ffff888fa2ef9200 R12: ffff888fa2ef9200
    Jan 25 21:09:15 TheiaHD kernel: R13: ffff888fa2ef9200 R14: ffff8887e1010000 R15: ffff888fa2ef9200
    Jan 25 21:09:15 TheiaHD kernel: FS:  0000000000000000(0000) GS:ffff88905e200000(0000) knlGS:0000000000000000
    Jan 25 21:09:15 TheiaHD kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jan 25 21:09:15 TheiaHD kernel: CR2: 0000000041f00083 CR3: 0000000005e0a000 CR4: 00000000003406e0
    Jan 25 21:09:15 TheiaHD kernel: Call Trace:
    Jan 25 21:09:15 TheiaHD kernel: <IRQ>
    Jan 25 21:09:15 TheiaHD kernel: ? br_handle_local_finish+0x31/0x31
    Jan 25 21:09:15 TheiaHD kernel: ipv4_conntrack_defrag+0xca/0xf6
    Jan 25 21:09:15 TheiaHD kernel: nf_hook_slow+0x37/0x96
    Jan 25 21:09:15 TheiaHD kernel: br_nf_pre_routing+0x2c0/0x2f7
    Jan 25 21:09:15 TheiaHD kernel: ? br_nf_forward_ip+0x349/0x349
    Jan 25 21:09:15 TheiaHD kernel: nf_hook_slow+0x37/0x96
    Jan 25 21:09:15 TheiaHD kernel: br_handle_frame+0x291/0x2d0
    Jan 25 21:09:15 TheiaHD kernel: ? br_pass_frame_up+0x143/0x143
    Jan 25 21:09:15 TheiaHD kernel: __netif_receive_skb_core+0x461/0x793
    Jan 25 21:09:15 TheiaHD kernel: ? __wake_up_common_lock+0xb/0xcb
    Jan 25 21:09:15 TheiaHD kernel: __netif_receive_skb_one_core+0x31/0x69
    Jan 25 21:09:15 TheiaHD kernel: netif_receive_skb_internal+0x9f/0xba
    Jan 25 21:09:15 TheiaHD kernel: napi_gro_frags+0x153/0x18b
    Jan 25 21:09:15 TheiaHD kernel: mlx4_en_process_rx_cq+0x7ea/0x953 [mlx4_en]
    Jan 25 21:09:15 TheiaHD kernel: ? mlx4_cq_completion+0x1e/0x63 [mlx4_core]
    Jan 25 21:09:15 TheiaHD kernel: ? mlx4_en_rx_irq+0x23/0x3e [mlx4_en]
    Jan 25 21:09:15 TheiaHD kernel: ? mlx4_eq_int+0xb2a/0xb55 [mlx4_core]
    Jan 25 21:09:15 TheiaHD kernel: mlx4_en_poll_rx_cq+0x66/0xc6 [mlx4_en]
    Jan 25 21:09:15 TheiaHD kernel: net_rx_action+0x10b/0x274
    Jan 25 21:09:15 TheiaHD kernel: __do_softirq+0xce/0x1e2
    Jan 25 21:09:15 TheiaHD kernel: irq_exit+0x5e/0x9d
    Jan 25 21:09:15 TheiaHD kernel: do_IRQ+0xa9/0xc7
    Jan 25 21:09:15 TheiaHD kernel: common_interrupt+0xf/0xf
    Jan 25 21:09:15 TheiaHD kernel: </IRQ>
    Jan 25 21:09:15 TheiaHD kernel: RIP: 0010:cpuidle_enter_state+0xe8/0x141
    Jan 25 21:09:15 TheiaHD kernel: Code: ff 45 84 ff 74 1d 9c 58 0f 1f 44 00 00 0f ba e0 09 73 09 0f 0b fa 66 0f 1f 44 00 00 31 ff e8 03 52 be ff fb 66 0f 1f 44 00 00 <48> 2b 1c 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48 39 cb
    Jan 25 21:09:15 TheiaHD kernel: RSP: 0018:ffffc9000022fea0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdb
    Jan 25 21:09:15 TheiaHD kernel: RAX: ffff88905e220c40 RBX: 000034eb542a1d6c RCX: 000000000000001f
    Jan 25 21:09:15 TheiaHD kernel: RDX: 000034eb542a1d6c RSI: 000000002abffb5d RDI: 0000000000000000
    Jan 25 21:09:15 TheiaHD kernel: RBP: ffff889040135c00 R08: 0000000000000002 R09: 0000000000020500
    Jan 25 21:09:15 TheiaHD kernel: R10: 0000000000000144 R11: 00009e9caca94e5c R12: 0000000000000002
    Jan 25 21:09:15 TheiaHD kernel: R13: 0000000000000002 R14: ffffffff81e5c4f8 R15: 0000000000000000
    Jan 25 21:09:15 TheiaHD kernel: do_idle+0x192/0x20e
    Jan 25 21:09:15 TheiaHD kernel: cpu_startup_entry+0x6a/0x6c
    Jan 25 21:09:15 TheiaHD kernel: start_secondary+0x197/0x1b2
    Jan 25 21:09:15 TheiaHD kernel: secondary_startup_64+0xa4/0xb0
    Jan 25 21:09:40 TheiaHD kernel: ------------[ cut here ]------------
    Jan 25 21:09:40 TheiaHD kernel: NETDEV WATCHDOG: eth2 (mlx4_core): transmit queue 0 timed out
    Jan 25 21:09:40 TheiaHD kernel: WARNING: CPU: 43 PID: 0 at net/sched/sch_generic.c:461 dev_watchdog+0x15f/0x1b7
    Jan 25 21:09:40 TheiaHD kernel: Modules linked in: tun iptable_mangle veth xt_nat ipt_MASQUERADE iptable_nat nf_nat_ipv4 iptable_filter ip_tables nf_nat md_mod bonding mlx4_en mlx4_core igb i2c_algo_bit amd64_edac_mod edac_mce_amd kvm_amd kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 mpt3sas crypto_simd i2c_piix4 raid_class cryptd wmi_bmof mxm_wmi i2c_core k10temp scsi_transport_sas ahci glue_helper ccp libahci pcc_cpufreq wmi button acpi_cpufreq [last unloaded: mlx4_core]
    Jan 25 21:09:40 TheiaHD kernel: CPU: 43 PID: 0 Comm: swapper/43 Not tainted 4.19.17-Unraid #1
    Jan 25 21:09:40 TheiaHD kernel: Hardware name: System manufacturer System Product Name/PRIME X399-A, BIOS 0808 10/12/2018
    Jan 25 21:09:40 TheiaHD kernel: RIP: 0010:dev_watchdog+0x15f/0x1b7
    Jan 25 21:09:40 TheiaHD kernel: Code: 55 63 97 00 00 75 36 4c 89 ef c6 05 49 63 97 00 01 e8 0e ba fd ff 89 e9 4c 89 ee 48 c7 c7 0e 5e d9 81 48 89 c2 e8 8c 2e b2 ff <0f> 0b eb 0f ff c5 48 81 c2 40 01 00 00 39 cd 75 98 eb 13 48 8b 83
    Jan 25 21:09:40 TheiaHD kernel: RSP: 0018:ffff88905e4c3ea0 EFLAGS: 00010286
    Jan 25 21:09:40 TheiaHD kernel: RAX: 0000000000000000 RBX: ffff88882f180438 RCX: 0000000000000007
    Jan 25 21:09:40 TheiaHD kernel: RDX: 0000000000000000 RSI: ffff88905e4d64e0 RDI: ffff88905e4d64e0
    Jan 25 21:09:40 TheiaHD kernel: RBP: 0000000000000000 R08: 0000000000000003 R09: 0000000000020500
    Jan 25 21:09:40 TheiaHD kernel: R10: 0000000000000dce R11: 000000000004caf4 R12: ffff88882f18041c
    Jan 25 21:09:40 TheiaHD kernel: R13: ffff88882f180000 R14: ffff88882f1b3f40 R15: 000000000000002b
    Jan 25 21:09:40 TheiaHD kernel: FS:  0000000000000000(0000) GS:ffff88905e4c0000(0000) knlGS:0000000000000000
    Jan 25 21:09:40 TheiaHD kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Jan 25 21:09:40 TheiaHD kernel: CR2: 0000152db1864110 CR3: 00000007e8dc6000 CR4: 00000000003406e0
    Jan 25 21:09:40 TheiaHD kernel: Call Trace:
    Jan 25 21:09:40 TheiaHD kernel: <IRQ>
    Jan 25 21:09:40 TheiaHD kernel: call_timer_fn+0x18/0x7b
    Jan 25 21:09:40 TheiaHD kernel: ? qdisc_reset+0xc0/0xc0
    Jan 25 21:09:40 TheiaHD kernel: expire_timers+0x7f/0x8e
    Jan 25 21:09:40 TheiaHD kernel: run_timer_softirq+0x72/0x120
    Jan 25 21:09:40 TheiaHD kernel: ? enqueue_hrtimer.isra.3+0x23/0x27
    Jan 25 21:09:40 TheiaHD kernel: ? __hrtimer_run_queues+0xd7/0x105
    Jan 25 21:09:40 TheiaHD kernel: ? ktime_get+0x3a/0x8d
    Jan 25 21:09:40 TheiaHD kernel: __do_softirq+0xce/0x1e2
    Jan 25 21:09:40 TheiaHD kernel: irq_exit+0x5e/0x9d
    Jan 25 21:09:40 TheiaHD kernel: smp_apic_timer_interrupt+0x7e/0x91
    Jan 25 21:09:40 TheiaHD kernel: apic_timer_interrupt+0xf/0x20
    Jan 25 21:09:40 TheiaHD kernel: </IRQ>
    Jan 25 21:09:40 TheiaHD kernel: RIP: 0010:native_safe_halt+0x2/0x3
    Jan 25 21:09:40 TheiaHD kernel: Code: c9 65 48 8b 04 25 40 5c 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b 00 a8 08 74 0b 65 81 25 b4 83 9c 7e ff ff ff 7f c3 fb f4 <c3> f4 c3 e8 5f aa a4 ff 65 8b 05 42 19 9c 7e fb 66 0f 1f 44 00 00
    Jan 25 21:09:40 TheiaHD kernel: RSP: 0018:ffffc9000665be48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
    Jan 25 21:09:40 TheiaHD kernel: RAX: 0000000080000000 RBX: ffff889059015c64 RCX: 000000000000001f
    Jan 25 21:09:40 TheiaHD kernel: RDX: ffff88905e4c0000 RSI: ffffffff81e5c420 RDI: ffff889059015c64
    Jan 25 21:09:40 TheiaHD kernel: RBP: 0000000000000001 R08: 0000000000000002 R09: 0000000000020500
    Jan 25 21:09:40 TheiaHD kernel: R10: 00000000000011b8 R11: 00009ed7ab4291be R12: ffff889059015c00
    Jan 25 21:09:40 TheiaHD kernel: R13: 0000000000000001 R14: ffffffff81e5c498 R15: 0000000000000000
    Jan 25 21:09:40 TheiaHD kernel: acpi_safe_halt+0x15/0x1f
    Jan 25 21:09:40 TheiaHD kernel: acpi_idle_enter+0x1dd/0x21b
    Jan 25 21:09:40 TheiaHD kernel: cpuidle_enter_state+0xa1/0x141
    Jan 25 21:09:40 TheiaHD kernel: do_i

    Next I am going to try the 'nocbs' in the syslinux config as mentioned by spaceinvader.

    Link to comment

    I am considering trying a Ryzen build too, and while gathering info about it, I found this video, 9:30 in he inserts rcu_nocbs which fixed it for him, maybe it will fix yours too?

     

     

     

    Link to comment

    @mikeydk tried this, still crashed earlier this morning.

     

    Next I will pull out the Mellanox cards I guess. If they're the issue that won't be good though, as I don't want to go back to gigabit.

    Link to comment
    3 minutes ago, d2dyno said:

    @mikeydk tried this, still crashed earlier this morning.

     

    Next I will pull out the Mellanox cards I guess. If they're the issue that won't be good though, as I don't want to go back to gigabit.

    That is annoying, I would love to hear what fixed it, when you find the problem. I consider building a new server based on the same CPU, but my current E3-1230 V2 just needs to hang in there a bit longer :D

    Link to comment

    I update my bios on my Gigabyte AX-370 i was on F20 and stabel move to F25.

    Now i get the same hard lock. if i go back to 6.6 i am stable, i check C6 is disable. So maybe amd change something in the new bios. 

    Link to comment

    I have same problem and i have no gigabyte board or extra network card. There must any change in latest RC, RC2 also hard locks for me. 6.6.6 is stable.

     

    Did u all change ur MTU? ichanged mine to 1436. Maybe its only hard locking when u chagned MTU.

    Edited by nuhll
    Link to comment
    On 1/26/2019 at 12:07 AM, bonienl said:

    Try a smaller MTU size, e.g. 1400 bytes.

    Tried this, didn't work.

     

    EDIT: with no end in sight, I am going back to 6.6.6, which I really didn't want to do. Clearly either Ryzen/Mellanox support is borked in this release, and I cannot spend 2 hours a day debugging anymore.

    Edited by d2dyno
    Link to comment
    6 hours ago, Stef-dk said:

    I update my bios on my Gigabyte AX-370 i was on F20 and stabel move to F25.

    Now i get the same hard lock. if i go back to 6.6 i am stable, i check C6 is disable. So maybe amd change something in the new bios. 

     

    1 hour ago, nuhll said:

    I have same problem and i have no gigabyte board or extra network card. There must any change in latest RC, RC2 also hard locks for me. 6.6.6 is stable.

     

    Did u all change ur MTU? ichanged mine to 1436. Maybe its only hard locking when u chagned MTU.

    Can you both either post a picture of your server monitor showing the crash error, or keep 'tail -f /var/log' open from another computer? Curious if our error logs are identical or different.

    Link to comment
    4 hours ago, d2dyno said:

     

    Can you both either post a picture of your server monitor showing the crash error, or keep 'tail -f /var/log' open from another computer? Curious if our error logs are identical or different.

    Thats the problem, i was not able to produce a usefull syslog (i tried tail -f to user share)

     

    But there is nothing "crashing" in it.

     

    I guess it must be more then just connectifity error then, because /mnt/user/something is only in ram, and if there is nothing logged, then it must be a bigger problem then just network.

     

    problem is i have currently a parity rebuild running, it says it need 5 days.. so i guess well have to wait for a next try.

     

    btw while the crash i plugged in VGA, and it just said normal:

    "blabla your server is reachable under ...

    Login:"

     

    Edited by nuhll
    • Like 1
    Link to comment
    12 hours ago, Stef-dk said:

    I update my bios on my Gigabyte AX-370 i was on F20 and stabel move to F25.

    Now i get the same hard lock. if i go back to 6.6 i am stable, i check C6 is disable. So maybe amd change something in the new bios. 

     

    Do NOT disable C6.  I've recently seen this advice floating around the forums, and even saw it in someone's video.  Not only will disabling C6 not help, it actually makes it worse.

     

    How do I know?  I'm the guy who originally identified the solution nearly 2 years ago.  I meticulously tested every BIOS setting, figuring out that disabling "Global C-state Control" is the solution.  I even have a link to this in my signature (though for some reason our sigs don't show here in the bug report section).

     

    Disabling C6 is not the same thing as disabling Global C-state Control.

     

    Now, all that said, it seems there's something going on with 6.7 that even Global C-state Control isn't helping.

    • Like 1
    Link to comment

    I was stable on 6.7rc1 for a couple days then went to rc2. RC2 has crashed twice now all within 24 hours of uptime. I am now running FCP in troubleshooting mode to capture logs and will upload once it happens again. 

    Link to comment

    Can the report be reopened? This doesn't seem it should be closed at all, with, by my count, 4 different reports of the issue.

    Link to comment

    In case anyone needs it, here's a link to the original discussion on the Ryzen lock issue: 

     

    Most of the troubleshooting comes before that entry.

     

    Some notes:  I never got a log entry, ever, in all my crashes - it just crashes too fast.  On a few occasions I was able to see a crash dump on the console screen, but that was rare, and though I shared photos of it, no one was able to determine anything from it. 

     

    I also configured my system to boot into Windows by default, then manually started Unraid - that way when it crashed the next boot was into Windows, and it was there that Windows reported the Machine Check Exceptions (MCE's) in the logs, which seem to get stored in the BIOS and reported to the system on the next boot, though Unraid doesn't show this info.

     

    Link to comment
    1 hour ago, d2dyno said:

    Can the report be reopened? This doesn't seem it should be closed at all, with, by my count, 4 different reports of the issue. 

    This is a tough one for us because Ryzen on Linux is just plain broken and AMD will not fix it.

     

    512 posts later (as of writing this) on the main kernel Bug Report, I don't see a clear solution:

    https://bugzilla.kernel.org/show_bug.cgi?id=196683

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.