Unraid locking up, webui unaccessble, SSH not accessible, cannot shut down via CLI


Recommended Posts

Good evening!

I'm about at my wits end with my server. Over the last few months I've had an issue where the server will lock up completely and I just cannot figure out why. It's completely random but tonight I managed to pull the syslogs before it went down. I have no idea how to interpret these so I was hoping someone else might be able to shed some light.

When the server locks up, something is being written/read to/from the cache as the light continues to blink as an indication actions are happening on the hardware. Sometimes I cannot SSH into the server, sometimes I can. The times I can SSH in, restarting via CLI does not actually restart. The WEBui will error out with an nginx 500 error, and most (sometimes not all) docker containers stop working.

Syslog attached, diagnostics attached.

What I've done so far:
Rebuilt Docker image
Replaced motherboard
Replaced Ram
Replaced NIC
Added 2 new SSDs (Raid1)

Apologies if I posted this in the wrong area, please move or direct me where to post and I will follow up asap.

syslog 2021-10-09 chunker-diagnostics-20211009-2249.zip

Link to comment

I think that you have file system corruption on sdn and sdo, that should be members of your cache_ssd.

I'd wait for JorgeB's advice on this.

 

Also, your syslog is spammed by:

Oct  9 23:35:00 Chunker kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Oct  9 23:35:00 Chunker kernel: caller _nv000723rm+0x1ad/0x200 [nvidia] mapping multiple BARs

Not sure if it is an issue.

 

In any case, in your situation, I'd lower my memory speed to something that is supported by the memory controller. This is possibly the cause of corruption.

That is 2933 for 4 single rank memory on a 3rd gen Ryzen.

While your in BIOS, you can also check that the Power supply control / C-States setting is appropriate.

Link to comment
2 minutes ago, Rockstar said:

they're running default values (no OC).

Yes they are.

Standard DDR4 memory is 2133. Everything above is an overclock. XMP is also an overclock but with pre-validated values on the sticks.

 

It can be perfectly fine if every element of the chain can handle the speed that you set.

In your situation, the memory might be fine. But the memory controller on Ryzen have different acceptable values depending on the number and type of sticks. Your are out of those specifications.

Link to comment
Oct 16 09:39:13 Chunker kernel: ------------[ cut here ]------------
Oct 16 09:39:13 Chunker kernel: WARNING: CPU: 8 PID: 0 at net/netfilter/nf_conntrack_core.c:1120 __nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack]
Oct 16 09:39:13 Chunker kernel: Modules linked in: tun xt_mark nvidia_uvm(PO) macvlan veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter iptable_mangle xfs md_mod nvidia_drm(PO) nvidia_modeset(PO) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops nvidia(PO) drm backlight agpgart ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding mlx4_en mlx4_core igb i2c_algo_bit edac_mce_amd kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel mxm_wmi wmi_bmof aesni_intel crypto_simd cryptd glue_helper mpt3sas i2c_piix4 i2c_core rapl k10temp raid_class ccp nvme scsi_transport_sas ahci nvme_core libahci wmi button acpi_cpufreq [last unloaded: mlx4_core]
Oct 16 09:39:13 Chunker kernel: CPU: 8 PID: 0 Comm: swapper/8 Tainted: P           O      5.10.28-Unraid #1
Oct 16 09:39:13 Chunker kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 4402 06/28/2021
Oct 16 09:39:13 Chunker kernel: RIP: 0010:__nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack]
Oct 16 09:39:13 Chunker kernel: Code: e8 dc f8 ff ff 44 89 fa 89 c6 41 89 c4 48 c1 eb 20 89 df 41 89 de e8 36 f6 ff ff 84 c0 75 bb 48 8b 85 80 00 00 00 a8 08 74 18 <0f> 0b 89 df 44 89 e6 31 db e8 6d f3 ff ff e8 35 f5 ff ff e9 22 01
Oct 16 09:39:13 Chunker kernel: RSP: 0018:ffffc90000394938 EFLAGS: 00010202
Oct 16 09:39:13 Chunker kernel: RAX: 0000000000000188 RBX: 000000000000c0bf RCX: 00000000b0198cd7
Oct 16 09:39:13 Chunker kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa0238f10
Oct 16 09:39:13 Chunker kernel: RBP: ffff8881113c2bc0 R08: 00000000fb009142 R09: 0000000000000000
Oct 16 09:39:13 Chunker kernel: R10: 0000000000000158 R11: ffff8880962fc200 R12: 0000000000005b44
Oct 16 09:39:13 Chunker kernel: R13: ffffffff8210b440 R14: 000000000000c0bf R15: 0000000000000000
Oct 16 09:39:13 Chunker kernel: FS:  0000000000000000(0000) GS:ffff88881ea00000(0000) knlGS:0000000000000000
Oct 16 09:39:13 Chunker kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 16 09:39:13 Chunker kernel: CR2: 0000000000fdd120 CR3: 00000001ae75e000 CR4: 0000000000350ee0
Oct 16 09:39:13 Chunker kernel: Call Trace:
Oct 16 09:39:13 Chunker kernel: <IRQ>
Oct 16 09:39:13 Chunker kernel: nf_conntrack_confirm+0x2f/0x36 [nf_conntrack]
Oct 16 09:39:13 Chunker kernel: nf_hook_slow+0x39/0x8e
Oct 16 09:39:13 Chunker kernel: nf_hook.constprop.0+0xb1/0xd8
Oct 16 09:39:13 Chunker kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
Oct 16 09:39:13 Chunker kernel: ip_local_deliver+0x49/0x75
Oct 16 09:39:13 Chunker kernel: ip_sabotage_in+0x43/0x4d [br_netfilter]
Oct 16 09:39:13 Chunker kernel: nf_hook_slow+0x39/0x8e
Oct 16 09:39:13 Chunker kernel: nf_hook.constprop.0+0xb1/0xd8
Oct 16 09:39:13 Chunker kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
Oct 16 09:39:13 Chunker kernel: ip_rcv+0x41/0x61
Oct 16 09:39:13 Chunker kernel: __netif_receive_skb_one_core+0x74/0x95
Oct 16 09:39:13 Chunker kernel: netif_receive_skb+0x79/0xa1
Oct 16 09:39:13 Chunker kernel: br_handle_frame_finish+0x30d/0x351
Oct 16 09:39:13 Chunker kernel: ? skb_copy_bits+0xe8/0x197
Oct 16 09:39:13 Chunker kernel: ? ipt_do_table+0x570/0x5c0 [ip_tables]
Oct 16 09:39:13 Chunker kernel: ? br_pass_frame_up+0xda/0xda
Oct 16 09:39:13 Chunker kernel: br_nf_hook_thresh+0xa3/0xc3 [br_netfilter]
Oct 16 09:39:13 Chunker kernel: ? br_pass_frame_up+0xda/0xda
Oct 16 09:39:13 Chunker kernel: br_nf_pre_routing_finish+0x23d/0x264 [br_netfilter]
Oct 16 09:39:13 Chunker kernel: ? br_pass_frame_up+0xda/0xda
Oct 16 09:39:13 Chunker kernel: ? br_handle_frame_finish+0x351/0x351
Oct 16 09:39:13 Chunker kernel: ? nf_nat_ipv4_pre_routing+0x1e/0x4a [nf_nat]
Oct 16 09:39:13 Chunker kernel: ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter]
Oct 16 09:39:13 Chunker kernel: ? br_handle_frame_finish+0x351/0x351
Oct 16 09:39:13 Chunker kernel: NF_HOOK+0xd7/0xf7 [br_netfilter]
Oct 16 09:39:13 Chunker kernel: ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter]
Oct 16 09:39:13 Chunker kernel: br_nf_pre_routing+0x229/0x239 [br_netfilter]
Oct 16 09:39:13 Chunker kernel: ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter]
Oct 16 09:39:13 Chunker kernel: br_handle_frame+0x25e/0x2a6
Oct 16 09:39:13 Chunker kernel: ? br_pass_frame_up+0xda/0xda
Oct 16 09:39:13 Chunker kernel: __netif_receive_skb_core+0x335/0x4e7
Oct 16 09:39:13 Chunker kernel: ? dev_gro_receive+0x55d/0x578
Oct 16 09:39:13 Chunker kernel: __netif_receive_skb_list_core+0x78/0x104
Oct 16 09:39:13 Chunker kernel: netif_receive_skb_list_internal+0x1bf/0x1f2
Oct 16 09:39:13 Chunker kernel: gro_normal_list+0x1d/0x39
Oct 16 09:39:13 Chunker kernel: napi_complete_done+0x79/0x104
Oct 16 09:39:13 Chunker kernel: mlx4_en_poll_rx_cq+0xa8/0xc7 [mlx4_en]
Oct 16 09:39:13 Chunker kernel: net_rx_action+0xf4/0x29d
Oct 16 09:39:13 Chunker kernel: __do_softirq+0xc4/0x1c2
Oct 16 09:39:13 Chunker kernel: asm_call_irq_on_stack+0x12/0x20
Oct 16 09:39:13 Chunker kernel: </IRQ>
Oct 16 09:39:13 Chunker kernel: do_softirq_own_stack+0x2c/0x39
Oct 16 09:39:13 Chunker kernel: __irq_exit_rcu+0x45/0x80
Oct 16 09:39:13 Chunker kernel: common_interrupt+0x119/0x12e
Oct 16 09:39:13 Chunker kernel: asm_common_interrupt+0x1e/0x40
Oct 16 09:39:13 Chunker kernel: RIP: 0010:native_safe_halt+0x7/0x8
Oct 16 09:39:13 Chunker kernel: Code: 60 02 df f0 83 44 24 fc 00 48 8b 00 a8 08 74 0b 65 81 25 a1 a9 95 7e ff ff ff 7f c3 e8 95 4d 98 ff f4 c3 e8 8e 4d 98 ff fb f4 <c3> 53 e8 04 ef 9d ff e8 04 76 9b ff 65 48 8b 1c 25 c0 7b 01 00 48
Oct 16 09:39:13 Chunker kernel: RSP: 0018:ffffc9000016fe78 EFLAGS: 00000246
Oct 16 09:39:13 Chunker kernel: RAX: 0000000000004000 RBX: 0000000000000001 RCX: 000000000000001f
Oct 16 09:39:13 Chunker kernel: RDX: ffff88881ea00000 RSI: ffffffff820c8c40 RDI: ffff88810207d064
Oct 16 09:39:13 Chunker kernel: RBP: ffff8881056c8000 R08: ffff88810207d000 R09: 00000000000000b8
Oct 16 09:39:13 Chunker kernel: R10: 00000000000000cb R11: 071c71c71c71c71c R12: 0000000000000001
Oct 16 09:39:13 Chunker kernel: R13: ffff88810207d064 R14: ffffffff820c8ca8 R15: 0000000000000000
Oct 16 09:39:13 Chunker kernel: ? native_safe_halt+0x5/0x8
Oct 16 09:39:13 Chunker kernel: arch_safe_halt+0x5/0x8
Oct 16 09:39:13 Chunker kernel: acpi_idle_do_entry+0x25/0x37
Oct 16 09:39:13 Chunker kernel: acpi_idle_enter+0x9a/0xa9
Oct 16 09:39:13 Chunker kernel: cpuidle_enter_state+0xba/0x1c4
Oct 16 09:39:13 Chunker kernel: cpuidle_enter+0x25/0x31
Oct 16 09:39:13 Chunker kernel: do_idle+0x1a6/0x214
Oct 16 09:39:13 Chunker kernel: cpu_startup_entry+0x18/0x1a
Oct 16 09:39:13 Chunker kernel: secondary_startup_64_no_verify+0xb0/0xbb
Oct 16 09:39:13 Chunker kernel: ---[ end trace 906f7f9f734c7e09 ]---

 

 

This is not my area of expertise, hopefully someone can point you on the right track.

  • Thanks 1
Link to comment
  • 2 months later...

So over the course of the weekend the following has been done:
Memtest on memory - 10 hours - 0 errors
Memory speeds have been set to the base limits of the board (2666) (in this case, verified this was true)
C-states have all been disabled (verification)
Having learned the relationship between memory and btrfs, I removed my raid1 btrfs cache pool and formatted it, then set it all back up again
Rebuilt the docker file, and added all my containers back.

I'm slowly losing my goddamn mind...  The lockups happen sometimes twice a day now, though this last instance after I finished rebuilding all the docker stuff was just under 48 hours uptime (not going to lie, I was very hopeful).

Although I don't think there's anything new in these diagnostics and logs...they're attached.

log20211219.txt chunker-diagnostics-20211219-1533.zip

Link to comment

So here's a new one that I cannot explain at all... I was watching Htop via SSH when I noticed all the cores peaking and then a ton of freezing and delays. The webui went down as it normally would (along with the containers) but the SSH terminal would still update every 30s or so.  I watched for around 10 mins and then hard reset it, I couldn't make sense of any processes that where causing an issue. It wasn't until the server came back online and I went to check the log file to see it was 38GB in file size!?!?!

It was repeating the text in the logfile I've uploaded over and over and over... perhaps there was other stuff but scrolling though 38gb of text is not something I was willing to do.

logfile20211222.txt

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.