Server frozen again! Finally have logs


Recommended Posts

So on and off since my jump to 6.9-RC2 I have had a freeze issue. I have replaced the mother board and ram in that time. Ram is now ECC, looking at the logs in my understanding it looks like a drive crash? Either MacVlan or Nvidia.

 

Syslog attached, Help is much appreciated. Server must be hard reset to get any access so diags arent possible.

 

Apr 18 04:13:38 thelibrary kernel: NETDEV WATCHDOG: eth1 (igb): transmit queue 2 timed out
Apr 18 04:13:38 thelibrary kernel: WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:442 dev_watchdog+0xcf/0x12b
Apr 18 04:13:38 thelibrary kernel: Modules linked in: macvlan md_mod nvidia_uvm(PO) veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs nvidia_drm(PO) nvidia_modeset(PO) drm_kms_helper drm backlight agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops nvidia(PO) nct6775 hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding igb i2c_algo_bit ipmi_ssif amd64_edac_mod edac_mce_amd kvm_amd kvm wmi_bmof crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd mpt3sas cryptd nvme nvme_core ccp ahci i2c_piix4 wmi raid_class glue_helper scsi_transport_sas rapl k10temp i2c_core acpi_ipmi libahci button ipmi_si acpi_cpufreq [last unloaded: md_mod]

 

syslog

Link to comment

Already removed, Saw that info during this week. Only GPU stuff installed is the driver plugin, a user script to allow more then three streams and plex. Far as I am aware nothing else is touching  nvidia-smi or the gpu.

Link to comment
Posted (edited)

Looks like I have had another kernal panic/macvlan crash but no lock up this time. I am able to export diags as a result, hopefully it shows something that can stop this all together. As I said earlier, I don't have any Dockers any more with static custom set ips in my doicker set up.

 

Apr 18 14:54:38 thelibrary kernel: ------------[ cut here ]------------
Apr 18 14:54:38 thelibrary kernel: WARNING: CPU: 6 PID: 13151 at net/netfilter/nf_conntrack_core.c:1120 __nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack]
Apr 18 14:54:38 thelibrary kernel: Modules linked in: macvlan nvidia_uvm(PO) veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs md_mod nvidia_drm(PO) nvidia_modeset(PO) drm_kms_helper drm backlight agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops nvidia(PO) nct6775 hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables bonding igb i2c_algo_bit ipmi_ssif amd64_edac_mod edac_mce_amd kvm_amd wmi_bmof kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel mpt3sas crypto_simd i2c_piix4 cryptd i2c_core nvme raid_class glue_helper ccp nvme_core scsi_transport_sas rapl ahci wmi acpi_ipmi k10temp libahci button ipmi_si acpi_cpufreq [last unloaded: i2c_algo_bit]
Apr 18 14:54:38 thelibrary kernel: CPU: 6 PID: 13151 Comm: kworker/6:0 Tainted: P           O      5.10.28-Unraid #1
Apr 18 14:54:38 thelibrary kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U, BIOS P3.50 11/02/2020
Apr 18 14:54:38 thelibrary kernel: Workqueue: events macvlan_process_broadcast [macvlan]
Apr 18 14:54:38 thelibrary kernel: RIP: 0010:__nf_conntrack_confirm+0x9b/0x1e6 [nf_conntrack]
Apr 18 14:54:38 thelibrary kernel: Code: e8 dc f8 ff ff 44 89 fa 89 c6 41 89 c4 48 c1 eb 20 89 df 41 89 de e8 36 f6 ff ff 84 c0 75 bb 48 8b 85 80 00 00 00 a8 08 74 18 <0f> 0b 89 df 44 89 e6 31 db e8 6d f3 ff ff e8 35 f5 ff ff e9 22 01
Apr 18 14:54:38 thelibrary kernel: RSP: 0018:ffffc9000031cd38 EFLAGS: 00010202
Apr 18 14:54:38 thelibrary kernel: RAX: 0000000000000188 RBX: 0000000000006b19 RCX: 00000000455a00ea
Apr 18 14:54:38 thelibrary kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffa01e8e64
Apr 18 14:54:38 thelibrary kernel: RBP: ffff8881c977ea80 R08: 000000001f21a935 R09: ffff8881965b4c20
Apr 18 14:54:38 thelibrary kernel: R10: 0000000000000158 R11: ffff888195fdbf00 R12: 0000000000008cdd
Apr 18 14:54:38 thelibrary kernel: R13: ffffffff8210b440 R14: 0000000000006b19 R15: 0000000000000000
Apr 18 14:54:38 thelibrary kernel: FS:  0000000000000000(0000) GS:ffff8887fe980000(0000) knlGS:0000000000000000
Apr 18 14:54:38 thelibrary kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 18 14:54:38 thelibrary kernel: CR2: 000015146e169000 CR3: 0000000246280000 CR4: 0000000000350ee0
Apr 18 14:54:38 thelibrary kernel: Call Trace:
Apr 18 14:54:38 thelibrary kernel: <IRQ>
Apr 18 14:54:38 thelibrary kernel: nf_conntrack_confirm+0x2f/0x36 [nf_conntrack]
Apr 18 14:54:38 thelibrary kernel: nf_hook_slow+0x39/0x8e
Apr 18 14:54:38 thelibrary kernel: nf_hook.constprop.0+0xb1/0xd8
Apr 18 14:54:38 thelibrary kernel: ? ip_protocol_deliver_rcu+0xfe/0xfe
Apr 18 14:54:38 thelibrary kernel: ip_local_deliver+0x49/0x75
Apr 18 14:54:38 thelibrary kernel: ip_sabotage_in+0x43/0x4d [br_netfilter]
Apr 18 14:54:38 thelibrary kernel: nf_hook_slow+0x39/0x8e
Apr 18 14:54:38 thelibrary kernel: nf_hook.constprop.0+0xb1/0xd8
Apr 18 14:54:38 thelibrary kernel: ? l3mdev_l3_rcv.constprop.0+0x50/0x50
Apr 18 14:54:38 thelibrary kernel: ip_rcv+0x41/0x61
Apr 18 14:54:38 thelibrary kernel: __netif_receive_skb_one_core+0x74/0x95
Apr 18 14:54:38 thelibrary kernel: process_backlog+0xa3/0x13b
Apr 18 14:54:38 thelibrary kernel: net_rx_action+0xf4/0x29d
Apr 18 14:54:38 thelibrary kernel: __do_softirq+0xc4/0x1c2
Apr 18 14:54:38 thelibrary kernel: asm_call_irq_on_stack+0x12/0x20
Apr 18 14:54:38 thelibrary kernel: </IRQ>
Apr 18 14:54:38 thelibrary kernel: do_softirq_own_stack+0x2c/0x39
Apr 18 14:54:38 thelibrary kernel: do_softirq+0x3a/0x44
Apr 18 14:54:38 thelibrary kernel: netif_rx_ni+0x1c/0x22
Apr 18 14:54:38 thelibrary kernel: macvlan_broadcast+0x10e/0x13c [macvlan]
Apr 18 14:54:38 thelibrary kernel: macvlan_process_broadcast+0xf8/0x143 [macvlan]
Apr 18 14:54:38 thelibrary kernel: process_one_work+0x13c/0x1d5
Apr 18 14:54:38 thelibrary kernel: worker_thread+0x18b/0x22f
Apr 18 14:54:38 thelibrary kernel: ? process_scheduled_works+0x27/0x27
Apr 18 14:54:38 thelibrary kernel: kthread+0xe5/0xea
Apr 18 14:54:38 thelibrary kernel: ? __kthread_bind_mask+0x57/0x57
Apr 18 14:54:38 thelibrary kernel: ret_from_fork+0x22/0x30
Apr 18 14:54:38 thelibrary kernel: ---[ end trace d16416a764eaff38 ]---

 

thelibrary-diagnostics-20210418-2211.zip

Edited by DuzAwe
Link to comment

One thing you can try it to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.