Jump to content
Sign in to follow this  
sirkuz

Sudden instability/crashes 6.8.3 LinuxServer Nvidia Build

3 posts in this topic Last Reply

Recommended Posts

Before I revert back to standard build to see if anything changes I thought I would first post. I have had the nvidia custom build running for months without any issues and then a month or so ago I started crashing. It typically happens randomly as far as I can tell anywhere from a few hours to 3-5 days max. Attached is the console output as well as pertinent part of the logs. Perhaps someone more familiar with them could let me know if it looks more hardware related (failing mem/cpu) or software. 

 

Thank you in advance!

 

Oct  9 03:25:51 Tower root: mover: finished
Oct  9 03:29:53 Tower kernel: mdcmd (336): spindown 2
Oct  9 03:30:52 Tower kernel: mdcmd (337): spindown 10
Oct  9 03:31:07 Tower kernel: mdcmd (338): spindown 7
Oct  9 03:31:09 Tower kernel: mdcmd (339): spindown 9
Oct  9 03:31:10 Tower kernel: mdcmd (340): spindown 11
Oct  9 03:32:22 Tower kernel: mdcmd (341): spindown 8
Oct  9 03:32:51 Tower kernel: mdcmd (342): spindown 4
Oct  9 03:37:14 Tower kernel: mdcmd (343): spindown 3
Oct  9 03:40:17 Tower kernel: mdcmd (344): spindown 13
Oct  9 03:47:39 Tower kernel: mdcmd (345): spindown 15
Oct  9 03:49:22 Tower kernel: mdcmd (346): spindown 16
Oct  9 03:57:26 Tower kernel: mdcmd (347): spindown 1
Oct  9 04:10:39 Tower kernel: mdcmd (348): spindown 0
Oct  9 04:10:42 Tower kernel: mdcmd (349): spindown 6
Oct  9 04:10:43 Tower kernel: mdcmd (350): spindown 29
Oct  9 09:57:03 Tower nginx: 2020/10/09 09:57:03 [error] 13476#13476: *1269818 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 192.168.199.101, server: , request: "POST /webGui/include/DeviceList.php HTTP/2.0", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "8a3d2bfb48855a17ee15048f7164ef99e10efe5f.unraid.net:4443", referrer: "https://8a3d2bfb48855a17ee15048f7164ef99e10efe5f.unraid.net:4443/Main"
Oct  9 09:57:03 Tower php-fpm[13435]: [WARNING] [pool www] child 6660 exited on signal 7 (SIGBUS) after 124.006379 seconds from start
Oct  9 10:15:25 Tower kernel: mdcmd (351): spindown 1
Oct  9 10:15:35 Tower kernel: mdcmd (352): spindown 7
Oct  9 10:32:22 Tower kernel: mdcmd (353): spindown 11
Oct  9 11:14:57 Tower kernel: mdcmd (354): spindown 5
Oct  9 11:51:28 Tower kernel: mdcmd (355): spindown 4
Oct  9 11:51:43 Tower kernel: mdcmd (356): spindown 14
Oct  9 11:53:14 Tower kernel: mdcmd (357): spindown 12
Oct  9 11:53:15 Tower kernel: mdcmd (358): spindown 13
Oct  9 11:54:17 Tower kernel: mdcmd (359): spindown 9
Oct  9 11:54:17 Tower kernel: mdcmd (360): spindown 10
Oct  9 11:54:18 Tower kernel: mdcmd (361): spindown 11
Oct  9 11:55:18 Tower kernel: mdcmd (362): spindown 16
Oct  9 11:55:56 Tower kernel: mdcmd (363): spindown 3
Oct  9 11:56:15 Tower kernel: mdcmd (364): spindown 2
Oct  9 13:29:26 Tower kernel: WARNING: CPU: 16 PID: 27141 at net/netfilter/nf_conntrack_core.c:945 __nf_conntrack_confirm+0xa0/0x69e
Oct  9 13:29:26 Tower kernel: Modules linked in: vhost_net tun vhost tap kvm_intel kvm nvidia_uvm(O) xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables veth macvlan xt_nat ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs nfsd lockd grace sunrpc md_mod mlx4_en mlx4_core igb(O) nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) crc32_pclmul intel_rapl_perf intel_uncore pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd intel_cstate coretemp crct10dif_pclmul intel_powerclamp drm_kms_helper crc32c_intel sb_edac drm syscopyarea mpt3sas isci x86_pkg_temp_thermal sysfillrect rsnvme(PO) sysimgblt fb_sys_fops ahci raid_class libsas nvme i2c_i801 nvme_core libahci agpgart wmi scsi_transport_sas ipmi_ssif pcc_cpufreq button
Oct  9 13:29:26 Tower kernel: i2c_core ipmi_si [last unloaded: tun]
Oct  9 13:29:26 Tower kernel: CPU: 16 PID: 27141 Comm: kworker/16:1 Tainted: P           O      4.19.107-Unraid #1
Oct  9 13:29:26 Tower kernel: Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.3 05/23/2018
Oct  9 13:29:26 Tower kernel: Workqueue: events macvlan_process_broadcast [macvlan]
Oct  9 13:29:26 Tower kernel: RIP: 0010:__nf_conntrack_confirm+0xa0/0x69e
Oct  9 13:29:26 Tower kernel: Code: 04 e8 56 fb ff ff 44 89 f2 44 89 ff 89 c6 41 89 c4 e8 7f f9 ff ff 48 8b 4c 24 08 84 c0 75 af 48 8b 85 80 00 00 00 a8 08 74 26 <0f> 0b 44 89 e6 44 89 ff 45 31 f6 e8 95 f1 ff ff be 00 02 00 00 48
Oct  9 13:29:26 Tower kernel: RSP: 0018:ffff889fffa03d58 EFLAGS: 00010202
Oct  9 13:29:26 Tower kernel: RAX: 0000000000000188 RBX: ffff888139ac1300 RCX: ffff88a0dc239098
Oct  9 13:29:26 Tower kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff81e09150
Oct  9 13:29:26 Tower kernel: RBP: ffff88a0dc239040 R08: 0000000029e50e39 R09: ffffffff81c8aa80
Oct  9 13:29:26 Tower kernel: R10: 0000000000000158 R11: ffffffff81e91080 R12: 000000000000aad4
Oct  9 13:29:26 Tower kernel: R13: ffffffff81e91080 R14: 0000000000000000 R15: 000000000000f92c
Oct  9 13:29:26 Tower kernel: FS:  0000000000000000(0000) GS:ffff889fffa00000(0000) knlGS:0000000000000000
Oct  9 13:29:26 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  9 13:29:26 Tower kernel: CR2: 0000148269c74000 CR3: 0000000001e0a001 CR4: 00000000001606e0
Oct  9 13:29:26 Tower kernel: Call Trace:
Oct  9 13:29:26 Tower kernel: <IRQ>
Oct  9 13:29:26 Tower kernel: ipv4_confirm+0xaf/0xb9
Oct  9 13:29:26 Tower kernel: nf_hook_slow+0x3a/0x90
Oct  9 13:29:26 Tower kernel: ip_local_deliver+0xad/0xdc
Oct  9 13:29:26 Tower kernel: ? ip_sublist_rcv_finish+0x54/0x54
Oct  9 13:29:26 Tower kernel: ip_sabotage_in+0x38/0x3e
Oct  9 13:29:26 Tower kernel: nf_hook_slow+0x3a/0x90
Oct  9 13:29:26 Tower kernel: ip_rcv+0x8e/0xbe
Oct  9 13:29:26 Tower kernel: ? ip_rcv_finish_core.isra.0+0x2e1/0x2e1
Oct  9 13:29:26 Tower kernel: __netif_receive_skb_one_core+0x53/0x6f
Oct  9 13:29:26 Tower kernel: process_backlog+0x77/0x10e
Oct  9 13:29:26 Tower kernel: net_rx_action+0x107/0x26c
Oct  9 13:29:26 Tower kernel: __do_softirq+0xc9/0x1d7
Oct  9 13:29:26 Tower kernel: do_softirq_own_stack+0x2a/0x40
Oct  9 13:29:26 Tower kernel: </IRQ>
Oct  9 13:29:26 Tower kernel: do_softirq+0x4d/0x5a
Oct  9 13:29:26 Tower kernel: netif_rx_ni+0x1c/0x22
Oct  9 13:29:26 Tower kernel: macvlan_broadcast+0x111/0x156 [macvlan]
Oct  9 13:29:26 Tower kernel: ? __switch_to_asm+0x41/0x70
Oct  9 13:29:26 Tower kernel: macvlan_process_broadcast+0xea/0x128 [macvlan]
Oct  9 13:29:26 Tower kernel: process_one_work+0x16e/0x24f
Oct  9 13:29:26 Tower kernel: worker_thread+0x1e2/0x2b8
Oct  9 13:29:26 Tower kernel: ? rescuer_thread+0x2a7/0x2a7
Oct  9 13:29:26 Tower kernel: kthread+0x10c/0x114
Oct  9 13:29:26 Tower kernel: ? kthread_park+0x89/0x89
Oct  9 13:29:26 Tower kernel: ret_from_fork+0x35/0x40
Oct  9 13:29:26 Tower kernel: ---[ end trace de4fa2592551a7a5 ]---
Oct  9 13:34:02 Tower kernel: mdcmd (365): spindown 6
Oct  9 14:30:16 Tower kernel: mdcmd (366): spindown 4
Oct  9 14:30:33 Tower kernel: mdcmd (367): spindown 2
Oct  9 14:31:28 Tower kernel: mdcmd (368): spindown 8
Oct  9 14:31:29 Tower kernel: mdcmd (369): spindown 9
Oct  9 14:31:34 Tower kernel: mdcmd (370): spindown 1
Oct  9 14:32:16 Tower kernel: mdcmd (371): spindown 10
Oct  9 14:33:12 Tower kernel: mdcmd (372): spindown 12
Oct  9 16:04:08 Tower kernel: mdcmd (373): spindown 1
Oct  9 19:50:17 Tower kernel: mdcmd (374): spindown 10
Oct  9 19:50:27 Tower kernel: mdcmd (375): spindown 11
Oct  9 19:50:28 Tower kernel: mdcmd (376): spindown 13
Oct  9 19:50:30 Tower kernel: mdcmd (377): spindown 8
Oct  9 19:50:32 Tower kernel: mdcmd (378): spindown 2
Oct  9 19:50:33 Tower kernel: mdcmd (379): spindown 4
Oct  9 19:50:35 Tower kernel: mdcmd (380): spindown 9
Oct  9 19:50:37 Tower kernel: mdcmd (381): spindown 5
Oct  9 19:50:39 Tower kernel: mdcmd (382): spindown 14
Oct  9 19:50:41 Tower kernel: mdcmd (383): spindown 15
Oct  9 19:50:48 Tower kernel: mdcmd (384): spindown 16
Oct  9 19:50:50 Tower kernel: mdcmd (385): spindown 17
Oct  9 19:50:52 Tower kernel: mdcmd (386): spindown 1
Oct  9 19:50:54 Tower kernel: mdcmd (387): spindown 3
Oct  9 19:50:56 Tower kernel: mdcmd (388): spindown 6
Oct  9 20:12:25 Tower kernel: mdcmd (389): spindown 12
Oct  9 20:59:14 Tower kernel: mdcmd (390): spindown 4
Oct  9 20:59:25 Tower kernel: mdcmd (391): spindown 2
Oct  9 20:59:38 Tower kernel: mdcmd (392): spindown 9
Oct  9 21:00:00 Tower kernel: mdcmd (393): spindown 10
Oct  9 23:15:03 Tower kernel: mdcmd (394): spindown 17

java_EczFsYBj4G.png

Share this post


Link to post

Macvlan call traces are usually caused by having dockers with a custom IP address, more info here:

 

Share this post


Link to post

Thank you kindly Jorge! Will be looking that over and adjusting as needed.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this