gdeyoung

Members
  • Posts

    31
  • Joined

  • Last visited

Converted

  • Gender
    Undisclosed

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

gdeyoung's Achievements

Newbie

Newbie (1/14)

2

Reputation

  1. It turned out not to be the UDM. It was the DNS Docker causing panics on the network stack that was my root issue.
  2. So a quick update to close this out as fixed. - Ran all three servers with issues in safe mode and had no kernel panics and were stable. - Began trying to troubleshoot down which docker container or plugin was causing the instability - Found out it was the Technitum DNS docker container in community apps. With one wrinkle, on a 1G connection it was stable. On a 10G connection it generated kernel panics and spin lock errors. The stock install was on BR0 so wonder if it was in the dockers config for networking. - Once I removed the Technitum DNS server docker from the three servers they have been completely stable
  3. The 3rd server is in safe mode and still going I switched the 2nd server back to 10G in normal mode with no file copies and it panic'd in 20 minutes. diag's attached mediaserver-diagnostics-20210122-1520.zip
  4. Ok, I put one of the servers in safe mode on 10G and doing some file copies.
  5. They have the default 1500 MTU on the 10G nic. Should I be using 9000 for the jumbo frames?
  6. Yes, the Intel NIC seem to be more stable. I was also having issues with some of my Windows PC with Aquantia 10G and transfers, so I switched to Intel 10G across the board. Yes, rolled back to 6.8.3 and had the same issues for both nics. My other observation is the panics are happening on the ingest servers where I copy files to more often.
  7. @JorgeB Thank for continuing to engage, I really appreciate it. I have completed troubleshooting to try and localize down the issues. I have completely rebuilt two of the four servers with new components the only remaining thing is the Drives and still get the panic issues. I have swapped out all of the all of the network gear, three different 10G switches, new cables. I have removed all external items or replaced several times with new and still get the panics. In the last couple of days I switched two of the servers back to 1G and they are rock solid with no issues and are not chatty in the logs. Where with the 10G I was getting a variety of things pop up in logs every hour. This is not my first post on this. In my previous post I did post full diags and got NO replies. I DM'd @limetech for help and still silence. So I am trying, I really would like to get this working. All I have are the panic traces to go on now and don't have the knowledge to trouble shoot at that level. Here are the two types of NIC's I have used that are supposed to be fully supported. TRENDnet - TRENDnet TEG-10GECSFP - SFP+ Aquantia chipset Supermicro AOC-STGF-i2S - Dual SFP+ Intel chipset
  8. So 2 days ago I switched the my 2nd server from 10g to 1G. 1 day ago I switched my 3rd server to 1G from 10G. These are all different hardware machines Intel & Ryzen running a combo of 6.8.3 and 6.9rc2. All of my servers on 10G (all on their swapped out/2nd 10G NIC) kernel panics under heavy/sustained file copy within 24hrs. Without heavy file load they will panic under 72hrs. I have reworked network and simplified network configs. I have up to date bios on mobo's. It all comes down to sustained load on the 10G Intel and Aquantia nics. I have even three 3 different 10G switches, new 10G DAC cables, 10000base-T transcievers with Cat-7. It all comes back to there is something in the kernel that isn't right with heavy 10G network loads and causes panics. One thing I'm seeing is native_queued_spin_lock_slowpath errors before the full panic but I'm not seeing high CPU loads. Found these two articles/posts that might have some relevance. High CPU load by native_queued_spin_lock_slowpath (linuxquestions.org) The need for speed and the kernel datapath - recent improvements in UDP packets processing - Red Hat Developer What can be done to get 10G working in a stable fashion with sustained file copy loads? The whole reason for 10G... @limetech @JorgeB
  9. Server 3 just panic'd again. again this is a 10G server. also on it's second 10G Intel nic. It appears the panics happen more under large file copy loads on the 10G connection. Will move it back to 1G to see if it makes a difference. Jan 19 16:27:05 Homeserver kernel: Call Trace: Jan 19 16:27:05 Homeserver kernel: <IRQ> Jan 19 16:27:05 Homeserver kernel: dump_stack+0x67/0x83 Jan 19 16:27:05 Homeserver kernel: nmi_cpu_backtrace+0x71/0x83 Jan 19 16:27:05 Homeserver kernel: ? lapic_can_unplug_cpu+0x97/0x97 Jan 19 16:27:05 Homeserver kernel: nmi_trigger_cpumask_backtrace+0x57/0xd4 Jan 19 16:27:05 Homeserver kernel: rcu_dump_cpu_stacks+0x8b/0xb4 Jan 19 16:27:05 Homeserver kernel: rcu_check_callbacks+0x296/0x5a0 Jan 19 16:27:05 Homeserver kernel: update_process_times+0x24/0x47 Jan 19 16:27:05 Homeserver kernel: tick_sched_timer+0x36/0x64 Jan 19 16:27:05 Homeserver kernel: __hrtimer_run_queues+0xb7/0x10b Jan 19 16:27:05 Homeserver kernel: ? tick_sched_handle.isra.0+0x2f/0x2f Jan 19 16:27:05 Homeserver kernel: hrtimer_interrupt+0xf4/0x20e Jan 19 16:27:05 Homeserver kernel: smp_apic_timer_interrupt+0x7b/0x93 Jan 19 16:27:05 Homeserver kernel: apic_timer_interrupt+0xf/0x20 Jan 19 16:27:05 Homeserver kernel: </IRQ> Jan 19 16:27:05 Homeserver kernel: RIP: 0010:gc_worker+0xad/0x270 Jan 19 16:27:05 Homeserver kernel: Code: f6 c6 01 0f 85 4a 01 00 00 41 0f b6 46 37 49 c7 c0 f0 ff ff ff 41 ff c5 48 6b c0 38 49 29 c0 4f 8d 3c 06 49 8b 97 80 00 00 00 <41> 8b 87 88 00 00 00 0f ba e2 0e 73 2c 48 8b 15 ce dc 88 00 29 d0 Jan 19 16:27:05 Homeserver kernel: RSP: 0018:ffffc9001683fe60 EFLAGS: 00000296 ORIG_RAX: ffffffffffffff13 Jan 19 16:27:05 Homeserver kernel: RAX: 0000000000000038 RBX: 0000000000000000 RCX: 0000000000010000 Jan 19 16:27:05 Homeserver kernel: RDX: 0000000000000188 RSI: 00000000000000ad RDI: ffff8887f610d500 Jan 19 16:27:05 Homeserver kernel: RBP: 0000000000005aae R08: ffffffffffffffb8 R09: ffffffff81574c00 Jan 19 16:27:05 Homeserver kernel: R10: ffffea000edc5700 R11: ffff8887f610d501 R12: ffffffff822aa760 Jan 19 16:27:05 Homeserver kernel: R13: 00000000dba74d6c R14: ffff8887abf8ca48 R15: ffff8887abf8ca00 Jan 19 16:27:05 Homeserver kernel: ? nf_ct_get_id+0x80/0xb7 Jan 19 16:27:05 Homeserver kernel: process_one_work+0x16e/0x24f Jan 19 16:27:05 Homeserver kernel: worker_thread+0x1e2/0x2b8 Jan 19 16:27:05 Homeserver kernel: ? rescuer_thread+0x2a7/0x2a7 Jan 19 16:27:05 Homeserver kernel: kthread+0x10c/0x114 Jan 19 16:27:05 Homeserver kernel: ? kthread_park+0x89/0x89 Jan 19 16:27:05 Homeserver kernel: ret_from_fork+0x22/0x40
  10. Ok to update this thread. I tried going back to 6.8.3 on the 2nd and 3rd of my 4 servers that are kernel panicking and they still having panics and crashes daily. My only server that is not experiencing any issues is my 4thone that is 1G connected one. All of my 10G are panicking, and I have replaced the nics to intel server class 10g nics. I finally took my 2nd server back to a 1G connection to see if that stays stable. I have more log snippets from the 10G servers. It looks like they are also having a native_queued_spin_lock_slowpath error in the panic. Call Trace: Jan 19 12:52:28 Mediaserver kernel: <IRQ> Jan 19 12:52:28 Mediaserver kernel: dump_stack+0x67/0x83 Jan 19 12:52:28 Mediaserver kernel: nmi_cpu_backtrace+0x71/0x83 Jan 19 12:52:28 Mediaserver kernel: ? lapic_can_unplug_cpu+0x97/0x97 Jan 19 12:52:28 Mediaserver kernel: nmi_trigger_cpumask_backtrace+0x57/0xd4 Jan 19 12:52:28 Mediaserver kernel: rcu_dump_cpu_stacks+0x8b/0xb4 Jan 19 12:52:28 Mediaserver kernel: rcu_check_callbacks+0x296/0x5a0 Jan 19 12:52:28 Mediaserver kernel: update_process_times+0x24/0x47 Jan 19 12:52:28 Mediaserver kernel: tick_sched_timer+0x36/0x64 Jan 19 12:52:28 Mediaserver kernel: __hrtimer_run_queues+0xb7/0x10b Jan 19 12:52:28 Mediaserver kernel: ? tick_sched_handle.isra.0+0x2f/0x2f Jan 19 12:52:28 Mediaserver kernel: hrtimer_interrupt+0xf4/0x20e Jan 19 12:52:28 Mediaserver kernel: smp_apic_timer_interrupt+0x7b/0x93 Jan 19 12:52:28 Mediaserver kernel: apic_timer_interrupt+0xf/0x20 Jan 19 12:52:28 Mediaserver kernel: </IRQ> RIP: 0010:native_queued_spin_lock_slowpath+0x6b/0x171 Jan 19 12:52:28 Mediaserver kernel: Code: 42 f0 8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 74 0e 81 e6 00 ff 00 00 75 1a c6 47 01 00 eb 14 85 f6 74 0a 8b 07 84 c0 74 04 f3 90 <eb> f6 66 c7 07 01 00 c3 48 c7 c2 40 07 02 00 65 48 03 15 80 6a f8 Jan 19 12:52:28 Mediaserver kernel: RSP: 0018:ffffc90003ce3b88 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13 Jan 19 12:52:28 Mediaserver kernel: RAX: 00000000001c0101 RBX: ffffc90003ce3c10 RCX: 000ffffffffff000 Jan 19 12:52:28 Mediaserver kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffea002085d368 Jan 19 12:52:28 Mediaserver kernel: RBP: ffffea0004d77200 R08: ffff888000000000 R09: ffffea0004d77240 Jan 19 12:52:28 Mediaserver kernel: R10: 0000000000000008 R11: 0000000000023eb8 R12: ffffea0004d77200 Jan 19 12:52:28 Mediaserver kernel: R13: ffff8882ed6dc400 R14: ffffea0004d77200 R15: ffff888114684600
  11. So my second server just crashed with a kernel panic, all three are having panics and they are all different hardware. Any idea from this trace? Jan 15 22:36:42 Mediaserver kernel: rcu: INFO: rcu_sched self-detected stall on CPU Jan 15 22:36:42 Mediaserver kernel: rcu: #0110-....: (59999 ticks this GP) idle=e7a/1/0x4000000000000000 softirq=11770626/11770626 fqs=14993 Jan 15 22:36:42 Mediaserver kernel: #011(t=60000 jiffies g=13660245 q=3404623) Jan 15 22:36:42 Mediaserver kernel: NMI backtrace for cpu 0 Jan 15 22:36:42 Mediaserver kernel: CPU: 0 PID: 28592 Comm: kworker/u24:0 Tainted: P O 5.10.1-Unraid #1 Jan 15 22:36:42 Mediaserver kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z390 Extreme4, BIOS P2.30 12/25/2018 Jan 15 22:36:42 Mediaserver kernel: Workqueue: events_power_efficient gc_worker Jan 15 22:36:42 Mediaserver kernel: Call Trace: Jan 15 22:36:42 Mediaserver kernel: <IRQ> Jan 15 22:36:42 Mediaserver kernel: dump_stack+0x6b/0x83 Jan 15 22:36:42 Mediaserver kernel: ? lapic_can_unplug_cpu+0x8e/0x8e Jan 15 22:36:42 Mediaserver kernel: nmi_cpu_backtrace+0x7d/0x8f Jan 15 22:36:42 Mediaserver kernel: nmi_trigger_cpumask_backtrace+0x56/0xd3 Jan 15 22:36:42 Mediaserver kernel: rcu_dump_cpu_stacks+0x9f/0xc6 Jan 15 22:36:42 Mediaserver kernel: rcu_sched_clock_irq+0x1ec/0x543 Jan 15 22:36:42 Mediaserver kernel: ? _raw_spin_unlock_irqrestore+0xd/0xe Jan 15 22:36:42 Mediaserver kernel: update_process_times+0x50/0x6e Jan 15 22:36:42 Mediaserver kernel: tick_sched_timer+0x36/0x64 Jan 15 22:36:42 Mediaserver kernel: __hrtimer_run_queues+0xb7/0x10b Jan 15 22:36:42 Mediaserver kernel: ? tick_sched_do_timer+0x39/0x39 Jan 15 22:36:42 Mediaserver kernel: hrtimer_interrupt+0x8d/0x160 Jan 15 22:36:42 Mediaserver kernel: __sysvec_apic_timer_interrupt+0x5d/0x68 Jan 15 22:36:42 Mediaserver kernel: asm_call_irq_on_stack+0x12/0x20 Jan 15 22:36:42 Mediaserver kernel: </IRQ> Jan 15 22:36:42 Mediaserver kernel: sysvec_apic_timer_interrupt+0x71/0x95 Jan 15 22:36:42 Mediaserver kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20 Jan 15 22:36:42 Mediaserver kernel: RIP: 0010:gc_worker+0xf4/0x240 Jan 15 22:36:42 Mediaserver kernel: Code: 5c 26 05 41 89 47 08 e9 bc 00 00 00 48 8b 15 ec 05 a4 00 29 d0 85 c0 7f 11 4c 89 ff e8 10 f0 ff ff ff 44 24 08 e9 9e 00 00 00 <85> db 0f 84 96 00 00 00 49 8b 87 80 00 00 00 a8 08 0f 84 87 00 00 Jan 15 22:36:42 Mediaserver kernel: RSP: 0018:ffffc9000525fe48 EFLAGS: 00000206 Jan 15 22:36:42 Mediaserver kernel: RAX: 0000000001447690 RBX: 0000000000000000 RCX: ffff888103000000 Jan 15 22:36:42 Mediaserver kernel: RDX: 00000001014602fe RSI: ffffc9000525fe5c RDI: ffff88840da28548 Jan 15 22:36:42 Mediaserver kernel: RBP: 000000000000c386 R08: 0000000000000000 R09: ffffffff815c56ac Jan 15 22:36:42 Mediaserver kernel: R10: 8080808080808080 R11: ffff88830e1fa780 R12: ffffffff82547ec0 Jan 15 22:36:42 Mediaserver kernel: R13: 000000009fd57c44 R14: ffff88840da28548 R15: ffff88840da28500 Jan 15 22:36:42 Mediaserver kernel: ? nf_conntrack_free+0x2b/0x35 Jan 15 22:36:42 Mediaserver kernel: ? gc_worker+0x9a/0x240 Jan 15 22:36:42 Mediaserver kernel: process_one_work+0x13c/0x1d5 Jan 15 22:36:42 Mediaserver kernel: worker_thread+0x18b/0x22f Jan 15 22:36:42 Mediaserver kernel: ? process_scheduled_works+0x27/0x27 Jan 15 22:36:42 Mediaserver kernel: kthread+0xe5/0xea Jan 15 22:36:42 Mediaserver kernel: ? kthread_unpark+0x52/0x52 Jan 15 22:36:42 Mediaserver kernel: ret_from_fork+0x22/0x30
  12. Ok, I figured it out. I was going about it backwards. In the network settings you can arrange the MAC addresses of the NIC's to what Eth port you want to assign them to. I just rearranged the port 0 MAC address to the Eth0 configuration To simplify the networking I turned off the bond for Eth 0-2 that was set to active-passive (that was the unraid default BTW) . I'm betting it was bouncing since I only had 10G port 1 (Eth1) plugged in. I will report back on the stability
  13. So I have a MB with a integrated 1G ethernet that is mounted as Eth0 I have a Intel 10G 2port SFP+ card that is mounted at Eth1 and Eth2 I have a single DAC cable in Eth1 of the 10G card It is configured as a active bridge on br0 for Eth0, Eth1, Eth2 This is all the default config I went into the BIOS and turned off the built in 1G card mounted as Eth0 I wanted the system to default to the port 0 of the 10G as Eth0 On bootup it has an error that Eth0 can't be found. How do I make the server forget the disabled 1G port and make the 10G port 0 as Eth0?
  14. Yes, I have the GPU stat plugin installed. Any insight on the kernel panic trace above?
  15. Does the above traces connect with the Nvidia driver at all? I'm seeing this in the log this morning after a reboot repeated a lot. Jan 15 10:12:30 Homeserver kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Jan 15 10:12:30 Homeserver kernel: caller _nv000709rm+0x1af/0x200 [nvidia] mapping multiple BARs Jan 15 10:12:32 Homeserver kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Jan 15 10:12:32 Homeserver kernel: caller _nv000709rm+0x1af/0x200 [nvidia] mapping multiple BARs Jan 15 10:12:34 Homeserver kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Jan 15 10:12:34 Homeserver kernel: caller _nv000709rm+0x1af/0x200 [nvidia] mapping multiple BARs Jan 15 10:12:35 Homeserver kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Jan 15 10:12:35 Homeserver kernel: caller _nv000709rm+0x1af/0x200 [nvidia] mapping multiple BARs Jan 15 10:12:36 Homeserver kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Jan 15 10:12:36 Homeserver kernel: caller _nv000709rm+0x1af/0x200 [nvidia] mapping multiple BARs Jan 15 10:12:38 Homeserver kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Jan 15 10:12:38 Homeserver kernel: caller _nv000709rm+0x1af/0x200 [nvidia] mapping multiple BARs Jan 15 10:12:39 Homeserver kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Jan 15 10:12:39 Homeserver kernel: caller _nv000709rm+0x1af/0x200 [nvidia] mapping multiple BARs