gdeyoung

Members
  • Posts

    35
  • Joined

  • Last visited

Converted

  • Gender
    Undisclosed

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

gdeyoung's Achievements

Noob

Noob (1/14)

3

Reputation

  1. I actually figured it out. You can install flatpaks in the steam headless and they are persistent. very cool
  2. I noticed on the last update a script that was updating flatpacks. Is there a way to add your own list of flatpaks to be installed and updated?
  3. I'm trying to revert to the previous version to fix the DB bug. I looked in your repo and you don't have a tag to load the previous or by version number that I can see. How do I revert the docker image?
  4. It turned out not to be the UDM. It was the DNS Docker causing panics on the network stack that was my root issue.
  5. So a quick update to close this out as fixed. - Ran all three servers with issues in safe mode and had no kernel panics and were stable. - Began trying to troubleshoot down which docker container or plugin was causing the instability - Found out it was the Technitum DNS docker container in community apps. With one wrinkle, on a 1G connection it was stable. On a 10G connection it generated kernel panics and spin lock errors. The stock install was on BR0 so wonder if it was in the dockers config for networking. - Once I removed the Technitum DNS server docker from the three servers they have been completely stable
  6. The 3rd server is in safe mode and still going I switched the 2nd server back to 10G in normal mode with no file copies and it panic'd in 20 minutes. diag's attached mediaserver-diagnostics-20210122-1520.zip
  7. Ok, I put one of the servers in safe mode on 10G and doing some file copies.
  8. They have the default 1500 MTU on the 10G nic. Should I be using 9000 for the jumbo frames?
  9. Yes, the Intel NIC seem to be more stable. I was also having issues with some of my Windows PC with Aquantia 10G and transfers, so I switched to Intel 10G across the board. Yes, rolled back to 6.8.3 and had the same issues for both nics. My other observation is the panics are happening on the ingest servers where I copy files to more often.
  10. @JorgeB Thank for continuing to engage, I really appreciate it. I have completed troubleshooting to try and localize down the issues. I have completely rebuilt two of the four servers with new components the only remaining thing is the Drives and still get the panic issues. I have swapped out all of the all of the network gear, three different 10G switches, new cables. I have removed all external items or replaced several times with new and still get the panics. In the last couple of days I switched two of the servers back to 1G and they are rock solid with no issues and are not chatty in the logs. Where with the 10G I was getting a variety of things pop up in logs every hour. This is not my first post on this. In my previous post I did post full diags and got NO replies. I DM'd @limetech for help and still silence. So I am trying, I really would like to get this working. All I have are the panic traces to go on now and don't have the knowledge to trouble shoot at that level. Here are the two types of NIC's I have used that are supposed to be fully supported. TRENDnet - TRENDnet TEG-10GECSFP - SFP+ Aquantia chipset Supermicro AOC-STGF-i2S - Dual SFP+ Intel chipset
  11. So 2 days ago I switched the my 2nd server from 10g to 1G. 1 day ago I switched my 3rd server to 1G from 10G. These are all different hardware machines Intel & Ryzen running a combo of 6.8.3 and 6.9rc2. All of my servers on 10G (all on their swapped out/2nd 10G NIC) kernel panics under heavy/sustained file copy within 24hrs. Without heavy file load they will panic under 72hrs. I have reworked network and simplified network configs. I have up to date bios on mobo's. It all comes down to sustained load on the 10G Intel and Aquantia nics. I have even three 3 different 10G switches, new 10G DAC cables, 10000base-T transcievers with Cat-7. It all comes back to there is something in the kernel that isn't right with heavy 10G network loads and causes panics. One thing I'm seeing is native_queued_spin_lock_slowpath errors before the full panic but I'm not seeing high CPU loads. Found these two articles/posts that might have some relevance. High CPU load by native_queued_spin_lock_slowpath (linuxquestions.org) The need for speed and the kernel datapath - recent improvements in UDP packets processing - Red Hat Developer What can be done to get 10G working in a stable fashion with sustained file copy loads? The whole reason for 10G... @limetech @JorgeB
  12. Server 3 just panic'd again. again this is a 10G server. also on it's second 10G Intel nic. It appears the panics happen more under large file copy loads on the 10G connection. Will move it back to 1G to see if it makes a difference. Jan 19 16:27:05 Homeserver kernel: Call Trace: Jan 19 16:27:05 Homeserver kernel: <IRQ> Jan 19 16:27:05 Homeserver kernel: dump_stack+0x67/0x83 Jan 19 16:27:05 Homeserver kernel: nmi_cpu_backtrace+0x71/0x83 Jan 19 16:27:05 Homeserver kernel: ? lapic_can_unplug_cpu+0x97/0x97 Jan 19 16:27:05 Homeserver kernel: nmi_trigger_cpumask_backtrace+0x57/0xd4 Jan 19 16:27:05 Homeserver kernel: rcu_dump_cpu_stacks+0x8b/0xb4 Jan 19 16:27:05 Homeserver kernel: rcu_check_callbacks+0x296/0x5a0 Jan 19 16:27:05 Homeserver kernel: update_process_times+0x24/0x47 Jan 19 16:27:05 Homeserver kernel: tick_sched_timer+0x36/0x64 Jan 19 16:27:05 Homeserver kernel: __hrtimer_run_queues+0xb7/0x10b Jan 19 16:27:05 Homeserver kernel: ? tick_sched_handle.isra.0+0x2f/0x2f Jan 19 16:27:05 Homeserver kernel: hrtimer_interrupt+0xf4/0x20e Jan 19 16:27:05 Homeserver kernel: smp_apic_timer_interrupt+0x7b/0x93 Jan 19 16:27:05 Homeserver kernel: apic_timer_interrupt+0xf/0x20 Jan 19 16:27:05 Homeserver kernel: </IRQ> Jan 19 16:27:05 Homeserver kernel: RIP: 0010:gc_worker+0xad/0x270 Jan 19 16:27:05 Homeserver kernel: Code: f6 c6 01 0f 85 4a 01 00 00 41 0f b6 46 37 49 c7 c0 f0 ff ff ff 41 ff c5 48 6b c0 38 49 29 c0 4f 8d 3c 06 49 8b 97 80 00 00 00 <41> 8b 87 88 00 00 00 0f ba e2 0e 73 2c 48 8b 15 ce dc 88 00 29 d0 Jan 19 16:27:05 Homeserver kernel: RSP: 0018:ffffc9001683fe60 EFLAGS: 00000296 ORIG_RAX: ffffffffffffff13 Jan 19 16:27:05 Homeserver kernel: RAX: 0000000000000038 RBX: 0000000000000000 RCX: 0000000000010000 Jan 19 16:27:05 Homeserver kernel: RDX: 0000000000000188 RSI: 00000000000000ad RDI: ffff8887f610d500 Jan 19 16:27:05 Homeserver kernel: RBP: 0000000000005aae R08: ffffffffffffffb8 R09: ffffffff81574c00 Jan 19 16:27:05 Homeserver kernel: R10: ffffea000edc5700 R11: ffff8887f610d501 R12: ffffffff822aa760 Jan 19 16:27:05 Homeserver kernel: R13: 00000000dba74d6c R14: ffff8887abf8ca48 R15: ffff8887abf8ca00 Jan 19 16:27:05 Homeserver kernel: ? nf_ct_get_id+0x80/0xb7 Jan 19 16:27:05 Homeserver kernel: process_one_work+0x16e/0x24f Jan 19 16:27:05 Homeserver kernel: worker_thread+0x1e2/0x2b8 Jan 19 16:27:05 Homeserver kernel: ? rescuer_thread+0x2a7/0x2a7 Jan 19 16:27:05 Homeserver kernel: kthread+0x10c/0x114 Jan 19 16:27:05 Homeserver kernel: ? kthread_park+0x89/0x89 Jan 19 16:27:05 Homeserver kernel: ret_from_fork+0x22/0x40
  13. Ok to update this thread. I tried going back to 6.8.3 on the 2nd and 3rd of my 4 servers that are kernel panicking and they still having panics and crashes daily. My only server that is not experiencing any issues is my 4thone that is 1G connected one. All of my 10G are panicking, and I have replaced the nics to intel server class 10g nics. I finally took my 2nd server back to a 1G connection to see if that stays stable. I have more log snippets from the 10G servers. It looks like they are also having a native_queued_spin_lock_slowpath error in the panic. Call Trace: Jan 19 12:52:28 Mediaserver kernel: <IRQ> Jan 19 12:52:28 Mediaserver kernel: dump_stack+0x67/0x83 Jan 19 12:52:28 Mediaserver kernel: nmi_cpu_backtrace+0x71/0x83 Jan 19 12:52:28 Mediaserver kernel: ? lapic_can_unplug_cpu+0x97/0x97 Jan 19 12:52:28 Mediaserver kernel: nmi_trigger_cpumask_backtrace+0x57/0xd4 Jan 19 12:52:28 Mediaserver kernel: rcu_dump_cpu_stacks+0x8b/0xb4 Jan 19 12:52:28 Mediaserver kernel: rcu_check_callbacks+0x296/0x5a0 Jan 19 12:52:28 Mediaserver kernel: update_process_times+0x24/0x47 Jan 19 12:52:28 Mediaserver kernel: tick_sched_timer+0x36/0x64 Jan 19 12:52:28 Mediaserver kernel: __hrtimer_run_queues+0xb7/0x10b Jan 19 12:52:28 Mediaserver kernel: ? tick_sched_handle.isra.0+0x2f/0x2f Jan 19 12:52:28 Mediaserver kernel: hrtimer_interrupt+0xf4/0x20e Jan 19 12:52:28 Mediaserver kernel: smp_apic_timer_interrupt+0x7b/0x93 Jan 19 12:52:28 Mediaserver kernel: apic_timer_interrupt+0xf/0x20 Jan 19 12:52:28 Mediaserver kernel: </IRQ> RIP: 0010:native_queued_spin_lock_slowpath+0x6b/0x171 Jan 19 12:52:28 Mediaserver kernel: Code: 42 f0 8b 07 30 e4 09 c6 f7 c6 00 ff ff ff 74 0e 81 e6 00 ff 00 00 75 1a c6 47 01 00 eb 14 85 f6 74 0a 8b 07 84 c0 74 04 f3 90 <eb> f6 66 c7 07 01 00 c3 48 c7 c2 40 07 02 00 65 48 03 15 80 6a f8 Jan 19 12:52:28 Mediaserver kernel: RSP: 0018:ffffc90003ce3b88 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13 Jan 19 12:52:28 Mediaserver kernel: RAX: 00000000001c0101 RBX: ffffc90003ce3c10 RCX: 000ffffffffff000 Jan 19 12:52:28 Mediaserver kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffea002085d368 Jan 19 12:52:28 Mediaserver kernel: RBP: ffffea0004d77200 R08: ffff888000000000 R09: ffffea0004d77240 Jan 19 12:52:28 Mediaserver kernel: R10: 0000000000000008 R11: 0000000000023eb8 R12: ffffea0004d77200 Jan 19 12:52:28 Mediaserver kernel: R13: ffff8882ed6dc400 R14: ffffea0004d77200 R15: ffff888114684600
  14. So my second server just crashed with a kernel panic, all three are having panics and they are all different hardware. Any idea from this trace? Jan 15 22:36:42 Mediaserver kernel: rcu: INFO: rcu_sched self-detected stall on CPU Jan 15 22:36:42 Mediaserver kernel: rcu: #0110-....: (59999 ticks this GP) idle=e7a/1/0x4000000000000000 softirq=11770626/11770626 fqs=14993 Jan 15 22:36:42 Mediaserver kernel: #011(t=60000 jiffies g=13660245 q=3404623) Jan 15 22:36:42 Mediaserver kernel: NMI backtrace for cpu 0 Jan 15 22:36:42 Mediaserver kernel: CPU: 0 PID: 28592 Comm: kworker/u24:0 Tainted: P O 5.10.1-Unraid #1 Jan 15 22:36:42 Mediaserver kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z390 Extreme4, BIOS P2.30 12/25/2018 Jan 15 22:36:42 Mediaserver kernel: Workqueue: events_power_efficient gc_worker Jan 15 22:36:42 Mediaserver kernel: Call Trace: Jan 15 22:36:42 Mediaserver kernel: <IRQ> Jan 15 22:36:42 Mediaserver kernel: dump_stack+0x6b/0x83 Jan 15 22:36:42 Mediaserver kernel: ? lapic_can_unplug_cpu+0x8e/0x8e Jan 15 22:36:42 Mediaserver kernel: nmi_cpu_backtrace+0x7d/0x8f Jan 15 22:36:42 Mediaserver kernel: nmi_trigger_cpumask_backtrace+0x56/0xd3 Jan 15 22:36:42 Mediaserver kernel: rcu_dump_cpu_stacks+0x9f/0xc6 Jan 15 22:36:42 Mediaserver kernel: rcu_sched_clock_irq+0x1ec/0x543 Jan 15 22:36:42 Mediaserver kernel: ? _raw_spin_unlock_irqrestore+0xd/0xe Jan 15 22:36:42 Mediaserver kernel: update_process_times+0x50/0x6e Jan 15 22:36:42 Mediaserver kernel: tick_sched_timer+0x36/0x64 Jan 15 22:36:42 Mediaserver kernel: __hrtimer_run_queues+0xb7/0x10b Jan 15 22:36:42 Mediaserver kernel: ? tick_sched_do_timer+0x39/0x39 Jan 15 22:36:42 Mediaserver kernel: hrtimer_interrupt+0x8d/0x160 Jan 15 22:36:42 Mediaserver kernel: __sysvec_apic_timer_interrupt+0x5d/0x68 Jan 15 22:36:42 Mediaserver kernel: asm_call_irq_on_stack+0x12/0x20 Jan 15 22:36:42 Mediaserver kernel: </IRQ> Jan 15 22:36:42 Mediaserver kernel: sysvec_apic_timer_interrupt+0x71/0x95 Jan 15 22:36:42 Mediaserver kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20 Jan 15 22:36:42 Mediaserver kernel: RIP: 0010:gc_worker+0xf4/0x240 Jan 15 22:36:42 Mediaserver kernel: Code: 5c 26 05 41 89 47 08 e9 bc 00 00 00 48 8b 15 ec 05 a4 00 29 d0 85 c0 7f 11 4c 89 ff e8 10 f0 ff ff ff 44 24 08 e9 9e 00 00 00 <85> db 0f 84 96 00 00 00 49 8b 87 80 00 00 00 a8 08 0f 84 87 00 00 Jan 15 22:36:42 Mediaserver kernel: RSP: 0018:ffffc9000525fe48 EFLAGS: 00000206 Jan 15 22:36:42 Mediaserver kernel: RAX: 0000000001447690 RBX: 0000000000000000 RCX: ffff888103000000 Jan 15 22:36:42 Mediaserver kernel: RDX: 00000001014602fe RSI: ffffc9000525fe5c RDI: ffff88840da28548 Jan 15 22:36:42 Mediaserver kernel: RBP: 000000000000c386 R08: 0000000000000000 R09: ffffffff815c56ac Jan 15 22:36:42 Mediaserver kernel: R10: 8080808080808080 R11: ffff88830e1fa780 R12: ffffffff82547ec0 Jan 15 22:36:42 Mediaserver kernel: R13: 000000009fd57c44 R14: ffff88840da28548 R15: ffff88840da28500 Jan 15 22:36:42 Mediaserver kernel: ? nf_conntrack_free+0x2b/0x35 Jan 15 22:36:42 Mediaserver kernel: ? gc_worker+0x9a/0x240 Jan 15 22:36:42 Mediaserver kernel: process_one_work+0x13c/0x1d5 Jan 15 22:36:42 Mediaserver kernel: worker_thread+0x18b/0x22f Jan 15 22:36:42 Mediaserver kernel: ? process_scheduled_works+0x27/0x27 Jan 15 22:36:42 Mediaserver kernel: kthread+0xe5/0xea Jan 15 22:36:42 Mediaserver kernel: ? kthread_unpark+0x52/0x52 Jan 15 22:36:42 Mediaserver kernel: ret_from_fork+0x22/0x30
  15. Ok, I figured it out. I was going about it backwards. In the network settings you can arrange the MAC addresses of the NIC's to what Eth port you want to assign them to. I just rearranged the port 0 MAC address to the Eth0 configuration To simplify the networking I turned off the bond for Eth 0-2 that was set to active-passive (that was the unraid default BTW) . I'm betting it was bouncing since I only had 10G port 1 (Eth1) plugged in. I will report back on the stability