gdeyoung

Members
  • Content Count

    30
  • Joined

  • Last visited

Community Reputation

2 Neutral

About gdeyoung

  • Rank
    Newbie

Converted

  • Gender
    Undisclosed

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

  1. So a quick update to close this out as fixed. - Ran all three servers with issues in safe mode and had no kernel panics and were stable. - Began trying to troubleshoot down which docker container or plugin was causing the instability - Found out it was the Technitum DNS docker container in community apps. With one wrinkle, on a 1G connection it was stable. On a 10G connection it generated kernel panics and spin lock errors. The stock install was on BR0 so wonder if it was in the dockers config for networking. - Once I removed the Technitum DNS server docker from t
  2. The 3rd server is in safe mode and still going I switched the 2nd server back to 10G in normal mode with no file copies and it panic'd in 20 minutes. diag's attached mediaserver-diagnostics-20210122-1520.zip
  3. Ok, I put one of the servers in safe mode on 10G and doing some file copies.
  4. They have the default 1500 MTU on the 10G nic. Should I be using 9000 for the jumbo frames?
  5. Yes, the Intel NIC seem to be more stable. I was also having issues with some of my Windows PC with Aquantia 10G and transfers, so I switched to Intel 10G across the board. Yes, rolled back to 6.8.3 and had the same issues for both nics. My other observation is the panics are happening on the ingest servers where I copy files to more often.
  6. @JorgeB Thank for continuing to engage, I really appreciate it. I have completed troubleshooting to try and localize down the issues. I have completely rebuilt two of the four servers with new components the only remaining thing is the Drives and still get the panic issues. I have swapped out all of the all of the network gear, three different 10G switches, new cables. I have removed all external items or replaced several times with new and still get the panics. In the last couple of days I switched two of the servers back to 1G and they are rock solid with no issue
  7. So 2 days ago I switched the my 2nd server from 10g to 1G. 1 day ago I switched my 3rd server to 1G from 10G. These are all different hardware machines Intel & Ryzen running a combo of 6.8.3 and 6.9rc2. All of my servers on 10G (all on their swapped out/2nd 10G NIC) kernel panics under heavy/sustained file copy within 24hrs. Without heavy file load they will panic under 72hrs. I have reworked network and simplified network configs. I have up to date bios on mobo's. It all comes down to sustained load on the 10G Intel and Aquantia nics. I have even three 3 different 10G switches, n
  8. Server 3 just panic'd again. again this is a 10G server. also on it's second 10G Intel nic. It appears the panics happen more under large file copy loads on the 10G connection. Will move it back to 1G to see if it makes a difference. Jan 19 16:27:05 Homeserver kernel: Call Trace: Jan 19 16:27:05 Homeserver kernel: <IRQ> Jan 19 16:27:05 Homeserver kernel: dump_stack+0x67/0x83 Jan 19 16:27:05 Homeserver kernel: nmi_cpu_backtrace+0x71/0x83 Jan 19 16:27:05 Homeserver kernel: ? lapic_can_unplug_cpu+0x97/0x97 Jan 19 16:27:05 Homeserver kernel: nmi_trigger_cpumask_back
  9. Ok to update this thread. I tried going back to 6.8.3 on the 2nd and 3rd of my 4 servers that are kernel panicking and they still having panics and crashes daily. My only server that is not experiencing any issues is my 4thone that is 1G connected one. All of my 10G are panicking, and I have replaced the nics to intel server class 10g nics. I finally took my 2nd server back to a 1G connection to see if that stays stable. I have more log snippets from the 10G servers. It looks like they are also having a native_queued_spin_lock_slowpath error in the panic. Call Trac
  10. So my second server just crashed with a kernel panic, all three are having panics and they are all different hardware. Any idea from this trace? Jan 15 22:36:42 Mediaserver kernel: rcu: INFO: rcu_sched self-detected stall on CPU Jan 15 22:36:42 Mediaserver kernel: rcu: #0110-....: (59999 ticks this GP) idle=e7a/1/0x4000000000000000 softirq=11770626/11770626 fqs=14993 Jan 15 22:36:42 Mediaserver kernel: #011(t=60000 jiffies g=13660245 q=3404623) Jan 15 22:36:42 Mediaserver kernel: NMI backtrace for cpu 0 Jan 15 22:36:42 Mediaserver kernel: CPU: 0 PID: 28592 Comm: kworker/u2
  11. Ok, I figured it out. I was going about it backwards. In the network settings you can arrange the MAC addresses of the NIC's to what Eth port you want to assign them to. I just rearranged the port 0 MAC address to the Eth0 configuration To simplify the networking I turned off the bond for Eth 0-2 that was set to active-passive (that was the unraid default BTW) . I'm betting it was bouncing since I only had 10G port 1 (Eth1) plugged in. I will report back on the stability
  12. So I have a MB with a integrated 1G ethernet that is mounted as Eth0 I have a Intel 10G 2port SFP+ card that is mounted at Eth1 and Eth2 I have a single DAC cable in Eth1 of the 10G card It is configured as a active bridge on br0 for Eth0, Eth1, Eth2 This is all the default config I went into the BIOS and turned off the built in 1G card mounted as Eth0 I wanted the system to default to the port 0 of the 10G as Eth0 On bootup it has an error that Eth0 can't be found. How do I make the server forget the disabled 1G port and make the 10G port 0 as Et
  13. Yes, I have the GPU stat plugin installed. Any insight on the kernel panic trace above?
  14. Does the above traces connect with the Nvidia driver at all? I'm seeing this in the log this morning after a reboot repeated a lot. Jan 15 10:12:30 Homeserver kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] Jan 15 10:12:30 Homeserver kernel: caller _nv000709rm+0x1af/0x200 [nvidia] mapping multiple BARs Jan 15 10:12:32 Homeserver kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window] J
  15. Happened again 24hrs later. Page faulted: Jan 11 08:30:00 Homeserver kernel: BUG: unable to handle page fault for address: 00000000000053d8 Whole trace: an 11 04:07:30 Homeserver kernel: br0: port 1(bond0) entered forwarding state Jan 11 04:08:25 Homeserver flash_backup: adding task: php /usr/local/emhttp/plugins/dynamix.unraid.net/include/UpdateFlashBackup.php update Jan 11 04:15:05 Homeserver kernel: br0: received packet on bond0 with own address as source address (addr:30:9c:23:af:51:e0, vlan:0) Jan 11 04:15:05 Homeserver kernel: br0: received pac