Synd

Members
  • Posts

    14
  • Joined

  • Last visited

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

Synd's Achievements

Noob

Noob (1/14)

7

Reputation

  1. Sorry can't share the diagnostic @JorgeB (there's Engineering Sample equipment on my unraid, i can't do that due to NDAs) model name : AMD EPYC 7B13 64-Core Processor root@Megumin:~# grep -c ^processor /proc/cpuinfo 128 The big difference from my look at both of them is the multiple GPUs vs my 1 (which is a converged accelerator, which is the ES part) root@Megumin:~# du -sh /run/udev/* | sort -h | tail -3 0 /run/udev/tags 0 /run/udev/watch 17M /run/udev/data (this is my udev size) I know there was changes on the kernel on udev for multiple gpus due to issues at booting them "fast" enough on AI clusters. (just don't remember which 6.1 version got it, but it's one between 6.12.6 and 6.12.8, still looking at patches to see where it was applied, as we upgraded to 6.6 at work to have the patch before it was on 6.1) So, one thing i should think is making the udev 64MB for 6.13 (which would help for weird and complex systems.) I know Rome vs Milan also do udev differently, as it was an updated architecture on the cpu side. Mobo: Asrock Rack RomeD8-2T CPU: Epyc 7B13 Ram: 1TB PCIe Expensions: 1 x LSI 9400-8i (x8) 3 x ASUS Hyper M.2 card filled with m.2 1 x Nvidia A100X ES 1 x Connectx-5 pro 2 x Oculink to U.2 2 x M.2 (On-board) Those are all my specs as i posted yesterday on Discord to confirm where the issue could be.
  2. Done, and found the regression in the kernel itself going thru our gits at work + the ticketering system. Dummy waits are applied to AMD with kernel 5.19 and was patched for 6.0. I shared a lot more details to staff in a private discord channel due to bringing internal work data into the conversation.
  3. Which is what we found on Discord that Intel is not affected too. @Fuggin and @Kilrah did a parity, but only AMD builds are slowing down. @AgentXXL @Pri and me are running Epyc. I know @The_Mountain also has the issue on Epyc. This is a specific AMD bug after testing with multiple setups in the #offtopic channel of the Discord.
  4. To add onto this, after more discussions, we found it's all the people with Epyc systems that have a slower parity since 6.11 like this bug presents. The Xeons users are usually fine with the Discord mods all running tests with some of the most active Discord users.
  5. Hi, Since 6.11.5 upgrades, each time I run a parity check, write to the array or use the mover, the parities are running at 60-70MB/s, while before the upgrade, it was at 170-180MB/s. The system is all connected on SAS3 equipment, but we've seen the same issues on Discord with others running SAS2, or Direct connect. I included my diagnostics, and others will do as well, as we talked on Discord to give as much data points about this. I don't spindown or anything. I run write reconstruct all the time, but it's almost like the setting doesn't work at all anymore. This is a bug that impacts AMD only. There was a regression in kernel 5.19 about dummy waits that caused to impact AMD CPUs. (https://www.phoronix.com/news/Linux-6.0-AMD-Chipset-WA) Thanks. megumin-diagnostics-20221221-1116.zip
  6. After update to RC4, the eth1 nic completely dissapeared, and only eth3 is left. So reboots have both nics working together. Reduced Priority to annoyance and added a new diagnostics to help compare with the rc3o, rc3 and rc4, if there's something to be seen that could cause it for the future. Thanks. megumin-diagnostics-20220319-2132.zip
  7. Hi, I was doing tests with the rc3n/o versions and it did this as well. My Nic is getting renamed from either eth1 to eth3 or the opposite each time I reboot the system. I've changed the assignation from eth4/5 to eth0/1 during the RC2 cycle, and once I went on Test versions (n or o) or RC3, it started doing this issue on reboot. I also see the NICs in my network_rules.cfg in double on some reboots. I uploaded diagnostics of 2 different reboots for this to show the issues i'm hitting as well. My config is running the Mellanox Nics in LACP as it's how my switches only allow bonding. Thanks, Synd. megumin-diagnostics-20220310-1650.zip megumin-diagnostics-20220310-1321.zip