Mellanox ConnectX-6 Support?

Amane · April 13

Hi everyone

Does anyone have ideas to fix that or experience with ConectX-6?
Card: Mellanox ConnectX-6 Lx (MCX631102AN-ADAT)

Apr 13 19:09:11 unraid unraid-api[10591]: ✔️ UNRAID API started successfully!
Apr 13 19:09:23 unraid SysDrivers: SysDrivers Build Complete
Apr 13 19:09:44 unraid kernel: mlx5_core 0000:61:00.0: poll_health:824:(pid 0): Fatal error 1 detected
Apr 13 19:09:44 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:423:(pid 0): PCI slot is unavailable
Apr 13 19:09:46 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:46 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:46 unraid kernel: mlx5_core 0000:61:00.1: poll_health:824:(pid 0): Fatal error 1 detected
Apr 13 19:09:46 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:423:(pid 0): PCI slot is unavailable
Apr 13 19:09:46 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:46 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.0: mlx5_crdump_collect:50:(pid 1083): crdump: failed to lock vsc gw err -16
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.0: mlx5_health_try_recover:335:(pid 1083): handling bad device here
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.0: mlx5_error_sw_reset:243:(pid 1083): start
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:50 unraid kernel: mlx5_core 0000:61:00.1: mlx5_crdump_collect:50:(pid 1206): crdump: failed to lock vsc gw err -16
Apr 13 19:09:50 unraid kernel: mlx5_core 0000:61:00.1: mlx5_health_try_recover:335:(pid 1206): handling bad device here
Apr 13 19:09:50 unraid kernel: mlx5_core 0000:61:00.1: mlx5_error_sw_reset:243:(pid 1206): start
Apr 13 19:09:51 unraid kernel: mlx5_core 0000:61:00.0: NIC IFC still 7 after 2000ms.
Apr 13 19:09:51 unraid kernel: mlx5_core 0000:61:00.0: mlx5_error_sw_reset:276:(pid 1083): end
Apr 13 19:09:51 unraid kernel: bond0: (slave eth2): Releasing backup interface
Apr 13 19:09:51 unraid kernel: mlx5_core 0000:61:00.0: mlx5e_execute_l2_action:598:(pid 1089): MPFS, failed to add mac a0:88:c2:ae:3f:50, err(-67)
Apr 13 19:09:51 unraid kernel: mlx5_core 0000:61:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
Apr 13 19:09:51 unraid kernel: mlx5_core 0000:61:00.0: mlx5_wait_for_pages:777:(pid 1083): Skipping wait for vf pages stage
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1: NIC IFC still 7 after 2000ms.
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1: mlx5_error_sw_reset:276:(pid 1206): end
Apr 13 19:09:52 unraid kernel: bond0: (slave eth3): Releasing backup interface
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1: mlx5e_execute_l2_action:598:(pid 1089): MPFS, failed to add mac a0:88:c2:ae:3f:51, err(-67)
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1: mlx5_wait_for_pages:777:(pid 1206): Skipping wait for vf pages stage
Apr 13 19:10:53 unraid kernel: mlx5_core 0000:61:00.0: mlx5_health_try_recover:338:(pid 1083): health recovery flow aborted, PCI reads still not working
Apr 13 19:10:55 unraid kernel: mlx5_core 0000:61:00.1: mlx5_health_try_recover:338:(pid 1206): health recovery flow aborted, PCI reads still not working

I will be a link before Unraid is launched and then the link goes down... after that I have this state:

Edited April 14 by Amane

Amane · April 13

Hi everyone

Does anyone have ideas to fix that or experience with ConectX-6?

Apr 13 19:09:11 unraid unraid-api[10591]: ✔️ UNRAID API started successfully!
Apr 13 19:09:23 unraid SysDrivers: SysDrivers Build Complete
Apr 13 19:09:44 unraid kernel: mlx5_core 0000:61:00.0: poll_health:824:(pid 0): Fatal error 1 detected
Apr 13 19:09:44 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:423:(pid 0): PCI slot is unavailable
Apr 13 19:09:46 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:46 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:46 unraid kernel: mlx5_core 0000:61:00.1: poll_health:824:(pid 0): Fatal error 1 detected
Apr 13 19:09:46 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:423:(pid 0): PCI slot is unavailable
Apr 13 19:09:46 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:46 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.0: mlx5_crdump_collect:50:(pid 1083): crdump: failed to lock vsc gw err -16
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.0: mlx5_health_try_recover:335:(pid 1083): handling bad device here
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.0: mlx5_error_sw_reset:243:(pid 1083): start
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:49 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:50 unraid kernel: mlx5_core 0000:61:00.1: mlx5_crdump_collect:50:(pid 1206): crdump: failed to lock vsc gw err -16
Apr 13 19:09:50 unraid kernel: mlx5_core 0000:61:00.1: mlx5_health_try_recover:335:(pid 1206): handling bad device here
Apr 13 19:09:50 unraid kernel: mlx5_core 0000:61:00.1: mlx5_error_sw_reset:243:(pid 1206): start
Apr 13 19:09:51 unraid kernel: mlx5_core 0000:61:00.0: NIC IFC still 7 after 2000ms.
Apr 13 19:09:51 unraid kernel: mlx5_core 0000:61:00.0: mlx5_error_sw_reset:276:(pid 1083): end
Apr 13 19:09:51 unraid kernel: bond0: (slave eth2): Releasing backup interface
Apr 13 19:09:51 unraid kernel: mlx5_core 0000:61:00.0: mlx5e_execute_l2_action:598:(pid 1089): MPFS, failed to add mac a0:88:c2:ae:3f:50, err(-67)
Apr 13 19:09:51 unraid kernel: mlx5_core 0000:61:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
Apr 13 19:09:51 unraid kernel: mlx5_core 0000:61:00.0: mlx5_wait_for_pages:777:(pid 1083): Skipping wait for vf pages stage
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1: NIC IFC still 7 after 2000ms.
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1: mlx5_error_sw_reset:276:(pid 1206): end
Apr 13 19:09:52 unraid kernel: bond0: (slave eth3): Releasing backup interface
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1: mlx5e_execute_l2_action:598:(pid 1089): MPFS, failed to add mac a0:88:c2:ae:3f:51, err(-67)
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
Apr 13 19:09:52 unraid kernel: mlx5_core 0000:61:00.1: mlx5_wait_for_pages:777:(pid 1206): Skipping wait for vf pages stage
Apr 13 19:10:53 unraid kernel: mlx5_core 0000:61:00.0: mlx5_health_try_recover:338:(pid 1083): health recovery flow aborted, PCI reads still not working
Apr 13 19:10:55 unraid kernel: mlx5_core 0000:61:00.1: mlx5_health_try_recover:338:(pid 1206): health recovery flow aborted, PCI reads still not working

The link was up before Unraid starts, and then the link goes down... after that I have this state:

Edited April 13 by Amane

bmartino1 · April 13

Curious as this is more network driver module issues.

?Docker ipvlan/macvlan tap issues?
Settings > Docker > Advance togle make sure "Docker custom network type" ipvlan...

But what's the output of "ip a" and "lspci -v"

Looking for the kernel driver in use. and if the melnox card is getting a ip and network traffic.

Unraid Devs: kmod-mlx5-core module may need to be looked at with kernel changes.

Edited April 13 by bmartino1

bmartino1 · April 13

@Amane

a NVIDIA EN Driver for Linux dirver is required.

Seeing as this may be a Nvidia driver issue. Custom plugin driver may be required to use the hardware.

https://docs.nvidia.com/networking/display/connectx6dxen/linux+driver+installation

Unraid doesn't support 3rd party run driver install instances, and at reboot they would be gone.

ich777 · April 13

2 hours ago, Amane said:

Apr 13 19:10:53 unraid kernel: mlx5_core 0000:61:00.0: mlx5_health_try_recover:338:(pid 1083): health recovery flow aborted, PCI reads still not working
Apr 13 19:10:55 unraid kernel: mlx5_core 0000:61:00.1: mlx5_health_try_recover:338:(pid 1206): health recovery flow aborted, PCI reads still not working

Seems like you have issues with your PCI bus and the card won't properly reset of the above errors.

Please make sure that you have Resizable BAR enabled in your BIOS.

Do you have AER on? If yes, are there some AER messages in the syslog too?

Maybe it would be better to reach out in the Nvidia Developer forums (at least that's what I do when they broke a driver).

Amane · April 13

2 hours ago, bmartino1 said:

a NVIDIA EN Driver for Linux dirver is required.

Ok i try this tomorow, if necessary i will make a script for the go file. thanks for the help

Edited April 13 by Amane

Amane · April 14

16 hours ago, ich777 said:

Please make sure that you have Resizable BAR enabled in your BIOS.

OK, it was deactivated, now it is activated (automatically). But it doesn't help - thanks anyway

16 hours ago, ich777 said:

Do you have AER on?

I can't find the option..

It may really be that I still have to set something in the BIOS.. (before I test another driver or a workaround)

Here are my comlete syslog: syslog.txt

Reduced to the most important parts:

Spoiler

Apr 14 13:34:14 unraid kernel: Linux version 6.1.79-Unraid (root@Develop-612) (gcc (GCC) 12.2.0, GNU ld version 2.40-slack151) #1 SMP PREEMPT_DYNAMIC Fri Mar 29 13:34:03 PDT 2024
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.0: [15b3:101f] type 00 class 0x020000
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.0: reg 0x10: [mem 0x10064000000-0x10065ffffff 64bit pref]
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.0: reg 0x30: [mem 0xf3500000-0xf35fffff pref]
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.0: PME# supported from D3cold
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.0: reg 0x1a4: [mem 0x10066800000-0x100668fffff 64bit pref]
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.0: VF(n) BAR0 space: [mem 0x10066800000-0x10066ffffff 64bit pref] (contains BAR0 for 8 VFs)
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.1: [15b3:101f] type 00 class 0x020000
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.1: reg 0x10: [mem 0x10062000000-0x10063ffffff 64bit pref]
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.1: reg 0x30: [mem 0xf3400000-0xf34fffff pref]
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.1: PME# supported from D3cold
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.1: reg 0x1a4: [mem 0x10066000000-0x100660fffff 64bit pref]
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.1: VF(n) BAR0 space: [mem 0x10066000000-0x100667fffff 64bit pref] (contains BAR0 for 8 VFs)
Apr 14 13:34:14 unraid kernel: pci_bus 0000:61: resource 1 [mem 0xf3400000-0xf35fffff]
Apr 14 13:34:14 unraid kernel: pci_bus 0000:61: resource 2 [mem 0x10062000000-0x10066ffffff 64bit pref]
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.0: Adding to iommu group 10
Apr 14 13:34:14 unraid kernel: pci 0000:61:00.1: Adding to iommu group 11
Apr 14 13:34:14 unraid kernel: ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver
Apr 14 13:34:14 unraid kernel: ixgbe: Copyright (c) 1999-2016 Intel Corporation.
Apr 14 13:34:14 unraid kernel: ixgbe 0000:24:00.0: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0
Apr 14 13:34:14 unraid kernel: ixgbe 0000:24:00.0: 31.504 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x4 link)
Apr 14 13:34:14 unraid kernel: ixgbe 0000:24:00.0: MAC: 4, PHY: 0, PBA No: 000000-000
Apr 14 13:34:14 unraid kernel: ixgbe 0000:24:00.0: 58:11:22:d7:21:36
Apr 14 13:34:14 unraid kernel: ixgbe 0000:24:00.0: Intel(R) 10 Gigabit Network Connection
Apr 14 13:34:14 unraid kernel: ixgbe 0000:24:00.1: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.0: firmware version: 26.36.1010
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.0: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
Apr 14 13:34:14 unraid kernel: ixgbe 0000:24:00.1: MAC: 4, PHY: 0, PBA No: 000000-000
Apr 14 13:34:14 unraid kernel: ixgbe 0000:24:00.1: 58:11:22:d7:21:37
Apr 14 13:34:14 unraid kernel: ixgbe 0000:24:00.1: Intel(R) 10 Gigabit Network Connection
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.0: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.0: Port module event: module 0, Cable plugged
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.0: mlx5_pcie_event:301:(pid 1136): PCIe slot advertised sufficient power (75W).
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.1: firmware version: 26.36.1010
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.1: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.1: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.1: Port module event: module 1, Cable unplugged
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.1: mlx5_pcie_event:301:(pid 790): PCIe slot advertised sufficient power (75W).
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.1: Supported tc offload range - chains: 4294967294, prios: 4294967295
Apr 14 13:34:14 unraid kernel: ixgbe 0000:24:00.1: complete
Apr 14 13:34:14 unraid kernel: ixgbe 0000:24:00.0: complete
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.1: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), active vports(0)
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
Apr 14 13:34:14 unraid kernel: mlx5_core 0000:61:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
Apr 14 13:34:15 unraid kernel: mlx5_core 0000:61:00.1: E-Switch: cleanup
Apr 14 13:34:16 unraid kernel: mlx5_core 0000:61:00.0: E-Switch: Unload vfs: mode(LEGACY), nvfs(0), active vports(0)
Apr 14 13:34:16 unraid kernel: mlx5_core 0000:61:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
Apr 14 13:34:16 unraid kernel: mlx5_core 0000:61:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
Apr 14 13:34:16 unraid kernel: mlx5_core 0000:61:00.0: E-Switch: cleanup
Apr 14 13:34:17 unraid kernel: ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver
Apr 14 13:34:17 unraid kernel: ixgbe: Copyright (c) 1999-2016 Intel Corporation.
Apr 14 13:34:18 unraid kernel: ixgbe 0000:24:00.0: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0
Apr 14 13:34:18 unraid kernel: ixgbe 0000:24:00.0: 31.504 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x4 link)
Apr 14 13:34:18 unraid kernel: ixgbe 0000:24:00.0: MAC: 4, PHY: 0, PBA No: 000000-000
Apr 14 13:34:18 unraid kernel: ixgbe 0000:24:00.0: 58:11:22:d7:21:36
Apr 14 13:34:18 unraid kernel: ixgbe 0000:24:00.0: Intel(R) 10 Gigabit Network Connection
Apr 14 13:34:19 unraid kernel: ixgbe 0000:24:00.1: Multiqueue Enabled: Rx Queue count = 63, Tx Queue count = 63 XDP Queue count = 0
Apr 14 13:34:19 unraid kernel: ixgbe 0000:24:00.1: 31.504 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x4 link)
Apr 14 13:34:19 unraid kernel: ixgbe 0000:24:00.1: MAC: 4, PHY: 0, PBA No: 000000-000
Apr 14 13:34:19 unraid kernel: ixgbe 0000:24:00.1: 58:11:22:d7:21:37
Apr 14 13:34:20 unraid kernel: ixgbe 0000:24:00.1: Intel(R) 10 Gigabit Network Connection
Apr 14 13:34:20 unraid kernel: mlx5_core 0000:61:00.0: firmware version: 26.36.1010
Apr 14 13:34:20 unraid kernel: mlx5_core 0000:61:00.0: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
Apr 14 13:34:20 unraid kernel: mlx5_core 0000:61:00.0: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
Apr 14 13:34:20 unraid kernel: mlx5_core 0000:61:00.0: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
Apr 14 13:34:20 unraid kernel: mlx5_core 0000:61:00.0: Port module event: module 0, Cable plugged
Apr 14 13:34:20 unraid kernel: mlx5_core 0000:61:00.0: mlx5_pcie_event:301:(pid 790): PCIe slot advertised sufficient power (75W).
Apr 14 13:34:20 unraid kernel: mlx5_core 0000:61:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
Apr 14 13:34:20 unraid kernel: mlx5_core 0000:61:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295
Apr 14 13:34:20 unraid kernel: mlx5_core 0000:61:00.1: firmware version: 26.36.1010
Apr 14 13:34:20 unraid kernel: mlx5_core 0000:61:00.1: 126.024 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x8 link)
Apr 14 13:34:21 unraid kernel: mlx5_core 0000:61:00.1: Rate limit: 127 rates are supported, range: 0Mbps to 24414Mbps
Apr 14 13:34:21 unraid kernel: mlx5_core 0000:61:00.1: E-Switch: Total vports 10, per vport: max uc(128) max mc(2048)
Apr 14 13:34:21 unraid kernel: mlx5_core 0000:61:00.1: Port module event: module 1, Cable unplugged
Apr 14 13:34:21 unraid kernel: mlx5_core 0000:61:00.1: mlx5_pcie_event:301:(pid 1085): PCIe slot advertised sufficient power (75W).
Apr 14 13:34:21 unraid kernel: mlx5_core 0000:61:00.1: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0)
Apr 14 13:34:21 unraid kernel: mlx5_core 0000:61:00.1: Supported tc offload range - chains: 4294967294, prios: 4294967295
Apr 14 13:34:22 unraid rc.inet1: ip -4 addr add 127.0.0.1/8 dev lo
Apr 14 13:34:22 unraid rc.inet1: ip -6 addr add ::1/128 dev lo
Apr 14 13:34:22 unraid rc.inet1: ip link set lo up
Apr 14 13:34:22 unraid kernel: MII link monitoring set to 100 ms
Apr 14 13:34:22 unraid rc.inet1: ip link add name bond0 type bond mode 1 miimon 100
Apr 14 13:34:22 unraid rc.inet1: ip link set eth0 up
Apr 14 13:34:22 unraid kernel: pps pps0: new PPS source ptp2
Apr 14 13:34:22 unraid kernel: ixgbe 0000:24:00.0: registered PHC device on eth0
Apr 14 13:34:22 unraid rc.inet1: ip link set eth0 master bond0 down type bond_slave
Apr 14 13:34:22 unraid kernel: ixgbe 0000:24:00.0: removed PHC on eth0
Apr 14 13:34:23 unraid kernel: pps pps0: new PPS source ptp2
Apr 14 13:34:23 unraid kernel: ixgbe 0000:24:00.0: registered PHC device on eth0
Apr 14 13:34:23 unraid kernel: bond0: (slave eth0): Enslaving as a backup interface with a down link
Apr 14 13:34:23 unraid rc.inet1: ip link set eth1 up
Apr 14 13:34:23 unraid kernel: pps pps1: new PPS source ptp3
Apr 14 13:34:23 unraid kernel: ixgbe 0000:24:00.1: registered PHC device on eth1
Apr 14 13:34:23 unraid rc.inet1: ip link set eth1 master bond0 down type bond_slave
Apr 14 13:34:23 unraid kernel: ixgbe 0000:24:00.1: removed PHC on eth1
Apr 14 13:34:24 unraid kernel: pps pps1: new PPS source ptp3
Apr 14 13:34:24 unraid kernel: ixgbe 0000:24:00.1: registered PHC device on eth1
Apr 14 13:34:24 unraid kernel: bond0: (slave eth1): Enslaving as a backup interface with a down link
Apr 14 13:34:24 unraid rc.inet1: ip link set eth2 up
Apr 14 13:34:25 unraid kernel: mlx5_core 0000:61:00.0 eth2: Link up
Apr 14 13:34:25 unraid kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready
Apr 14 13:34:25 unraid rc.inet1: ip link set eth2 master bond0 down type bond_slave
Apr 14 13:34:26 unraid kernel: mlx5_core 0000:61:00.0 eth2: Link up
Apr 14 13:34:26 unraid kernel: bond0: (slave eth2): making interface the new active one
Apr 14 13:34:26 unraid kernel: bond0: (slave eth2): Enslaving as an active interface with an up link
Apr 14 13:34:26 unraid rc.inet1: ip link set eth3 up
Apr 14 13:34:27 unraid kernel: mlx5_core 0000:61:00.1 eth3: Link down
Apr 14 13:34:27 unraid rc.inet1: ip link set eth3 master bond0 down type bond_slave
Apr 14 13:34:27 unraid kernel: ixgbe 0000:24:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: None
Apr 14 13:34:28 unraid kernel: mlx5_core 0000:61:00.1 eth3: Link down
Apr 14 13:34:28 unraid kernel: bond0: (slave eth3): Enslaving as a backup interface with a down link
Apr 14 13:34:28 unraid rc.inet1: ip link set name bond0 type bond primary eth0
Apr 14 13:34:28 unraid rc.inet1: ip link add name br0 type bridge stp_state 0 forward_delay 0
Apr 14 13:34:28 unraid kernel: bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
Apr 14 13:34:28 unraid rc.inet1: ip link set bond0 down
Apr 14 13:34:28 unraid rc.inet1: ip -4 addr flush dev bond0
Apr 14 13:34:28 unraid rc.inet1: ip link set bond0 master br0 up
Apr 14 13:34:28 unraid kernel: br0: port 1(bond0) entered blocking state
Apr 14 13:34:28 unraid kernel: br0: port 1(bond0) entered disabled state
Apr 14 13:34:28 unraid kernel: device bond0 entered promiscuous mode
Apr 14 13:34:28 unraid kernel: device eth2 entered promiscuous mode
Apr 14 13:34:28 unraid kernel: bond0: (slave eth0): link status definitely up, 10000 Mbps full duplex
Apr 14 13:34:28 unraid kernel: bond0: (slave eth0): making interface the new active one
Apr 14 13:34:28 unraid kernel: device eth2 left promiscuous mode
Apr 14 13:34:28 unraid kernel: device eth0 entered promiscuous mode
Apr 14 13:34:28 unraid kernel: mlx5_core 0000:61:00.0: mlx5e_fs_set_rx_mode_work:837:(pid 790): S-tagged traffic will be dropped while C-tag vlan stripping is enabled
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: poll_health:840:(pid 0): device's health compromised - reached miss count
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:429:(pid 0): Health issue observed, High temperature, severity(2) CRITICAL:
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:433:(pid 0): assert_var[0] 0x00000073
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:433:(pid 0): assert_var[1] 0x00000073
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:433:(pid 0): assert_var[2] 0x00000000
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:433:(pid 0): assert_var[3] 0x00000000
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:433:(pid 0): assert_var[4] 0x00000000
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:433:(pid 0): assert_var[5] 0x00000000
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:436:(pid 0): assert_exit_ptr 0x20b561d8
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:437:(pid 0): assert_callra 0x20b56238
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:438:(pid 0): fw_ver 26.36.1010
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:440:(pid 0): time 1713094456
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:441:(pid 0): hw_id 0x00000216
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:442:(pid 0): rfr 0
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:443:(pid 0): severity 2 (CRITICAL)
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:444:(pid 0): irisc_index 0
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:445:(pid 0): synd 0x10: High temperature
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:447:(pid 0): ext_synd 0x0000
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:448:(pid 0): raw fw_ver 0x1a2403f2
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: poll_health:840:(pid 0): device's health compromised - reached miss count
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:429:(pid 0): Health issue observed, High temperature, severity(2) CRITICAL:
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:433:(pid 0): assert_var[0] 0x00000073
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:433:(pid 0): assert_var[1] 0x00000073
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:433:(pid 0): assert_var[2] 0x00000000
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:433:(pid 0): assert_var[3] 0x00000000
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:433:(pid 0): assert_var[4] 0x00000000
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:433:(pid 0): assert_var[5] 0x00000000
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:436:(pid 0): assert_exit_ptr 0x20b561d8
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:437:(pid 0): assert_callra 0x20b56238
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:438:(pid 0): fw_ver 26.36.1010
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:440:(pid 0): time 1713094456
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:441:(pid 0): hw_id 0x00000216
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:442:(pid 0): rfr 0
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:443:(pid 0): severity 2 (CRITICAL)
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:444:(pid 0): irisc_index 0
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:445:(pid 0): synd 0x10: High temperature
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:447:(pid 0): ext_synd 0x0000
Apr 14 13:34:30 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:448:(pid 0): raw fw_ver 0x1a2403f2
Apr 14 13:35:14 unraid kernel: mlx5_core 0000:61:00.0: poll_health:824:(pid 0): Fatal error 1 detected
Apr 14 13:35:14 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:423:(pid 0): PCI slot is unavailable
Apr 14 13:35:14 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 14 13:35:14 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 14 13:35:15 unraid SysDrivers: SysDrivers Build Starting
Apr 14 13:35:17 unraid kernel: mlx5_core 0000:61:00.1: poll_health:824:(pid 0): Fatal error 1 detected
Apr 14 13:35:17 unraid kernel: mlx5_core 0000:61:00.1: print_health_info:423:(pid 0): PCI slot is unavailable
Apr 14 13:35:17 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 14 13:35:17 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 14 13:35:17 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 14 13:35:17 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 14 13:35:17 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 14 13:35:17 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 14 13:35:19 unraid kernel: mlx5_core 0000:61:00.0: mlx5_crdump_collect:50:(pid 1085): crdump: failed to lock vsc gw err -16
Apr 14 13:35:19 unraid kernel: mlx5_core 0000:61:00.0: mlx5_health_try_recover:335:(pid 1085): handling bad device here
Apr 14 13:35:19 unraid kernel: mlx5_core 0000:61:00.0: mlx5_error_sw_reset:243:(pid 1085): start
Apr 14 13:35:20 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 14 13:35:20 unraid kernel: mlx5_core 0000:61:00.0 eth2: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 14 13:35:20 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 14 13:35:20 unraid kernel: mlx5_core 0000:61:00.1 eth3: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
Apr 14 13:35:21 unraid kernel: mlx5_core 0000:61:00.0: NIC IFC still 7 after 2000ms.
Apr 14 13:35:21 unraid kernel: mlx5_core 0000:61:00.0: mlx5_error_sw_reset:276:(pid 1085): end
Apr 14 13:35:21 unraid kernel: bond0: (slave eth2): Releasing backup interface
Apr 14 13:35:21 unraid kernel: mlx5_core 0000:61:00.0: mlx5e_execute_l2_action:598:(pid 790): MPFS, failed to add mac a0:88:c2:ae:3f:50, err(-67)
Apr 14 13:35:21 unraid kernel: mlx5_core 0000:61:00.0: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
Apr 14 13:35:21 unraid kernel: mlx5_core 0000:61:00.0: mlx5_wait_for_pages:777:(pid 1085): Skipping wait for vf pages stage
Apr 14 13:35:21 unraid kernel: mlx5_core 0000:61:00.1: mlx5_crdump_collect:50:(pid 1136): crdump: failed to lock vsc gw err -16
Apr 14 13:35:21 unraid kernel: mlx5_core 0000:61:00.1: mlx5_health_try_recover:335:(pid 1136): handling bad device here
Apr 14 13:35:21 unraid kernel: mlx5_core 0000:61:00.1: mlx5_error_sw_reset:243:(pid 1136): start
Apr 14 13:35:23 unraid kernel: mlx5_core 0000:61:00.1: NIC IFC still 7 after 2000ms.
Apr 14 13:35:23 unraid kernel: mlx5_core 0000:61:00.1: mlx5_error_sw_reset:276:(pid 1136): end
Apr 14 13:35:23 unraid kernel: bond0: (slave eth3): Releasing backup interface
Apr 14 13:35:23 unraid kernel: mlx5_core 0000:61:00.1: mlx5e_execute_l2_action:598:(pid 1180): MPFS, failed to add mac a0:88:c2:ae:3f:51, err(-67)
Apr 14 13:35:23 unraid kernel: mlx5_core 0000:61:00.1: E-Switch: Disable: mode(LEGACY), nvfs(0), active vports(0)
Apr 14 13:35:23 unraid kernel: mlx5_core 0000:61:00.1: mlx5_wait_for_pages:777:(pid 1136): Skipping wait for vf pages stage
Apr 14 13:35:38 unraid kernel: mlx5_core 0000:61:00.0: invalid VPD tag 0xff (size 0) at offset 0; assume missing optional EEPROM
Apr 14 13:35:38 unraid kernel: mlx5_core 0000:61:00.1: invalid VPD tag 0xff (size 0) at offset 0; assume missing optional EEPROM
Apr 14 13:36:23 unraid kernel: mlx5_core 0000:61:00.0: mlx5_health_try_recover:338:(pid 1085): health recovery flow aborted, PCI reads still not working
Apr 14 13:36:26 unraid kernel: mlx5_core 0000:61:00.1: mlx5_health_try_recover:338:(pid 1136): health recovery flow aborted, PCI reads still not working

Edited April 14 by Amane

Amane · April 14

mstflint -d 61:00.0 -i fw-ConnectX6Lx-rel-26_40_1000-MCX631102AN-ADA_Ax-UEFI-14.33.10-FlexBoot-3.7.300.bin burn
FATAL - Can't find device id.
-E- Cannot open Device: 61:00.0. No such file or directory. MFE_UNSUPPORTED_DEVICE

@ich777 How do I get the new firmware into the card, do I have to fix my problem first? - I think so..

Edited April 14 by Amane

ich777 · April 14

1 hour ago, Amane said:

do I have to fix my problem first?

Yes:

Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: poll_health:840:(pid 0): device's health compromised - reached miss count
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:429:(pid 0): Health issue observed, High temperature, severity(2) CRITICAL:
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:433:(pid 0): assert_var[0] 0x00000073
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:433:(pid 0): assert_var[1] 0x00000073
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:433:(pid 0): assert_var[2] 0x00000000
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:433:(pid 0): assert_var[3] 0x00000000
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:433:(pid 0): assert_var[4] 0x00000000
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:433:(pid 0): assert_var[5] 0x00000000
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:436:(pid 0): assert_exit_ptr 0x20b561d8
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:437:(pid 0): assert_callra 0x20b56238
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:438:(pid 0): fw_ver 26.36.1010
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:440:(pid 0): time 1713094456
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:441:(pid 0): hw_id 0x00000216
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:442:(pid 0): rfr 0
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:443:(pid 0): severity 2 (CRITICAL)
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:444:(pid 0): irisc_index 0
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:445:(pid 0): synd 0x10: High temperature
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:447:(pid 0): ext_synd 0x0000
Apr 14 13:34:29 unraid kernel: mlx5_core 0000:61:00.0: print_health_info:448:(pid 0): raw fw_ver 0x1a2403f2

(second line)

1 hour ago, Amane said:

How do I get the new firmware into the card

I'm also not too sure if a ConnectX6 Lx is already supported by mstflint

Edited April 14 by ich777

bmartino1 · April 14

Most likely a window/full Linux desktop environment to install the firmware:

https://network.nvidia.com/support/firmware/connectx6dx/

firmware install instructions:

https://network.nvidia.com/support/firmware/nic/

the LX may have been pulled from a DELL

https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=ntf5k

Edited April 14 by bmartino1

ich777 · April 14

10 minutes ago, bmartino1 said:

Most likely a window/full Linux desktop environment to install the firmware:

https://network.nvidia.com/support/firmware/connectx6dx/

Please read the full post…

The card is overheating and therefore disabled.

It should be also possible to flash the firmware on Unraid with the MFT plugin, but first the overheating problem needs to be solved.

bmartino1 · April 14

6 minutes ago, ich777 said:

Please read the full post…

The card is overheating and therefore disabled.

It should be also possible to flash the firmware on Unraid with the MFT plugin, but first the overheating problem needs to be solved.

~~Unless you have a diagnostic or other PM Log there is no mention of overheating except with your post and not from the client/user who organdy poste~~d.

While I do see in your post a temp critical for the log the user posted log do not show that... Glad the plugin/unraid is able to flash the firmware. Thought that was a question.

I see now, its a hidden spoiler...

Edited April 14 by bmartino1

Amane · April 14

2 hours ago, ich777 said:

but first the overheating problem needs to be solved.

Even if I don't plug anything in and just leave the card dangling in the PCI slot, after a while 100°C..
I have a fan right next to it, but I think I need an jetengine with dry ice.. 😅 ..f*ing server-hardware.

grafik.png.1d0fbc2b3e3fbb07b15f5cde1abadb1c.png

What is the card doing that it gets so hot when I'm not using it?

Edited April 14 by Amane

Amane · April 14

Its definitly a termalproblem:
output.gif.ed20dd4f164dc61a4cf2ed02854359e0.gif

I managed to keep the temperatures from rising above 105°C, and look here:

By the way, I have successfully falsh the newest firmware..

grafik.png.8ef9db55de03a6c5a98e210c819f215f.png

But it doesn't help, why soooo hot... nothing is plugged in..

Edited April 16 by Amane

ich777 · April 15

7 hours ago, Amane said:

But it doesn't help, why soooo hot... nothing is plugged in..

What do you expect? This is server hardware and TBH I really don't understand why you need a 1x 50Gbps adapter (or 2x 25Gbps) adapter for home usage...

In a server you have a significant amount of airflow over theses cards, they even have some times special air direction thingies to ensure enough air flow.

These cards usually don't have ASPM or similar features and are running always at full blast.

You can try a ConnectX3 card which you can keep under control with a little bit of airflow over them (~70°C) or with a little fan attached ~47°C.

7 hours ago, Amane said:

By the way, I have successfully falsh the newest firmware..

With the plugin, or better speaking with the tools that it ships with?

Next time when you have such an hardware issue please always attach Diagnostics for easier troubleshooting.

Amane · April 15

7 minutes ago, ich777 said:

I really don't understand why you need a 1x 50Gbps adapter (or 2x 25Gbps) adapter for home usage..

Hey, I have 25Gbit with my provider (real 15Gbit)..

9 minutes ago, ich777 said:

With the plugin, or better speaking with the tools that it ships with?

With mstflint 👍

Yes, it's server hardware, but with that little aluminum cooler, it seems to work nit even with a direct airflow. (There are other hot PCI devices in the area) - I will probably have to consider converting the component to water cooling..

ich777 · April 15

4 hours ago, Amane said:

Hey, I have 25Gbit with my provider (real 15Gbit)..

And you can reach those speeds in the real world? Sure you'll maybe reach those speeds with Speedtest and such sites but do you really benefit from the full 15Gbit/s with Google services or GitHub? I assume you will be somewhat limited by the remote servers anyways.

I would stick to a ConnectX4 if you really need those kind of speeds or not (of course with active cooling) because those cards run much cooler FWIK and they can also run 25Gbit/s

4 hours ago, Amane said:

Yes, it's server hardware, but with that little aluminum cooler, it seems to work nit even with a direct airflow.

I think you are underestimating how much airflow there is in a server chassis and how cooling works in a server chassis, this is a completely different ballpark compared to a consumer or serverish case...

Amane · April 15

3 hours ago, ich777 said:

And you can reach those speeds in the real world? Sure you'll maybe reach those speeds with Speedtest and such sites but do you really benefit from the full 15Gbit/s with Google services or GitHub? I assume you will be somewhat limited by the remote servers anyways.

I would stick to a ConnectX4 if you really need those kind of speeds or not (of course with active cooling) because those cards run much cooler FWIK and they can also run 25Gbit/s

I actually find it quite cheeky that you claim I can't use it anyway, apart from the fact that it's not true -

I use a downloader where I get almost more than 10Gbit... and I want to use the bandwidth too!

It's nice to have - even if I know that I will rarely even come close to reaching that speed..

I work in the IT industry and know the performance of a server rack.. but show the small cooler, like a raspberry cooler.

I just wanted to give it a try...

grafik.png.81e6d6c863b2d6587daa134410cb5e10.png

JorgeB · April 15

That is indeed kind of small, I have two 25GbE ConnectX-4 like this:

They work fine without special cooling, though for large continuous transfers they need some airflow around them.

ich777 · April 15

1 hour ago, Amane said:

I actually find it quite cheeky that you claim I can't use it anyway, apart from the fact that it's not true

Just to clarify this was a question…

Amane · April 15

2 minutes ago, ich777 said:

Just to clarify this was a question…

ok, sry bad english..

Amane · April 21

@ich777

Hi, I have new findings!

The termalsensor AUXTIN is not output correctly!

On 4/15/2024 at 1:05 AM, Amane said:

It has nothing to do with the Mellanox card, even when I have unplugged it, the thermal sensors are displayed at 105°C and above.. You can find a lot about this on the internet. Actually I should ignore them... but the problem is that my card only works if they stay below 105°C, also the temperature of the Melanox card is probably wrong because of that?!

They are aditional temperature sensors.. I'm trying to deactivate them somehow...

I plugged the card in with a riser cable and touched it with my fingers, it doesn't get that bad warm...

Edited April 21 by Amane

ich777 · April 21

2 hours ago, Amane said:

The termalsensor AUXTIN is not output correctly!

Yes this is a common thing and the case on most systems like on mine but with a negative value:

grafik.png.199f1ae5b38ecf22dbda488065c33f7d.png

2 hours ago, Amane said:

but the problem is that my card only works if they stay below 105°C, also the temperature of the Melanox card is probably wrong because of that?!

Not really, your card actually reports that temperature and I only display it on the plugin page through the command `mget_temp` nothing more.

I don't think that you can deactivate the thermistor on the card because then the card also won't work and report a silly temperature and please keep in mind that the card deactivates itself and that is a thing that the firmware on the card does.

2 hours ago, Amane said:

I plugged the card in with a riser cable and touched it with my fingers, it doesn't get that bad warm...

Maybe the thermistor is defective on your card but AFAIK the thermistor is located right in the DIE.

What could also cause such an issue if the heatsink doesn't make proper contact with the whole chip.

bmartino1 · April 21

it may need new thermal pads/paste to the heat sink. Otherwise, you may need to put a 12 v fan behind the card.

Remember that these cards were put in server running 40mm delta fans and constant active air for cooling.

Mellanox ConnectX-6 Support?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation