Jump to content

unRAID 6.5.1 - Bridging Introduces Severe and Inconsistent Network Lag


0perat0r

Recommended Posts

Hello unRAID.

 

First let me say I am sorry for the wall of text below, but this issue has not been easy to diagnose.

 

I have been having an issue with my unRAID Pro server for as long as I can remember. I have always used VMs with this specific server, and that always required using the "br0" virtual switch to bridge connections to the VMs. Long story short, if I just disable bridging/br0 all of my issues go away. Re-enabling it, all of my issues return. 


Testing:

A Powershell script running on machines connecting to unRAID to test how long it takes to open a shared folder:

 

$i=1
Do {
    Start-Sleep -s 3
Measure-Command { cmd /c dir \\10.0.10.3\testshare\testfolder }
    $i
    $i++
    }
While ($i -le 150)

 

Output (with br0 disabled):

9
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 277
Ticks             : 2770585
TotalDays         : 3.20669560185185E-06
TotalHours        : 7.69606944444444E-05
TotalMinutes      : 0.00461764166666667
TotalSeconds      : 0.2770585
TotalMilliseconds : 277.0585

10
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 254
Ticks             : 2549549
TotalDays         : 2.95086689814815E-06
TotalHours        : 7.08208055555556E-05
TotalMinutes      : 0.00424924833333333
TotalSeconds      : 0.2549549
TotalMilliseconds : 254.9549

11
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 218
Ticks             : 2185821
TotalDays         : 2.52988541666667E-06
TotalHours        : 6.071725E-05
TotalMinutes      : 0.003643035
TotalSeconds      : 0.2185821
TotalMilliseconds : 218.5821

Output (with br0 enabled):

6
Days              : 0
Hours             : 0
Minutes           : 1
Seconds           : 2
Milliseconds      : 747
Ticks             : 627476154
TotalDays         : 0.000726245548611111
TotalHours        : 0.0174298931666667
TotalMinutes      : 1.04579359
TotalSeconds      : 62.7476154
TotalMilliseconds : 62747.6154

7
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 475
Ticks             : 4752386
TotalDays         : 5.50044675925926E-06
TotalHours        : 0.000132010722222222
TotalMinutes      : 0.00792064333333333
TotalSeconds      : 0.4752386
TotalMilliseconds : 475.2386

8
Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 26
Milliseconds      : 348
Ticks             : 263486462
TotalDays         : 0.00030496118287037
TotalHours        : 0.00731906838888889
TotalMinutes      : 0.439144103333333
TotalSeconds      : 26.3486462
TotalMilliseconds : 26348.6462

You can see that with bridging enabled, accessing folders on the unRAID server can take literally minutes sometimes. And general browsing performance is abysmal. It is very inconsistent though, sometimes working quickly, other times take ages. Again, upon disabling bridging all lag in browsing disappears.

 

Pinging the server works perfectly (only after toggling br0 upon a fresh boot, more on that later):

(ping)

Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64

I have made sure DNS resolution/pathing works through Windows Server 2012:

(tracert)

1  <1ms  <1ms  <1ms HYDRA (10.0.10.3)
Trace Complete

I even added the IP and Hostname of the Unraid server in the Host file of all machines.

 

A few other bits:
-This does this with the new 10GB X540-T2 NIC, an older 10GB X520-DA2, and the old on-board 1GB NIC.
-This does this with 802.3ad (LACP) bonding on OR off.
-This does this with a standard MTU of 1500 or a jumbo MTU of 9014.
-The VLAN this server sits on is setup completely with MTU 9000/9014. I have also tested with a completely MTU 1500 network.
-There is only one VLAN exposed to this server for simplicity.
-The unRAID server is "Joined" to a Windows Domain, and files are read/write protected via user security groups.
-That Windows 2012 R2 Standard system acts as the DNS and DHCP server on the network.
-Routing is done with pfSense
-Switch is a late-model (brand new) Ubiquiti ES-16-XG, copper ports ARE being used, though the issue happens using SFP\X520-DA2 too.
-I see this issue from all other machines on the network that connect to the unRAID system, and at the same exact time. Whatever is happening is happening on unRAID's side.

-This same network has two ESXI servers going, both of which "bridge" fine.

 

Here is the log enabling br0 (with X540-T2, 802.3ad LACP, and 9014 MTU):

May 12 09:19:32 HYDRA ool www[18903]: /usr/local/emhttp/plugins/dynamix/scripts/netconfig 'eth0'
May 12 09:19:32 HYDRA rc.inet1: ip -4 addr flush dev bond0
May 12 09:19:32 HYDRA avahi-daemon[4170]: Withdrawing address record for 10.0.10.3 on bond0.
May 12 09:19:32 HYDRA avahi-daemon[4170]: Leaving mDNS multicast group on interface bond0.IPv4 with address 10.0.10.3.
May 12 09:19:32 HYDRA avahi-daemon[4170]: Interface bond0.IPv4 no longer relevant for mDNS.
May 12 09:19:32 HYDRA rc.inet1: ip link set bond0 down
May 12 09:19:32 HYDRA avahi-daemon[4170]: Interface bond0.IPv6 no longer relevant for mDNS.
May 12 09:19:32 HYDRA avahi-daemon[4170]: Leaving mDNS multicast group on interface bond0.IPv6 with address fe80::a236:9fff:fee5:34d4.
May 12 09:19:32 HYDRA avahi-daemon[4170]: Withdrawing address record for fe80::a236:9fff:fee5:34d4 on bond0.
May 12 09:19:32 HYDRA rc.inet1: ip link set eth0 nomaster
May 12 09:19:32 HYDRA kernel: bond0: Removing an active aggregator
May 12 09:19:32 HYDRA kernel: bond0: Releasing backup interface eth0
May 12 09:19:32 HYDRA kernel: bond0: the permanent HWaddr of eth0 - a0:36:9f:e5:34:d4 - is still in use by bond0 - set the HWaddr of eth0 to a different address to avoid conflicts
May 12 09:19:32 HYDRA kernel: bond0: first active interface up!
May 12 09:19:32 HYDRA kernel: ixgbe 0000:06:00.0: removed PHC on eth0
May 12 09:19:32 HYDRA kernel: ixgbe 0000:06:00.0 eth0: changing MTU from 9014 to 1500
May 12 09:19:32 HYDRA rc.inet1: ip link set eth1 nomaster
May 12 09:19:32 HYDRA kernel: bond0: Removing an active aggregator
May 12 09:19:32 HYDRA kernel: bond0: Releasing backup interface eth1
May 12 09:19:32 HYDRA kernel: ixgbe 0000:06:00.1: removed PHC on eth1
May 12 09:19:32 HYDRA kernel: ixgbe 0000:06:00.1 eth1: changing MTU from 9014 to 1500
May 12 09:19:32 HYDRA rc.inet1: ip link set bond0 down
May 12 09:19:32 HYDRA rc.inet1: ip link del bond0
May 12 09:19:32 HYDRA kernel: bond0 (unregistering): Released all slaves
May 12 09:19:32 HYDRA rc.inet1: ip link add name bond0 type bond mode 4 miimon 100
May 12 09:19:32 HYDRA rc.inet1: ip link set bond0 up
May 12 09:19:32 HYDRA kernel: IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
May 12 09:19:32 HYDRA kernel: 8021q: adding VLAN 0 to HW filter on device bond0
May 12 09:19:32 HYDRA rc.inet1: ip link set eth0 master bond0 down
May 12 09:19:32 HYDRA kernel: pps pps2: new PPS source ptp2
May 12 09:19:32 HYDRA kernel: ixgbe 0000:06:00.0: registered PHC device on eth0
May 12 09:19:32 HYDRA kernel: 8021q: adding VLAN 0 to HW filter on device eth0
May 12 09:19:32 HYDRA kernel: bond0: Enslaving eth0 as a backup interface with a down link
May 12 09:19:32 HYDRA rc.inet1: ip link set eth1 master bond0 down
May 12 09:19:33 HYDRA kernel: pps pps3: new PPS source ptp3
May 12 09:19:33 HYDRA kernel: ixgbe 0000:06:00.1: registered PHC device on eth1
May 12 09:19:33 HYDRA kernel: 8021q: adding VLAN 0 to HW filter on device eth1
May 12 09:19:33 HYDRA kernel: bond0: Enslaving eth1 as a backup interface with a down link
May 12 09:19:33 HYDRA rc.inet1: ip link add name br0 type bridge stp_state 0 forward_delay 0
May 12 09:19:33 HYDRA rc.inet1: ip link set br0 up
May 12 09:19:33 HYDRA rc.inet1: ip link set bond0 down
May 12 09:19:33 HYDRA rc.inet1: ip link set bond0 mtu 9014
May 12 09:19:33 HYDRA kernel: ixgbe 0000:06:00.0 eth0: changing MTU from 1500 to 9014
May 12 09:19:33 HYDRA kernel: ixgbe 0000:06:00.1 eth1: changing MTU from 1500 to 9014
May 12 09:19:33 HYDRA rc.inet1: ip -4 addr flush dev bond0
May 12 09:19:33 HYDRA rc.inet1: ip link set bond0 promisc on master br0 up
May 12 09:19:33 HYDRA kernel: device bond0 entered promiscuous mode
May 12 09:19:33 HYDRA kernel: device eth0 entered promiscuous mode
May 12 09:19:33 HYDRA kernel: device eth1 entered promiscuous mode
May 12 09:19:33 HYDRA kernel: IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
May 12 09:19:33 HYDRA kernel: 8021q: adding VLAN 0 to HW filter on device bond0
May 12 09:19:33 HYDRA kernel: br0: port 1(bond0) entered blocking state
May 12 09:19:33 HYDRA kernel: br0: port 1(bond0) entered disabled state
May 12 09:19:33 HYDRA rc.inet1: ip link set br0 mtu 9014
May 12 09:19:33 HYDRA rc.inet1: ip -4 addr add 10.0.10.3/255.255.255.0 dev br0
May 12 09:19:33 HYDRA avahi-daemon[4170]: Joining mDNS multicast group on interface br0.IPv4 with address 10.0.10.3.
May 12 09:19:33 HYDRA avahi-daemon[4170]: New relevant interface br0.IPv4 for mDNS.
May 12 09:19:33 HYDRA avahi-daemon[4170]: Registering new address record for 10.0.10.3 on br0.IPv4.
May 12 09:19:33 HYDRA rc.inet1: ip link set br0 up
May 12 09:19:33 HYDRA rc.inet1: ip -4 route add default via 10.0.10.1 dev br0
May 12 09:19:34 HYDRA avahi-daemon[4170]: Joining mDNS multicast group on interface br0.IPv6 with address fe80::cc9b:5eff:fe34:d3c4.
May 12 09:19:34 HYDRA avahi-daemon[4170]: New relevant interface br0.IPv6 for mDNS.
May 12 09:19:34 HYDRA avahi-daemon[4170]: Registering new address record for fe80::cc9b:5eff:fe34:d3c4 on br0.*.
May 12 09:19:36 HYDRA ntpd[2594]: Deleting interface #6 bond0, 10.0.10.3#123, interface stats: received=129, sent=170, dropped=0, active_time=101831 secs
May 12 09:19:36 HYDRA ntpd[2594]: 166.70.136.35 local addr 10.0.10.3 -> <null>
May 12 09:19:38 HYDRA kernel: ixgbe 0000:06:00.0 eth0: NIC Link is Up 10 Gbps, Flow Control: None
May 12 09:19:38 HYDRA kernel: bond0: link status definitely up for interface eth0, 10000 Mbps full duplex
May 12 09:19:38 HYDRA kernel: bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
May 12 09:19:38 HYDRA kernel: bond0: first active interface up!
May 12 09:19:38 HYDRA kernel: IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
May 12 09:19:38 HYDRA kernel: br0: port 1(bond0) entered blocking state
May 12 09:19:38 HYDRA kernel: br0: port 1(bond0) entered forwarding state
May 12 09:19:38 HYDRA kernel: ixgbe 0000:06:00.1 eth1: NIC Link is Up 10 Gbps, Flow Control: None
May 12 09:19:38 HYDRA kernel: bond0: link status definitely up for interface eth1, 10000 Mbps full duplex
May 12 09:19:40 HYDRA avahi-daemon[4170]: Joining mDNS multicast group on interface bond0.IPv6 with address fe80::a236:9fff:fee5:34d4.
May 12 09:19:40 HYDRA avahi-daemon[4170]: New relevant interface bond0.IPv6 for mDNS.
May 12 09:19:40 HYDRA avahi-daemon[4170]: Registering new address record for fe80::a236:9fff:fee5:34d4 on bond0.*.
May 12 09:19:41 HYDRA ntpd[2594]: Listen normally on 7 br0 10.0.10.3:123
May 12 09:19:41 HYDRA ntpd[2594]: new interface(s) found: waking up resolver
May 12 09:19:49 HYDRA ool www[19815]: /usr/local/emhttp/plugins/dynamix/scripts/emhttpd_update
May 12 09:19:49 HYDRA emhttpd: req (5): cmdStatus=apply&csrf_token=****************
May 12 09:19:49 HYDRA emhttpd: Starting services...
May 12 09:19:49 HYDRA emhttpd: shcmd (5810): /usr/local/sbin/mount_image '/mnt/user/system/libvirt/libvirt.img' /etc/libvirt 1
May 12 09:19:49 HYDRA kernel: BTRFS info (device loop2): disk space caching is enabled
May 12 09:19:49 HYDRA kernel: BTRFS info (device loop2): has skinny extents
May 12 09:19:49 HYDRA root: Resize '/etc/libvirt' of 'max'
May 12 09:19:49 HYDRA kernel: BTRFS info (device loop2): new size for /dev/loop2 is 1073741824
May 12 09:19:49 HYDRA emhttpd: shcmd (5812): /etc/rc.d/rc.libvirt start
May 12 09:19:49 HYDRA root: Starting virtlockd...
May 12 09:19:49 HYDRA root: Starting virtlogd...
May 12 09:19:49 HYDRA root: Starting libvirtd...
May 12 09:19:49 HYDRA kernel: tun: Universal TUN/TAP device driver, 1.6
May 12 09:19:49 HYDRA kernel: virbr0: port 1(virbr0-nic) entered blocking state
May 12 09:19:49 HYDRA kernel: virbr0: port 1(virbr0-nic) entered disabled state
May 12 09:19:49 HYDRA kernel: device virbr0-nic entered promiscuous mode
May 12 09:19:50 HYDRA kernel: virbr0: port 1(virbr0-nic) entered blocking state
May 12 09:19:50 HYDRA kernel: virbr0: port 1(virbr0-nic) entered listening state
May 12 09:19:50 HYDRA avahi-daemon[4170]: Joining mDNS multicast group on interface virbr0.IPv4 with address 192.168.122.1.
May 12 09:19:50 HYDRA avahi-daemon[4170]: New relevant interface virbr0.IPv4 for mDNS.
May 12 09:19:50 HYDRA avahi-daemon[4170]: Registering new address record for 192.168.122.1 on virbr0.IPv4.
May 12 09:19:50 HYDRA dnsmasq[20823]: started, version 2.79 cachesize 150
May 12 09:19:50 HYDRA dnsmasq[20823]: compile time options: IPv6 GNU-getopt no-DBus i18n IDN2 DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth no-DNSSEC loop-detect inotify
May 12 09:19:50 HYDRA dnsmasq-dhcp[20823]: DHCP, IP range 192.168.122.2 -- 192.168.122.254, lease time 1h
May 12 09:19:50 HYDRA dnsmasq-dhcp[20823]: DHCP, sockets bound exclusively to interface virbr0
May 12 09:19:50 HYDRA dnsmasq[20823]: reading /etc/resolv.conf
May 12 09:19:50 HYDRA dnsmasq[20823]: using nameserver 10.0.10.2#53
May 12 09:19:50 HYDRA dnsmasq[20823]: using nameserver 10.0.10.1#53
May 12 09:19:50 HYDRA dnsmasq[20823]: using nameserver 8.8.8.8#53
May 12 09:19:50 HYDRA dnsmasq[20823]: read /etc/hosts - 2 addresses
May 12 09:19:50 HYDRA dnsmasq[20823]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
May 12 09:19:50 HYDRA dnsmasq-dhcp[20823]: read /var/lib/libvirt/dnsmasq/default.hostsfile
May 12 09:19:50 HYDRA kernel: virbr0: port 1(virbr0-nic) entered disabled state
May 12 09:19:53 HYDRA kernel: vfio-pci 0000:05:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
May 12 09:19:53 HYDRA kernel: br0: port 2(vnet0) entered blocking state
May 12 09:19:53 HYDRA kernel: br0: port 2(vnet0) entered disabled state
May 12 09:19:53 HYDRA kernel: device vnet0 entered promiscuous mode
May 12 09:19:53 HYDRA kernel: br0: port 2(vnet0) entered blocking state
May 12 09:19:53 HYDRA kernel: br0: port 2(vnet0) entered forwarding state
May 12 09:19:55 HYDRA kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
May 12 09:19:55 HYDRA kernel: br0: port 3(vnet1) entered blocking state
May 12 09:19:55 HYDRA kernel: br0: port 3(vnet1) entered disabled state
May 12 09:19:55 HYDRA kernel: device vnet1 entered promiscuous mode
May 12 09:19:55 HYDRA kernel: br0: port 3(vnet1) entered blocking state
May 12 09:19:55 HYDRA kernel: br0: port 3(vnet1) entered forwarding state
May 12 09:19:55 HYDRA avahi-daemon[4170]: Joining mDNS multicast group on interface vnet0.IPv6 with address fe80::fc54:ff:fe5a:4500.
May 12 09:19:55 HYDRA avahi-daemon[4170]: New relevant interface vnet0.IPv6 for mDNS.
May 12 09:19:55 HYDRA avahi-daemon[4170]: Registering new address record for fe80::fc54:ff:fe5a:4500 on vnet0.*.
May 12 09:19:56 HYDRA avahi-daemon[4170]: Joining mDNS multicast group on interface vnet1.IPv6 with address fe80::fc54:ff:fe4c:8dc1.
May 12 09:19:56 HYDRA avahi-daemon[4170]: New relevant interface vnet1.IPv6 for mDNS.
May 12 09:19:56 HYDRA avahi-daemon[4170]: Registering new address record for fe80::fc54:ff:fe4c:8dc1 on vnet1.*.

The "warning" lines:

May 12 09:19:32 HYDRA kernel: bond0: the permanent HWaddr of eth0 - a0:36:9f:e5:34:d4 - is still in use by bond0 - set the HWaddr of eth0 to a different address to avoid conflicts
May 12 09:19:38 HYDRA kernel: bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond

 

I am not sure if these are warnings/errors that are common when cycling a bridge like this or if this indeed points to the issue. Could enabling the bridge be causing ARP issues because of a MAC conflict? It looks like the LACP response warning comes right after only one interface is up, so maybe this error is very temporary.

 

On the issue mentioned before where I mentioned "only after toggling br0 upon a fresh boot," the unRAID server experiences severe network lag "spikes" if rebooted with bridging enabled. You can see this just by continually pinging the server:

(ping)

Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Request timed out.
Request timed out.
Request timed out.
Reply from 10.0.10.3: bytes 32 time=8ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Request timed out.
Request timed out.
Reply from 10.0.10.3: bytes 32 time=13ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64

This "lag" is absolutely felt when using the server. Whatever you are doing at the time has to wait until the server comes back. Toggling br0 after every reboot fixes this:

(ping)

Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64
Reply from 10.0.10.3: bytes 32 time<1ms TTL=64

This issue is weird as everything WORKS, just inconsistently. I have used Wireshark to see what I can gleam from connections to the unRAID server, only to find that there is sometimes simply a large delay between packets being received from unRAID, as if it is busy doing something else, which is not usually the case with more than 12GB of RAM free and 0-8 % CPU usage on a hyper-threaded 6 core.

 

unRAID Server Specs:
Xeon E5-1650 V3
Supermicro X10SRL-FO
32GB (4x8GB) RDIMM ECC DDR4
X540-T2
H310 in IT Mode
CSE846 with Expander Backplane, Dual Linked
Dual 8TB Parity 7200RPM
12 8TB Data 5400RPM
512GB RAID0 (Data) x2 256GB 850 Pro for Cache
512GB 850 Pro - Unassigned Devices for VMs
480GB Intel 730 - Unassigned Devices for VMs
USB is Swissbit 2GB SLC

 

As stated above, I have tested this without LACP and Jumbo frames enabled, only to see the same issue. I have been having this issue since I built the machine in early 2016 with only a 1GB NIC, and just always assumed unRAID was a "slow" product. It is only recently with the move to 10GB that I went and disabled bridging and VMs completely and found unRAID to -NOT- be a slow product. I have had this issue with every version I have upgraded to since 2016.

 

Any help or insight you can provide would be of use to me. I am debating moving my virtualization needs back to ESXI and building a much more basic core system for unRAID, but if I can figure out why bridging is so problematic on this system, I can keep it as-is.

 

Thank you.

Link to comment
  • 2 months later...

I can't really tell what is the cause of your problem, but bridging is an essential part of VM working. It is a must have requirement to let VMs talk to the outside world.

I have not seen any similar reports as yours and *if* it is a widespread thing sure a lot of users will complain because of the basic requirement to have bridging.

 

From own experiences I can only tell it is working fine without any delays on the different VMs that I am ussing.

 

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...