Dropped RX packages on br0


jowe

Recommended Posts

I have a very annoying problem, br0 is dropping packages all the time. Just a few, and i dont have any problem most of the time. But a couple of times the whole server has stoped responding on the network, and i just have to unplug the nic, replug, and all worked again. I dont know if this is related to the dropps. And it only happens after a month or so without reboots.

 

I used the onboard Realtek NIC at first, but changed to a Intel gigabit ct desktop adapter. With the 82574L controller. And a new cable. Tried in 2 different switches. No change at all.

 

I´ve tried to shut down different VMs, dockers and my receiver that are connected to the same router. I've even tried to start from clean install of unRAID.

 

I get some "Tower kernel: br0: received packet on eth0 with own address as source address", Most of the time 4 in a row. But the IF is dropping packages all the time, not just when those get logged.

 

I'm out of ideas, so i need your help guys!

 

br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500

        inet 192.168.1.101  netmask 255.255.255.0  broadcast 192.168.1.255

        ether xx:xx:xx:xx:xx:xx  txqueuelen 0  (Ethernet)

        RX packets 5426647  bytes 14194429168 (13.2 GiB)

        RX errors 0  dropped 5861  overruns 0  frame 0

        TX packets 4081191  bytes 30250635997 (28.1 GiB)

        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tower-diagnostics-20160202-0758.zip

Link to comment
  • 1 month later...

Dropped packets are not necessary a problem in themselves!!!

 

I currently have 20,000,000+ dropped receive packets on my Media Server (see before for its specs).  That server has only been up for about four days.  The server has ZERO issues even with that many!

 

However, sometimes they are an indicator of a problem BUT you have to have another problem before you can even say that they might even be a part of the problem.  BTW, if you google this issue, you generally find that it is a networking issue and the source is often outside of the server. 

 

Now that problem where you have to unplug/replug the NIC connection is an issue but, if you had it with both the Intel and RelaTek NIC's, I would be more suspicious of the switch.  (The 'green' feature that most switches have these days sometimes seems to disconnect/shutdown an active port when it shouldn't.)  The next time the problem happens, try cycling the power on the switch.  You could switch the cable to another port on the switch as this sometimes seems to help.  (I have had to replace a switch where this problem became a real headache.)

Link to comment

The OP's 0.1% dropped packets is absolutely nothing to worry about.

 

Frank1940's quoted value of 20,000,000 dropped packets is meaningless without knowing the total number of packets.

 

15,372,099 Packets received!  Actual Count on drops:  20,646,601!  (So you can see that dropped packets are not significant of any real server type problems.  I tend to suspected that my Netgear media players tend to flood the network with packet requests as the count always increases dramatically when I am streaming a BluRay rip. )

 

EDIT:  This is a point of information rather being any truly useful information in helping the OP to get to the bottom of his issue...

Link to comment
  • 1 month later...

Anyone with the same problem?

 

Yes,  4+ million dropped packets in the last 5 hours out of 106 million packets, so about 4% are dropped.

 

Using the same CT card you have.  If I install an older Intel PRO/1000 GT, I would get about 8000 an hour instead of 800,000.  With my onboard Realtek 8211, I would have had double the amount of dropped packets.  Odd thing is I run a 2nd unRAID on the exact same network, using the same RAM and CPU, and version of unRAID.  It has a Realtek 8168 and it has 0 dropped packets in over 100 days!  This leads me to believe it isn't switch, router, etc.  Problem remains, even if I swap ports.

 

I also run the card in promiscuous mode, and still get dropped packets.  So I think they are actual dropped packets.  Using these same NIC's on unRAID 5.0.6, I had zero dropped packets in years.  So I definitely think it is something to do with the kernel and drivers that are built into unRAID, and 6.1.X is less friendly to NIC's.

Link to comment

Anyone with the same problem?

 

Yes,  4+ million dropped packets in the last 5 hours out of 106 million packets, so about 4% are dropped.

 

Using the same CT card you have.  If I install an older Intel PRO/1000 GT, I would get about 8000 an hour instead of 800,000.  With my onboard Realtek 8211, I would have had double the amount of dropped packets.  Odd thing is I run a 2nd unRAID on the exact same network, using the same RAM and CPU, and version of unRAID.  It has a Realtek 8168 and it has 0 dropped packets in over 100 days!  This leads me to believe it isn't switch, router, etc.  Problem remains, even if I swap ports.

 

I also run the card in promiscuous mode, and still get dropped packets.  So I think they are actual dropped packets.  Using these same NIC's on unRAID 5.0.6, I had zero dropped packets in years.  So I definitely think it is something to do with the kernel and drivers that are built into unRAID, and 6.1.X is less friendly to NIC's.

 

If you google the 'problem' of dropped packets I think you will find that the consensus is that they are not a problem in and by themselves. (At least, that is what I found when I did it.)  If you don't have some other problem, they are harmless.  There is no doubt that things have been changed in the kernel (look at the version numbers) and in how the kernel was compiled for unRAID to make it more 'friendly' to both Dockers and VM's.  Did these changes have effects in other areas?  Undoubtedly!  But it they break anything?  That is the question...

 

What I think was happening in earlier version was that all the packets were being accepted and the 'extra' ones tossed away without any notice of the action.  Now, for whatever reason, they are being tossed up at the NIC level and the count of these 'extra' ones is "dropped packets'.  I also suspect that the change which was made to change how the interrupts were handled affected this problem.  (You can go looking back in the 6.1 beta series change notices to find which one it was done in.)

Link to comment

Anyone with the same problem?

 

Yes,  4+ million dropped packets in the last 5 hours out of 106 million packets, so about 4% are dropped.

 

Using the same CT card you have.  If I install an older Intel PRO/1000 GT, I would get about 8000 an hour instead of 800,000.  With my onboard Realtek 8211, I would have had double the amount of dropped packets.  Odd thing is I run a 2nd unRAID on the exact same network, using the same RAM and CPU, and version of unRAID.  It has a Realtek 8168 and it has 0 dropped packets in over 100 days!  This leads me to believe it isn't switch, router, etc.  Problem remains, even if I swap ports.

 

I also run the card in promiscuous mode, and still get dropped packets.  So I think they are actual dropped packets.  Using these same NIC's on unRAID 5.0.6, I had zero dropped packets in years.  So I definitely think it is something to do with the kernel and drivers that are built into unRAID, and 6.1.X is less friendly to NIC's.

 

If you google the 'problem' of dropped packets I think you will find that the consensus is that they are not a problem in and by themselves. (At least, that is what I found when I did it.)  If you don't have some other problem, they are harmless.  There is no doubt that things have been changed in the kernel (look at the version numbers) and in how the kernel was compiled for unRAID to make it more 'friendly' to both Dockers and VM's.  Did these changes have effects in other areas?  Undoubtedly!  But it they break anything?  That is the question...

 

What I think was happening in earlier version was that all the packets were being accepted and the 'extra' ones tossed away without any notice of the action.  Now, for whatever reason, they are being tossed up at the NIC level and the count of these 'extra' ones is "dropped packets'.  I also suspect that the change which was made to change how the interrupts were handled affected this problem.  (You can go looking back in the 6.1 beta series change notices to find which one it was done in.)

 

That's the reason I've ran in promiscuous mode.  It ignores all the "extra ones" that are supposed to be dropped, and should only be reporting genuine dropped packets.

 

So I don't agree with your conclusion that this is normal.  I think something is broken.  I do agree that new Linux kernels to treat more things as dropped packets (IP6 when you aren't configured, etc.), so seeing an increase in dropped pockets may be normal.  But once those are all filtered out, and this many drops are reported, then there is a problem.  Essentially 4% drop of valid packets means 4% needs to be resent (and then some of the resent get dropped, etc.).  But you are basically talking about a 4% slowdown in network transfers in my case (about 9.5% with the Realtek 8211).

 

Over 5 hours I transferred 120+ GB of data to unRAID and had 4 million dropped packets.  On my other unRAID, which is on the same network, I transfer 6 GB/hour 24/7 to the unRAID.  That is 144 GB per day, and that server has been running over 100 days without a single dropped packet.

 

Now, I don't know that it is an unRAID problem per se.  It may be more of a Linux/hardware problem, and in their desire to upgrade, unRAID stepped on the landmine.  I don't know.  If you look at change log, one of the updates to UnRAID notes that new firmware for Realtek NIC's was added to correct issues.  So is this why my one Realtek NIC works error free?  Some firmware fix to the NIC that unRAID distributes?  I believe  that the 8168 in my working server is one that had a firmware update, and an updated driver too.  However, unRAID doesn't detect my Realtek 8211 as a Realtek NIC (in the trouble server), so it loads the forcedeth generic driver.

 

However, it also loaded the forcedeth driver in 5.0.6, and I had zero drops on that platform.  So something else is going on, but I don't know what.  Also, if you look at the Intel CT, there are numerous reports of dropped packets with linux in general, and there have been several driver and firmware modifications made to the NIC.  Unfortunately, the last report I have of it working error free on linux was pack in Linux 2.X something.  But that doesn't mean further changes to Linux didn't break it (I think there is a newer driver for Linux that I don't believe unRAID is using).

 

I guess my general complaint here is that I don't think unRAID looks enough to the past when they do their upgrades.  Everyone here has unRAID boxes built from an older mishmash of equipment, that is stable for them.  But unRAID wants to add some new doodad like Docker, and they upgrade drivers, Linux kernel, etc to accommodate.  Then old, stable platforms get broken (I guess we should never upgrade, and I'm kicking myself a little for moving off 5.0.6 now).

 

unRAID would actually work better as a hardware/software platform.  If Limetech controlled the hardware, they would know that only three MB's were used, three different NIC's, one brand of RAM, etc.  Then when it came time to upgrade the software, they would have 48 pieces (just an example) of hardware to test, and once it was all working in Beta, it could be released to everyone, with zero problems (more or less).

 

And just for more info on changes to dropped packets:

 

Beginning with kernel 2.6.37, it has been changed the meaning of dropped packet count. Before, dropped packets was most likely due to an error. Now, the rx_dropped counter shows statistics for dropped frames because of:

 

Softnet backlog full  -- (Measured from /proc/net/softnet_stat)

Bad / Unintended VLAN tags

Unknown / Unregistered protocols

IPv6 frames when the server is not configured for IPv6

 

If any frames meet those conditions, they are dropped before the protocol stack and the rx_dropped counter is incremented.

 

Care should be taken to confirm that frames are not being legitimately dropped.  A quick way to test this (WARNING: this test does not work for bonding interfaces) is to force the NIC into promiscuous mode:

 

host:~# ifconfig <interface> promisc

 

And then watching the rx_dropped counter.  If it stops incrementing while the NIC is in promiscuous mode; then it is more than likely showing drops because of the reasons listed earlier.  If frames continue to be shown as dropped, investigation should take place to determine root cause.

 

 

 

Link to comment

And here is a similar problem with Fedora.  One version no dropped packets, update and poof, dropped packets.

 

http://forums.fedoraforum.org/showthread.php?t=297243

 

But apparently they are not real dropped packets.  I'm trying to read through, understand what was done, and apply it to troubleshooting on my server.  His problems were flow control (flow control was now being counted as dropped packets) and DHCP requests (VLAN related).

 

And for perspective, Fedora 18 was 3.6.X kernel and Fedora 20 was 3.11.X kernel.

 

unRAID 5.0.6 was kernel 3.9.X.  Every stable release of 6 and higher have been in kernel 4.X.

 

So base don this users report, and my experience, something changed after 3.9.X.

Link to comment

I guess my general complaint here is that I don't think unRAID looks enough to the past when they do their upgrades.  Everyone here has unRAID boxes built from an older mishmash of equipment, that is stable for them.  But unRAID wants to add some new doodad like Docker, and they upgrade drivers, Linux kernel, etc to accommodate.  Then old, stable platforms get broken (I guess we should never upgrade, and I'm kicking myself a little for moving off 5.0.6 now).

No different than any other OS.  Try running Win 10 on a platform you bought for XP.  During the WinXP days, I bought a Microsoft fingerprint reader.  Windows Vista -> Fingerprint Reader no longer works.  My Microsoft WebCam that worked perfectly with Windows 7 does not have any drivers for Windows 10.

 

unRAID would actually work better as a hardware/software platform.  If Limetech controlled the hardware, they would know that only three MB's were used, three different NIC's, one brand of RAM, etc.  Then when it came time to upgrade the software, they would have 48 pieces (just an example) of hardware to test, and once it was all working in Beta, it could be released to everyone, with zero problems (more or less).

Absolutely which is why intermittently LT does build and sell their own servers.  However if that's all that they supported, then the price of your server would skyrocket accordingly
Link to comment

All right, I figured out my Dropped RX packets...

 

98% of them were caused by flow control.  On earlier versions of the Linux kernel, flow control was disabled on Intel NIC's.  Then it was enabled. So upgrading unRAID results in a kernel with flow control active. 

 

My stupid router doesn't allow flow control to be disabled and doesn't have QOS (which disables flow control).  But I disabled it at the NIC, and 98% of dropped packets went away.

 

I was still getting a few and the rest were resolved by increasing the size of the RX buffer (default 256 and changed to 1024).

 

The rest were being caused by multicast packets that somehow weren't being routed correctly.  I don't use multicast anyway, so I disabled multicast on the router, and the errors completely stopped!

 

Now to transfer another 100 GB of data and see how it compares...

Link to comment

Well,

 

So far out of 186+ million packets received, I have had 1 drop. 

 

So I believe what was happening before was receiver buffer would fill up.  With flow control, a pause command would be issued until the network card got its regularly scheduled CPU time to flush the buffer.  This pause causes a certain number of packets (probably ones in transit or software, but not yet buffered) to be dropped.

 

But now, when the buffer fills, the NIC uses an IRQ to take immediate control of the CPU (at the expense of other programs) to flush the buffer.  So more CPU usage, but no dropped packets, which is fine for me since I'm only using 20% of my CPU anyway.

Link to comment

Well,

 

So far out of 186+ million packets received, I have had 1 drop. 

 

So I believe what was happening before was receiver buffer would fill up.  With flow control, a pause command would be issued until the network card got its regularly scheduled CPU time to flush the buffer.  This pause causes a certain number of packets (probably ones in transit or software, but not yet buffered) to be dropped.

 

But now, when the buffer fills, the NIC uses an IRQ to take immediate control of the CPU (at the expense of other programs) to flush the buffer.  So more CPU usage, but no dropped packets, which is fine for me since I'm only using 20% of my CPU anyway.

 

So, are you getting more throughput? 

 

And as a second piece of information, how did you disable the flow control? 

 

Reason for Edit:  Removed reply from within Quote.

Link to comment

Well,

 

So far out of 186+ million packets received, I have had 1 drop. 

 

So I believe what was happening before was receiver buffer would fill up.  With flow control, a pause command would be issued until the network card got its regularly scheduled CPU time to flush the buffer.  This pause causes a certain number of packets (probably ones in transit or software, but not yet buffered) to be dropped.

 

But now, when the buffer fills, the NIC uses an IRQ to take immediate control of the CPU (at the expense of other programs) to flush the buffer.  So more CPU usage, but no dropped packets, which is fine for me since I'm only using 20% of my CPU anyway.

 

So, are you getting more throughput? 

 

And as a second piece of information, how did you disable the flow control? 

 

Reason for Edit:  Removed reply from within Quote.

 

It's hard to say on the speed.  The top speed is not any faster.  However, when I charted data transfers previously, the charts almost looked like a sine wave (I guess from the pauses).  Now they are running at the same top speed, but the speed is staying consistent.  So I think you would have to be transferring  more data in a given period of time.

 

Copying to my cache drive doesn't feel a whole lot faster, but copying directly to a disk in the array definitely feels a lot faster.

 

First you should check to see if flow control is your problem (See the link I posted earlier).  The best way to disable it is to turn it off at the switch/router (if possible).  Enabling QOS on your router will also disable flow control.

 

If that doesn't work, you can manually turn it off with:

 

ethtool -A ethX autoneg off rx off

 

Instead of ethX, use whatever your adapter is in you system (eth0 in mine).

 

ethtool -a ethx will show your NIC's current status.

 

Up to 410+ million received packets and still just 1 drop.

 

 

Link to comment

......

 

ethtool -A ethX autoneg off rx off

 

Instead of ethX, use whatever your adapter is in you system (eth0 in mine).

.......

 

I am a bit confused about why you are turning off autoneg.  As I was looking into this parameter, I seemed to get the sense that this is the automatic negotiation of the NIC connect speed.  Am I totally confused or is there some other issue that I am not aware of? 

Link to comment

......

 

ethtool -A ethX autoneg off rx off

 

Instead of ethX, use whatever your adapter is in you system (eth0 in mine).

.......

 

I am a bit confused about why you are turning off autoneg.  As I was looking into this parameter, I seemed to get the sense that this is the automatic negotiation of the NIC connect speed.  Am I totally confused or is there some other issue that I am not aware of?

 

This is autonegotiation of flow control.  There are many autonegotitations.  If you are connected to a switch that has flow control you can't disable, then if you don't turn autoneg off, flow control will automatically turn back on the instant you turn it off (so it will look like rx off does nothing).

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.