Udp packet drops and packet receive error difference

If you write a program which receives very high amount of udp packets, it is good to know whether all the packets processed by your program or not.

To get the information about statictics of udp stack, you can use netstat with -anus parameter like that:

$ netstat -anus
...
Udp:
    531412134 packets received
    125 packets to unknown port received.
    38491 packet receive errors
    531247364 packets sent
...

In this example there are 125 unknown port errors. It can be normal because you can’t control the incoming traffic if no firewall exists and any host can send udp packets to any port, which your server doesn’t listen. In this scenario, you don’t much worry about it. But if this number is so high on average for a day, you may want to capture udp traffics and make further analysis on that.

Another reason for unknown port case can be very important issue. If your program crashes randomly, your system may be receiving packets but because of the crash there are no software to accept that packets, so unknown port error counter increments. You have to fix this at the software level.

There are 38491 packet receive errors and we have to take it seriously. Packet receive errors doesn’t include the problems occured on network card level, it shows only received packets on udp protocol stack. Main reasons for packet receive errors:

  • udp packet header corruption or checksum problems
  • packet receive buffer problems in application or kernel side

Unlike TCP, UDP protocol does not have built-in flow-control capabilities, so if you can’t process all of the received packets fast enough, kernel will start to drop new incoming packets because of the socket receive buffer is full. When you don’t make any tuning on udp stack, default udp receive buffer size is between 32-128 kilobytes per socket. You can set it to much higher value with setsockopt like below:

int size = 2 * 1024 * 1024;
setsockopt(socket, SOL_SOCKET, SO_RCVBUF, &size, (socklen_t)sizeof(int));

2 MB’s of receive buffer handle up to 1 Gbps of data if you can process fast, but you can also increase it to 16 or 32 MB if you need.

Please note that, you can’t set socket receive buffer to maximum value defined in kernel which you can see on /proc/sys/net/core/rmem_max. You have to change this value to use big socket receive buffer size in application:

$ sudo sysctl -w net.core.rmem_max=33554432

There is one more parameter: netdev_max_backlog. It controls the number of packets allowed to queue for network cards in kernel side. If you are receiving very high level of traffic, you may want to inrease packet backlog queue in kernel to 2000 (Default value is 1000):

$ sudo sysctl -w net.core.netdev_max_backlog=2000

After these settings, you must check udp packet statistics again. There is no way to get same statistics grouped by process, so they are always showing the whole udp protocol stack. It will be reset with system reboot.

See also: man setsockopt

Here is a life saver comment update after 6 years :slight_smile:

Today I found a strange issue on udp packet drops.

I have a non-standard application which is processing more than 300.000 short udp packets per second, working on lots of vlans and uses ip advanced routing tables with 10Gbps ixgbe driver.

The problem is, after a while udp packets start to get dropped. Most of the udp packet drops related with the udp socket receive buffer size but this time, there is no increase on receive buffer error while packet receive error increasing dramatically on netstat -anus output.

And there is no udp checksum error too, it can be controlled through /proc/net/snmp file:

$ watch -n1 "cat /proc/net/snmp | grep 'Udp:'"

Look at the fields especially on InErrors, RcvbufErrors and InCsumErrors.

I look at all of the tunable parameters both of the kernel side (including NAPI and backlog parameters) and ethernet driver side (including all type of offloading, ring buffer parameters), nothing helps.

After that I found dropwatch network diagnose utility which is trying to find where packets are getting dropped: https://github.com/nhorman/dropwatch

It can be started as:

$ sudo dropwatch -l kas
Initializing kallsyms db
dropwatch> start
Enabling monitoring...
Waiting for activation ack....
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring

and it just print lots of the drops like that:

709 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
1535 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
1311 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
2376 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
1348 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
1595 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
2456 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)

I quickly look at the udp_queue_rcv_skb kernel function (https://github.com/torvalds/linux/blob/v4.19/net/ipv4/udp.c#L1989) and a few steps later I found that the drops comes from __udp_enqueue_schedule_skb function.

I don’t go into much detail after this point because it gives me hint to play with socket receive buffer size int the application (holaa we get the same parameter which we already said that it must be checked first). Yes, receive buffer errors count is not increasing, so our udp socket buffer size already big enough. But, it is too big!

We used 16 MB of udp socket receive buffers on this setup and it works fairly well without any packet drops on different hardware and we can’t reproduce the problem on other than ixgbe ethernets. Problem must be somewhat related with the ixgbe but when we decreased socket receive buffers to 2 or 4MB, all the problems just gone and we don’t see any drops at udp_queue_rcv_skb lines on the dropwatch outputs.

In summary;

  • dropwatch is really great utility to diagnose network packet drops on kernel side
  • you must always check everything in twice when using big socket receive buffers on your application
1 Like

Here is another problem which is not the same but somewhat similar, again dropwatch is a key for the solution:

You are a master!!

I have been suffering from the same problem (this case with a tg3 driver).
The software (composed by several pieces) run ok but after 6 hours one these pieces start to work weirdly, just the one in charge of processing UDP packages (about 6 MBytes/second) and even though, NTP daemon started to make strange things (WTF).
I run dropwatch and realised that I got the same “drops at udp_queue_rcv_skb” messages as you.
After reducing the rcv buffer to 2 MB all problems went away.

Thank you friend!

thanks for the awesome information.

thanks my issue has been fixed.