Here is a life saver comment update after 6 years
Today I found a strange issue on udp packet drops.
I have a non-standard application which is processing more than 300.000 short udp packets per second, working on lots of vlans and uses ip advanced routing tables with 10Gbps ixgbe
driver.
The problem is, after a while udp packets start to get dropped. Most of the udp packet drops related with the udp socket receive buffer size but this time, there is no increase on receive buffer error while packet receive error increasing dramatically on netstat -anus
output.
And there is no udp checksum error too, it can be controlled through /proc/net/snmp
file:
$ watch -n1 "cat /proc/net/snmp | grep 'Udp:'"
Look at the fields especially on InErrors, RcvbufErrors and InCsumErrors.
I look at all of the tunable parameters both of the kernel side (including NAPI and backlog parameters) and ethernet driver side (including all type of offloading, ring buffer parameters), nothing helps.
After that I found dropwatch network diagnose utility which is trying to find where packets are getting dropped: https://github.com/nhorman/dropwatch
It can be started as:
$ sudo dropwatch -l kas
Initializing kallsyms db
dropwatch> start
Enabling monitoring...
Waiting for activation ack....
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
and it just print lots of the drops like that:
709 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
1535 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
1311 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
2376 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
1348 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
1595 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
2456 drops at udp_queue_rcv_skb+384 (0xffffffff832666a4)
I quickly look at the udp_queue_rcv_skb
kernel function (https://github.com/torvalds/linux/blob/v4.19/net/ipv4/udp.c#L1989) and a few steps later I found that the drops comes from __udp_enqueue_schedule_skb
function.
I don’t go into much detail after this point because it gives me hint to play with socket receive buffer size int the application (holaa we get the same parameter which we already said that it must be checked first). Yes, receive buffer errors count is not increasing, so our udp socket buffer size already big enough. But, it is too big!
We used 16 MB of udp socket receive buffers on this setup and it works fairly well without any packet drops on different hardware and we can’t reproduce the problem on other than ixgbe
ethernets. Problem must be somewhat related with the ixgbe but when we decreased socket receive buffers to 2 or 4MB, all the problems just gone and we don’t see any drops at udp_queue_rcv_skb lines on the dropwatch outputs.
In summary;
- dropwatch is really great utility to diagnose network packet drops on kernel side
- you must always check everything in twice when using big socket receive buffers on your application