- The source blog.cloudflare.com/how-to-rece…
- 16 Jun 2015 by Marek Majkowski.
- This translation and publication has been approved by the original author Marek Majkowski, please contact me directly if you need to request permission from a source.
- Reprint is forbidden.
Last week, in a casual conversation, I overheard a colleague say, “Linux networking stacks are slow! You can’t expect it to handle more than 50,000 packets per second per core!”
I think, while I agree that 50kpps per core is probably the limit for any real application, what are the limits of the Linux networking stack? Let’s rephrase it to make it more interesting:
How difficult is it to write a program that receives 1 million UDP packets per second on Linux?
Hopefully this article has been a good experience in modern network stack design.
CC BY-SA 2.0 image by Bob McCaffrey
First, let’s assume:
- Measuring packets per second (PPS) is much more interesting than measuring bytes per second (BPS). High BPS can be achieved by better pipline and sending longer packets. Improving PPS is much harder.
- Since we are interested in PPS, our experiment will use short UDP messages. To be exact, 32 bytes of UDP payload. This means 74 bytes on the Ethernet Layer.
- For the experiment, we will use two physical servers: “Receiver” and “sender”.
- They both have two 6-core 2GHz Xeon processors. Use hyperthreading (HT) to have 24 processors per server. Solarflare’s multi-queue 10G network card (NIC) is installed on the server and 11 receive queues are configured. See details later.
- The source code for the test program is available here: UDPSender, UDpreceiver
The premise
We use port 4321 to send and receive UDP packets. Before we begin, we must ensure that the communication is not corrupted by iptables:
receiver$ iptables -I INPUT 1 -p udp --dport 4321 -j ACCEPT
receiver$ iptables -t raw -I PREROUTING 1 -p udp --dport 4321 -j NOTRACKCopy the code
Configure some IP addresses for later use:
receiver$ for i in `seq 1 20`; do \
ip addr add 192.168.254.$i/24 dev eth2; \
doneSender $IP addr add 192.168.254.30/24 dev eth3Copy the code
1. The easiest way
First, let’s do the simplest experiment. How many packets can a simple sender and receiver send?
Sender pseudocode:
fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
fd.bind(("0.0.0.0", 65400)) # select source port to reduce nondeterminism
fd.connect(("192.168.254.1", 4321))
while True:
fd.sendmmsg(["\x00"32] * * 1024)Copy the code
While we can use the usual Send Syscall, it is not efficient. It is best to avoid context switching by the kernel. Fortunately, Linux recently added a syscall: sendmmsg. Let’s send 1024 packets at a time.
Receiver pseudocode:
fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
fd.bind(("0.0.0.0", 4321))
while True:
packets = [None] * 1024
fd.recvmmsg(packets, MSG_WAITFORONE)Copy the code
Recvmmsg is a more efficient version of recv Syscall.
Let’s try it:
Sender $./ UDPSender 192.168.254.1:4321 Receiver $./ UDPreceiver1 0.0.0.0:4321 0.352m PPS 10.730MiB / 90.010Mb 0.284m PPS 8.655MiB / 72.603Mb 0.262m PPS 7.991MiB / 67.033Mb 0.199M PPS 6.081MiB / 51.013Mb 0.195M PPS 5.956MiB / 49.966Mb 0.199M PPS 6.060MiB / 50.836Mb 0.200M PPS 6.097MiB / 51.147Mb 0.197M PPS 6.021MiB / 50.509MbCopy the code
Using a simple implementation, our data can reach between 197K and 350kpps. That’s a good number. However, PPS jitter is quite large. This is due to the fact that the kernel constantly switches our programs between different CPU cores. Anchoring the process to the CPU core avoids this problem:
/ udPSender 192.168.254.1:4321 Receiver $taskset -c 1./ UDPreceiver1 0.0.0.0:4321 0.362m PPS 11.058MiB / 92.760Mb 0.374m PPS 11.411MiB / 95.723Mb 0.369m PPS 11.252MiB / 94.389Mb 0.370m PPS 11.289MiB / 94.696Mb 0.365m PPS 11.152MiB / 93.552Mb 0.360m PPS 10.971MiB / 92.033MbCopy the code
Now, the kernel scheduler keeps the process on the defined CPU, improving the locality access effect of the processor cache and ultimately making the PPS data more consistent, which is exactly what we want.
2. Send more packets
While 370K PPS is not bad for a simple program, it is far from the goal of 1Mpps. To receive more packets, first we must send more packets. Let’s try sending data using two separate threads:
Sender $taskset -c 1./ udpSender \ 192.168.254.1:4321 192.168.254.1:4321 Receiver $taskset -c 1./ udPreceiver1 0.0.0.0:4321 0.349m PPS 10.651MiB / 89.343Mb 0.354m PPS 10.815MiB / 90.724Mb 0.354m PPS 10.806MiB / 90.646Mb 0.354m PPS 10.811 the MiB / 90.690 MbCopy the code
Ethtool -s will reveal the actual destination of the packet:
receiver$ watch 'sudo ethtool -S eth2 |grep rx'Rx_packets: 8.0/s Rx-1. Rx_packets: 0.0/s RX-2. Rx_packets: 0.0/s Rx-3. Rx-4. Rx_packets: 0.5/s rx-5. Rx_packets: 0.0/s rx-6. Rx_packets: 0.5/s rx-6. 0.0/s rx-9. Rx_packets: 0.0/s Rx-10. Rx_packets: 0.0/s rx-10Copy the code
With these statistics, the NIC reports that it has successfully sent about 350K PPS to the RX-4 queue. Rx_nodesc_drop_cnt is a SolarFlare-specific counter that represents 450kpps of data that the NIC failed to deliver to the kernel.
Sometimes it is not clear why packets are not delivered. In our example, it is obvious: queue 4-rx sends packets to CPU #6. CPU #6, on the other hand, can’t handle any more packets; it reads 350kpps or so at full load. Here’s what happens in HTOP:
Multi-queue NICs Crash course
In the past, network cards had only one RX queue for passing packets between the hardware and the kernel. One obvious limitation of this design is that the number of packets delivered cannot exceed the processing power of a single CPU.
In order to use multi-core systems, NICs began to support multiple RX queues. The design is simple: each RX queue is anchored to a separate CPU, so the NIC can use all the cpus as long as packets are sent to the RX queue. But it raises the question: given a packet, how does the NIC decide which RX queue to push the packet with?
Round-robin balancing is not acceptable because it may cause packet reordering problems within a single connection. Another approach is to use the hash of the package to determine the RX queue number. Hashes are usually computed from a tuple (SRC IP, DST IP, SRC port, DST port). This ensures that packets of a single connection will always be on the exact same RX queue and that reordering of packets in a single connection will not occur.
In our example, hashes can be used like this:
RX_queue_number = hash('192.168.254.30'.'192.168.254.1', 65400, 4321) % number_of_queuesCopy the code
Multi-queue hashing algorithm
Hashing algorithms can be configured using ethtool. Our Settings are:
receiver$ ethtool -n eth2 rx-flow-hash udp4
UDP over IPV4 flows use these fields for computing Hash flow key:
IP SA
IP DACopy the code
This is equivalent to: for IPv4 UDP packets, the NIC hashes (SRC IP, DST IP) addresses. Such as:
RX_queue_number = hash('192.168.254.30'.'192.168.254.1') % number_of_queuesCopy the code
The range of results is very limited because port numbers are ignored. Many nics allow custom hash algorithms. Similarly, using ethtool, we can select tuples for hashing (SRC IP, DST IP, SRC Port, DST Port):
receiver$ ethtool -N eth2 rx-flow-hash udp4 sdfn
Cannot change RX network flow hashing options: Operation not supportedCopy the code
Unfortunately, our NIC does not support it. So our experiment is limited to hashing (SRC IP, DST IP).
Description of NUMA performance
So far, all of our packets have only flowed into one RX queue, accessing only one CPU.
Let’s use this opportunity to test the performance of different cpus. In our setup, the Receiver host has two separate CPU slots, each for a different NUMA node.
The receiver thread can be set to one of four scenarios. The four options are:
- The receiver runs on one CPU, and the RX queue runs on another CPU of the same NUMA node. The performance we see above is about 360kpps.
- The receiver uses exactly the same CPU as the RX queue and we can get \~ 430kpps. But it caused extremely high jitters. If the NIC is flooded with packets, performance drops to zero.
- When the receiver is running on the HT peer of the CPU processing the RX queue, its performance is about half of normal, at about 200kpps.
- The receiver runs on a different NUMA node from the RX queue, and we get \~ 330K PPS. But the performance is not very stable.
While a 10% performance penalty running on different NUMA nodes doesn’t sound too bad, the problem only gets worse as you scale up. In some test cases, only 250kpps were squeezed per core. Jitter stability was poor in all tests across NUMA nodes. At higher throughput, the performance loss between NUMA nodes is more pronounced. In one of the tests, 4x performance was lost when the Receiver was run on a bad NUMA node.
3. Multiple receiver IP addresses
Since the hash algorithm on our NIC is very limited, the only way to distribute packets in multiple RX queues is to use multiple IP addresses. Here is an example of how to send packets to different destination IP:
Sender $taskset -c 2. / udpSender 192.168.254.1:4321 192.168.254.2:4321Copy the code
Use ethtool to verify that packets arrive on different RX queues:
receiver$ watch 'sudo ethtool -S eth2 |grep rx'Rx-0. Rx_packets: 5.0 /s Rx-1. Rx_packets: 5.0 /s rx-2. Rx_packets: 5.0 /s rx-3. Rx_packets: 0.5/s rx-9. Rx_packets: 0.5/s rx-9. Rx_packets: 0.5/s rx-9. 0.0 / s rx - 10. Rx_packets: 0.0 / sCopy the code
Receiving part:
Receiver $taskset -c 1./udpreceiver1 0.0.0.0:4321 0.609M PPS 18.599MiB / 156.019Mb 0.657M PPS 20.039MiB / 168.102Mb 0.649m PPS 19.803MiB / 166.120MbCopy the code
That was quick! Two cores are busy processing RX queues, while the third core runs the application and can get \~ 650K PPS!
We can increase this number further by sending data to three or four RX queues, but soon the application will hit another limit. This time rx_nodesC_DROP_cnt is not increased, but netstat “receive errors” is:
receiver$ watch 'netstat -s --udp'Udp: 437.0K /s packets received 0.0/s packets to unknown port received. 386.9K /s packet Receive errors 0.0/s packets sent RcvbufErrors: 123.8K /s SndbufErrors: 0 InCsumErrors: 0Copy the code
This means that while the NIC can pass packages to the kernel, the kernel cannot pass packages to applications. In our example, it only delivers 440kpps, and the remaining 390kpps(Packet Receive Errors) + 123kpps(RcvbufErrors) are discarded because the application is not receiving them fast enough.
4. Multi-threaded reception
We need to extend receiver. To receive data from multiple threads, our simple program doesn’t work very well:
Sender $taskset -c 1,2./udpsender 192.168.254.1:4321 192.168.254.2:4321 receiver$taskset -c 1,2./udpreceiver1 0.0.0.0:4321 2 0.495m PPS 15.108MiB / 126.733Mb 0.480M PPS 14.636MiB / 122.775Mb 0.461M PPS 14.071MiB / 118.038Mb 0.486M PPS MiB / 124.322 14.820 MbCopy the code
Reception performance deteriorates compared to single-threaded programs. This is caused by a lock contention on the UDP receive buffer. Since both threads use the same socket descriptors, they spend a large percentage of their time locked around the UDP receive buffer. This article gives a more detailed description of the problem.
Using multiple threads to receive from a single descriptor is not optimal.
5. SO_REUSEPORT
Fortunately, a workaround has recently been added to Linux: SO_REUSEPORT Flag. When this flag is set on socket descriptors, Linux will allow many processes to bind to the same port. In fact, any number of processes can be bound to it, and the load will be spread among the processes.
With SO_REUSEPORT, each process will have a separate socket descriptor. Therefore, each process will have a dedicated UDP receive buffer. This avoids the competition problems encountered before:
Receiver $taskset -c 1,2,3,4./udpreceiver1 0.0.0.0:4321 4 1 1.114m PPS 34.007MiB / 285.271Mb 1.147m PPS 34.990MiB / 293.518Mb 1.126M PPS 34.374MiB / 288.354MbCopy the code
Now that’s more like it! Throughput is not bad now!
There is room for improvement in our scheme. Although we started four receiving threads, the load was not evenly distributed among them:
Two threads received all the work, and the other two threads received no packets at all. This is caused by a hash conflict, but this time at the SO_REUSEPORT layer.
conclusion
I also did some further testing and was able to get 1.4Mpps with fully aligned RX queues and receiver threads on a single NUMA node. Running the Receiver on a different NUMA node causes the number to drop by up to 1Mpps.
In summary, if you want perfect performance, you need:
- Ensure that traffic is evenly spread across multiple RX queues and SO_REUSEPORT processes. In practice, loads are usually well distributed as long as there is a large number of connections (or traffic).
- Packets received from the kernel need to have enough spare CPU to carry them.
- For better performance, both RX queues and receiver processes should be located on a single NUMA node.
Although we have shown that it is technically possible to receive 1Mpps on a Linux machine, the application does not actually do any processing with the received packets, and it does not even look at the contents of the traffic. Do not expect this performance from any real-world application that handles a large amount of business.