Cilium. IO /blog/ 2020/0…

Thomas Graf is the co-founder of Cilium and also the CTO and co-founder of Isovalent, Cilium’s parent company. Prior to that, Thomas spent 15 years developing the Linux kernel in the areas of networking, security and eBPF. Note: this article has obtained the author’s translation authorization

Everybody is good! 👋

As more and more critical loads are migrated to Kubernetes, network performance benchmarking is becoming an important reference in choosing a Kubernetes network solution. In this article, we will explore Cilium’s performance characteristics based on the results of a number of benchmarking tests conducted over the past few weeks. At the request of our users, we will also present Calico’s test results for direct comparison.

In addition to presenting the resulting data from the tests, we will take a closer look at the topic of container network benchmarking and explore the following aspects:

  • Throughput benchmark
  • Whether container networks add overhead
  • Break the rules: eBPF Host Routing
  • Measurement delay: requests per second
  • CPU flame map comparison of Cilium eBPF and Calico eBPF
  • New connection processing rate
  • Comparison between WireGuard and IPsec encryption costs
  • The test environment

Summary of test results

Before looking at the benchmark and its data in detail, let’s present the aggregated test results. You can skip this section if you want to dive right into the details of testing and draw your own conclusions.

  • EBPF is decisive: Cilium outperforms Calico’s eBPF Data Path in some respects, such as the latency observed in the TCP_RR and TCP_CRR benchmarks. In addition, the more important conclusion is that eBPF is significantly superior to IPtables. Both Cilium and Calico perform significantly better in configurations that allow eBPF to bypass Iptables than when iptables is not circumvented.

    After looking into the details, we found that Cilium and Calico do not use eBPF in exactly the same way. While some of the concepts are similar (and not surprising given the nature of open source), the CPU flame chart shows that Cilium takes advantage of additional context-switching savings. This may explain the difference in the TCP_RR and TCP_CRR test results.

    Overall, eBPF is clearly the best technology to address the cloud native requirements challenge, based on benchmarking results.

  • Observability, NetworkPolicy, and Service: For this benchmark, we focused on the commonality of the two, namely the network. This allows us to directly compare performance differences brought about by node networks. However, observability, NetworkPolicy, and Service are also required in practical applications, where Cilium and Calico eBPF data paths differ greatly. Cilium supports some features that Calico eBPF data paths do not, but even standard features such as Kubernetes NetworkPolicy are implemented differently from Calico. If we put more effort into testing eBPF in these more advanced use cases, we might see a significant difference in performance in these areas. However, due to lack of space, this paper will not discuss this further, and a more in-depth study will be left to the next article.

  • Comparing WireGuard and IPsec: Somewhat surprisingly, although WireGuard achieved a higher maximum throughput in our tests, IPsec was more CPU efficient at the same throughput. This is most likely due to the AES-Ni CPU instruction set. This instruction set supports unencrypting IPsec, but WireGuard does not benefit from it. When the AES-Ni instruction set is unavailable, the result is significantly reversed.

    The good news is that as of Cilium 1.10, Cilium supports not only IPsec but also WireGuard. You can choose one of them to use.

Throughput benchmark

Disclaimer:

Benchmarking is difficult. Test results depend heavily on the hardware environment in which the tests are run. Unless the results are collected on the same system, comparisons should not be made directly in absolute terms.

Let’s start with the most common and obvious TCP throughput benchmark, measuring the maximum data transfer rate between containers running on different nodes.

The figure above shows the maximum throughput that can be achieved for a single TCP connection, with the optimal configurations performing just over 40 Gbit/s. These results are based on Netperf’s TCP_STREAM test, which uses 100 Gbit/s network ports to ensure network adapters do not become bottlenecks. Since a single Netperf process is running to transfer data over a single TCP connection, most of the network processing is done by a single CPU core. This means that the maximum throughput above is limited by the available CPU resources of a single core, so you can show how much throughput each configuration can achieve when the CPU becomes the bottleneck. Later in this article, the tests will be expanded to use more CPU cores to eliminate CPU resource limitations.

Throughput achieved using high-performance eBPF is even slightly higher than node-to-node throughput. This is very surprising. It is generally accepted that container networks incur additional overhead compared to node-to-node networks. Let’s put that puzzle on the back burner for the moment and analyze it after further study.

CPU resources required for the 100 Gbit/s transfer rate

The results of the TCP_STREAM benchmark have hinted at which configurations are most efficient for achieving high transfer rates, but let’s take a look at the overall CPU consumption of the system when running the benchmark.

The figure above shows the CPU utilization required for the entire system to reach 100 Gbit/s throughput. Note that this is different from the CPU consumption corresponding to throughput in the previous figure. In the figure above, all CPU usage has been converted to a stable transfer rate of 100 Gbit/s for direct comparison. The shorter the bar chart, the more efficient the configuration is at 100 Gbit/s.

Note: TCP stream performance is often limited by the receiver, because the sender can use TSO large packets at the same time. This can be observed in the increased CPU overhead on the server side in the above tests.

The significance of TCP throughput benchmarks

While most users are unlikely to encounter these throughput levels on a regular basis, such benchmarks are important for specific types of applications:

  • AI/ML applications that require access to large amounts of data
  • Data upload/download service (backup service, VM image, container image service, etc.)
  • Streaming services, especially 4K+ resolution streaming

Later in this article, we will continue to delve into measurement latency: requests per second and new connection processing rates to better illustrate the performance characteristics of typical microservice workloads.

Whether container networks add overhead

In the analysis of the first benchmark, we mentioned that container networks incur some additional overhead compared to node networks. Why is that? Let’s compare these two network models from an architectural perspective.

The figure above shows that the container network also needs to perform all the node-to-node network processing, and that these processes occur in the container’s network namespace (dark blue).

Since node network processing also needs to take place inside the container network namespace, any work outside the container network namespace is essentially additional overhead. The figure above shows the network path for Linux routing when using Veth devices. If you use a Linux bridge or OVS, the network model may be slightly different, but the basic overhead points are the same.

Break the rules: eBPF host-routing

In the benchmark above, you may have wondered the difference between the Cilium eBPF and Cilium eBPF Legacy Host Routing configurations and why the native Cilium eBPF data path is so much faster than the Host route. The native Cilium eBPF data path is an optimized data path called eBPF host Routing, as shown below:

EBPF host routing allows bypassing all iptables and upper network stacks in the host namespace, as well as some context switching across Veth pairs to save resource overhead. Network packets are captured as soon as they reach the network interface device and are delivered directly to the network namespace of the Kubernetes Pod. At the traffic outlet side, the packets also pass through the Veth pair, are captured by eBPF and directly transmitted to the external network interface. EBPF queries the routing table directly, so this optimization is completely transparent and compatible with all the services that provide route allocation running on the system. For details on how to enable this feature, see eBPF host Routing in the Tuning guide.

Calico eBPF is using some similar circumvention for Iptables, but it’s not exactly the same as Cilium’s principle, as described later in this article. However, the test results proved that bypassing slow kernel subsystems, such as Iptables, can provide significant performance gains.

Line Rate approaching 100 Gbit/s

In the previous article, we analyzed the benchmark results involving only one CPU core. Next we will let go of single-core and parallelize TCP streams to run multiple Netperf processes.

Note: Since the hardware has 32 threads, we deliberately chose 32 processes to ensure that the load is evenly distributed on the system.

The figure above does not provide much information except to show that all of the tested configurations can achieve line rates close to 100 Gbit/s if sufficient CPU resources are devoted. However, we can still see differences in efficiency in terms of CPU resources.

Note that the CPU usage in the figure above covers all CPU consumption, including the consumption of running Netperf processes, as well as the CPU resources required by the workload to perform network I/O. However, it does not include the CPU consumption associated with any business logic that the application typically needs to perform.

Measurement delay: requests per second

Requests per second is almost the opposite of throughput metrics. It measures the transmission rate of sequential single-byte round-trips over a single TCP persistent connection. This benchmark demonstrates the efficiency of network packet processing. The lower the latency of individual network packets, the more requests can be processed per second. Co-optimization of throughput and latency is often a trade-off. A larger buffer is ideal for maximum throughput, but large buffers can lead to increased latency. This phenomenon is known as buffer bloat. Cilium provides a feature called Bandwidth Manager that automatically configures fair queues, enables Pod rate limits based on earliest dispatch times, and optimizes TCP stack Settings for server workloads to achieve an optimal balance between throughput and latency.

This benchmark is often overlooked, but it is often much more important to users than you might think, because it emulates a very common microservice usage pattern: sending requests and responses between services using persistent HTTP or gRPC connections.

The following figure shows the TCP_RR test performance of a single Netperf process with different configurations:

Configurations that performed better in this test also achieved lower average latency. However, this is not enough to draw conclusions about P95 or P99 delays. We will explore these issues in a future blog post.

We further tested running 32 parallel Netperf processes to take advantage of all available CPU cores. As you can see, performance has improved for all configurations. However, unlike the throughput test, putting more CPU resources into this test does not compensate for the lack of efficiency, because the maximum processing rate is limited by latency rather than available CPU resources. Even if network bandwidth became a bottleneck, we would see the same number of requests per second.

Overall, the results were very encouraging and Cilium was able to achieve a processing rate of nearly 1,000,000 requests per second via eBPF host routing on our test system.

CPU flame map comparison of Cilium eBPF and Calico eBPF

Overall, Cilium eBPF and Calico eBPF perform roughly the same. Is this because they use the same data path? And it isn’t. There is no predefined eBPF data path. EBPF is a programming language and runtime engine that allows you to build data path features and many other features. Cilium and Calico eBPF data paths are very different. In fact, Cilium offers many features that Calico eBPF does not support. But even when it comes to interacting with the Linux networking stack, there are significant differences. Further analysis can be made by looking at the CPU flame maps of both.

Cilium eBPF (Receiving path)

Cilium’s eBPF host routing provides a good context-free way to transfer data (from the network card to the application’s socket). This is why in the flame diagram above the entire receiver path fits nicely into a flame diagram. The flame chart also shows processing blocks for eBPF, TCP/IP, and sockets.

Calico eBPF (Receive path)

The Calico eBPF receiver looks different. Although the same eBPF processing block executes the eBPF program, the Calico eBPF receive path passes through an additional Veth that is not required at the Cilium eBPF data path receiver.

The processing in the figure above is still performed in the context of the host. The flame chart below shows the work in the Pod restored by process_backlog. While this works the same as in the Cilium scenario (both TCP/IP+ socket data transfer), additional context switching is required because it crosses Veth.

If you wish to do further research on your own, you can open the interactive flame map SVG file by clicking on the following link to see the details:

New connection processing rate

The connection processing rate benchmark is based on the number of requests per second benchmark, but new connections are established for each request. The results of this benchmark show the performance difference between using persistent connections and creating new connections for each request. Creating a new TCP connection involves multiple components in the system, so this test is by far the most stressful test for the entire system. From this benchmark, we can see that it is possible to take full advantage of most of the resources available in the system.

This test shows a workload that receives or initiates a large number of TCP connections. A typical application scenario is one in which a large number of client requests are handled by an exposed service, such as an L4 proxy or a service that creates multiple connections for external endpoints, such as data crawlers. This benchmark is designed to pressure the system as much as possible with the least amount of work offloaded to the hardware to show the maximum performance differences between configurations.

First, we run a Netperf process to do the TCP_CRR test.

The performance differences between different configurations in a single process are already huge, and will be even greater with more CPU cores. It is also clear that Cilium is once again able to compensate for the performance loss caused by the additional overhead of the network namespace and achieve nearly the same performance as the baseline configuration.

Follow-up plans: This CPU utilization surprised us and prompted further research in the 1.11 development cycle. It seems that whenever network namespace usage is involved, resource overhead on the sender side is always necessary. This overhead is present in all configurations involving network namespaces, so it is likely to be caused by the kernel data paths involved by both Cilium and Calico. We will keep you updated on the progress of this part of the research.

When we ran 32 Netpert processes running TCP_CRR tests in parallel to take advantage of all the CPU cores, we observed something very interesting.

The connection processing rate of the baseline configuration decreased significantly. The performance of the baseline configuration did not improve further with the increase in available CPU resources, although the size of the connection trace status table changed accordingly and we confirmed that there was no performance degradation due to the connection trace record reaching the upper limit. We repeated the same test many times and the results were still the same. When we manually bypassing the Iptables connection trace table with the -j NOTRACK rule, the problem was immediately resolved and the baseline configuration performance was restored to 200,000 connections per second. So obviously, once the number of connections exceeds a certain threshold, the Iptables connection tracking table starts to have problems.

Note: In this test, the Calico eBPF data path test results were not very stable. We don’t know why yet. Network packet transmission is also not very stable. We don’t take into account the test results because they may or may not be accurate. We invited the Calico team to work with us on this issue and re-test it.

Considering that we are using a standard unmodified application to process requests and transfer information, 200,000 connections per second is a very good result. However, let’s take a look at CPU consumption.

The benchmark results show the largest performance differences across configurations. To reach the goal of processing 250,000 new connections per second, the entire system must consume 33% to 90% of available resources.

Since the CPU consumption of the sending end is always higher than that of the receiving end, we can be sure that the number of connections received per second is greater than the number of connections initiated per second for the same resource.

Comparison between WireGuard and IPsec encryption costs

It is possible that everyone would agree that WireGuard would perform better than IPsec, so let’s test WireGuard performance under different maximum Transmission units (Mtus).

There are some differences between different configurations. It is worth noting that Cilium in combination with Kube-Proxy performs better than Cilium alone. However, this performance difference is relatively small and can be largely compensated by optimizing MTU.

Here is the CPU resource consumption:

The above results indicate that the CPU usage difference between different configurations is small when the MTU is the same, so the optimal PERFORMANCE can be obtained by optimizing the MTU configuration. We also tested the number of requests per second and got the same results. Interested readers should refer to the CNI performance Benchmarking section of the Cilium documentation.

Comparison between WireGuard and IPsec

It is even more interesting to compare the performance of Wireguard and IPsec. Cilium has been supporting IPsec for some time. Starting in 1.10, Cilium also began to support WireGuard. All else being equal, it would be interesting to compare these two encryption schemes side by side.

As expected, WireGuard throughput was higher, and the WireGuard had a higher maximum transfer rate under both MTU configurations.

The CPU usage of WireGuard and IPsec in different MTU configurations is tested when the throughput reaches 10 Gbit/s.

While WireGuard has a higher maximum throughput, IPsec is more efficient with less CPU overhead for the same throughput, which is a huge difference.

Note: For IPsec to be efficient, hardware that supports the AES-NI instruction set is required to uninstall IPsec encryption. Future plans: It is not clear why the higher efficiency of IPsec does not result in higher throughput. Using more CPU cores also did not significantly improve performance. This is most likely due to the fact that RSS does not handle encrypted traffic well across CPU cores, because L4 information that is typically used for hashing and distributing traffic across CPU cores is encrypted and cannot be parsed. Therefore, from a hashing point of view, all connections are the same, because only two IP addresses are utilized in the test.

Does this affect latency? Let me look into it further. Latency benchmarks best describe the reality of microservice workloads, which typically use persistent connections to exchange requests and responses.

The CPU efficiency is consistent with the number of requests per second observed. However, the total CPU consumption for each configuration is not very high. The difference in latency is more significant than the difference in CPU consumption.

The test environment

Here is the bare-metal configuration we used. We built two identical systems that were directly connected to each other.

  • CPU: AMD Ryzen 9 3950X, AM4 platform, 3.5 GHz, 16 cores 32 threads
  • Main board: X570 Aorus Master, supporting PCIe 4.0 x16
  • Memory: HyperX Fury DDR4-3200 128 GB, XMP frequency 3.2 GHz
  • Nic: Intel E810-CQDA2, dual-port, 100 Gbit/s per port, PCIe 4.0 x16
  • Operating system kernel:Linux 5.10 LTS (configured asCONFIG_PREEMPT_NONE)

All tests used a standard 1500 byte MTU unless otherwise noted. Although the higher the MTU value, the better the absolute value of the test results, the purpose of the benchmark in this article is to compare relative differences, not to test absolute values of the highest or lowest performance.

Test configuration

At the request of our users, we presented Calico’s test results for comparison. To make the comparison as clear as possible, we used the following configuration types for testing:

  • Baseline configuration (node to node): This configuration does not use Kubernetes or containers and runs directly on the naked machine during benchmarkingnetperf. In general, this configuration provides optimal performance.
  • Cilium eBPF: The Cilium version is 1.9.6 and has been tuned according to the tuning guide to enable eBPF host routing and Kube-proxy replacement. This configuration requires an operating system kernel version 5.10 or later. This configuration is most comparable to the Calico eBPF configuration. We focused on benchmarking in the direct routing mode because performance is often more important in this mode. We will further benchmark the tunnel mode later.
  • Cilium eBPF (traditional host Routing) : Cilium version 1.9.6, runs in traditional host routing mode, uses standard Kube-Proxy, supports 4.9 and later kernel versions. This configuration is most comparable to the Calico configuration.
  • Calico eBPF: The version of Calico is 3.17.3, which uses eBPF data path with Kube-Proxy replacement enabled, connection tracing bypass enabled, and eBPF FIB query enabled. This configuration requires an operating system kernel version 5.3 or later. This configuration is most comparable to that of Cilium eBPF.
  • Calico: Calico version 3.17.3, using standard Kube-Proxy, supports a lower version of the operating system kernel. This configuration is more referential than that of Cilium eBPF (traditional host routing).

Repeat the test results

All scripts used for the test have been uploaded to GitHub repository Cilium/Cilium-Perf-Networking, which can be used to reproduce the test results.

The next step

We have achieved a lot in terms of performance tuning, but we have many other ideas and will further optimize Cilium’s performance in all aspects.

  • Observability benchmarking: Pure network benchmarking is necessary, but it is the resource consumption required to implement observability that really distinguishes a system from others. Observability is a key feature of infrastructure, both for system security and troubleshooting, and visual resource consumption varies greatly from system to system. EBPF is an excellent tool for observability, and Cilium’s Hubble components can benefit from it. In this article’s benchmark, we disabled Hubble so that the results could be compared with Calico. In the following articles, we will benchmark Hubble to verify THE CPU requirements of Hubble and compare It with other similar systems.

  • Service and NetworkPolicy benchmarks: The current benchmark results do not involve any Service or NetworkPolicy. We did not test either to control the scope of this article. We will further test use cases using NetworkPolicy, in addition to east-west and North-south services. If you can’t wait, the Cilium 1.8 release blog has posted some benchmark results and shows significant performance improvements from XDP and eBPF.

    Currently, we are still not satisfied with NetworkPolicy’s performance in CIDR rules. Our current architecture is optimized for a small number of complex CIDRS, but it does not cover some of the special cases implemented using LPM. Some users may want to benchmark large permit and block lists for individual IP addresses. We will prioritize this use case and provide a hash-based implementation.

  • Memory optimization: We will continue to optimize Cilium’s memory footprint. Cilium’s main memory footprint comes from eBPF Map allocations. These are the kernel-level data structures required for network processing. To improve efficiency, the size of the eBPF Map is preset to minimize the amount of memory required according to the configuration. We’re not very happy with this at the moment, and this will be our focus in future releases.

  • Break more rules: Bypass Iptables more: We believe that iptables should be bypassed entirely. The container namespace and other parts of the system still have optimization potential. We will also continue our efforts to accelerate data path applications for the service Grid. There is already a rudimentary version of redirection using the Envoy Socket layer. Please look forward to progress in this regard.

  • Ideas and suggestions: If you have other ideas and suggestions, please let us know. For example, what benchmarks or improvements would you like us to make? We would love to get feedback. You can post your thoughts at Cilium Slack or contact us on Twitter.

For more information

  • All of the resulting data for this article is available in the CNI Performance Benchmarking section of the Cilium documentation. We will continue to update these data.
  • The Tuning guide provides a complete tutorial for tuning Cilium.
  • For more information about Cilium, please refer to the official Cilium documentation.
  • For more information about eBPF, visit eBPF’s official website.

KubeSphere community events notice

In order to have close communication with old and new friends in the community, we will join hands with CNCF and other partners to bring technology exchanges and collisions in Shanghai, Hangzhou, Shenzhen and Chengdu from May to July. In 2021, after the first Meetup in Shanghai, we will continue the theme of KubeSphere and Friends and bring you Kubernetes and Cloud Native Meetup in Hangzhou on May 29th.

We have specially customized KubeSphere full set of commemorative peripheral gifts: T-shirts, mugs, commemorative badges, canvas bags, masks, etc. In addition, there are all kinds of nuclear books waiting for you!

What’s up? Are you hooked? Sign up for the upcoming Hangzhou station to get customized souvenirs!

This article is published by OpenWrite!