Kubernetes Service[1] is used to realize the mutual call and load balancing between services in the cluster. Currently, the community mainly implements userspace, iptables and IPVS three modes. The IPVS mode provides the best performance, but there is still room for optimization. The IPVS kernel module is used to implement DNAT and NF_Conntrack /iptables is used to implement SNAT. Nf_conntrack is designed for general purpose, and its internal states and processes are complicated, resulting in great performance loss.

Tencent TKE team [2] developed a new IPVS-BPF mode, completely bypassing the processing logic of NF_ConnTrack and using eBPF to complete the SNAT function. For the most common Pod access ClusterIP scenario, the short connection performance is improved by 40% and the P99 latency is reduced by 31%. NodePort scenarios are improved. For details, see the following table and the performance Measurement section.

I. Status quo of container network

The iptables mode

Existing problems:

1. Poor scalability. As service data reaches thousands, the performance of both the control side and the data side degrades dramatically. The reason is that the interface design of the Iptables control surface requires traversal and modification of all rules every time a rule is added, and the performance of the control surface is O(n²). On the data side, rules are organized in linked lists with O(n) performance

2. The LB scheduling algorithm supports only random forwarding

IPVS mode

IPVS was designed specifically for LBS. It uses hash tables to manage services, adding, deleting, and searching services in O(1) time complexity. However, the IPVS kernel module does not have SNAT functionality, so it borrows the SNAT functionality of Iptables. After IPVS performs DNAT on packets, the connection information is stored in NF_ConnTrack, and iptables performs SNAT based on this relay. This mode is currently the best choice for Kubernetes network performance. However, due to the complexity of NF_ConnTrack, it brings great performance loss.

2. Ipvs-bpf scheme introduction

EBPF introduction

EBPF [3] is a virtual machine implemented by software in the Linux kernel. Users compile eBPF programs into eBPF instructions, which are then loaded into specific mount points of the kernel via BPF () system calls. Specific events trigger the execution of eBPF instructions. When mounting eBPF instructions, the kernel performs sufficient validation to prevent eBPF code from affecting kernel security and stability. The kernel also performs JIT compilation to translate eBPF instructions into native instructions, reducing performance overhead.

The kernel presets many eBPF mount points in the network processing path, such as XDP, QDISC, TCP-BPF, socket, etc. EBPF programs can be loaded into these mount points and invoke specific kernel-provided apis to modify and control network messages. EBPF programs can store and exchange data using map data structures.

Ipvs-bpf optimization scheme based on eBPF

Tencent TKE team designed and implemented IPVS-BPF to solve the performance problems caused by NF_ConnTrack. The core idea is to bypass NF_ConnTrack and reduce the number of instructions to process each message, so as to save CPU and improve performance. The main logic is as follows:

1. A switch is introduced in the IPVS kernel module to switch between the native IPVS logic and ipvS-BPF logic

2. In IPVS-BPF mode, move the IPVS hook point from LOCALIN to PREROUTING to bypass NF_ConnTrack

3. Add or delete session information in eBPF Map in IPVS connection creation and connection deletion codes

4. Mount the SNAT code of eBPF to qdisc and execute SNAT according to the session information in eBPF Map

In addition, icmp and fragmentation are specially processed. The detailed background and details will be introduced in the subsequent QCon online conference [4]. Welcome to discuss together.

Comparison of packet processing flow before and after optimization

The packet processing process is greatly simplified.

Why not just go full eBPF

Many readers will ask why the IPVS module is integrated with eBPF, rather than using eBPF directly to implement services.

We also carefully studied this problem at the beginning of the design, with the following considerations:

• NF_ConnTrack consumes more CPU instructions and latency than IPVS modules and is the number one performance killer of forwarding paths. IPVS itself is designed for high performance, not performance bottlenecks

•IPVS has a history of nearly 20 years and is widely used in production environments with guaranteed performance and maturity

•IPVS maintains session table aging internally through timers, whereas eBPF does not support timers and maintains session tables through user-space codes

•IPVS supports a rich set of scheduling policies, which can be overridden with eBPF, and many of the scheduling policies require loop statements that eBPF does not support

Our goal is to achieve a controllable amount of code, the landing of the optimization scheme. Based on the above considerations, we choose to reuse IPVS module, bypass NF_ConnTrack, using eBPF to complete the SNAT scheme. The final amount of data side code was 500+ lines of BPF code and 1000+ lines of IPVS module changes (most of which were added to assist SNAT Map management).

Three, performance measurement

This chapter through quantitative analysis method, using PERF tool to read CPU performance counters, from the micro point of view to explain the macro performance data. WRK and IPERF are used in this paper.

The test environment

Two points need to be noted in reproducing the test:

1. Different clusters and machines of the same model may have different background performance data due to different topologies of mother machines and racks. To reduce the error caused by such differences, we compared the IPVS and IPVS-BPF modes by using the same cluster, the same set of back-end Pods, and the same LB node. To measure the IPVS performance data in IPVS mode, switch the LB node to ipvS-BPF mode, and then measure the performance data in IPVS-BPF mode. (Note: The switching mode is realized by switching the control surface from KuBE-proxy to Kube-proxy-BPF in the background, which is not supported by product functions.)

2. The objective of this test is to measure the impact of software module optimization on LB to access service performance, so that the bandwidth and CPU of the client and RS target server can not become a bottleneck. Therefore, the LB node under pressure test adopts the 1-core model and does not run the back-end Pod instance. The nodes running back-end services use 8-core models

NodePort

ClusterIP

Here the LB Node (Node on the left) adopts the SA2 1-core 1G model.

The measured results

In ipvS-BPF mode, the NodePort short connection performance improves by 64% and the ClusterIP short connection performance improves by 40%.

NodePort optimization is more effective because NodePort requires SNAT, and our eBPF SNAT is more efficient than iptables SNAT, so the performance improvement is even greater.

As shown in the figure above, the ipvS-BPF mode has a 22% performance improvement over IPVS mode in iPERF bandwidth tests.

In the figure above, WRK tests show a 47% reduction in NodePort short connection P99 latency.

In the figure above, WRK tests show a 31% reduction in P99 latency for ClusterIP short connections.

Instruction count and CPI

Test summary

The Service type

Short connection CPS

Short connection P99 delay

Long connection swallow

ClusterIP + 40%-31% None, see nodePort + 64%-47% +22% below

As shown in the table above, compared with the native IPVS mode, Nodeport improves short connection performance by 64%, p99 latency by 47%, and long connection bandwidth by 22%. ClusterIP processing short connection throughput increased by 40% and P99 latency decreased by 31%.

When testing ClusterIP long connection throughput, iPERF itself consumed 99% of the CPU, making the optimization not easy to measure directly. In addition, we also found an increase in CPI in IPVS-BPF mode, which deserves further study.

Other optimizations, feature limitations and follow-up work

In the course of developing ipvS-BPF solutions, some other issues have been addressed or optimized

• Low new-build performance when conn_reuse_mode = 1 [5] and no route to host problems [6]

The problem is that when the client initiates a large number of new TCP connections, the new connections are forwarded to the Terminating Pod, resulting in continuous packet loss. This problem does not occur with IPVS conn_reuse_mode=1. However, when conn_reuse_mode=1, there is another bug that the performance of new connections decreases dramatically, so conn_reuse_mode=0 is generally set. We have fixed the problem completely in the TencentOS kernel with the codes ef8004F8 [7], 8EC35911 [8], 07A6e5FF63 [9] and are submitting the fix to the kernel community.

• OCCASIONAL 5s delay for DNS resolution [10]

Iptables SNAT allocates Lports to insert NF_ConnTrack, which uses optimistic locking. If a race occurs, inserting the same Lport and quintuple into NF_ConnTrack at the same time will result in packet loss. In ipvS-BPF mode, the procedure for SNAT to select an Lport and the procedure for inserting a Hash table are repeated for a maximum of five times, reducing the probability of this problem.

• CLB health check failed due to externalIP optimization [11]

Details: github.com/kubernetes/… 07864

Features limit

• When a Pod accesses its own service, ipvS-BPF forwards requests to other pods, not to the Pod itself

The follow-up work

• Use eBPF to further optimize clusterIP performance based on Cilium’s approach

• Investigate the causes of CPI increases in IPVS-BPF mode and explore the possibility of further performance improvements

How to enable ipvs-BPF mode in TKE

As shown in the figure below, when creating clusters on Tencent Cloud TKE console [12], select ipvS-BPF for kube-Proxy proxy mode under advanced Settings.

Currently, this feature needs to apply for a whitelist. Please submit your application through the Application page [13].

Vi. Related patents

The related patent applications generated by this product are as follows:

2019050831CN A message transmission method and related devices

2019070906CN Load balancing method, device, device, and storage medium

2020030535CN A method for detecting idle network service applications using eBPF technology

2020040017CN Host real-time load awareness adaptive load balancing scheduling algorithm