This article is compiled from Meituan comment technology Salon no. 14: The Story behind Meituan – You don’t know Meituan cloud.
Meituan Dianping technology Salon is hosted by the technical team of Meituan Dianping every month. Each salon invites technical experts from Meituan-Dianping and other Internet companies to share practical experience from the front line, covering major technical fields.
At present, the salon is held in Beijing, Shanghai, Xiamen and other places. Would you like to join the latest salon? Please follow the wechat public account “Meituan-Dianping Technical Team”.
This salon includes three lectures: Meituan Cloud Docker platform, Meituan cloud object storage system, Meituan layer 4 load balancing gateway MGW. Other lectures will be published, please continue to pay attention to.
preface
In the era of rapid development of mobile Internet, load balancing plays an important role. It is the entrance of application traffic and plays a decisive role in the reliability and performance of applications. Therefore, load balancing needs to meet the two characteristics of high performance and high reliability. MGW is a four-layer load balancer developed by Meituan-Dianping, which is mainly used to replace the four-layer load balancer LVS in the original environment. Currently, it handles tens of Gbps of meituan-Dianping traffic and tens of millions of concurrent connections. This paper mainly introduces how MGW achieves high performance and high reliability.
What is load balancing?
In the early days of the Internet, the business flow was small and the business logic was simple, so a single server could meet the basic needs. However, with the development of the Internet, more and more business traffic and more and more complex business logic, the performance problems of a single machine and a single point of failure have become prominent, so multiple machines are needed to carry out the horizontal expansion of performance and avoid single point of failure. But how do you distribute traffic from different users to different servers?
The early approach was to use DNS as a payload, allowing clients to direct their traffic to each server by resolving different IP addresses for them. However, this method has a big disadvantage of delay. After the scheduling policy is changed, the cache of DNS nodes at all levels does not take effect on the client in time, and the SCHEDULING policy of DNS load is relatively simple, which cannot meet service requirements, so load balancing occurs.
Client traffic will first reach the load balancing server, through certain by the load balancing server scheduling algorithm will flow distributed to different application servers, and load balancing server can do periodic health examination on the application server, when found fault nodes and the dynamic of the nodes from the application server in the cluster, to ensure that the application of high availability.
Load balancing is divided into four – layer load balancing and seven – layer load balancing. Layer-4 load balancing works at the transport layer of the OSI model and forwards traffic to the application server by modifying the address information of the packets received from the client.
Layer-7 load balancing works at the application layer of the OSI model. Because it needs to parse application layer traffic, layer-7 load balancing also needs a complete TCP/IP protocol stack after receiving client traffic. Layer-7 load balancer establishes a complete connection with the client and resolves the request traffic of the application layer. Then, it selects an application server according to the scheduling algorithm and establishes another connection with the application server to send the request. Therefore, the main work of Layer-7 load balancer is proxy.
Since the main work of layer-4 load balancing is forwarding, there is a forwarding mode problem. Currently, there are mainly four forwarding modes: DR mode, NAT mode, TUNNEL mode, and FULLNAT mode.
In DR mode, traffic is forwarded to the application server through layer 2 by modifying the destination MAC address of the data packet. In this way, the application server can directly send the response to the application server, which has high performance. Because this mode relies on Layer 2 forwarding, the load balancing server and application server must be in a Layer 2 reachable environment, and VIP must be configured on the application server.
In NAT mode, traffic can reach the application server by changing the destination IP address of the data packet. In this way, the destination IP address of the data packet is the IP address of the application server. Therefore, you do not need to configure VIP on the application server. Defect is due to the pattern changed the destination IP address, so that if the application server will reply packet directly sent to the client’s words, its source IP is the application server’s IP, the client will not receive this reply, so we need to keep flow back to the load balancing, load balancing will reply packet source IP change back to the VIP and then sent to the client, In this way, normal communication can be ensured. Therefore, the NAT mode requires that load balancers exist on the network in the form of gateways.
The advantages and disadvantages of TUNNEL mode are the same as those of DR mode. In TUNNEL mode, the application server must support TUNNEL.
FULLNAT mode implements source ADDRESS translation (SNAT) on the basis of NAT mode. SNAT has the advantage that the response traffic can be routed back to load balancing through normal Layer 3 routing. In this way, load balancing does not need to exist in the form of a gateway on the network and has low requirements on the network environment. The application server loses the real IP address of the client.
The FULLNAT mode is described in detail below. First, a localip address pool is required for load balancing. SNAT source IP addresses are selected from the localip address pool. When the client traffic reaches the load balancer, the load balancer selects an application server from the application server pool based on the scheduling policy and changes the destination IP address of the data packet to that of the application server. At the same time, select a localip address from the localip address pool and change the source IP address of the data packet to a localip address. In this way, when the application server responds to the packet, the destination IP address is a localip address that actually exists on the load balancer. Therefore, the application server can reach the load balancer through a normal layer-3 route. Because FULLNAT makes one more SNAT than NAT mode and has the operation of selecting ports in SNAT, its performance is inferior to that of NAT mode. However, FULLNAT is selected as the forwarding mode of MGW due to its strong adaptability to network environment.
Why choose self-developed layer 4 load balancing?
There are two main reasons for choosing the self-developed layer 4 load balancer: the first is considering the high cost of hardware load balancer; Second, with the increasing traffic of Meituan Dianping, LVS has encountered performance bottlenecks and rising operation and maintenance costs.
Hardware load balancing cost problem
- Hardware cost: low-end hardware load balancing price in hundreds of thousands, high-end millions, the price is very expensive. When we need to put together a high availability cluster, it takes several machines and is extremely expensive.
- Labor cost: The hardware load balancing function is powerful and the configuration is flexible. As a result, we need some professional trained personnel for maintenance, which increases the labor cost.
- Time cost: When we encounter bugs or new requirements in the process of use and the manufacturer needs to provide a new version, we need to go through a tedious process to report to the manufacturer, and then the manufacturer releases a new version for us to upgrade. The time cycle is very long, which is unacceptable in the high-speed development of the Internet industry.
LVS performance issues
Initially, Meituan-Dianping used a load balancing structure composed of LVS+Nginx, LVS for four layers of load balancing, and Nginx for seven layers of load balancing. However, with the rapid growth of meituan-Dianping traffic (both the number of new connections and the throughput increased by three times in a few months), LVS failures occurred frequently, resulting in performance bottlenecks. Therefore, we developed a high-performance and reliable four-layer load balancing MGW to replace LVS.
How does MGW achieve high performance
Here’s how MGW achieves high performance by comparing some of the performance bottlenecks of LVS.
Interrupts and long protocol stack path performance
Interrupt is one of the most important factor affecting the LVS performance, if we need to process 6 million packets, a second every six data packets to create a hardware interrupt, the second will generate 1 million hardware interrupt, every produce hardware interrupt interrupted intensive computation load balancing procedure, among a large number of cache miss, The impact on performance is significant.
Meanwhile, LVS is an application developed based on kernel NetFilter, which is a hook framework running in the kernel protocol stack. This means that by the time the packet arrives at THE LVS, it has already gone through a long stack of processing that is not required by the LVS, causing some unnecessary performance loss.
For these two problems, the solution is to use polling mode driver and kernel bypass, and the user-mode PMD driver provided by DPDK can solve these two problems. DPDK uses a large number of hardware-related features such as NUMA, Memory Channel, DDIO, etc., which greatly optimizes performance. Meanwhile, it provides more network libraries, which can greatly reduce the difficulty of development and improve development efficiency. Therefore, DPDK is chosen as the development framework of MGW.
The lock
Because the kernel is a general-purpose application, it does not have a custom design for specific scenarios, which results in some common data structures requiring locking protection. Here are the reasons for locking and the MGW solution.
First, we will introduce Receive Side Scaling (RSS). RSS is a function that hashes data packets to different network card queues through the tuple information of data packets. In this case, different cpus can read data from corresponding network card queues for processing, so that CPU resources can be fully utilized. The mode of FULLNAT used by MGW was introduced earlier. FULLNAT will change all the tuple information of packets, so that packets in the same connection, request and reply directions may be hashed by RSS to different NETWORK card queues. Different network card queues mean that packets are processed by different cpus. In this case, you need to lock the session structure when accessing it.
There are two ways to solve this problem. One is to set RSS(cip0, cport0, VIP0, vport0) = RSS(diP0, dport0, LIP0, lport0) equal by selecting a port lport0. Another method is to assign a localIP address to each CPU. During SNAT IP selection, different cpus select their own localIP address. After the response comes back, packets with the specified destination IP address will be sent to the specified queue through the mapping between LIP and CPU.
Since the second method happens to be supported by the flow Director feature of the network card, we chose the second method to unlock the session structure.
The flow director can send the specified data packets to the specified NETWORK adapter queue according to certain policies, and its priority in the network adapter is higher than RSS. Therefore, we allocate a localIP for each CPU during initialization, such as lip0 for CPU0 and LIP1 for CPU1. Allocate LIP2 to CPU2 and LIP3 to CPU3. When a request packet (CIP0, Cport0, VIP0, vport0) reaches load balancing and is hashed by RSS to queue 0, the packet is processed by CPU0. In fullNAT, CPU0 selects lip0 as its localIP and sends data packets (LIP0, lport0, DIP0, and dport0) to the application server. After the application server responds, Reply packets (DIP0, Dport0, LIP0,lport0) are sent to the load balancer. In this case, we can put a rule under the flow director to send the packet whose destination IP is LIP0 to queue 0, so that the reply packet will be sent to queue 0 for CPU0 to process. In this case, when the CPU processes packets in two directions of the same connection, it is a completely serial operation, so it is not necessary to lock the session structure.
Context switch
In the design, it is hoped that the control plane and the data plane are completely separated, the data plane concentrate on its own processing, not interrupted by any events. Therefore, the CPU is divided into two groups, one for the data plane and one for the control plane. In addition, cpus on the data plane are isolated so that processes on the control plane cannot be scheduled on the cpus on the data plane. Cpu-bound the threads of the data plane so that each data thread has a CPU to itself. Other programs on the control plane, such as the Linux kernel and SSH, run on the cpus on the control plane.
How does MGW achieve high reliability
The following describes how MGW achieves high reliability at each of the three layers: MGW cluster, MGW stand-alone, and application server.
High reliability of the cluster
MGW uses OSPF+ECMP mode to form a cluster. The data packets are hhash to each node in the cluster through ECMP, and the route of this machine is dynamically removed after a single machine fails through OSPF. In this way, ECMP will no longer distribute traffic to this machine, thus achieving a dynamic failover.
The traditional ECMP algorithm has a serious problem. When the number of nodes in the cluster changes, the path of most traffic will change, and the changed traffic cannot find its own session structure when it reaches other MGW nodes, which will lead to a large number of abnormal connections and have a great impact on services. This problem is exacerbated by the fact that we take every node offline when upgrading the cluster.
A solution is to use a hash switches support consistency, so that when there is a change in the node, the only change can have impact on the node connection, other connection will remain normal, but support the exchange of this algorithm is less, and have not fully achieve high availability, so we did a session between cluster synchronization function.
Each node in the cluster will fully synchronize its session out, so that each node in the cluster maintains a global session table. Therefore, no matter the traffic path changes in any form after the node changes, the traffic can find its own session structure, that is, can be forwarded normally. This ensures that all connections are healthy if the number of nodes in the cluster changes.
In the design of the process of the main consideration of two issues: the first is failover, the second is fault recovery and capacity expansion.
failover
In the case of failover, we hope that after a machine fails, the switch can immediately switch the traffic to other machines, because if the traffic is not cut away, it means that all the traffic that reaches this machine will be discarded, resulting in a large number of packet loss. Through investigation and test, it is found that when all physical interfaces are used on the switch side and the interface is powered off on the server side, the switch will instantly switch the traffic to other machines. After a 100ms test of sending two packets (one for the client and one for the server), this operation results in zero packet loss.
Because failover mainly depends on the switches of perception, when there is some abnormal, server switches perceive, switch fault switching operation can’t, so we need a healthy self-check program, once every half a second to health self-check, when abnormal is found that the server on the server to perform so power operation, so that the flow rate cut away at once.
Failover power mainly depends on the front-end ports operations and network adapter driver is running in the main program, when the main program after hanging up, can’t again so the execution of power operation, so in order to solve this problem, the main process will capture the abnormal signal, when found abnormal to power operation, the network card in power after the operation to continue to send signals to the system for processing.
After the above design, MGW can achieve 0 packet loss in upgrade operation, 0 packet loss in main program failure, and there will be a packet loss of up to 500ms in other anomalies (network cables, etc.), because such anomalies need to be detected by self-checking program, and the self-checking program cycle is 500ms.
Fault recovery and capacity expansion
During fault recovery or capacity expansion, the number of nodes in the cluster changes and the traffic path changes. When the changed traffic reaches the original nodes of the cluster, the original nodes maintain a global session table, so the traffic can be forwarded normally. However, if the traffic reaches a new machine that does not have a global session table, it is discarded entirely. To solve this problem, MGW goes through an intermediate pre-online state after it goes online. In this state, MGW does not let the switch know that it is online, so the switch does not cut traffic. First MGW will send the other nodes in the cluster a batch synchronization at the request of the other nodes will own the session after the receipt of a request quantity on the node synchronization to the new launch, new online node to switch after receiving the full session perceive themselves online, then switch to cut traffic can normally be forwarded.
There are two main problems in this process. The first problem is that since there is no master node in the cluster to maintain a global state, if the request is lost or the session synchronization data is lost, the new online node cannot maintain a global session state. However, considering that all nodes maintain a global session table, and therefore all nodes have the same number of sessions, you can send a Finish message with the number of sessions each time all nodes have done batch synchronization. When a new online node receives a Finish message, it compares the number of sessions it has with the number of sessions in Finish. When the quantity requirements are met, the new on-line node will control its own on-line operation. Otherwise, wait for a timeout period and perform batch synchronization again until the requirements are met.
Another problem is that if a new connection occurs during batch synchronization, the new connection will not be synchronized to the new machine through batch synchronization. If too many new connections are created, the new machines will never meet the requirements. Therefore, you need to ensure that machines in the pre-online state receive incremental synchronization data, because new connections can be synchronized through incremental synchronization. Incremental synchronization and batch synchronization are used to ensure that new online machines eventually get a global session table.
Single machine high reliability
In terms of high reliability of single machine, MGW made an automated test platform, which judged whether a test case was successfully executed through connectivity and correctness of configuration, and the failed test case platform could notify the tester through email. At the end of each iteration of new functions, test cases of new functions will be added to the automation platform, so that automated tests will be carried out before each launch, which can greatly avoid problems caused by changes.
In the past, manual regression testing was required before each launch. Regression testing was time-consuming and easy to miss use cases, but it had to be done in order to avoid new problems caused by changes. With the automated test platform, the efficiency and reliability of regression testing were greatly improved.
RS reliability
Node smooth offline
In terms of RS reliability, MGW provides the smooth offline function of nodes, mainly to solve the problem that when users need to upgrade RS, if the RS to be upgraded is directly offline, all the existing connections on the RS will fail, affecting services. At this time, if the smooth offline function of MGW is called, MGW can guarantee the normal operation of the existing RS connection, but will not schedule new connections above. When all existing connections end, MGW will report an end state, and users can upgrade RS according to the end state. After the upgrade, the online interface will be called to enable the RS to carry out normal service. If the user platform supports automatic application deployment, the user can access the cloud platform and use the smooth offline function to implement fully automated upgrade operations that have no impact on services.
Consistent Source IP Hash scheduler
The source IP Hash scheduler ensures that the same client connections are scheduled to the same application server, that is, a one-to-one mapping relationship is established between the client and the application server. The common source IP Hash scheduler changes the mapping relationship after the application server changes, affecting services.
Therefore, we developed a consistent source IP Hash scheduler to ensure that when the application server cluster changes, only the mapping relationship between the application server and the client changes, and other things remain unchanged.
To balance traffic, a fixed number of virtual nodes are allocated to the Hash ring and then the vm nodes are evenly redistributed to physical nodes. The redistribution algorithm must ensure the following two points:
- When physical nodes change, only a few virtual node mappings change, which is the basic principle of consistent Hash.
- Because MGW exists in the form of clusters, when multiple application servers go online or offline, inconsistent sequence may occur when they are feedback to different MGW nodes. Therefore, the final mapping relationship must be consistent regardless of the online or offline sequence of application servers generated by different MGW nodes. If they are inconsistent, the same client connections will be scheduled to different application servers by different MGW nodes, violating the principle of the source IP Hash scheduler.
Combining the above two points, the consistent Hash algorithm of Google Maglev load balancing is a good example, which is described in detail in the paper and will not be discussed here.
conclusion
Through meituan point evaluation and meituan cloud flow verification, MGW has excellent performance and stability in both traditional network environment and overlay large layer environment. In terms of business scenarios, it covers database business, ten-million-level long connection business, embedded business, storage business and Web application business such as hotel, takeout and group purchase. In the context of rapidly changing business requirements, MGW has been constantly improving its own functions and performing well in various business scenarios. In the future, MGW will not only improve the functional requirements of the fourth floor, but also consider developing to the seventh floor.
The resources
- DPDK.
- LVS.
- Eisenbud D E, Yi C, Contavalli C, et al. Maglev: A Fast and Reliable Software Network Load Balancer.
Don’t want to miss tech blog updates? Want to comment on articles and interact with authors? First access to technical salon information?
Please follow our official wechat account “Meituan-Dianping Technical Team”.