Author | Li Wanzhi, Volcano Engine development engineer
This paper is compiled from the speech of Meetup, a developer community of Volcano Engine, and mainly introduces the Service Mesh flow management technology in the scene of large-scale flow of red packets in Douyin Spring Festival Gala.
Background and Challenge
The red envelope project of 2021 CCTV Spring Festival Gala leaves little time for business r&d students, who need to complete the development, test and launch of relevant codes within limited time.
The whole project involved different technical teams, and of course a lot of microservices. These microservices have their own language stacks, including Go, C++, Java, Python, Node, etc., while running in very complex environments, such as containers, virtual machines, physical machines, etc. These microservers may need to use different traffic management strategies to ensure stability at different stages of the event.
Therefore, the infrastructure needs to provide unified traffic governance capabilities for these microservices from different teams and written in different languages.
Response to traditional microservices architecture
Speaking of microservices, let’s first look at how traditional microservices architectures solve these problems. With the continuous development of enterprise organizations, the business logic of products becomes increasingly complex. In order to improve the iterative efficiency of products, the back-end architecture of Internet software gradually evolves from single large service to distributed micro service. Compared with individual architectures, distributed architectures are less stable and observable. In order to improve these points, we need to implement a lot of functionality on the microservices framework. Such as:
- Microservices need to call each other to complete the functions of the original single large service, which involves the related network communication, as well as the serialization of requests and responses brought by network communication.
- Calls between services involve service discovery.
- Distributed architectures may require different traffic governance policies to ensure the stability of services calling each other.
- In microservice architecture, observability needs to be improved, including logging, monitoring, Tracing, etc.
By implementing these capabilities, microservices architectures can also address some of the problems mentioned earlier. But there are some problems with microservices themselves:
- Realizing multiple functions on the multi-language micro-service framework involves very high development and operation costs.
- The delivery or version recall of some new features on the micro-service framework requires the cooperation of business research and development students to carry out relevant changes and release online, which will lead to the phenomenon of long-term fragmentation and uncontrolled version of the micro-service framework.
So how do we solve these problems? There is a saying in software engineering that any problem can be solved by adding an intermediate layer. The industry has already provided an answer to our previous question. This middle layer is the Service Mesh.
Self-developed Service Mesh implementation
Now I will introduce the implementation of volcano Engine’s self-developed Service Mesh. Take a look at the architecture diagram below.
The Proxy node in the blue rectangle is the data plane of the Service Mesh. It is a separate process deployed in the same running environment (same container or same machine) as the Service process running the business logic. The Proxy process is used to Proxy all the traffic flowing through the Service process. The Service discovery and traffic governance policies mentioned above can be implemented on the microservice framework by the data side process.
The green rectangle is the control surface of the Service Mesh. The routing traffic and governance policies we need to implement are determined by this control surface. It is a remotely deployed service that issues traffic governance rules with a data plane process, which then executes them.
At the same time, we can also see that the data surface and control surface are independent of business, and their release and upgrade are relatively independent, without informing business r&d students. Some of the problems mentioned above can be solved based on such an architecture:
- We do not need to implement the various functions of the microservices framework in every language, but only in the data side process of the Service Mesh.
- At the same time, the data plane process shields various complex operating environments, and the Service process only needs to communicate with the data plane process.
- Various flexible traffic governance policies can also be customized by the process control side services of the Service Mesh.
Service Mesh traffic governance technology
Next, I will introduce the specific traffic management technologies provided by our Service Mesh implementation to ensure a stable performance of the micro Service in the face of the traffic peak of Tik Tok Spring Festival Gala.
First, the core of traffic governance:
- Routing: Traffic originating from one microservice entity may require some service discovery or flow through rules to the next microservice. Many traffic governance capabilities can be derived from this process.
- Security: When traffic is transferred between different micro-services, identity authentication, authorization, and encryption are required to ensure that traffic is secure, authentic, and trusted.
- Control: Dynamically adjust governance policies to ensure the stability of microservices in different scenarios.
- Observability: This is an important point, we need to record and track the status of traffic, and cooperate with the early warning system to find and solve problems in time.
The above four core aspects, combined with specific traffic governance strategies, can improve the stability of micro-services, ensure the security of traffic content, improve the research and development efficiency of business students, and improve the overall disaster recovery capacity in the face of black Swan events.
Let’s take a look at what traffic governance policies Service Mesh provides to ensure microservice stability. Stability strategy — fusing
The first is the circuit breaker. In microservices architectures, single points of failure are the norm. When a single point of failure occurs, how to ensure the overall success rate is a problem to be solved.
From the client’s point of view, the circuit breaker records the success rate of traffic requests from the service reaching each node downstream. When the success rate of requests reaching the downstream is lower than a certain threshold, we will fuse the node so that traffic requests will not be sent to the faulty node.
When the faulty node is recovered, we also need a certain strategy to recover after the fuse. For example, you can try to send some traffic to the failed node within a period of time. If the node still cannot provide service, the circuit breaker continues. If service is available, ramp up traffic until it returns to normal levels. A circuit-breaker strategy tolerates the unavailability of individual nodes in the microservices architecture and prevents an avalanche of further deterioration.
Stability strategy – current limiting
Another governance strategy is traffic limiting. Traffic limiting is based on the fact that the Server is less successful in processing requests when it is overloaded. For example, a Server node can normally handle 2000 QPS, but under an overload condition (say 3000 QPS), the Server can only handle 1000 QPS or less. Traffic limiting proactively drops some traffic to prevent Server overload and avalanche effect.
Stability strategy – Degrade
When the Server node becomes further overloaded, a degrade strategy is required. There are two scenarios for demotion:
- One is to discard traffic proportionally. For example, the traffic sent from service A to service B can be discarded at A certain rate (20% or higher).
- Another is the downgrading of bypass dependency. Assume that service A depends on services B, C, and D. If D is in bypass mode, you can choke off the traffic that depends on D in bypass mode so that the released resources can be used for computing core paths to prevent further overload.
Stability strategy – dynamic overload protection
Circuit breakers, current limiting, and downscaling are all strategies to deal with errors. In fact, the best strategy is to nip them in the back before they happen, which is the dynamic overload protection introduced next. As mentioned above, it is difficult to determine the threshold of the traffic limiting strategy. It is generally used to observe the QPS that a node can carry through pressure measurement. However, the upper limit may vary on different nodes due to different operating environments. Dynamic overload protection is based on the fact that service nodes with the same resource specifications may not have the same processing capabilities. How to achieve dynamic overload protection? It is divided into three parts: overload detection, overload processing, overload recovery. One of the most critical is how to determine if a Server node is overloaded.
The Ingress Proxy in the figure above is the data plane process of the Service Mesh. It proxies traffic and sends it to the Server process. T3 in the figure can be understood as the time from the time the Proxy process receives the request to the time the Server returns after processing the request. Can this time be used to determine overload? The answer is no, because the Server may depend on other nodes. It may be that the processing time of other nodes becomes longer, resulting in the longer processing time of the Server. In this case, T3 does not reflect that the Server is in the overloaded state.
In the figure, T2 represents the time interval between the data plane process forwarding the request to the Server and the Server actually processing it. Can T2 reflect the state of overload? The answer is yes. Why is it ok? As an example, suppose the Server is running on a 4-core, 8G instance, which determines that the Server can only handle a maximum of four requests at the same time. If 100 requests are called to the Server, the remaining 96 requests will be pending. When pending is too long, it is considered overloaded.
What should I do if Server overload is detected? There are also many strategies for overload processing. The strategy we adopt is to proactively drop low-priority requests according to the priority of the request, so as to alleviate Server overload. When the Server returns to its normal level after dropping some traffic, we need to perform overload recovery so that the QPS can reach its normal state.
How is this process dynamic? Overload detection is a real-time process with a certain period of time. During each cycle, when the Server is detected to be overloaded, it can slowly drop some low-priority requests at a proportional rate. In the next period, if the Server is recovered, the drop ratio is adjusted to make the Server recover gradually.
The effect of dynamic overload protection is very obvious: it can ensure that the service will not crash under the condition of heavy traffic and high pressure. This strategy has also been widely applied to some large services in the Red envelope project of Douyin Spring Festival Gala.
Stability strategy – Load balancing
Let’s look at the load balancing strategy. Suppose there is A service A that sends traffic to downstream service B, and both A and B have 10,000 nodes. How can we ensure that the traffic from A to B is balanced? In fact, there are many methods, the most common are random polling, weighted virtual machine, weighted polling, these strategies can be seen from the name of what it means.
Another common strategy is consistent hashing. Hashing refers to establishing a mapping relationship between a request and a node by ensuring that the request must be routed to the same node in the downstream according to some characteristics of the request. The consistent hash policy is mainly applied to cache-sensitive services. It can greatly improve the cache hit ratio, improve Server performance, and reduce the timeout error rate. When there are some new nodes added to the service, or some nodes are not available, the consistency of the hash can affect the established mappings as little as possible.
There are many other load balancing strategies that are not widely used in production scenarios and will not be described here.
Stability strategy – Node fragmentation
In the face of the scene of tiktok Spring Festival Gala red envelopes with large traffic scale, there is a more useful strategy is node sharding. Node sharding is based on the fact that the reuse rate of long connections is very low for multi-node microservices. Because microservices generally communicate through TCP protocol, TCP connections need to be established first, and traffic flows on the TCP connection. We try to reuse as many connections as possible to send search responses to avoid the overhead of frequently making and closing connections.
When the nodes are very large, such as Service A and Service B each have 10,000 nodes, they need to maintain A very large number of long connections. To avoid maintaining such a long connection, an idle timeout is usually set. When no traffic passes through a connection for a certain period of time, the connection is shut down. In the case of a very large number of service nodes, long connections degenerate into short connections, so that each request needs to establish a connection to communicate. Its effects are:
- Error caused by connection timeout.
- Performance degrades.
The solution to this problem is node sharding. In fact, we also used this strategy very widely in the scene of red envelopes in the Spring Festival Gala. In this strategy, services with A large number of nodes are sharded, and then A mapping relationship is established. As shown in the following figure, the traffic sent by service Fragment 0 of Service A must reach Service fragment 0 of service B.
This can greatly improve the multiplexing rate of long connections. For the original relationship of 10000000, now it has become a normal relationship, such as 100100. We use the node fragmentation strategy to greatly improve the reuse rate of long connections, reduce the errors caused by connection timeout, and improve the performance of microservices.
The efficiency of strategy
The aforementioned flow limiting, fusing, downgrade, dynamic overload protection, and node sharding are all strategies related to improving the stability of microservices, and there will be some strategies related to efficiency. We first introduce the concept of swimming lanes and dye shunts.
One of the functions shown in the figure above may involve six microservices a, B, C, D, E, and F. The swimlanes can isolate this traffic, and each swimlane has six microservices that can complete a function. Dye shunting refers to making the flow flow into different swimming lanes according to certain rules, and then using it to complete some functions, which mainly include:
- Feature debugging: In the process of online development and testing, some requests issued by individuals can be sent to the swimlane set by themselves and Feature debugging can be carried out.
- Failure drill: After the development of some services for The Tik Tok Spring Festival Gala, it is necessary to conduct drills to deal with different failures. At this point we can divert the pressure gauge flow through some rules to the swim lane for the failure drill.
- Traffic recording and playback: Records traffic of a certain rule and plays it back. It is mainly used to debug bugs or find problems in certain black production scenarios.
The security policy
Security policies are also an important part of traffic governance. We provide three main security policies:
- Authorization: Authorization defines which services a service can be invoked by.
- Authentication: When a service receives traffic, it needs to authenticate the source of the traffic.
- Bidirectional encryption (mTLS) : Bidirectional encryption is used to prevent traffic from snooping, tampering, or attacks.
With these policies, we provide reliable identity authentication, secure transmission of encryption, and protection against the content of the transmitted traffic from being tampered with or attacked.
Spring Festival Gala red envelope scene landing
With the strategies mentioned above, we can greatly improve the stability of microservices and the efficiency of business development. But there are some challenges when we hit the ground running, and the main one is performance. We know that by adding a middle tier, while improving scalability and flexibility, there is an additional cost, and that cost is performance. In the absence of Service Mesh, the main overhead of microservice framework comes from serialization and deserialization, network communication, Service discovery, and traffic governance policies. With Service Mesh, there are two types of overhead: Protocol parsing For data plane agent traffic, you need to parse the protocol of the traffic to know where it is coming from and where it is going. However, the cost of protocol parsing itself is very high, so we can add a header (a collection of keys and values) to put service meta information such as the source of traffic into this header. In this way, we only need to parse one or two hundred bytes of content to complete the relevant routing. Interprocess communication Data plane processes broker traffic for business processes, usually using iptables. Overhead is very high, so we use interprocess communication by agreeing to a Unix Domain socket address or a local port with the microserver framework and then doing traffic hijacking. While this approach provides some performance improvements over Iptables, it also comes with some additional overhead of its own.
How can we reduce the cost of interprocess communication? Traditional interprocess communication, such as Unix domain sockets or local ports, involves copying the content transferred from user to kernel mode. For example, forwarding a request to a data side process would involve copying the request between the user and kernel states, and the data side process would involve copying the request from the kernel to the user state when reading it, which would involve up to four memory copies each time.
Our solution is done through shared memory. Shared memory is one of the most high-performance interprocess communication methods in Linux, but it has no notification mechanism. When we put a request into shared memory, another process is unaware of the request. So we need to introduce some event notification mechanism to let the data side process know. We did this with Unix Domain sockets, which had the effect of reducing memory copying overhead. We also reference a queue in shared memory that can harvest IO in bulk, reducing system calls. Its effect is also very obvious, in some risk control scenarios of The Tik Tok Spring Festival Gala, the performance can be improved by 24%. Once these optimizations are done, there is less resistance to landing.
conclusion
This topic describes the traffic governance capabilities that Service Mesh provides to ensure the stability and security of microservices. It mainly includes three core points:
- Stability: In the face of instantaneous flow peak of 100 million QPS, the flow management technology provided by Service Mesh ensures the stability of microservices.
- Security: Security policies provided by the Service Mesh ensure that traffic between services is secure and reliable.
- Efficiency: The gala involves microservices written in many different programming languages, and Service Mesh naturally provides unified traffic management capabilities for these microservices, improving the development efficiency of developers.
Q&A
Q: Why does IPC communication in shared memory reduce system calls?
A: When the client process puts A request into shared memory, we need to notify the Server process to process it. There will be A wake up operation, and each wake up operation means A system call. There is no need to perform the same wake up operation when the Server is not woken up yet, or when the next request comes in while it is processing the request. This reduces the need to wake up frequently in request-intensive scenarios, thus reducing the system call effect.
Q: Is the self-developed Service Mesh implementation purely self-developed or based on community products such as Istio? If you are developing yourself, do you use Go or Java? Student: Does the data surface use Envoy? Iptables for traffic hijacking?
A:
- The data surface was redeveloped based on Envoy in C++.
- Traffic hijacking uses the UDS or local port agreed with the microservice framework instead of iptables.
- Ingess Proxy and business processes are deployed in the same running environment, and no container restart is required to publish updates.