Introduction: This article mainly introduces our recent practice and value exploration in solving overseas business problems using Service Mesh architecture. In the process of introducing Mesh architecture into overseas businesses, we made full use of Istio’s abstract ability to describe and define routing based on YAML, formulated enterprise traffic governance standards, and unified various routing modules of overseas businesses developed for many years into a unified routing framework using Mesh. And in this year’s Singles’ Day to support the full volume of overseas business. We also hope that our experience introduction can provide some reference for other colleagues who are still exploring how to land Mesh.
The author | thousand deep Jane, day
background
Since Service Mesh was proposed in 2016, it has made great progress in both application exploration and technology evolution. Many Domestic Internet giants have implemented Mesh. As one of the early manufacturers investing in Mesh field, Alibaba has experienced technology verification, value exploration, scale implementation and technology bonus release within the group. During this period, it has solved many difficult challenges related to the scale of Alibaba Group and witnessed the innovative changes brought by this technology: On the one hand, Alibaba’s services and nodes are extremely large, and istio + Envoy is difficult to support on such a large scale because we do a lot of work on performance optimization. In addition, as for the technical support system, a lot of basic implementation of alibaba is based on Java technology stack. In order to access alibaba’s relatively perfect technical system, we also spent a lot of energy to rewrite many internal components with C++ and Golang. In value exploration, we have made a lot of balance and trade-offs between short-term value and long-term value with the business side.
This article mainly introduces some of our recent practices and value exploration in solving overseas business problems using the Service Mesh architecture. In the process of introducing Mesh architecture into overseas businesses, we made full use of Istio’s abstract ability to describe and define routing based on YAML, formulated enterprise traffic governance standards, and unified various routing modules of overseas businesses developed for many years into a unified routing framework using Mesh. And in this year’s Singles’ Day to support the full volume of overseas business. We also hope that our experience introduction can provide some reference for other colleagues who are still exploring how to land Mesh.
Routing complexity of overseas services
In the alibaba group, overseas business to the requirement of routing is far more complicated than domestic business, so the overseas business team micro service framework based on existing custom made many routing capabilities, each routing capabilities independent implements a module, such as cutting flow, disaster, drills, gray and other dimension flow scheduling, so formed many independent modules, Use way is also different, such as some through configuration scheduling, some modify code scheduling. The cost of maintaining these modules is very high, and the routing mode is not flexible enough and the granularity is large. Based on this background, we started to introduce Mesh into overseas businesses to solve these business pain points through the unified path of Mesh.
Through service abstraction, three basic processes of overseas service routing are summarized: traffic marking, cluster grouping, and conditional routing. It can be simply described as the traffic meeting certain conditions and routed to a certain group in the corresponding cluster. The problem then becomes how to divide cluster node groups and how to identify the traffic that meets the conditions. The corresponding Isito is the Virtual Service and Destination Rule. The former can select a predefined group based on some header, context, and other conditions in the request, while the latter divides the group based on the machine’s label. With the routing model in place, the next step is to map the various routing modules of the overseas business to the Virtual Service and Destination Rule. However, in fact, routing is more complex than we expected. In addition to supporting routing superposition, fallbacks with various conditions are also needed. Finally, it is like a big funnel, and each routing module filters out a batch of nodes that meet the requirements based on the previous routing module according to its own conditions. Therefore, we improved Istio and proposed the concepts of RouteChain and RouteGroup. A group of Virtual services and Destination Rule is a RouteGroup, which is used to define a class of routes. Multiple Routegroups are randomly arranged through RouteChain to form a large funnel (as shown below).
On a standard Istio implementation, Destination rules actually work by dividing a group of labels on the control plane, then creating a cluster for that group and dispatching it to the Envoy. If a cluster is divided into multiple groups, and each group has some junctions between it will actually cause nodes to swell under an Envoy, for example if a node belongs to three groups, that node will hold three copies inside the Envoy. In Alibaba, the number of nodes is generally large, and the clustering method of superimposing Istio will cause memory enlargement Envoy. Therefore, we have made an internal optimization for this situation, sinking the whole Destination Rule grouping logic. Grouping is done by the Envoy himself inside the cluster. Subset LoadBalancer is similar to the Envoy community’s Subset LoadBalancer mechanism in that nodes only hold one Subset, and each Subset is really just a set of Pointers to a node. In this way, we finally successfully mapped all routes of overseas business into our unified routing scheme.
Layer up unified traffic scheduling
For the business side, it always pays attention to routing functions and scenarios, such as grayscale scenarios and cut-flow scenarios. The bottom layer of the Mesh provides routing atomic capability, which can group a cluster by machine, region, environment, etc., and route to a certain group according to the header, context and other information in the request. There is a gap between the two: how to use the routing atomic capabilities provided by Mesh to build scheduling scenarios with business semantics. Therefore, we implemented a hierarchical unified traffic scheduling scheme with the business side. The whole scheme is divided into three layers: the bottom layer is a Mesh base that provides atomic routing capabilities, including basic atomic routing capabilities such as RouteGroup and RouteChain; The middle layer is the product capability with platform attributes, which encapsulates the underlying atomic capability provided by Mesh and provides a customized standard model for business scenarios. Routing policies can be defined and routing combinations can be arranged. The top layer is a traffic scheduling scenario with business attributes. The architecture of unified traffic scheduling is as follows:
This unified traffic scheduling scheme enables all overseas routes to converge to one platform, and the subsequent new route scenarios can be completed without code changes, and the granularity of flow cutting can also achieve service granularity. Compared with the previous application dimension, granularity is finer and efficiency is higher.
Route visualization
In addition to value exploration, we also solved many engineering practice problems during the Mesh process, such as the visualization of the Mesh routing process. Before the introduction of the Mesh, routing problems of the business side were solved by various functional teams, but after the Mesh, the responsibility of routing maintenance was transferred to the Mesh team, so the workload of answering questions and troubleshooting of the Mesh team became huge. In addition, the overseas business routes can be superimposed and freely arranged. It can also be time consuming to ensure that the routing configuration is as expected. In order to solve this problem, we developed a routing simulation platform, which can mirror, parse and replay online traffic and generate routing process records, and finally return them to the routing platform. Through such a closed-loop simulation process, which routegroups have been passed internally and which routing groups have been matched. Finally, the selected machines are presented in a routing diagram, and the routing process is directly graphical.
For example, there are the following routing requests:
Through simulation on the simulation platform, the route execution diagram which is exactly the same as the route selection on the line can be obtained. The route selection process and result are clear at a glance, and the results are as follows:
conclusion
By implementing incremental value of services, we can explore and solve various problems in the Mesh process together with the business side and grow together, which provides a feasible promotion path for large-scale implementation of Service Mesh. At present, we have built a complete product system around Service Mesh. In addition to supporting a large number of e-commerce businesses within Ali Group, we have also exported a number of capabilities on open source and cloud. In the future, we will continue to invest in value exploration, implementation path, traffic governance standards, high-performance Service grid and other aspects, and timely share the experience alibaba has accumulated in the Mesh field with the industry. We look forward to seeing the flourishing of Service Mesh on the road of building this future-oriented new technology.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.