The concept of cloud native is in full swing, but the real large-scale landing of the company is still countable, ant as the domestic relatively early to try the company, after more than 2 years of exploration, precipitation out of a set of feasible plan and finally passed the double 11 test.

Why do WE need Service Mesh?

Why we need Service Mesh and what its value to the business is, we summarize three points:

1. Decouple microservice governance from business logic.

2. Unified management of heterogeneous systems.

3. Financial level network security.

The following are expounded respectively.

1. Decouple microservice governance from business logic

Before Service Mesh, the traditional micro Service system is played by middleware teams to provide an SDK for business applications to use, in which various Service governance capabilities are realized, such as Service discovery, load balancing, fusing flow limiting, Service routing, etc.

At runtime, the SDK and business application code are actually mixed and run in the same process, with very high coupling degree, which brings a series of problems:

  • ** High upgrade cost. ** Each upgrade requires the business application to change the SDK version number and release it again. In the case of Ants, we used to spend thousands of person-days updating middleware versions each year.
  • ** The version is seriously fragmented. ** Due to the high upgrade cost, the middleware will continue to develop forward. As time passes, online SDK versions are not unified and their capabilities are uneven, resulting in difficult unified governance.
  • ** Middleware evolution is difficult. ** Due to the serious version fragmentation, the middleware needs to be compatible with all kinds of old version logic in the code during the evolution process, which is like wearing “shackles” and cannot achieve rapid iteration.

With the Service Mesh, we were able to take most of the SDK capabilities out of the application and disassemble them into separate processes that run in Sidecar mode. By sinking service governance capabilities into the infrastructure, the business can be more focused on business logic, while the middleware team can be more focused on building common capabilities, truly independent evolution, transparent upgrade, and overall efficiency.

2. Unified management of heterogeneous systems

As new technologies evolve, applications and services often emerge in different languages and frameworks within the same company. Take ant as an example, its internal business is also full of flowers, including front-end, search and recommendation, artificial intelligence, security and other businesses. Besides Java, it also uses NodeJS, Golang, Python, C++, etc. In order to manage these services in a unified manner, We had to redevelop a full SDK for each language and each framework, which was very expensive to maintain and very challenging for our staff structure.

After the Service Mesh, by transforming the Service governance main body ability to sink to the infrastructure, multilingual support is a lot easier, just need to provide a very lightweight SDK, and even many cases do not need a separate SDK, you can easily realize the unity of the language and protocol traffic control, monitoring and management requirements.

3. Financial level network security

Many companies currently build microservices based on the assumption of “Intranet trust”, but this assumption may not be appropriate in the context of the current large-scale cloud, especially when it comes to some financial scenarios.

Through the Service Mesh, we can more easily realize the identity identification and access control of applications. With data encryption, we can realize full-link trust, so that services can run in zero-trust networks and improve the overall security level.

Ii. Large-scale landing practice of Ant Service Mesh

Because of the benefits of Service Mesh, we started to explore the technology and conducted a small pilot in early 2018. However, when promoting Service Mesh with the business team, we were faced with soul searching:

**1. Do I need to change the business code? ** Business team is faced with heavy business pressure on a daily basis and does not have much energy for technical transformation.

**2. The upgrade process does not affect our business. ** For the company’s business, stability must be the first priority, and the new structure should not affect the business

** the rest is up to you. The implication of ** is that the business team is willing to cooperate with us to implement the Service Mesh as long as we can ensure low transformation cost and good stability.

Which brings me to the famous product value formula:

From this formula, we can know:

** “new experience — old experience” ** is all the benefits of the aforementioned Service Mesh that need to be maximized

** “migration costs” ** are the costs of moving the business to the new architecture of the Service Mesh that need to be minimized, mainly including

  • ** Access cost: ** How do existing systems access the Service Mesh? Do you want to do business transformation?
  • ** Smooth migration: ** The production environment already has many business systems running. Can it be smoothly migrated to the Service Mesh architecture?
  • ** Stability: **Service Mesh is a new set of architecture, how to ensure stability after Service migration?

Now let’s see what ants do.

Access to the cost

Since ant’s services uniformly use SOFA framework, in order to minimize the access cost of business, our solution is to modify the logic of SOFA SDK to automatically identify the operation mode. When the operation environment is found to have enabled Service Mesh, it will automatically connect with Sidecar. If Service Mesh is not enabled, it continues to run as a non-service Mesh. For the business side, it only needs to upgrade the SDK once to complete the access without changing the business code.

Let’s take a look at how the SDK works with Sidecar, starting with the process of service discovery:

1. Assume that the server runs on 1.2.3.4 and listens to port 20880. First, the server sends a service registration request to Sidecar, informing Sidecar of the service to be registered and the IP + port (1.2.3.4:20880).

2. The Sidecar on the server sends a service registration request to the registry, informing it of the service to be registered and the IP + port. Note that the registered port is not the service application port (20880), but a port that Sidecar listens on (for example, 20881).

3. The calling end sends a service subscription request to the Sidecar to inform the service information to be subscribed

4. Sidecar on the calling end pushes the service address to the calling end. Note that the IP address pushed is the local IP address and the port is the port that Sidecar on the calling end listens to (for example, 20882).

5. The Sidecar on the calling end sends a service subscription request to the registry to inform the service information to be subscribed

6. Registry pushes service address to Sidecar (1.2.3.4:20881)

Let’s look at the service communication process:

1. The calling end obtains the server address 127.0.0.1:20882 and invokes the service to this address

2. After receiving the request, the Sidecar on the calling end parses the request header to know the specific service information to be called, and then obtains the address returned from the service registry before launching the actual call (1.2.3.4:20881).

3. After receiving the request, the Sidecar on the server processes the request and finally sends it to the server (127.0.0.1:20880).

After the preceding procedure, the SDK and Sidecar are connected. One might ask, why not use the Iptables solution? The main reason is that the performance of IPtables deteriorates when there are too many rules configured. More importantly, iptables is difficult to troubleshoot due to its poor management and observability.

Smooth migration

Ant’s production environment runs a large number of business systems with complex upstream and downstream dependencies, some of which are very core applications that may fail if jitter is slight. Therefore, smooth migration is a necessary option for a large architecture transformation like Service Mesh, and grayscale and rollback are also required.

Thanks to the registry we kept in the architecture, the smooth migration solution is also straightforward:

1. Initial status

In the following diagram, there is an initial service provider and a service caller.

2. Transparently migrate callers

In our solution, there is no requirement to migrate the caller or the Service first. It is assumed that the caller wants to migrate to the Service Mesh first, so as long as Sidecar injection is enabled on the caller, the SDK will automatically recognize that the Service Mesh is enabled. The Sidecar subscribes to the service and communicates with it, and then Sidecar subscribes to the service and communicates with the real service provider, who is completely unaware that the caller has migrated. So the caller can start Sidecar one at a time in grayscale and roll back if there is a problem.

3. Transparently migrate service providers

Assuming that the server wants to migrate to the Service Mesh first, as long as Sidecar injection is enabled on the server, the SDK will automatically recognize that the Service Mesh is enabled and register and communicate with Sidecar. Sidecar then registers itself with the registry as a service provider, and the caller still subscribes to the service from the registry, unaware that the service provider has migrated. So the server can start Sidecar one by one in gray scale and roll back if there is a problem.

4. The final state

Finally, the final state is reached, with both the caller and the server smoothly migrating to the Service Mesh, as shown in the figure below.

The stability of

With the introduction of the Service Mesh architecture, we have initially decoupled the application from the infrastructure, greatly speeding up the iteration of the infrastructure, but what does this mean for stability?

Under the mode of the SDK, middleware classmate after release the SDK, business applications will escalate, and according to the development, testing, pretest, gray scale, production environment gradually and complete functions such as authentication, from a certain extent, there are a lot of business students in the help of the middleware products do test, and the environmental small gradually upgrade, So the risk is very small.

However, with Service Mesh, business applications and infrastructure are decoupled. This speeds up iteration, but it also means that we can no longer use the previous model to ensure stability. We need not only to ensure product quality during the development phase, but also to control risk during changes online.

Given the size of ant clusters, where online changes often involve hundreds of thousands of containers, how do you ensure the stability of an upgrade on such a large scale? Our solution: unattended change.

Before we look at the unattended changes, let’s take a look at the unmanned vehicle. The graph below defines the maturity level of unmanned vehicle, from L0 to L5. L0 corresponds to most of our current driving modes. The car itself does not have any automation ability, so it needs to be completely controlled by the driver, while L5 is the highest level, which can achieve real driverless driving. As we know Tesla, its automatic driving is between L2 and L3, which is capable of automatic driving in certain scenarios.

We also defined the level of unattended change with reference to this system, as shown in the figure below:

L0: Pure human change, black screen operation, without any tool assistance

Rocky: There are some tools, but they’re not systematic. You need to orchestrate different tools to make a change

L2: With the preliminary automation capability, the system can arrange and chain the whole change process by itself, and has the ability of forced gray scale. Therefore, compared with L1 level, human hands are free, and only one work order needs to be submitted for each change

L3: The system is equipped with observation capability. In the process of change, if there is any abnormality, users will be notified and the change will be blocked. Therefore, compared with L2 level, people’s eyes are also liberated

L4: further, system has the ability of decision, when found problems change, system can automatic processing to achieve self-healing, so compared to the level of L3, the human brain is liberated, changes can be done in the middle of the night, have a problem in accordance with the predefined solution automatic processing system, it cannot solve the need to call people online processing

L5: This is the final state. After submitting the change, the user can leave. The system will execute it automatically and make sure there is no problem

At present, our self-evaluation has achieved L3 level, mainly reflected in:

1. The system automatically arranges the batch strategy to achieve mandatory gray scale

2. Change defense is introduced, and pre – and post-check is added to block changes in time when problems occur

The change defense process is as follows:

  • After the change work order is submitted, the system will batch the changes and enable the batch change according to the room, application, and unit
  • Pre-verification is performed before each batch of changes, such as checking whether the current time is in the peak period, whether it is in the fault period, and checking the system capacity
  • If the pre-check fails, the change is terminated and the changed student is notified. If the pre-check succeeds, the Mosn upgrade or access process starts
  • After the change is complete, post-verification is performed, for example, to check service monitoring, such as whether the success rate of transaction and payment drops, and service health, such as whether RT and error rate are abnormal. In addition, upstream and downstream systems are also checked and associated with alarms to check whether faults occur during the change
  • If the post-check fails, the change will be terminated and the student who changed will be notified. If it passes, the change process of the next batch will start

The overall architecture

Let’s take a look at the overall architecture of Ant SOFAMesh. The “dual-mode micro-service” here refers to the combination of traditional SDK-BASED micro-service and Service Mesh micro-service, so as to achieve:

  • ** Interconnectivity: ** Applications in two systems can access each other
  • ** Smooth migration: ** applications can smoothly migrate between two architectures, and can be transparently unaware of upstream and downstream dependencies
  • ** Flexible evolution: ** With connectivity and smooth migration implemented, we can flexibly adapt applications and architecture evolution according to the actual situation

On the control surface, we introduced Pilot configuration delivery (such as service routing rules), and retained an independent registry for service discovery to achieve smooth migration and scale landing.

On the data side, we used our own Mosn, which not only supports SOFA applications, but also Dubbo and Spring Cloud applications.

In terms of deployment mode, we support not only container/K8S, but also virtual airport scene.

Landing scale and business value

At present, Service Mesh covers thousands of ant applications and achieves full coverage of core links. The number of Pod in production and operation is hundreds of thousands. The number of QPS processed on Double Eleven day reaches tens of millions, and the average processing response time is less than 0.2ms.

In terms of business value, through Service Mesh architecture, we have preliminarily achieved the decoupling of infrastructure and business applications. The upgrade capacity of infrastructure has been increased from 1-2 times/year to 1-2 times/month, which not only greatly speeds up the iteration speed, but also saves the upgrade cost of thousands of person-days per year for the whole station. With the help of Mosn traffic allocation to achieve the time-sharing scheduling scene, only with 3M40s to complete the 2W + container switch, save 3.6W + physical core, realize the double promotion without machine; In the aspect of security and trust, it implements identity authentication, service authentication and communication encryption, so that services can run in zero-trust network and improve the overall security level. In terms of service governance, it quickly launched the capabilities of adaptive flow limiting, full flow limiting, single-machine pressure measurement, and business unit isolation, which greatly improved the level of refined service governance and brought great value to the business.

Looking to the future

At present, we can see very clearly that the entire industry is going through the process from Cloud Hosted to Cloud Ready to Cloud Native.

But the point I want to emphasize here is that we are not technology for technology’s sake, technology development is essentially business development. Cloud native is the same, its fundamental is to improve efficiency, reduce costs, so cloud native itself is not an end, but a means.

With the large-scale deployment of Service Mesh, we have taken a solid step towards cloud native, verified the feasibility and indeed seen the improvement in r&d and operation efficiency for both the business and the infrastructure team.

At present Mosn mainly provides RPC and MQ capabilities, but there is still a lot of infrastructure logic embedded in business systems as SDKS. There is still a long way to go to truly decoupled infrastructure and business, so in the future we will sink more capabilities into Mosn (such as transactions, caching, configuration, task scheduling, etc.). The Service Mesh evolves from Mesh to Mesh. Business applications will interact with Mosn through standardized interfaces in the future, without the need to introduce various heavy SDKS, so that Mosn evolves from a simple traffic proxy into the next generation of middleware runtime.

In this way, the coupling between business applications and infrastructure can be further reduced, making business applications lighter. As shown in the figure, the evolution from the earliest monolithic applications to microservices has achieved decoupling between business teams, but has not decoupled the coupling between business teams and infrastructure teams. The future direction is as shown in the third figure. We hope that business applications will move towards pure business logic. By sinking all the non-business logic into the Sidecar, the business and infrastructure can truly evolve independently, increasing overall efficiency.

Another trend is Serverless. Currently, limited by application volume, startup speed and other factors, Serverless is mainly used in Function.

However, we have always believed that Serverless is not limited to Function scenarios. Its flexibility, free operation and maintenance, on-demand and other features are obviously more valuable for ordinary business applications.

After a service application is upgraded to Micrologic + Sidecar, the size of the application becomes smaller and the startup speed is faster. On the other hand, the infrastructure can be optimized (for example, connecting to the database in advance and preparing cached data in advance). In this way, ordinary business applications can be integrated into the Serverless system and truly enjoy the benefits of efficiency, cost and other aspects brought by Serverless.