The introduction of Mesh mode is the key path to realize application cloud native, and Ant Group has implemented large-scale implementation inside. As more middleware capabilities such as Message, DB and Cache Mesh sink, application runtime evolving from Mesh will be the future of middleware technology. Application runtime is designed to help developers quickly build cloud-native applications and further decouple applications from infrastructure. The core of application runtime is API standards, which are expected to be jointly built by the community.

Ant Group Mesh introduction

Ant is a company driven by technology and innovation. From the earliest payment application on Taobao to a large company serving 1.2 billion users around the world, the evolution of Ant’s technical architecture can be roughly divided into the following stages:

Before 2006, the earliest Alipay was a centralized single application, and different businesses were developed in a modular way.

In 2007, with the spread of more payment scenarios, we started to do some application, data separation, and some SOA-oriented transformation.

Since 2010, it has launched phenomenal-level products such as quick payment, mobile payment, double Eleven support, and Yu ‘ebao. When the number of users reaches 100 million, the number of applications of Ant has increased by an order of magnitude. Ant has developed many full sets of micro-service middleware to support ant’s business.

In 2014, like a flower by bai bai, offline payment, more scene appears in the form of more business, on the availability and stability of the ant put forward higher request, the ants on the micro service middleware LDC unitized support, support business different live, and to support the double tenth shearing section.at the elastic scalability capacity on the flow of hybrid cloud.

In 2018, ant’s business is not only digital finance, but also the emergence of some new strategies such as digital life and internationalization, which urges us to have a more efficient technical architecture to make the business run faster and more stable. Therefore, Ant combines the popular concept of cloud native in the industry. Some implementation of Service Mesh, Serverless, and trusted native directions is done internally.

It can be seen that ant’s technical architecture is also constantly evolving with the company’s business innovation. I believe that students who have engaged in micro-services have a deep experience of the process from centralized to SOA and then to micro-services, and the practice from micro-services to cloud native is explored by Ant itself in recent years.

Why introduce a Service Mesh

Since Ant has a complete set of microservice governance middleware, why does it need to introduce Service Mesh?

Take SOFARPC, a service framework developed by Ant itself, as an example. It is a powerful SDK, which contains a series of capabilities such as service discovery, routing, fusing and flow limiting. In a basic SOFA(Java) application, the business code integrates SOFARPC’s SDK, and the two run in one process. After the large-scale landing micro-service of ants, we are faced with the following problems:

High upgrade cost: THE SDK requires the introduction of business code, and every upgrade requires the application to release the modified code. Due to the large scale of the application, during some major technical changes or security fixes. It takes thousands of applications to upgrade at a time. Version fragmentation: Due to the high cost of upgrade, SDK version fragmentation is serious, which leads to the need to be compatible with historical logic when writing code, and the overall technical evolution is difficult. Cross-language governance: most of ant’s online applications in the middle and background use Java as the technology stack, but there are many cross-language applications in the foreground, AI, big data and other fields, such as C++, Python, Golang and so on. Because there is no SDK of corresponding language, their service governance ability is actually lacking.

We noticed that some of the concepts of Service Mesh were starting to emerge in cloud native, so we started to explore this direction. In the concept of Service Mesh, there are two concepts: one is the Control Plane and the other is the Data Plane. The control plane is not expanded here for the time being. The core idea of the data plane is decoupling, which abstractions some complex logic without business relationship (such as service discovery, service routing, fusing flow limiting and security in RPC calls) into an independent process. As long as the communication protocols of the service and the independent process remain unchanged, the evolution of these capabilities can be independently upgraded with the independent process, and the entire Mesh can evolve in a unified manner. Our cross-language application, as long as the traffic is passing through our Data Plane, can enjoy the aforementioned capabilities related to service governance. The application is transparent to the underlying infrastructure capabilities, and is truly cloud native.

Landing process of ant Mesh

Since the end of 2017, Ant has explored the technical direction of Service Mesh and proposed a vision of unified infrastructure and no business upgrade. The main milestones are:

At the end of 2017, the Service Mesh technology was pre-researched and identified as the future development direction.

At the beginning of 2018, I started to develop Sidecar MOSN with Golang and open source, mainly supporting RPC pilot in a small range on double Eleven.

In 2019, the form of Message Mesh and DB Mesh was added, covering several core links and supporting 618.

In 2019, it covered hundreds of applications of all the great Promotion core links, supporting the great Promotion Day at that time.

On November 11, 2020, more than 80% of online applications in the whole site were connected to Mesh, and the whole Mesh system was capable of developing capabilities and upgrading the whole site within 2 months.

Ant Mesh landing architecture

At present, thousands of ant landing applications and hundreds of thousands of containers are applied to Mesh. The landing of this scale is among the best in the industry, and there is no previous path to learn from. Therefore, a complete r & D operation and maintenance system is also built to support ant meshing during the landing process.

The architecture of Ant Mesh is roughly as shown in the figure. At the bottom is our control plane, which deploys the service end of the service governance center, PaaS, monitoring center and other platforms, all of which are existing products. There is also our operation and maintenance system, including research and development platform and PaaS platform. In the middle is our main data plane MOSN, which manages RPC, message, MVC, task four kinds of traffic, as well as the basic ability of health check, monitoring, configuration, security, technical risk are sunk, and MOSN also shields some interaction between business and the basic platform. DBMesh is an independent product of ant, which is not shown in the picture. Then on the top layer are some of our applications, which currently support Java, Nodejs and other languages. For application, Mesh, although can do infrastructure decoupling, but still need an additional access to the upgrade cost, so in order to promote application access, ants do get through the whole process of research and development operations, including on the existing framework to do the most simplified access, through partial to promote the risk control and schedule, make the new application the default access Mesh and some other things.

At the same time, with the increasing sinking capacity, each capability also faced some problems of r&d collaboration, and even mutual impact on performance and stability. Therefore, we also improved the R&D efficiency of Mesh itself, such as modular isolation, dynamic insertion and removal of new capabilities, and automatic regression. At present, a sinking capability can be completed within 2 months from development to full station promotion.

Exploration on cloud native application runtime

New problems and thinking after large-scale landing

After the large-scale implementation of Ant Mesh, we have encountered some new problems: high maintenance costs for cross-language SDKS: Take RPC for example. Most of the logic has been sunk into MOSN, but some of the communication coDEC logic is in a lightweight SDK in Java, which has some maintenance costs. There are as many lightweight SDKS as there are languages. One team cannot be proficient in all languages, so the code quality of this lightweight SDK is a problem.

New scenarios for business compatibility with different environments: part of ant applications are both deployed inside ant and exported to financial institutions. When deployed to the ant, they are connected to the ant’s control plane, and when they are connected to the bank, they are connected to the bank’s existing control plane. The current approach of most applications is to encapsulate a layer of their own code and temporarily support the docking of unsupported components.

From Service Mesh to multi-mesh: The earliest ant scenario is Service Mesh. MOSN intercepts traffic through network connection proxy, and other middleware interacts with the server through the original SDK. Nowadays, MOSN is not only a Service Mesh, but a multi-mesh, because in addition to RPC, we also support more middleware Mesh implementation, including message, configuration, cache, and so on. It can be seen that for every sinking middleware, there is almost a corresponding lightweight SDK on the application side. Combining with the first question above, it can be found that there are many lightweight SDKS to maintain. To keep functions separate, each function opens a different port and calls the MOSN over a different protocol. For example, RPC protocol for RPC, MQ protocol for message, Redis protocol for cache. However, MOSN is not only traffic oriented, for example, configuration exposes the API to business code to use.

To solve the problem and the scenario, we are thinking about the following points:

1. Can SDKS of different middleware and languages have the same style?

2. Can the interaction protocols of each sinking capability be unified?

3. Is our middleware sink component-oriented or capability-oriented?

4. Can the underlying implementation be replaced?

Ant Cloud native application runtime architecture

Since March of last year, after several rounds of internal discussions and a survey of new ideas in the industry, we have come up with a concept called “cloud native Application runtime” (runtime). As the name suggests, we want this runtime to include all the distributed capabilities that applications care about, helping developers quickly build cloud-native applications, and helping further decouples applications from infrastructure!

The core points of the cloud native application runtime design are as follows: First, we decided to develop our cloud native application runtime based on THE MOSN kernel due to the experience of large-scale implementation and the supporting operation and maintenance system. Second, capability oriented, rather than component oriented, defines the API capabilities of this runtime. Third, the interaction between the business code and the Runtime API uses the unified gRPC protocol, so that the business side can directly generate a client through the proto file and call directly. Fourth, the component implementation behind the capability can be replaced, for example, the provider of the registry service can be SOFARegistry, Nacos, or Zookeeper.

Runtime capability abstraction

In order to abstract some of the capabilities most needed for izumo native applications, we set a few principles:

1. Focus on apis and scenarios for distributed applications rather than components; 2.API is intuitive, out of the box, convention trumps configuration; 3. The API is not bound to implement, and the extended field is differentiated.

After we have the principle, we abstract out three sets of apis, which are mosN. proto when the application calls the running time, Appcallback. proto when the application calls the running time, and exoskeleton.proto when the application is operating. For example, RPC calls, sending messages, read caching, and read configuration are applied to the runtime, while RPC request collection, message collection, and task scheduling are applied to the runtime, and other monitoring checks, component management, and flow control are related to the runtime operations and maintenance.

Examples of the three proTos can be seen below:

Runtime component control

On the other hand, in order to achieve run-time implementation replaceable, we also introduced two concepts in MOSN. We called each distributed capability a Service, and then there are different components to implement the Service. A Service can have multiple components implementing it. A component can implement multiple services. For example, the “Mq-pub” messaging Service has two components: SOFAMQ and Kafka. The Kafka Component implements two services: sending messages and checking health. When the business actually makes a request through the GRPC-generated client, the data is sent to the Runtime via the gRPC protocol and distributed to a later concrete implementation. In this way, applications only need to use the same SET of apis, through the request parameters or runtime configuration, to connect to different implementations.

Runtime vs. Mesh

To sum up, a simple comparison between the cloud native application runtime and the previous Mesh is as follows:

The cloud native application runtime landing scene has been researched and developed since last year. Currently, the runtime mainly lands in the following scenarios.

Heterogeneous technology stack access

In Ant, in addition to RPC service governance, message and other requirements, applications of different languages also hope to use ant unified middleware and other infrastructure capabilities. Java and Nodejs have corresponding SDK, while other languages do not have corresponding SDK. With the application runtime in place, these heterogeneous languages can call the runtime directly through the gRPC Client and connect to the ant infrastructure.

Unbinding a vendor

As mentioned just now, Ant’s blockchain, risk control, intelligent customer service, financial platform and other businesses are deployed on both the main site and Ali Cloud or proprietary cloud. With a runtime, an application can create a mirror image of a set of code with the runtime, using configuration to decide which underlying implementation to call, without being tied to a specific implementation. For example, SOFARegistry and SOFAMQ are connected to ants, Nacos and RocketMQ are connected to the cloud, and Zookeeper and Kafka are connected to private clouds. We’re in the middle of landing this scene. Of course, this can also be used for legacy system governance, such as upgrading from SOFAMQ 1.0 to SOFAMQ 2.0 without upgrading the application connected to the runtime.

FaaS cold start preheating tank

FaaS cold start preheating pool is also a scene we are exploring recently. As we all know, when FaaS Function is cold start, it needs to create Pod, download Function and then start up, which will be a long process. With the runtime, we can create the Pod in advance and start the lucky line. By the time the application starts, the application logic is actually very simple. After testing, we can shorten the time from 5s to 1s by 80%. We will continue to explore this direction.

Planning and Outlook

The API to build

One of the most important parts of the runtime is the DEFINITION of the API. We already have a fairly complete API for landing inside, but we’ve seen a lot of products in the industry with similar appeals, like DAPR, Envoy, etc. So one of the things that we’re going to be doing is bringing together communities to come up with a set of cloud-native apis that everyone can agree on.

For open source

In addition, we will gradually develop our internal runtime practices in the near future. We expect to release version 0.1 in May or June, and we will continue to release a small version per month, aiming to release version 1.0 before the end of the year.

conclusion

A final summary:

1. The introduction of Service Mesh mode is the key path to realize the application of original cloud generation; 2. Any middleware can be meshed, but the problem of r&d efficiency still exists; 3. Large-scale Mesh landing is an engineering matter, which requires a complete supporting system; 4. Cloud native application runtime will be the future form of middleware and other basic technologies to further decouple application and distributed capabilities; 5. API is the core of cloud native application runtime, and the community is expected to jointly build a standard.

read

  • Introduction to cloud native technology: Exploration and Practice of cloud native open Operation and Maintenance System

  • Short Steps to a Thousand Miles: A review of QUIC Agreement landing in Ant Group

  • Rust’s new frontier: secret computing

  • Protocol Extension Base On Wasm