*Service Mesh* is the foundation for the next generation of microservices architecture. Ant Group has been exploring and piloting the technology since early 2018. Service Mesh currently covers thousands of ant applications, providing full coverage of core links.

Ant Group has taken a solid step towards cloud native through the large-scale implementation of Service Mesh, verified its feasibility, and truly saw that the sinking of infrastructure has brought both business and infrastructure team improved efficiency of RESEARCH and development, operation and maintenance, and reduced costs.

At the same time, Ant is also actively open to the mature technology to the society, currently has its own research data MOSN open source, open source enthusiasts are welcome to build together.

github.com/mosn/mosn

The foreword | |

Micro Service architecture is the Internet and financial institutions to become the mainstream mode of system architecture, the core is integrated Service communication, management functions of Service framework, and at the same time in the continued evolution of micro Service framework, Service grid Service (Mesh) as a new type of micro Service architecture, because the architecture is flexible, strong universality, is thought to have a good development prospect.

The Industrial and Commercial Bank of China (HEREINAFTER referred to as ICBC) took the initiative to explore the field of service grid and started the pre-research work of service grid technology in 2019. After in-depth research and practice of service grid technology, the service grid platform was established in 2021. The integrated development of service grid and existing micro-service architecture will help ICBC transform its application architecture into distributed and service-oriented, and bear the core banking system of open platform in the future.

PART. 1 Service grid development status in the industry

Since the birth of service Grid technology in 2016, many open source products have emerged in the industry, such as Istio (Google + IBM + Lyft), Linkerd (Twitter), Consul (Hashicorp) and so on. Among them, Istio is the most active and recognized by the community and is regarded as the benchmark open source product of the service grid.

A service grid is an infrastructure layer that deals specifically with service communications. It takes over the communication traffic of the service container by injecting Sidecar container into the service Pod. Meanwhile, the Sidecar container is connected with the control plane of the grid platform. Based on the policy delivered by the control plane, the agent traffic is managed and controlled, and the governance capacity of the original service framework is lowered into the Sidecar container. In this way, the capability of the basic framework is sunk and decoupled from the business system.

Figure 1: Service grid schematic

After the Sidecar container takes over the incoming and outgoing traffic of back-end services, services communicate with each other through standard protocols, enabling cross-language and cross-protocol services to access each other. In addition, the Sidecar container can control proxy traffic, such as unified service routing, security encryption, and monitoring and collection.

Figure 2: Service grid request flow diagram

PART. 2 Service grid technology in ICBC

Exploration and Practice

Icbc started the IT architecture transformation project in 2015. Up to now, the distributed system has covered more than 240 key applications, produced about 480,000 distributed service nodes of providers, and achieved the cluster processing capacity exceeding the host performance capacity gradually. While the distributed service platform of ICBC stably supports the smooth operation of existing business systems, it also has some common challenges in the industry, such as:

(1) Cross-language technology stack interconnection requires the development of multiple sets of basic framework, technology development and maintenance costs are high.

(2) Under the line of multiple products, each application uses different versions of the basic framework, so it takes a long time to promote the application to upgrade the framework, and the production runs multiple versions of the basic framework in parallel, resulting in great compatibility pressure.

In order to solve the current pain point, ICBC actively introduced service grid technology, explored the decoupling of business system and infrastructure, and improved service governance ability.

Integrated development with micro-service framework to build an enterprise-level service grid platform

In the construction process of Service Mesh platform, the registry, Service monitoring and other infrastructure of the original distributed system are integrated, and the most basic communication protocol encoding and decoding capability of the original Service framework client is retained in the business system in the form of lightweight client. The capabilities of other service framework clients are sunk into Sidecar, which can be compatible with the development of service framework and smooth transition.

At present, the icbc has completed Service grid (Service Mesh) for the construction of the platform, with the development of distributed Service platform integration, the Service governance through the heterogeneous language system and monitoring system, decoupling the business and the middleware system, enrich the traffic management ability, and has set up a smart interest in, such as character recognition application business pilot.

Figure 3: Service Grid Sidecar and microservices SDK comparison diagram

The service grid control plane contains modules such as configuration center, registry center, security center, control center, monitoring center, and log center. The data plane Sidecar and the original service framework use the same communication protocol (Dubbo/Spring Cloud) to support the interconnection and smooth migration between the service grid system and the original service framework system.

Figure 4: Icbc service grid architecture diagram

Explore enterprise-level solutions that support scale deployment and smooth migration

Icbc service grid has carried out practical practice on traffic agent deployment mode, smooth migration, performance optimization and other aspects under big data and high-frequency online service scenarios.

(1) Non-intrusive traffic agent deployment mode in big data scenarios

Icbc application development language mainly uses Java, but Python language is also widely used in the field of big data. For heterogeneous language scenarios, the service grid platform provides a non-invasive transparent hijacking traffic proxy scheme, which simplifies the access difficulty of heterogeneous language applications. The core of the non-intrusive traffic agent is to intercept the traffic entering and leaving the service container and redirect the traffic to the Sidecar container by modifying the network Iptables rules.

Its concrete implementation is as follows: When the Pod is started, modify the network Iptables rule of the Pod by Init Container (initializing the Container). The Iptables rule forcibly redirects the traffic in and out of the service Container to the Sidecar Container so that the Sidecar Container can take over the traffic of the service Container.

Figure 5: Schematic diagram of transparent hijacking traffic proxy

However, Iptables presents significant performance and maintainability challenges, so in the online high-frequency service scenario, we provide a traffic proxy solution where lightweight clients collaborate with Sidecar.

(2) Deployment mode of low-intrusion traffic agent in high-frequency online scenario

In online high frequency service scenario, we based on business application introduced lightweight client, the client on the business under the premise of transparent, change the service registry found behavior of business applications, will launch a service registration with the original to the registry subscription behavior into the local 127.0.0.1 sidecars address service register and subscribe, The Sidecar agent initiates service registration and subscription to the registry. After the service container subscrires through the Sidecar proxy, the local destination IP address of the service obtained by the service container is the 127.0.0.1 Sidecar address. All subsequent requests are directly sent to the Sidecar and then forwarded by the Sidecar to the real destination IP address to implement the traffic proxy capability.

Figure 6: Schematic diagram of port traffic proxy

(3) Smooth migration from traditional deployment to grid deployment

At present, ICBC microservice is mainly composed of two service instances based on Dubbo and Spring Cloud, which have been running in the production environment on a large scale. When the service grid system is introduced, it needs to have smooth transition ability with the original microservice system. Icbc supports both Dubbo and Spring Cloud protocols through the service grid system, and service grid instances and original service framework instances can access each other through the same protocol. So that under the same registry, the service grid system and the original distributed service system can be integrated development, smooth transition.

Figure 7: Smooth migration diagram

(4) Performance challenges and optimization after scale deployment

At present, icbc’s largest registry cluster has a super-scale business scenario with over 480,000 providers. In the open source Isito architecture, destination addresses and configuration information discovered by services are delivered in full through Pilot’s Xds API. In the case of a large number of service instances, full delivery affects the performance and stability of Pilot and Sidecar. The service Grid platform introduces third-party registries and configuration centers. The Sidecar directly connects to the registry and configuration center, supports on-demand subscription and accurate configuration delivery, greatly reducing the pressure on Pilot and Sidecar. Through pressure testing, the control plane has the capacity to support millions of instances.

Figure 8: Evolution diagram of ICBC control surface components

Build enterprise-level service governance capabilities to support precise traffic control

At present, the traffic governance capability of open source Istio is extremely limited, with only basic routing and observability, which cannot meet the requirements of enterprise level. Based on Istio architecture design, SOFAMesh developed its own data surface and optimized some control surface components to meet the needs of enterprises. Icbc cooperated with SOFAMesh team to build a financial level service grid platform and enhanced the flow control capability at the enterprise level. Icbc service grid has perfect monitoring operation and maintenance capability, which can monitor the running status of each node, support real-time traffic allocation of each node, have real-time traffic removal capability for faulty nodes, and can carry out unified security control for each node.

(1) Monitor operation and maintenance capabilities

The service grid platform has built-in perfect monitoring and alarm capabilities, and supports reporting service monitoring, link monitoring and other monitoring indicators to the third-party monitoring system. The alarm can be triggered synchronously according to the threshold of abnormal rate of business requests within a unit time, and the corresponding alarm events can be triggered synchronously when the service governance functions such as current limiting, fusing, degradation and fault self-healing are triggered.

(2) Traffic management ability

The service grid platform has the ability of fine-grained accurate traffic matching, identifying the traffic set with specific identification from the perspective of traffic identity identification, and carrying out accurate control on this part of the traffic. The platform now supports enterprise-level traffic control capabilities (label/method/service/application), such as traffic limiting, fusing, degradation, routing, traffic mirroring, link encryption, authentication, fault drilling, and fault isolation.

(3) Fault self-healing ability

Traditional fault feedback relies on monitoring and alarm to temporarily deal with the fault node through the emergency plan, and the ability of business and operation and maintenance to customize the emergency plan is strongly dependent on experienced operation and maintenance engineers, and the cost of novice users is high. In addition, the plan operation is scattered in the document, and the maintainability is poor. With business iteration, it may gradually degenerate and increase the operation complexity. Service grid platform provides a unified fault self-healing system, the basis of business failure rate request within the time window for gold index, the auxiliary window during at least call number, multiple failure rate, etc., to achieve common fault automatic perception, automatically from the client or server side network isolation fault node, and the fault node can network since the recovery after recovery, achieve business self-healing ability, This improves the high availability of operation and maintenance of distributed systems.

Figure 9: Fault isolation working diagram

(4) Safety management ability

The service grid platform has supported security authentication capability, supported state secret and a variety of mainstream algorithms to build encryption channels, to achieve more secure data transmission, with zero-trust network security attitude, to achieve full link trust and encryption; In addition, it can identify the caller id and set the access control policy (black or white list) based on the id. In a service scenario with multiple access parties, the system can prevent system failures or malicious attacks of individual customers, implement blacklist control for abnormal customers, and reject unauthorized access to protect system availability.

Figure 10: Schematic diagram of security control

PART. 3 Future outlook

Service grid, as the next generation micro-service technology in the field of cloud native, has been evolving for more than five years. It has only been practiced in mass production by a few leading enterprises, and there is no successful case in the financial industry represented by banks. Icbc service grid has completed the business pilot of multi-language, heterogeneous technology and edge scenarios. The advantages of service grid in flow control and system expansibility are basically demonstrated. It has the capability of sinking service governance to the infrastructure layer and the feasibility of highly decoupling middleware and business system.

Follow-up, icbc will pilot experience at the early stage of the comprehensive summary, on the basis of expanding the pilot range of application, fully demonstrates the technical architecture of service grid technology in differentiation, diversification of bank business scenarios of adaptability, synchronous perfect platform ability burnish, comprehensive performance capacity and stability, and service for financial trade ground grid technology to provide best practices and demonstration.

Recommended Reading of the Week

MOSN sub-project Layotto: Open a new chapter of service grid + application runtime

Reduce cost and improve efficiency! The transformation of the registration center in ant Group

Improved Stability: New features for SOFARegistry V6

We made a distributed registry