Article | Zhang Huaren (Flower name: Hua Lun)

Architect of basic Technology Architecture Department of China Merchant Bank

Proofreading | Kan Guangwen (Name: Kongmen)

Read this article in 4832 words in 10 minutes

Introduction | |

In the microservice architecture, the invocation between services is complex, and an application may carry different traffic. Running in the same application process, there is bound to be mutual influence between various services.

If a sudden increase in the traffic of one service leads to a surge in the load of application processes, resulting in request queuing, other services will also be affected. In most cases, such interaction is within the tolerance range or can be avoided. In certain scenarios, we may need to consider isolating some business traffic to eliminate the risk of interaction between businesses:

  • For example, when background scheduling type traffic affects online user requests;

  • Another example is when a less sensitive or even failable business affects a more sensitive business that needs to be reinsured.

The demand for service link isolation is common in the industry. The usual solution is to create a new application and migrate the services that need to be isolated to the new application.

The way to build new applications, research and development, operation and maintenance need to pay multiple costs, related applications also need to cooperate with transformation and migration. For the case that only a single application needs to be created, it may be grudgingly acceptable. Some applications of network bank, such as high guarantee minimalist gateway and high guarantee customer view, are currently adopting this scheme. This approach is cumbersome, and when we expect business isolation across multiple applications on the entire link associated with a particular business, the cost of this approach increases nonlinearly and becomes unacceptable.

In the cloud native architecture, container and traffic can be managed in a more detailed manner. We have a simpler, more flexible, and more general alternative to the above scenario of service traffic isolation. We call it “Service unit isolation”, which can achieve the above requirements without creating new applications. This scheme has been applied in many business scenarios including core link, and has successfully passed the test of this year’s Double Eleven Promotion.

So what exactly is business unit isolation? How do we isolate business links with “business unit isolation”? This article will tell you in detail.

PART. 1 Concepts and basic principles

Concept and operation and maintenance model

Service Unit Isolation is a solution for traffic dyeing and resource isolation, which helps services to achieve service link isolation relatively easily. In the process of investigation and verification, we also put forward the optimization and improvement plan and promoted the implementation, which finally further reduced the cost of business access.

“Business unit isolation” needs to be articulated in conjunction with two new concepts: “AIG” and “business unit”.

AIG is a set of resources isolated by an application to support some business. A business link consisting of one or more AIG applications that serve a particular business or class of business is called a business unit. Ensuring that there is only traffic diverted to a service unit that matches the characteristics is called “isolated deployment of the service unit”.

Main tasks and supporting facilities

From the concept of “service unit isolation”, it is easy to see that: To achieve traffic isolation of a service link, at least the following things must be done:

1. Service unit construction: Create a service unit for each application on the link, and ensure that no traffic flows into the new service unit.

2. Service traffic identification: You need to identify the traffic flowing into a specific service by an upstream application.

3. Traffic diversion for specific services: A mechanism is required for identifying traffic for specific services to flow to newly created service units.

Obviously, these things need to work together on the infrastructure side and the application side. As shown in the figure below, related infrastructure and functions are as follows:

1. Business unit construction: provide complete R&D/operation and maintenance/monitoring support for AIG;

2. Traffic identification (RPC) : applications upstream of service units in the link (A) need to access the marking and dyeing SDK to issue marking and dyeing rules through the dyeing control platform;

3. Traffic identification (scheduling) : Complex scheduling (message triggering, application within a single LDC autonomous distribution of batch tasks) can be transformed into sofarPC-based streaming tasks, so as to achieve dyeing and isolation.

4. Traffic diversion for specific services: Refined routes on the MOSN side need to support AIG, so that traffic can flow to new specific service units.

Business unit building

A service unit is a relatively abstract concept corresponding to a service link.

In practice, in order to make the business unit more concrete, we stipulate that the Aigcode part of the AIG name (Appname – AigCode) of multiple applications within a business unit must be as consistent as possible.

Thus, building a particular business unit essentially creates a resource group (AIG) that serves a particular business isolation for all related applications on the link.

For a single application, building AIG consists of two parts:

One is to initialize AIG metadata;

The second is the operation and maintenance operations around AIG (expansion and contraction, offline, restart, sidecar injection and upgrade, etc.).

It can be seen that in order to support AIG, almost all operation and maintenance operations on THE PaaS side need to be adapted, and the workload is very large. Therefore, PaaS side had to make trade-offs in supporting AIG and decided to support AIG only in the final workload operation and maintenance mode, which also led to the migration of AIG’s strongly dependent application from the existing image mode to workload’s mode.

In workload o&M mode, PaaS arranges publishing and o&M contents into CRD resources, which are handed over to sigma (K8s) at the bottom for O&M. Switching to workload o&M mode helps the group to release the o&M system in a unified manner and better support scenarios such as flexible capacity expansion and self-healing.

However, compared with image mode, workload mode has a great impact on users’ usage habits and experience, and there are also many related problems in the initial stage. Therefore, although workload of online business has been advancing in an orderly manner, in the subsequent project of connecting core business to AIG, in order to avoid the forced switch to WORKLOAD operation and maintenance mode affecting the emergency operation and maintenance of core business, we also asked PaaS to support only opening workload on AIG machine. In view of this situation, a complete hybrid operation and maintenance verification has been done.

RPC Traffic Isolation

After a service unit is created, how to ensure that the new service unit does not receive RPC traffic by default without traffic diversion?

The application machine has RPC traffic because the machine IP is mounted in the SOFARegistry and AntVip: After the MOSN detects that an application process is successfully started, the MOSN registers the service information with SOFARegistry. After the o&M process passes the machine health check, the PaaS invoks the interface to mount the machine IP address to AntVip.

So, to ensure that the new AIG machine has no incoming traffic by default, adjustments are required on the MOSN and PaaS sides.

The overall adjustment plan is shown in the figure below:

How do you identify RPC traffic for a particular business?

After the upstream application is connected to the marking dyeing SDK, it can be intercepted by the RPC interceptor in the SDK when it is called by other applications as a server and other applications as a client. The interceptor compares the marking dyeing rules issued by RPC requests. The match adds the business request id to the RPC Header.

Finally, traffic is diverted to a specific business unit.

With the refined routing capability of MOSN, traffic can be routed to a specified service unit and converged within the service unit. Service unit isolation mainly uses the client routing capability of THE MOSN. When the client application initiates calls and requests to flow through the MOSN of the current Pod, the traffic can be controlled according to the routing rules delivered by us.

Scheduling traffic isolation

Scheduling is messages in nature, and simple scheduling scenarios usually have no need for isolation. Many of the scenarios that require isolation are currently in a “message task + Tier 3 distribution” mode, using scheduling to trigger batch logic.

Layer 3 distribution protocol distributes requests based on TB-Remoting rather than the standard SOFARPC protocol and does not go through MOSN. Therefore, THE MOSN cannot control the direction of such requests.

In order to solve this problem, AntScheduler introduced a new streaming scheduling mode. By transforming the three-tier distribution mode into multiple standard SOFARPC calls, AntScheduler works seamlessly with MOSN to meet the demand of traffic isolation.

For scenarios where traffic is scheduled to be directly routed to AIG, the AntScheduler interface can be configured directly. After configuration, the platform will deliver routing rules for MOSN clients at service level.

For the whole link isolation scenario, the scheduling platform connects to the marking and dyeing platform, and the RPC traffic initiated by the scheduling platform will be marked automatically. Downstream applications can choose to perform further dyeing and traffic diversion based on this calibration.

PART. 2 Link Isolation for Asynchronous Account Repair

After the service unit isolation infrastructure is deployed, several service scenarios are connected. Asynchronous link repair is the first time that service unit isolation is applied to the core link. Real-time transaction traffic and asynchronous link repair traffic are isolated to avoid any impact on each other. This year’s double eleven promote asynchronous account repair business unit carried 10% of the flow of asynchronous account repair, showing a silky performance.

I’ll use this project as a vehicle to detail how we can isolate business links with “business unit Isolation.”

Project background

Project-related applications are located on the core links of network providers and are reinsurance objects. However, services are expected to develop rapidly in the future. Therefore, the high availability guarantee of links is faced with great challenges.

The current link mainly has two types of traffic, one is real-time transaction traffic, the other is the upstream asynchronously initiated account repair traffic.

For the traffic of the repair class, failure is tolerated because it has fallen into the database. And real-time transaction traffic is the object that must be reinsured.

In the subsequent development of services, the asynchronous account repair traffic will increase sharply, and the real-time transaction traffic may be affected. Therefore, the service needs to isolate the asynchronous account repair traffic from the real-time transaction traffic to ensure the high availability of the real-time transaction.

The overall plan

Because links involve multiple core applications, the cost of initial transformation and subsequent maintenance is very high if the traditional solution of creating applications is adopted. Therefore, services want to adopt the solution of Service unit isolation. After in-depth communication with the business side, it is confirmed that a new asynchronous account repair business unit will be created and bear the following traffic:

1. Asynchronous account repair traffic (RPC) from upstream application U;

2. Subsequent traffic from the upstream application U’s billing scheduling (scheduling ->RPC);

Asynchronous repair RPC isolation

The upstream application U of the above asynchronous bill filling element needs to be slightly modified, and access the traffic marking and dyeing SDK, so that we can identify the traffic to the asynchronous bill filling element.

After application U is connected to SDK, when it is called by other applications as a server or as a client, it will be intercepted by RPC interceptor in SDK and can be marked and dyed. Dyed traffic carries a traffic identifier in the RPC request or response Header. When MOSN routing identifies this identifier, the traffic can be diverted to the asynchronous account filling service unit.

The following figure shows the marking, dyeing and drainage logic diagram of RPC traffic for asynchronous account repair:

Asynchronous repair schedule isolation

The identification of scheduled traffic requires the application to switch from “message task + Tier 3 distribution” mode to streaming task mode, to multiple SOFARPC calls, which can then be fine-routed to the specified AIG with MOSN.

In this project, RPC requests for payment scheduling have already been marked, so you only need to dye and deliver MOSN traffic diversion rules on upstream application U side.

The whole logic is shown as follows:

Pressure measurement and gray scale mechanism

The marking dyeing SDK can recognize the pressure test traffic when marking the traffic, but we did not use this method in this project. Instead, we added qualification conditions in the MOSN routing rules.

On the one hand, because SDK does not support network provider pressure measurement traffic identification;

On the other hand, the MOSN rule delivery process is simpler.

You can configure multiple MOSN routing rules. Each rule consists of the valid scope, condition, and destination of the route. The MOSN supports any gray scale and restricted pressure measurement traffic, ensuring traffic diversion security. Below is the MOSN routing rule diagram of upstream application U gray scale drainage 1/1000 pressure measurement flow (shadowTest=T) to application A asynchronous payment AIG (A-Vostro) :

The intra-cell traffic converges

After the traffic flows into the service unit, other applications will be invoked. You need to deliver MOSN routing rules to ensure that the traffic convergent within the service unit. Otherwise, the traffic flows back to the default service unit by default.

The initial scheme is to continue to route by means of the traffic identification written by the marking and dyeing SDK, as follows: scope: app=U; Condition: sl_biz_unit = XXX; Destination: mosn_aig = A – PCS.

However, such rules are strongly bound with client applications and server applications. For complex scenarios such as this project, each invocation relationship needs to issue a rule, and the workload of overall sorting and maintenance is very large.

During the investigation and verification, we identified this problem and finally proposed a more concise and feasible solution (AIG self-convergence) after discussing with relevant students. The MOSN supports identifying its own AIGcode and sending it to all applications that invoke the application. Rules can be simplified to be related to the current application and aIGCode only, for example: scope: AigCode = Vostro; Destination: mosn_aig = A – PCS. After simplification, the number of rules is the same as the number of applications in the cell.

The self-convergence rule of this project is shown as follows:

Summary and outlook |

This paper mainly introduces a new solution and business practice for network providers to deal with traffic isolation.

Compared to traditional clunky solutions that add new applications, “business unit isolation” solutions based on cloud native technologies such as containers and ServiceMesh are more lightweight and flexible. Currently, we have implemented the isolation of RPC, scheduling and HTTP traffic, and will further improve the isolation of messages and other traffic in the future.

Welcome students who have similar complaints or are interested in relevant technical solutions to exchange and discuss at any time.

Recommended Reading of the Week

Cloud native runtime for the next five years

Ant Group technical Risk Coding Platform Practice (MaaS)

Still worried about managing multiple clusters? OCM come!

Exploration and practice of Service Mesh in INDUSTRIAL and Commercial Bank of China