Stand on the draught

What’s been the hottest technology topic in the last two years?

The Service Mesh is definitely there.

What is a Service Mesh?

It can be likened to TCP/IP between applications or microservices, responsible for network calls, traffic limiting, fuses, and monitoring between services.

You don’t have to worry about the TCP/IP layer when you’re writing an application (such as a RESTful application over HTTP), and you don’t have to worry about the Service Mesh between services that would otherwise be implemented through the Service framework. Just hand it over to the Service Mesh.

The Service Mesh operates as a sidecar and is transparent to applications. All traffic between applications passes through the Service Mesh. Therefore, application traffic can be controlled in the Serivce Mesh, which is a natural traffic hijacking point for traffic limiting fuses.

Now more than 80% of ant applications have completed Mesh, and the construction of unified Mesh current limiting fuse is natural.

Service Mesh is an infrastructure layer that handles communication between services. It is responsible for the complex service topology that makes up modern cloud-native applications to reliably deliver requests.

In practice, Service Mesh is typically implemented as an array of lightweight network agents that are deployed with application code without the application being aware of the presence of the agent.

Compared with traditional flow limiting components, Mesh flow limiting has many advantages and achieves significant benefits in r&d efficiency and r&d cost:

– The MOSN architecture’s natural traffic hijacking eliminates the need for each application to access the SDK

– There is no need to develop different versions of stream limiting components for a specific language

– The upgrade of the traffic limiting capability does not require service synchronization

Background Business

Before the realization of unified Mesh flow limiting, ant Group had several different flow limiting products, which provided different flow control strategies respectively:

Traffic limiting configurations of different types (SOFARPC, wireless gateway RPC, HTTP, and messaging) are distributed on different platforms and maintained by different teams, resulting in uneven product and document quality, high learning costs, and poor user experience.

Different traffic limiting policies require access to different SDKS, which leads to many indirect dependencies, frequent upgrades caused by security vulnerabilities, and high maintenance costs.

There is not only unnecessary manpower input in development and construction, but also trouble and inconvenience for the business side.

On the other hand, our business is getting bigger and bigger, but a large number of services are still using the simplest single-node traffic limiting strategy. There is no universal adaptive traffic limiting, hotspot traffic limiting, fine traffic limiting, and cluster traffic limiting capabilities.

Faults occur frequently due to the lack of current limiting capability, leakage of current limiting configuration, and incorrect current limiting configuration.

In the Mesh architecture, SIDecAR has natural advantages for traffic management. Services do not need to access or upgrade traffic limiting components in applications, and middleware does not need to develop or maintain multiple versions of traffic limiting components based on different technology stacks.

In the context of large-scale internal access of the Service Mesh ant, you can centrally channel different traffic limiting capabilities to the MOSN and configure all traffic limiting rules to the Unified traffic limiting center to further improve the MOSN’s traffic management capability and greatly reduce Service traffic limiting access and configuration costs.

Based on this background, we carry out the unified current limiting capacity construction in MOSN.

Standing on the shoulders of giants

In building a unified flow limiting capability, we investigated a number of established products, including our own Guardian, Shiva, Dujiangyan and others, as well as concurrency-limits, Hystrix, Sentinel and others from the open source community.

We found that Alibaba group’s open source Sentinel is one of them.

In the process of building Shiva, we also communicated with Sentinel students, who are also actively building a Golang version of Sentinel-Golang.

MOSN is an open source framework developed by ants and built based on Golang technology. If combined with Sentinel’s strong flow control ability and relatively outstanding community influence, it is simply a combination of strong and strong, like a tiger with added power, perfect match and complement each other. Ah.

Sentinel is not an out-of-the-box business for us. We are not a brand new business without historical baggage. We have to consider the compatibility of ant infrastructure and historical current-limiting products.

  1. Control plane rule delivery requires the ant infrastructure to be removed

  2. * Sentinel-Golang’s stand-alone current-limiting, fusing and other logic is quite different from our previous products

  3. Cluster limiting is also implemented with ant infrastructure

  4. *Sentinel adaptive current limiting particle size is too coarse, * Ants have a need for more refinement

  5. The log collection scheme needs to be adjusted

After comprehensive consideration, we decided to expand based on Sentinel and build the ant’s own Mesh current limiting capability on the shoulders of giants.

Based on the good expansion capability of Sentinel, we have made our own implementation of single-node traffic limiting, service fusing, cluster traffic limiting and adaptive traffic limiting, and fed some common changes back to the open source community. Meanwhile, we have built a unified log monitoring alarm and unified traffic limiting center.

Finally, we completed the construction of various capabilities in MOSN. The following table shows the comparison of MOSN flow limiting capabilities with other flow limiting components:

Occam’s razor

Pluralitas non est ponenda sine necessitate.

Don’t add entities if you don’t have to

A traffic limiting strategy is supported by an SDK and a management background. The interactive experience is uneven, and the quality of the documents and operation manuals is also uneven. It is maintained and answered by different teams, and if you have experienced all of them once, you will definitely hate them.

One of the core purposes of Mesh unified traffic limiting is to cut down these things, simplify them, reduce the learning and use costs of business students, and reduce our own maintenance costs.

– Flow control ability all integrated into MOSN, take the long, to its dross

– All control consoles for flow control are closed to the unified traffic limiting center

I think this is the last restrictor wheel we built

The youth is better than the blue

As mentioned above, we are on the shoulder of Sentinel to achieve the unified flow limiting of Mesh. What do we do that Sentinel does not have?

In fact, we have made our own implementations of almost all of the current limiting capabilities provided by Sentinel, and there are a number of highlights and enhancements.

Here are a few highlights of our technology.

Adaptive current limiting

– For business students, it takes a lot of time and effort to conduct capacity evaluation and pressure test regression for interfaces one by one. Limited energy can only be invested in key interface guarantee, which will inevitably miss traffic limiting of some small traffic interfaces.

– Students in charge of quality and stability assurance often see various faults caused by leakage and current limiting, misallocation and current limiting, pressure measurement failure, thread blocking, etc.

We hope that MOSN can accurately find the culprit of insufficient system resources and automatically adjust abnormal flow in real time according to the water level of the system, even in the case of system leakage, misallocation, and current limiting.

In this context, we have implemented a set of self-detection and self-regulation traffic limiting strategy in line with the definition of mature cloud native.

The realization principle of adaptive current limiting is not complicated. The simple explanation is that the whole water level of the system is detected in real time after the current limiting is triggered, and the flow rate is adjusted in proportion to the second level.

The core logic is as follows:

– System resource detection: detects the usage of system resources at the second level. If the usage exceeds the threshold for N seconds (5 seconds by default), the baseline calculation is triggered and the pressure measurement traffic is blocked and resources are transferred to online services.

– Baseline calculation: go through all the current interface statistics, find out the large users of resources through a series of algorithms, and then find out the abnormal traffic in these large users, make a snapshot of their current resource occupation and store it in the baseline data;

– Baseline regulator: adjusts the baseline data stored in the previous step according to the actual situation. Adjust the baseline value in seconds according to the result of system resource detection. If the baseline value still exceeds the system threshold, the baseline value will be scaled down; otherwise, the baseline value will be scaled back.

– Flow limiting decision:

If the system traffic continuously passes through the adaptive traffic limiting module, the system attempts to obtain the baseline data of the interface.

If the baseline data is available, the request is allowed to pass based on the actual situation and whether the current concurrency exceeds the baseline data.

This set of self-implemented adaptive current limiting has the following advantages:

– Worry free configuration: no code intrusion, minimal configuration;

– second level regulation: single machine self-detection and self-regulation, without external dependence, second level adjustment of water level;

– Intelligent identification: pressure measurement of resources moving, abnormal flow identification and other features;

– Precise identification: Compared with other adaptive flow limiting technologies, such as Netflix’s Concurrency -limits and Sentinel’s SYSTEM dimension adaptive flow limiting based on THE BBR concept, precise identification can achieve self-adaptive flow limiting for interface dimensions, even parameters or application sources.

Cluster current-limiting

Before introducing cluster traffic limiting, let’s briefly consider the scenarios in which single-node traffic limiting may be inadequate.

The counter of single-machine current limiting is counted independently in the memory of the single-machine, data between independent machines are not concerned with each other, and each machine usually uses the same current limiting configuration.

Consider the following scenario:

**-** Suppose that the service wants to configure a total traffic limiting threshold smaller than the total number of machines. For example, the service has 1000 machines, but wants to limit the total number of QPS to 500. If the QPS of each machine is less than 1, how can I configure the single-machine traffic limiting threshold?

– Suppose the service wants to limit the total number of QPS to 1000 and there are 10 machines in total, but the service traffic distributed to each machine is not absolutely uniform. What is the value of single-machine traffic limiting? *

Any problem in computer science can be solved by adding an indirect intermediate layer. It is easy to think of a unified external counter to store limiting statistics. This is the basic idea of cluster limiting.

However, there are some problems with synchronizing the request cache for each request:

**- ** If the number of requests is large, the cache will be under great pressure, and sufficient resources need to be applied;

**- ** Synchronous request caching, especially in the case of cross-town access caching, can significantly increase the time, and in the worst case 30ms+ cross-town call time is not acceptable for every business.

**- ** We provide both synchronous and asynchronous traffic limiting modes in cluster traffic limiting. For high volume or time sensitive situation we have designed a level 2 cache scheme, no longer each time the request cache, but to do a sum in local, after reaching a certain share of or reach a certain time interval after consulting the cache, if the remote share has been deducted out, will stop flow into again, until the next time window after repair. Asynchronous traffic limiting mode balances the performance and precision of cluster traffic limiting in heavy traffic scenarios.

Fine flow limit

Traditional interface-granularity traffic limiting may fail to meet some complex service traffic limiting requirements. For example, a service on the same interface needs to be treated differently based on different call sources, or an independent traffic limiting configuration is configured based on the value of a service parameter (such as merchant ID and activity ID).

Fine current limiting is designed to solve such complex current limiting configurations.

Let’s first sort out the conditions that business students may want to support. Basically, there are several categories:

  1. By source of business

For example, the service provided by application A is invoked by system B, SYSTEM C, and system D. Therefore, only the traffic from APPLICATION B is restricted.

  1. By service parameter value

For example, by UID, activity ID, merchant ID, payment scenario ID, etc.

  1. According to the full link service standard ¹

For example, “Huabei withholding”, “Yu ‘ebao subscription payment” and so on.

[Note 1] : The all-link service standard is an identifier generated according to the rules configured for services. This identifier is transparently transmitted in the RPC protocol to identify service sources across services.

In more complex scenarios, the above conditions may have some logical operation relationships, for example, the traffic whose service source is A and activity ID is XXX, the service label is A or B and parameter value is XXX, etc.

Some of the above conditions can be directly obtained from the header of the request. For example, the service source application and the source IP address can be directly obtained, which is called basic information. However, service parameters and full-link identification are not available for every application, which is called service information.

Traffic condition rules enable basic information and service information to support basic logical operations and generate independent sub-resource points based on the operation results.

According to the condition rules of service configuration, traffic is divided into several sub-resource points, and independent traffic limiting rules are configured for the sub-resource points to meet the requirements of refined traffic limiting.

Do More

What else can we do after we have achieved the unification of current-limiting fuses? So let’s talk to you about some of our thoughts.

Current limiting X self-healing

After the realization of adaptive current limiting, we soon carried out large-scale promotion coverage within the group. Almost every day, there were cases triggered by adaptive current limiting, but we found that in many cases, the trigger of adaptive current limiting was caused by single machine failure. Hundreds of thousands of containers are running online, and it is inevitable that single machine jitter will occasionally occur.

Traffic limiting solves the overall capacity problem. For heavily dependent services, services still fail after traffic limiting. It is better to quickly transfer traffic to other healthy machines.

Traditional self-healing platform is found by monitoring the machine fault, then the subsequent self-healing action, monitoring data often have 2 ~ 3 minutes delay, if in the adaptive current limiting triggered immediately after a report data to the self-healing platform, self-healing platform to judge to confirm whether the single machine problem, then implement the self-healing process, it can improve the effectiveness of healing, Further improve the business availability rate.

In the same way, if the self-healing platform receives the message triggered by adaptive traffic limiting and finds that the problem is not a stand-alone problem but the overall capacity problem, it can perform rapid capacity expansion to realize self-healing of the capacity problem.

Limit current X downgrading center

When a heavily dependent service fails, traffic limiting ensures that the service does not avalanche due to capacity problems and does not improve service availability. Traffic can be diverted when a single failure occurs, but what should be done when an overall failure occurs?

A better approach is to forward the request to a pre-prepared degraded service.

Demotion mediators, implemented on the Serverless platform, can sink some of the general logic of demotion into the base (for example: Cache billing, asynchronous recovery, etc.), the service can implement its Serverless service degradation module according to actual requirements. In this way, even when the service is completely unavailable, the MOSN can still forward the request to the degraded service, so as to achieve a higher service availability.

“Total” “

With the gradual enrichment and improvement of MOSN traffic limiting capability and the construction of more Mesh high availability capability in the future, MOSN has gradually become an important link in the technical risk and high availability infrastructure.

The above is our experience sharing of Mesh limiting practice and implementation. We hope that you can have a deeper understanding and understanding of Service Mesh through these sharing. We also hope that you can pay more attention to MOSN, so that we can get more feedback from the community and help us do better.