Author: Chu He

The stability of microservices has always been a topic of great concern for developers. As services evolve from single architecture to distributed architecture and deployment modes change, the dependency relationship between services becomes more and more complex, and service systems also face huge high availability challenges. Application High Availability Service (AHAS) is a cloud product deposited by Alibaba’s internal High Availability system for many years. It takes traffic and fault tolerance as the entry point. From flow control, unstable call isolation, fusible downgrading, hotspot traffic protection, system adaptive protection, cluster flow control and other dimensions to help ensure the stability of services, while providing second-level traffic monitoring and analysis functions. AHAS not only has a wide range of applications in e-commerce fields such as Taobao and Tmall within Alibaba, but also has a large number of practices in Internet finance, online education, games, live broadcasting and other large government-owned enterprises.

Flow control is the most common and direct control method to ensure the stability of microservices. Each system or service has its capacity upper limit. The flow control idea is very simple. When the QPS of an interface exceeds the upper limit, the system rejects redundant requests to prevent the system from being overwhelmed by sudden traffic. The most common solution in the market is single-node flow control. For example, if the maximum capacity of an interface is estimated by PTS performance test to be 100 QPS and the service has 10 instances, single-node flow control is configured to be 10 QPS. But most of the time, because of the uncertainty of the flow distribution, the flow control of the single-machine dimension has some poor results.

Typical scenario 1: Precisely control the total number of calls to the downstream

Scenario: Service A frequently invokes the query interface of service B. However, service A and SERVICE B have different capacities. Service B agrees to provide service A with A maximum of 600 QPS of query capability, which is controlled by means of flow control.

Pain point: If the single-node flow control policy is configured, due to the invocation logic and load balancing policy, the traffic from A to B to each instance may be unevenly distributed. Some instances of service B with heavy traffic trigger single-node flow control, but the overall limit is not reached, resulting in SLA failure. This imbalance often occurs when a dependent service or component (such as database access) is called, which is a typical scenario for cluster flow control: precise control of the total number of microservice clusters calling downstream services (or databases, caches).

Typical scenario 2: The total number of requests is controlled by the service link

Scenario: The Nginx/Ingress Gateway and THE Spring Cloud Gateway (Zuul) are used to control the incoming traffic, hoping to accurately control the traffic of a certain API or a group of apis to protect the traffic in advance and prevent excess traffic from being sent to the back-end system.

Pain point: If the configuration is based on the single-machine dimension, on the one hand, the change of the number of machines on the gateway is not sensed, and on the other hand, uneven gateway traffic may lead to poor traffic limiting effect. And from the perspective of gateway entry, configuring the overall threshold is the most natural means.

AHAS cluster flow control

AHAS cluster flow control can accurately control the total amount of real-time calls of a certain service interface in the whole cluster, which can solve the problem of poor flow limiting effect caused by uneven flow, frequent changes in the number of machines and too small amortization threshold of single-node flow control. Combined with single-node flow control pocket bottom, better play the effect of flow protection.

For the above scenario, AHAS cluster flow control supports precise control of the total number of calls regardless of the call logic, traffic distribution, and instance distribution, whether it is Dubbo service calls, Web API access, or custom business logic. It can not only support large flow control of hundreds of thousands of QPS, but also support precise control of small traffic in minute/hour service dimension. The behavior after protection is triggered can be customized by users (such as returning user-defined content and objects).

AHAS cluster protection has the following advantages:

  • Various scenarios: from precise protection of gateway /Mesh entry traffic, precise flow control of Web/RPC service /SQL invocation, to traffic control of minute/hour service dimensions, supporting hundreds of thousands of QPS.

  • Low cost of use: no special access, out of the box, fast configuration;

  • Automatic management, control, and o&M: Automatic management, control, and allocation of Token Server resources ensures availability. Users do not need to prepare and allocate server resources, but only need to configure rules and business processes.

  • Low performance loss: in performance mode, there is no delay increase for the service link. In precise mode, RT loss for the service link is controlled within 3ms, and users can use the service link safely.

  • Supporting observability to learn interface stability and rule effectiveness in real time.

Let’s use an example to introduce how to quickly connect an application to AHAS to play with cluster flow control capability and ensure service stability.

Fast play with AHAS cluster flow control

In the first step, we connect the service or gateway to AHAS traffic protection. AHAS provides a variety of fast and convenient non-invasive access means:

AHAS traffic protection supports multi-language native access such as Java, Go, C++, and PHP, as well as Nginx/Ingress gateway access and Mesh access. Java applications support a full range of 20+ microservice frameworks/components (see links at the end of this article) :

  • Web server: Spring Web/Spring Boot/Spring Cloud/Tomcat, Jetty/Undertow

  • Web Client: OkHttp/Apache HttpClient

  • RPC: Dubbo/Feign/gRPC

  • DAO/ cache: MyBatis/Spring Data JPA/Memcached/Jedis client

  • MQ Consumer: RocketMQ client/Kafka Client /RocketMQ Client

  • API Gateway: Spring Cloud Gateway/Zuul 1.x

  • Reactor framework

After AHAS is successfully connected, you can see your own service in the AHAS console (see related links at the end of this article for details) as long as the service call/interface access is triggered, and you can see your own interface in the monitoring page:

Step 2: Enable the cluster flow control function on the “Cluster Flow Control – Cluster Configuration” page in the left menu of the application. We can enable “trial” clusters, and different cluster sizes can carry different QPS levels:

The third step is to find an interface on the real-time monitoring page and click “+” in the upper right corner to add flow control rules (see related links at the end of this article for details). In the following example, we configure the cluster flow control rule for the /doSomething interface, whose total number of visits per second does not exceed 200. If the rule status is On, the new rule takes effect immediately.

By clicking Next, we can also configure the processing logic (see the link at the end of this article) for the selected Web/RPC call after the protection rule is triggered, such as custom return values. When the final configuration is complete, we click on the Add button and this rule will apply to each node.

After the configuration is complete, we can initiate a certain number of requests to the interface to different machines in the application cluster. It can be found that the return behavior preset in the rules will be automatically returned when the number of requests exceeds 200 per second. At the same time, the real-time monitoring page of the console can also see that the excess traffic is rejected, and the total level of interface passing per second is stable at 200 QPS:

Through a few simple configurations, we can quickly experience the “silky smooth” protection capability of AHAS cluster flow control for service traffic. At the same time, AHAS also recently launched a new Nginx/Ingress gateway traffic protection, Web request parameters flow control (details see the link at the end of the article) and other core functions, welcome to click to read the original text, go to the AHAS console for a quick experience.

A link to the

1) A full range of 20+ microservice frameworks/components:

Help.aliyun.com/document_de…

2) AHAS Console:

Common-buy.aliyun.com/?commodityC…

3) New flow control rules:

Help.aliyun.com/document_de…

4) Processing logic after protection rules are triggeredHelp.aliyun.com/document_de…

5) Nginx/Ingress gateway traffic protectionHelp.aliyun.com/document_de…

6) Flow control of Web request parametersHelp.aliyun.com/document_de…