Brief introduction: Application High Availability Service (AHAS) is a cloud product deposited by Alibaba’s internal High Availability system for many years. It takes traffic and fault tolerance as the entry point. From flow control, unstable call isolation, fusible downgrading, hotspot traffic protection, system adaptive protection, cluster flow control and other dimensions to help ensure the stability of services, while providing second-level traffic monitoring and analysis functions.

Author: Chu He

The stability of microservices has always been a topic of great concern for developers. As services evolve from single architecture to distributed architecture and deployment modes change, the dependency relationship between services becomes more and more complex, and service systems also face huge high availability challenges. Application High Availability Service (AHAS) is a cloud product deposited by Alibaba’s internal High Availability system for many years. It takes traffic and fault tolerance as the entry point. From flow control, unstable call isolation, fusible downgrading, hotspot traffic protection, system adaptive protection, cluster flow control and other dimensions to help ensure the stability of services, while providing second-level traffic monitoring and analysis functions. AHAS not only has a wide range of applications in e-commerce fields such as Taobao and Tmall within Alibaba, but also has a large number of practices in Internet finance, online education, games, live broadcasting and other large government-owned enterprises.

Flow control is the most common and direct control method to ensure the stability of microservices. Each system or service has its capacity upper limit. The flow control idea is very simple. When the QPS of an interface exceeds the upper limit, the system rejects redundant requests to prevent the system from being overwhelmed by sudden traffic. The most common solution in the market is single-node flow control. For example, if the maximum capacity of an interface is estimated by PTS performance test to be 100 QPS and the service has 10 instances, single-node flow control is configured to be 10 QPS. But most of the time, because of the uncertainty of the flow distribution, the flow control of the single-machine dimension has some poor results.

Typical scenario 1: Precisely control the total number of calls to the downstream

** Scenario: ** Service A needs to call the query interface of service B frequently, but the capacity of service A and service B is different. Service B agrees to provide service A with A total of 600 QPS query capability, which is controlled by means of flow control.

** Pain points: ** If configured according to the single-node flow control policy, due to reasons such as invocation logic and load balancing policy, the traffic distribution of each instance from A to B may be very uneven. Some instances of service B with heavy traffic trigger single-node flow control, but the overall limit is not reached, resulting in SLA failure. This imbalance often occurs when a dependent service or component (such as database access) is called, which is a typical scenario for cluster flow control: precise control of the total number of microservice clusters calling downstream services (or databases, caches).

Typical scenario 2: The total number of requests is controlled by the service link

** Scenario: ** In Nginx/Ingress Gateway, API Gateway (Zuul) for inlet traffic control, hoping to accurately control a certain or a group of API traffic to play a role in advance protection, excess traffic will not be sent to the back-end system.

** Pain points: ** If the configuration is based on the single-machine dimension, on the one hand, it is not good to perceive the change of the number of gateway machines, on the other hand, uneven gateway traffic may lead to poor traffic limiting effect; And from the perspective of gateway entry, configuring the overall threshold is the most natural means.

AHAS cluster flow control

AHAS cluster flow control can accurately control the total amount of real-time calls of a certain service interface in the whole cluster, which can solve the problem of poor flow limiting effect caused by uneven flow, frequent changes in the number of machines and too small amortization threshold of single-node flow control. Combined with single-node flow control pocket bottom, better play the effect of flow protection.

For the above scenario, AHAS cluster flow control supports precise control of the total number of calls regardless of the call logic, traffic distribution, and instance distribution, whether it is Dubbo service calls, Web API access, or custom business logic. It can not only support large flow control of hundreds of thousands of QPS, but also support precise control of small traffic in minute/hour service dimension. The behavior after protection is triggered can be customized by users (such as returning user-defined content and objects).

AHAS cluster protection has the following advantages:

  • ** Multiple scenarios: ** fully covers the scenarios from precise protection of gateway /Mesh entry traffic, precise flow control of Web/RPC service /SQL invocation, and traffic control of minute/hour business dimension, supporting hundreds of thousands of QPS;
  • ** Low cost of use: ** No special access, out of the box, fast configuration;
  • ** Automatic control and operation: ** Automatic control and allocation of Token Server resources, automatic operation and maintenance capability to ensure availability, users do not need to pay attention to server resource preparation and allocation, only need to pay attention to rule configuration and business process;
  • ** Low performance loss: in ** performance mode, there is no delay increase for the service link. In precise mode, RT loss for the service link is controlled within 3ms, and users can be assured to use it.
  • ** Matching observable capability, ** real-time understanding of interface stability and rule effectiveness.

Let’s use an example to introduce how to quickly connect an application to AHAS to play with cluster flow control capability and ensure service stability.

Fast play with AHAS cluster flow control

In the first step, we connect the service or gateway to AHAS traffic protection. AHAS provides a variety of fast and convenient non-invasive access means:

AHAS traffic protection supports multi-language native access such as Java, Go, C++, and PHP, as well as Nginx/Ingress gateway access and Mesh access. Java applications support a full range of 20+ microservice frameworks/components:

  • Web server: Spring Web/Spring Boot/Spring Cloud/Tomcat, Jetty/Undertow
  • Web Client: OkHttp/Apache HttpClient
  • RPC: Dubbo/Feign/gRPC
  • DAO/ cache: MyBatis/Spring Data JPA/Memcached/Jedis client
  • MQ Consumer: RocketMQ client/Kafka Client /RocketMQ Client
  • API Gateway: Spring Cloud Gateway/Zuul 1.x
  • Reactor framework

After AHAS access is successful, as long as the service call/interface access is triggered, you can see your own service on the AHAS console and see your own interface on the monitoring page:

Step 2: Enable the cluster flow control function on the “Cluster Flow Control – Cluster Configuration” page in the left menu of the application. We can enable “trial” clusters, and different cluster sizes can carry different QPS levels:

The third step is to find an interface on the real-time monitoring page and click “+” in the upper right corner to add flow control rules. In the following example, we configure the cluster flow control rule for the /doSomething interface, whose total number of visits per second does not exceed 200. If the rule status is On, the new rule takes effect immediately.

By clicking Next, we can also configure the processing logic after the protection rule is triggered for the selected Web/RPC call, such as custom return values. When the final configuration is complete, we click on the Add button and this rule will apply to each node.

After the configuration is complete, we can initiate a certain number of requests to the interface to different machines in the application cluster. It can be found that the return behavior preset in the rules will be automatically returned when the number of requests exceeds 200 per second. At the same time, the real-time monitoring page of the console can also see that the excess traffic is rejected, and the total level of interface passing per second is stable at 200 QPS:

Through a few simple configurations, we can quickly experience the “silky smooth” protection capability of AHAS cluster flow control for service traffic. At the same time, AHAS also recently launched Nginx/Ingress gateway traffic protection, Web request parameters flow control and other core functions, welcome to click to read the original text, go to the AHAS console for a quick experience.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.