In microservices architecture, stability and high availability are eternal topics. In actual governance, we may encounter the following scenarios:

  • Some application grayscale release, first on a few machines, because the code logic to write a problem, causing the thread pool full, abnormal running.
  • In the server cluster, the load of some servers is too high due to full disks or resource contention between hosts, and the client invocation times out.
  • The thread pool of some machines in the server cluster is Full, causing a Full Garbage Collection.
In the above three scenarios, the upstream machine will be overwhelmed by a temporary failure of one of the downstream machines, causing an application avalanche, because the client will still send requests to the servers that are not aware of the problem, causing the business invocation to fail.

In this scenario, if only for service for this downgrade, is too big to the harm of the application, but if we can detect some faults in the cluster service organization, and carries on the brief isolation, can effectively guarantee high availability of service and the stability of the system, at the same time to operations staff provided valuable buffer time, used for positioning problem, troubleshooting.

This article, the first in a series of microservices Governance Practices, introduces how to implement outlier instance removal. This series of articles is based on the micro-service practice of Alibaba Cloud’s commercial product EDAS. If your team has strong micro-service governance ability, we hope our practice and thinking behind micro-service governance can provide you with some reference.

Enterprise-level distributed service EDAS will be upgraded in a big way. From 15:00 to 17:00 on December 17th, middleware junior sister will be waiting for you in the live broadcast room. With the development engineer Shimian, you will introduce the micro-service governance ability such as the removal of EDAS outliers. There are gifts to keep, direct studio address click “portal”.

Microservice Outlier Ejection

  • What is outlier instance removal
When the rammer is abnormal at a single point, the consumer can take the initiative to judge, and remove the corresponding provider instance in a short time, no longer request, and continue to visit after a certain interval. In addition, the provider can detect global exceptions. When the number of provider exception instances exceeds a certain control percentage, the overall service quality of the provider is low.

  • The ability to remove outlier instances
From the fault tolerance capability of the service layer, the service stability is enhanced to effectively solve the problem of single point of failure.

  • And fuse the difference
A circuit breaker is a method of disconnecting the input load of a service when it surges, so that the service is not overwhelmed quickly, causing an avalanche effect. The circuit breaker generally consists of circuit breaker request judgment algorithm, circuit breaker recovery mechanism, circuit breaker alarm and other modules. Isolation is an architectural approach to unitary system design to avoid the spread of failure when dependent services fail.

If only because of the single-point outlier problem in the server cluster, using the fusing downgrading scheme will do too much harm to the application, outlier instance removal can effectively solve the single-point outlier problem and ensure the quality of service. If the overall service quality of the provider is low and the outlier removal effect is not obvious, you can use the fuse degrade function.

  • The outlier instance removed the supported version
As long as your application version is in the list, you can use the outlier removal feature without changing a line of code.

Current supported versions Dubbo2.5.0 to 2.7.3< 2.5.0 Dubbox Version Spring CloudD, E, F, G, and HC

Having covered most of the microservices scenarios on the market, we will continue to support the latest open source Dubbo/Spring Cloud releases.

We provide Outlier removal functions for Dubbo and Spring Cloud scenarios. This paper will first introduce the practice and effect of Dubbo Microservice Outlier Ejection.

The following will demonstrate the function and effect of Dubbo outlier removal on EDAS.

Enterprise Distributed Application Service (EDAS) is a PaaS platform for Application hosting and micro-service management, providing full-stack solutions for Application development, deployment, monitoring, operation and maintenance. It also supports Dubbo, Spring Cloud and other microservice operating environments.

www.aliyun.com/product/eda…

To prepare

Next, a micro-service Demo is used to demonstrate the outlier removal function, which can be downloaded and verified from Github

Github.com/aliyun/alib…
Micro-service Demo is a simple e-commerce project. The following figure shows the project structure. Cartservice is the shopping cartservice provider of Dubbo framework, and productservice is the commodity detail service provider provided by Spring Cloud. Frontend is the Web controller, which is the front-end display page, and can be understood as consumer.





We will use the CartService service (Dubbo server) as an example to show the outlier instance removal function.

Deploy the microservice Demo on EDAS

First, CD cartService will be switched to the cartService directory, and then MVN clean install will be packaged, and then CD CartService-provider /target will be switched to the target directory. We can see the newly generated cartService-provider-1.0.0-snapshot. jar package and then create a CartService application on EDAS.





After clicking next, upload the cartService-provider /target/ cartService-provider-1.0.0-snapshot. jar package and then next, remember the login password until the application is successfully created.





Then start the application, and so far we have started a CartService-provider. Click to expand the capacity according to this instance specification. This service is deployed on two instances.



We in the provider’s com. Alibabacloud. Hipstershop. Provider. CartServiceImpl class as you can see, This provider provides two shopping cart services, viewCart and addItemToCart, in which we add some logic to simulate runtime exceptions.

@Value("${exception.ip}")
    private String exceptionIp;

    @Override
    public List<CartItem> viewCart(String userID) {

        if(exceptionIp ! = null && exceptionIp.equals(getLocalIp())) { throw new RuntimeException("Runtime exception");
        }

        return cartStore.getOrDefault(userID, Collections.emptyList());
    }Copy the code
ExceptionIp is the exception. IP configuration item of the ACM configuration center. If this parameter is set to the local IP address, the service throw RuntimeException is used to simulate a service exception scenario.

  • Why is cartService expanded to two instances? As you may have guessed, the runtime assigns the IP address of one instance by configuring the ACM configuration center to simulate an instance exception scenario.
Next, we need to deploy frontend/ProductService in the same way. Upload frontend/target/frontend- 1.0.0-snapshot. jar and frontend- 1.0.0-snapshot. jar respectively Productservice/productservice – the provider/target/productservice – the provider – 1.0.0 – the SNAPSHOT. The jar

As you can see from the figure below, our microservice Demo is deployed on EDAS.





Simulating service Exceptions

In the Frontend application, the example public IP address is 47.99.150.33.



Enter the browser to http://47.99.150.33:8080/





Click the View Cart access to http://47.99.150.33:8080/cart



As you can see, the services are all normal at this point.

We go to the ACM configuration center and configure exception. IP to 172.16.205.180 (that is, the IP of one of the instances of CartService).



Then continue to visit http://47.99.150.33:8080/cart, found that 50% of probability error pages





At this point, we write a script, timing simulation request to http://47.99.150.33:8080/cart.

while :
do
        result=`curl The $1 -s`
        if [[ "$result"= = *"500"*]];then
                echo `date +%F-%T` $result
        else
                echo `date +%F-%T` "success"
        fiSleep 0.1doneCopy the code
Then sh curlservice. Sh http://47.99.150.33:8080/cart

We see a 50% call success rate of 10 calls per second over and over again.





In fact, it is understandable that the downstream service quality deteriorates dramatically with the exception of an upstream machine, and may even be dragged down by the (system, business) exception of some upstream machine.

Enable the outlier removal policy

Next, I will demonstrate the start of the outlier removal strategy and its effect.

create

We entered the [Outlier Instance Removal] interface under [Microservice Management] in the left list of EDAS and chose to create an outlier instance removal policy.





Then follow the prompts step by step to create an outlier removal strategy.

The basic information





As shown above, you can select the namespace, fill in the policy name, and select the framework type (Dubbo/Spring Cloud) supported by the policy.

Select effective Application





According to the current invocation method, we only need to configure the Frontend application to protect the downstream application consumer.

Allocation strategy





These parameters provide default values. You need to adjust the most appropriate values based on your application. Because RuntimeException to be protected is a service exception, select Network exception + service exception. (It should be noted that even if the upper limit of the instance removal ratio is extremely low and the integer is lower than 1, when the number of instances in the cluster is greater than 1 and an instance is abnormal, we will remove an instance).

Has been created





The policy information is displayed.

strategy





See the outlier removal policy we created for the Dubbo framework and for the exception type of network exception + business exception.

Verify the effect of outlier removal

At this time, we see that the outlier removal function takes effect after the abnormality is perceived, and the correct results are returned after the request is called for a while.





Constantly refresh the browser to http://47.99.150.33:8080/cart are also normal



If the client detects that a server is abnormal, it automatically deletes the server. Call only the Provider instances that are normal, and we can also monitor through ARMS (EDAS monitoring system) to see the quality of service improve, and traffic is removed from the abnormal Provider.

Dubbo framework can from/home/admin/opt/ArmsAgent/logs directory in the log, the “OutlierRouter” keyword in the search logs can see a series of event log instance was removed from the group.

Modify/disable the outlier removal policy

For EDAS applications we support dynamic modification and deletion of outlier policies through the console.

  • Modify the corresponding policy rules
Click Modify to take effect or edit a policy.





Then add or delete applications or adjust parameters, which take effect immediately

  • Deleting a Corresponding Policy




The operations performed on the console take effect in real time for application configurations. If a policy is deleted, the related policy is disabled by default.

If we turn on the ARMS monitor to observe the specific invocation.

Very different monitoring

If we enable monitoring, we will see information such as traffic and request errors.

Before starting outlier removal

Then jump to the application monitoring page of ARMS (EDAS Monitoring System). We need to enable advanced monitoring for all three applications.



We can see the results directly from the application monitoring page of ARMS (EDAS Monitoring System) in the figure below.





As we can see in the following topology, traffic is constantly accessing the CartService service.





After starting outlier removal

The outlier removal effect can be seen in a simple example. Of course, the improvement of service quality can be obviously observed through the monitoring of ARMS (EDAS monitoring system).





It can be seen that after the point where outlier removal was started, the error rate decreased significantly from 50%.





Two of the small fluctuation burrs are caused by the fact that after outlier removal for a period of time, the access to the removed endPoint will be re-attempted. If the error rate is still higher than the threshold, isolation will continue with a longer interval.

Outlier instances remove specific control logic

We have seen, in front of the power to remove from the group of stability improvement to bring help, now we will excise control logic of the concrete analysis from the group of instances, help you better understand the significance of various parameters, and can according to your own application, by adjusting the parameters of configuration of the right to remove from the group strategy.

For the Dubbo/SpringCloud framework:

  • The default QPS lower limit is 1
Outlier protection starts only when the QPS of the current instance is greater than 1.

  • The default error rate lower limit is 50%
If the invocation error rate of an instance is higher than 50%, the system considers that an instance of the server cluster is abnormal.

  • By default, the maximum number of instances removed is 20%
If more than 20% of the current service cluster instances are abnormal, the system removes only 50% of the abnormal instances in the cluster.

  • Exception types
If the exception type is network exception, the system only counts the network exception errors into the error rate statistics and ignores the service exception. On the contrary, if network exception and service exception are selected, the system considers all exceptions as errors and calculates them into the error rate.

  • Description of the recovery detection unit time (30s by default) and the maximum number of unrecovered times (40 by default)
The first removal duration is 0.5 minutes, after which the consumer will continue to access the Provider. If the service quality of the Provider is still low, the provider will continue to be removed. The removal duration increases linearly with the number of consecutive removal increases by 0.5 minutes each time. Remove for 20 minutes at most each time. Of course, if the service quality is restored after the call is continued, it will be regarded as a healthy service. The next time the service quality problem is caused by an exception, it will be isolated again for 0.5 minutes and continue the above rules.

  • out
However, if only one instance of the service being invoked by the client provides the service provider, this instance is not isolated.

If the number of service instances invoked by the client is greater than one and the number of instances calculated by the outlier isolation ratio is less than one, one instance will be removed if a single point of failure occurs in the server cluster.

All of the above examples can be interpreted as endpoint (IP +port is latitude)

  • General best practices
You can set a threshold for the relative error rate (50%) and an upper limit for the proportion of excessively low instances to be removed (10%).

Outlier instance removal technical details

Non-invasive technology

The non-intrusive solution is implemented through agent technology, in a word bytecode enhancement technology, which inserts our code at runtime and changes the original logic of the application. It can be understood as runtime AOP, by inserting Filter/Router into the Dubbo link, Enhance LoadBalance logic in Spring Cloud to implement the desired routing control logic. At the same time, because it is enhanced by Agent, the overall link of each version of Dubbo has basically no big change, and the Spring Cloud model is consistent, we can basically cover all versions with a low cost.



Technical architecture of Dubbo Agent solution

For users, they can enjoy stability enhancement without changing a single line of code or configuration.

Outlier instance removal technique

Outlier Detection Outlier Detection

All data statistics are based on time window.

Dubbo 2.7 embedded a MetricsFilter into the link to perform dob processing for each request/response of the link, and counted RT, call success or not, and exception type. In addition, endpoint (IP +port) was used as the key for storage

2. Collect statistics of HTTP requests in the Agent base. Collect statistics of data in the latest time window based on data results such as URL, RT, status code and exception type.

The call information of the last N seconds is counted in real time as the basis for the action of outlier instance removal.

Outlier Ejection

Dubbo is implemented based on the Dubbo Router. For all invokers corresponding to the upstream service called, the “unhealthy” nodes are screened and the masked information is recorded.





The dubo-router control logic only checks and marks the status of each request. Two special threads in the background judge whether the marked traffic is in the isolation list or excluded from it, and modify the shielding information and other time-consuming operations to ensure the real-time performance of the request to the greatest extent.

Spring Cloud is implemented based on the LoadBalace extension and works in a similar way.







The original link

This article is the content of Ali Cloud, shall not be reproduced without permission.