Brief introduction: This is the “how to traffic condition online application architecture” series of the second, this a series of a total of three papers, aims to use the most simple language will influence the stability of the online application flow and technical problems do a classified, solutions to these questions only some code level of detail, some need tools to cooperate, others need expensive solution, If your application wants to have a one-stop experience of “traffic lossless” on the cloud, you can pay attention to the cloud product “Enterprise Distributed Application Service (EDAS)” of Ali Cloud, and EDAS will continue to evolve towards the direction of default access traffic lossless. In the next chapter, we will explain it from the perspective of data service exchange, and more importantly, the next chapter will point out the two keys to key prevention.

The author | solitary Yi, ten sleeps

preface

An article on how to build traffic condition of online application architecture | opening project Is our based on the image below, explained the traffic analysis and traffic access two places do some key technologies of traffic condition, this article we mainly from the traffic service dimensions, detailed talk about formal service process will affect the online traffic some technical details.

service

Recently, we analyzed the online fault causes of a large Internet company in the last year. Among them, product quality problems (unreasonable design, BUG, etc.) accounted for the highest proportion, 37%. Problems due to online releases, product and configuration changes accounted for 36%; Finally, the problems of high availability classes that depend on services, such as device failures, upstream and downstream dependent services problems or resource bottlenecks, etc.

Based on the reason analysis on the left side, it can be seen that how to manage the problems of change, product quality, and high availability classes is the key to how the service can achieve traffic lossless. We have broken down the application lifecycle into several key stages:

  • Application change states: We need to have specific means to protect our services (applications) throughout their release and configuration changes.
  • Operation service state: when our changes are completed, it is in a “cold” state at the beginning. How to make the system pass smoothly before normal flow or even super normal flow?
  • High availability dependencies: Is there a solution when some of the nodes we serve have problems, or our external dependencies (other microservices, DB, caching) have bottlenecks?

In order to deal with the above scenarios, we will give examples to discuss the corresponding solutions.

1, change state: application elegant offline

One step in applying a change is to stop the original application. Before online applications in the production environment are stopped, the service traffic needs to be offline. There are two main types of traffic: 1) synchronous call traffic, such as RPC and HTTP requests; 2) The other is asynchronous traffic, such as message consumption and background task scheduling.

Both types of traffic can cause traffic loss if the process is stopped while there are still unprocessed requests on the server. To solve this situation, there are usually two steps:

1) Remove the existing nodes from the corresponding registration service, scenarios include: remove the RPC service from the nodes in the registry, remove the HTTP service from the upstream load balancer, and try to close the thread pool of background tasks (message consumption, etc.), so that no new consumption or service is carried out.

2) Stop the process for a period of time (depending on the service situation) to ensure that the incoming traffic in the process can be handled properly, and then shut down the process.

2. Change state: application scheduling

Change another action is in the process of choosing resources (machine or container) to launch a deployment, how to choose the resource is we usually sense, scheduling 】 【 if is one of the traditional physical machine or virtual machine scheduling ability lack, this layer without too much use of space, largely because of his resources are fixed; However, the emergence of container technology (especially the popularity of Kubernetes technology later) has brought many changes to the field of delivery, and it has also brought a different story to the field of scheduling, that is, it has led the era of flexible scheduling from the traditional allocation of planning resources.

In Kubernetes, by default, you can select the most suitable node for scheduling based on the resource information (CPU, memory, disk, etc.) used by the application. If we go one step further, we can also customize a scheduler according to the traffic characteristics of our own applications. For example, we can try to make the applications with heavy traffic not gather on the same node, so as to avoid the traffic loss caused by bandwidth preemption.

3. Changed state: the application goes online gracefully

After the application is scheduled to the appropriate resources, the next step is to deploy and start the application. Similar to the application stop scenario, it is likely that our nodes will have been registered and the thread pool for background tasks will have started before the application is fully initialized. At this point, upstream services (such as SLB routing and message consumption) will schedule traffic. However, until the application is fully initialized, the quality of service of traffic cannot be guaranteed. For example, the first few requests after a Java application is started are basically a “stuck” state.

How to solve this problem? In contrast to the elegant offline action sequence of the application, we need to consciously delay the initialization of the service registration, background task thread pool, and message consumer thread pool. Be sure to wait until the application is fully initialized. If external load balancing is used to route traffic, you need to use deployed automation tools to coordinate traffic.

4. Change state: application service warm-up

Once the system is online, if it encounters a sudden increase in traffic, it may cause the system water level to rise instantly and then crash. Typical scenarios, such as zero point of rush hour, flood peak influx, application instances will instantly enter a large amount of traffic, these traffic will trigger such as JIT compilation, framework initialization, class loading and other low-level resource optimization problems, these optimization will cause high load problems to the system in a short period of time, resulting in traffic loss. To solve this problem, we need to control the slow increase of traffic and increase the service capacity of newly started applications by enabling parallel class loading with the class loader, initializing the framework in advance, and asynchronizing logs. In this way, the traffic of operations such as capacity expansion and online operation in heavy traffic scenarios is not damaged.

5. Changed state: Combination of Kubernetes service

Starting in 2020, we see a clear trend that Spring Cloud + Kubernetes has become the most popular combination in microservices. In a microservice system built based on Kubernetes, how to effectively combine microservice system and Kubernetes is a very challenging point. The Pod lifecycle management in Kubernetes itself provides two detection points:

  • RreadinessProbe is used to check whether a Pod is ready to receive traffic. If the probe fails, the node is selected in Kubernetes Service and its status is NotReady.
  • LivenessProbe is used to check whether the Pod is healthy. If the probe fails, the Pod will be restarted.

If our application is not configured with readinessProbe, it will only check whether the process in the container is running by default, and it is difficult to determine whether the business running in the process is really healthy. During the publishing process, if we use a rolling publishing strategy, Kubernetes will start destroying the old Pod when it finds that the business process in the new Pod has started, which seems to be fine. But we think about it carefully, “the new business process is started in the pod”, doesn’t mean “business has been launched in place,” sometimes, if the business code there is a problem, then we start the process of, or even has exposed the business port, but because the abnormal situation, such as business code to process up service haven’t had time to register. However, the old version of the Pod has been destroyed. For consumers of applications, the problem of No Provider may occur, resulting in a large amount of traffic loss during the release process.

Similarly, if our application is not configured with livenessProbe, Kubernetes will only check whether the process in the container is alive by default. However, if a process in our application is in suspended animation due to resource contention, FullGc, thread pool full, or some unexpected logic, the process is alive. But the service quality is low or even zero. All traffic entering the current application at this point will report an error, resulting in a large amount of traffic loss. At this point, our application should tell Kubernetes via livenessProbe that the current application Pod is in an unhealthy state that cannot recover on its own and needs to be restarted.

The readinessProbe and livenessProbe configurations are designed to provide timely and sensitive feedback on the health status of the current application to ensure that all processes in the Pod are in a healthy state and that service traffic is not damaged.

6, change state: gray scale

During one iteration of the release, it is difficult to guarantee that the new code will be tested without any problems online. Why are most failures release-related? Because the release is the last link of the overall business to be released online, some problems accumulated in the process of research and development will be triggered in many cases in the final release. In other words, the unspoken rule is that publishing online is almost always bug-free, just big and small, but the question to be answered in publishing is: if there are going to be problems, how do you minimize the impact? The answer is grayscale. If there are some untested issues that happen to be released online in full batch, the errors will be magnified by the spread of the entire network, resulting in a large and prolonged loss of online traffic. If our system has grayscale capability (or even full-link grayscale), then we can minimize the impact surface of the problem by grayscale publishing. Publishing is much more stable and secure if the system has full visibility in the grayscale process. If the grayscale capability can get through the whole link process, the online traffic can be effectively guaranteed even in the face of multiple applications at the same time.

7. Operating State: Service degradation

When the application is experiencing peak business, it is found that downstream service providers are experiencing performance bottlenecks, or even are about to affect the business. We can degrade some of the service consumers so that non-important business parties do not actually call and directly return Mock results or even exceptions, reserving valuable downstream service provider resources for important business callers to use, thus improving overall service stability. We call this process service degradation.

If the downstream services on which an application depends are unavailable, service traffic is lost. You can configure the service degradation capability. When downstream services are abnormal, service degradation enables traffic to fail fast on the calling end, effectively preventing avalanche.

8. Operating state: automatic outlier removal

Similar to service degradation, automatic outlier is the ability to automatically remove nodes when a single service becomes unavailable in the process of traffic service. It is different from service degradation mainly in two aspects:

1) Automatic completion: Service degradation is an operation and maintenance action that needs to be configured through the console and specified with the corresponding service name to achieve the corresponding effect; The [automatic outlier removal] capability is to actively detect the survival of upstream nodes and degrade the whole link.

2) Granularity removal: service degradation degradation is (service + node IP). Take Dubbo as an example, a process will publish a microservice with the service Interface name as the service name. If the degradation of this service is triggered, the service of this node will not be called next time, but other services will be called. But “outlier” means that the entire node will not be called.

9. High availability: Registry Dr

As the core component of service registry discovery, registry is an essential part of microservices architecture. In CAP’s model, the registry can sacrifice A little bit of data consistency (C), that is, the service address that each node gets at the same time allows for brief inconsistencies, but availability must be guaranteed (A). Because if the registry becomes unavailable due to some problem, the nodes that connect to it can have a catastrophic impact on the entire system because they can’t get the service address. In addition to the common high availability methods, the registries have the following specific disaster recovery methods:

1) Push protection: Data center network jitter or during the publishing process, there will often be batch intermittent disconnection, but this situation is not the service is not available, if the registry identifies this is an exception (batch intermittent disconnection or empty address), it should adopt a conservative policy. In order to avoid the error of “no provider” in all services, all micro services will lead to a large amount of traffic loss.

2) Client cache Dr: As with empty protection, standing on the client side of the logic is also applicable, we often meet the client and registry network problems when the address update, the client can not fully believe all the results of the registry feedback, only clearly told is normal results to update the address in memory, Especially when the last address to take a more cautious strategy; At the same time, after getting the address, can not be fully trusted, because it is likely that the registry pushed down the address is not reachable. In this case, a heartbeat keepalive policy is required to dynamically adjust the availability of the peer service to avoid traffic loss caused by sending the service directly to the unreachable address.

3) Local cache Dr: Is the Dr Of the registry and client sufficient? It is often enough not to make a change, but what happens if an application makes the registry unavailable while making a change? One is that the address itself can not be registered, and the other is that if there is a service dependent call, traffic will appear “no provider”, resulting in traffic loss, or even can not start at all. If we simplify the dependency of the registry to a resolution of the address of a service name, we can actually save the resolution results locally as a disaster backup, which can effectively avoid traffic loss due to the registry being unavailable during the change process.

10. High availability: same-city multi-machine room Dr

The characteristic of the same city is that RT is generally in a relatively low latency (< 3ms), so by default, we can build a large LAN based on different machine rooms in the same city, and then distribute our application across multiple machine rooms, in order to deal with the risk of flow damage in the case of a single room failure. This infrastructure is less costly to build and requires less architectural change than remote living. However, in the microservice system, the links between applications are complex. As the link depth becomes deeper, the complexity of governance increases. As shown in the following figure, front-end traffic may be called in different equipment rooms, resulting in RT surge and traffic loss.

To solve the above problem, the key is to support same-room routing at the service framework level. That is, if the target service is the same as the equipment room, traffic is preferentially routed to the node in the same equipment room as the target service. The method to achieve this capability is to report the information of the machine room where the caller is located when registering the service. The information of the machine room is also pushed to the caller as meta information. During routing, the caller customizes the Router capability of the service framework and preferes the address of the same machine room as the caller as the target address.

conclusion

This is the “how to traffic condition online application architecture” series of the second, this a series of a total of three papers, aims to use the most simple language will influence the stability of the online application flow and technical problems do a classified, solutions to these questions only some code level of detail, some need tools to cooperate, others need expensive solution, If your application wants to have a one-stop experience of “traffic lossless” on the cloud, you can pay attention to the cloud product “Enterprise Distributed Application Service (EDAS)” of Ali Cloud, and EDAS will continue to evolve towards the direction of default access traffic lossless. In the next chapter, we will explain it from the perspective of data service exchange, and more importantly, the next chapter will point out the two keys to key prevention.

The original link

This article is the original content of Aliyun and shall not be reproduced without permission.