Introduction: In distributed scenarios, microserver processes exist in the form of containers and run under the support of container scheduling systems such as K8S. Container group Pod is the minimum resource unit of K8S. As services are iterated and updated, when a new version comes online, the new version needs to be replaced by the services that are running online.
The author | | ali Li Zhixin source technology public number
A background
1 gracefully up and down the line
In distributed scenarios, microserver processes exist in the form of containers and run under the support of container scheduling systems such as K8S. Container group Pod is the minimum resource unit of K8S. As services are iterated and updated, when a new version comes online, the new version needs to be replaced by the services that are running online.
In the process of stable production, container scheduling is completely controlled by K8S, and micro-service governance is maintained and managed by the service framework or operation and maintenance personnel. In a new release, or enlarge shrinks scenarios, will terminate the old container instances, and use the new container instances to replace, for carrying high flow line production environment, this process to replace the cohesion of a but appear problem, will cause a lot of mistakes in a short time request, trigger the alarm and even affect the normal business. For larger manufacturers, the loss caused by problems in the release process can be huge.
Therefore, the appeal of graceful up and down the line was made. This requires that on the basis of stable service invocation ability and traditional service governance ability, the service framework should provide stable guarantee in the offline process of service, so as to reduce operation and maintenance cost and improve application stability.
2 Desired effect
Ideally, I believe, the effect of the elegant line up and down, is bearing a lot of traffic in a distributed system, all component instances can optionally expansion, contraction, rolling update, in this case need to ensure stability in the process of update TPS (number of requests per second) and rt (request delay), and ensure that no request error caused by up and down the line. Furthermore, the disaster recovery capability of the system ensures proper traffic scheduling in the case that one or more nodes are unavailable, thus minimizing the occurrence of incorrect requests.
- Dubbo-go’s elegant up-and-down ability
Dubbo-go’s exploration of graceful offline dates back three years, and the ability to gracefully offline has been available since early version 1.5. Through the monitoring of the termination semaphore, it can realize the cleaning work such as de-registration and port release, remove the traffic and ensure the correct response of the client request.
Some time ago, with the official release of Dubbo-Go 3.0, I mentioned in a proposal issue (Dubbo-Go Issue 1685) [1] some issues that production users attach great importance to as the direction of the 3.x version. And invited everyone to talk about their opinions on these directions. The most popular feature among users is the ability to get up and down the line without loss. Thanks again to Wang Xiaowei from the community for his contribution.
Dubo-go now has this capability, which has been refined and tested in production environments, and will be available in a later release.
Two dubo-Go elegant up and down line implementation ideas
The graceful up and down line can be divided into three angles. Online, offline, and client Dr Policies. These three perspectives ensure that the production instance is free from error requests during normal release iterations.
1 Client load balancing mechanism
The microservices architecture modeled on Apache’s top level project Dubbo will not be described here, but in a distributed scenario, most users will use the service discovery capabilities provided by third-party registry components, even within K8S. In terms of operation and maintenance costs, stability, and hierarchical decoupling, native services are rarely used for Service discovery and load balancing, except in some special cases, so these capabilities become standard in microservice frameworks.
Those familiar with Dubbo will know that Dubbo supports multiple load balancing algorithms that are integrated into the framework through an extensible mechanism. Dubbo-go also supports multiple load balancing algorithms in multi-instance scenarios, such as RR, random number, flexible load balancing, and so on.
The image below is from Dubbo’s website
Dubo-go load balancing component
The Dubo-Go service framework has an interface level extension mechanism that can load different implementations of the same component interface depending on configuration. Among them is the random algorithm load balancing strategy, which is the default dubbo-Go load balancing algorithm. When this algorithm is used for load balancing, all providers are randomly selected based on a certain weighting policy. All provider instances are potentially downstream.
This traditional load balancing algorithm has hidden dangers, that is, it does not affect the selection of downstream instances in subsequent calls because of the results of previous calls. Therefore, if some downstream instances are in the offline phase and temporarily unavailable, all random requests to the instance will be reported as an error, which will cause huge losses in the case of high traffic.
Cluster Retry Policy
The image below is from Dubbo’s website
The cluster retry strategy of Dubbo-Go is borrowed from Dubbo. By default, Failover logic is used. Of course, there are also failback and Fallfast policies, which are integrated into the framework based on the scalability of components.
Both the load balancing and retry logic mentioned above are based on the idea of “section-oriented programming”, constructing an abstract implementation of invoker to pass traffic layers downstream. For the Failover policy, the retry logic for incorrect requests is added on the basis of load balancing downstream instances. Once a request fails, the next invoker is selected to try until the request succeeds or the maximum number of requests is exceeded.
The cluster retry strategy merely increases the number of attempts and reduces the error rate, but is essentially stateless and can have disastrous consequences when downstream services become unavailable.
Blacklist mechanism
The blacklisting mechanism was the first requirement arranged by my elder brother during my internship last year. The general idea was simple: add the IP address of the instance corresponding to the invoker that requested to be thrown wrong into the blacklist, and no traffic will be imported into the instance later. After a period of time, try to request it, and delete it from the blacklist if successful.
The implementation logic of this mechanism is very simple, but it essentially upgrades the stateless load balancing algorithm to stateful. For an unavailable downstream instance, one request will quickly block the instance, and other requests will identify the instance in the blacklist, thus avoiding additional traffic.
For such a policy, the timeout that is retained in the blacklist, the policy that attempts to remove from the blacklist, etc., should be considered in the context of the specific scenario, which is essentially a stateful failover policy. Universality is strong.
P2C flexible load balancing algorithm
The flexible load balancing algorithm is an important feature of Dubbo3 ecosystem, which the Dubbo-Go community is working with Dubbo to explore and practice. Some readers may have seen the introduction of Dubo-Go 3.0 in previous articles. In simple terms, it is a state of, not as “one size fits all” as the blacklist, broader and more comprehensive consider variables a load balancing strategy, the algorithm of P2C, on the basis of considering each downstream instance request delay, such as machine resources performance variables, downstream instance through certain strategies to determine which is the most appropriate, and specific strategies, Specific application scenarios will be explored by interested community members, currently led by Niu Xuewei (github@justxuewei) from Byte.
Many of these load balancing strategies are designed to maximize access to healthy instances from the client’s point of view. In terms of lossless up-down view, abnormal instances in the release phase can be filtered out by the client through reasonable algorithms and policies, such as blacklist mechanism.
I think client-side load balancing is a generic capability that is just icing on the cake for lossless offline scenarios, not a core element. In essence, consider the server instance being “offline” to solve the underlying problem.
2 The server goes online gracefully
Compared with the client, the server is the service provider and the entity of the user’s business logic, which is more complex in the scenario we discussed. Before we discuss the server side, let’s revisit the basic service invocation model.
Traditional service invocation model
Refer to the architecture diagram on the Dubbo website to complete a service invocation. Generally, there are three components: registry, server, and client.
- The server first needs to expose the service and listen on ports to be able to accept requests.
- The server registers current service information, such as IP and port, in a centralized registry, such as Nacos.
- The client accesses the registry to obtain the IP address and port of the service to be invoked, and discovers the service.
- Service invocation, the client for the corresponding IP and port requests.
These four simple steps are the core focus of Dubo-Go’s elegant offline strategy. Normally, the four steps follow this path very smoothly and logically. On the other hand, in a large production cluster, there are a lot of details to consider when a service comes online.
We need to understand, how did the error occur in the up-down process? We only need to focus on two errors: “a request was sent to an unhealthy instance” and “the process that was processing the request was killed”, which are the source of almost all errors in the up-down process.
Service elegant on-line logic details
When the service comes online, follow the steps above to first expose the service and listen on the port. After ensuring that the service provider can provide services normally, it registers its own information in the registry so that traffic from the client can be sent to its OWN IP address. This order must not be out of order, otherwise the service will receive the request before it is ready, resulting in an error.
The above is just a simple case. In a real world scenario, a server instance usually consists of a set of interdependent clients and servers. In the configuration of the Dubbo ecosystem, these are called services and references.
In an example familiar to business students, a service function executes some business logic and makes calls to multiple downstream services, which may include databases, caches, or other service providers, and returns the results. This corresponds to the concept of the Dubbo ecosystem, where a Service listens for ports and accepts requests that are forwarded to the upper layer of the application business code, and the business code written by the developer requests downstream objects through the client, known as Reference. Of course, there are several downstream protocols here, so we will only consider the Dubbo stack.
From the common Service model mentioned above, we can think of a Service as dependent on Reference. All references of a Service must be working properly before the current Service can properly receive services from upstream. It follows that the Service should be loaded after Reference, and when all references are loaded to ensure that the clients are available, the Service should be loaded, the working services exposed, and the registry should be registered to call upstream. If, conversely, the Service is ready and Reference is not, a request error will result.
Therefore, the service on-line logic is Consumer load -> Provider load -> Registry service registration.
Some readers may wonder what happens if the Consumer relies on the current instance’s own Provider. The Dubbo implementation can initiate function calls without going over the network, and the Go side can do the same, but the implementation is still in development. This is relatively rare and more familiar.
3 The server goes offline gracefully
There are more points to consider when a service goes offline than when it goes online. Let’s go back to the four steps of the service invocation model mentioned in the previous section:
- The server first needs to expose the service and listen on ports to be able to accept requests.
- The server registers current service information, such as IP and port, in a centralized registry, such as Nacos.
- The client accesses the registry to obtain the IP address and port of the service to be invoked, and discovers the service.
- Service invocation, the client for the corresponding IP and port requests.
If a service is going offline, make sure you clean up the mess. If the current process were terminated, on the one hand, a large number of TCP connections would fail in a fraction of a second, depending on the client load balancing strategy mentioned in Chapter 1. On the other hand, a large number of in-process requests are forcibly discarded. It’s not elegant! So when an instance knows it is being terminated, the first thing it does is tell the client, “MY service is being terminated, cut off the traffic.” This is reflected in the implementation, which removes its own service information from the registry. The client will not send the request until it can’t get the current instance IP, and it is only elegant to terminate the process then.
The above is just a simple case. In real life, the client may not be able to cut off traffic that fast, and the server has a lot of work on its hands. Terminating the process prematurely can be interpreted as spilling a basin of water on the floor.
With that in mind, let’s talk in detail about the steps to take a service offline.
Graceful offline use and trigger
As mentioned in the story above, the process must first know that it is “about to be terminated”, thus triggering the elegant logoff logic. This message can be a semaphore. When K8S terminates the container process, kubelet sends the SIGTERM semaphore to the process. Dubbo-go framework presets a series of termination semaphore listening logic, so that after receiving the termination signal, the process can still control its own action, that is, the implementation of elegant offline logic.
However, some applications listen for the SIGTERM signal to process the logoff logic themselves. For example, closing DB connections, clearing caches, etc., especially for gateway type applications that act as access layer, both Web container and RPC container exist. It is particularly important to close the Web container or the RPC container first. So dubo-go allows the user to control the timing of signal listening by configuring internal.signal and gracefully closing the RPC container at the right time with graceful_shutdown.beforeshutdown (). Similarly, Dubbo-Go allows the user to choose whether to enable new number listening in the configuration.
The registration
As mentioned above, the server needs to tell the client that it is terminating by unregistering it through the registry. Common service registration middleware, such as Nacos, Zookeeper, Polaris, etc., support service de-registration and notify upstream clients of the deletion action as an event. The client must be listening to the registry at all times, and the success of the request depends largely on whether the messages from the registry are listened to and responded to in time by the client.
In the dubo-Go implementation, the client gets the delete event first and removes the instance’s corresponding Invoker from the cache. This ensures that subsequent requests do not flow downstream to the invoker.
The de-registration process, while fast, spans three components and is not guaranteed to be instantaneous. So the next step is to wait for client updates.
Somewhat related to the later steps, only unregister, not unsubscribe, is performed at this stage, because during the graceful logoff execution, there will be requests from its own clients to the downstream. If unsubscribe, it will not be able to receive updates from the downstream, which may lead to errors.
Waiting for client updates
The server cannot quickly kill the current service after the unregistration of the graceful logout logic is executed. Instead, it blocks the graceful logout logic for a short period of time, which is configured by the developer. By default, it should be longer than the time from unregistration to client cache deletion.
After this waiting time, the server can assume that there are no new requests coming from the client, and the logic is to reject any new requests.
Wait for requests from upstream to complete
This still doesn’t kill the current process. It’s like holding the basin of water in your hand. All you did was turn away from the tap without emptying the basin. So it’s still a matter of waiting for all requests from upstream that the current instance is processing to complete.
The server maintains a concurrency safe counter in a layer of Filter that records the number of requests that enter the current instance but are not returned. The elegant logoff logic polls the counter at this point, and once the counter is zero, it is deemed that there are no more requests from upstream, and the water from upstream is emptied.
Wait for a response to your request
At this point, the whole link, their upstream requests are removed clean. But the number of requests you make downstream is unknown, and at this point there may be a large number of requests made by the current instance that are not being responded to. If the current process is abruptly terminated at this point, it can cause unpredictable problems.
So again, similar to the above logic, the service maintains a thread-safe counter on the client filter, polling by graceful logoff logic and waiting until all requests have returned and the counter is zero.
If the current instance has a client with a continuous stream of landlords making requests downstream, the counter may never return to zero and will rely on the timeout configuration for this phase to forcibly end this phase.
Destroy the protocol and free the port
At this point, it’s safe to do the final work of destroying protocols, turning off listeners, releasing ports, and unsubscribing to the registry. Users may want to perform some of their own logic after the logoff logic is complete and the port is released, so a callback interface can be provided to the developer.
Three elegant up and down line effect
According to the above introduction, we carried out pressure test and simulated up-and-down test in the cluster.
Using one client instance, five proxy instances, and five Provider instances, the request link is:
client -> proxy -> provider
Because of resource issues, we chose to keep the client under 5000 TPS pressure, exposing the success rate and error request count through Dubbo-Go’s Prometheus visual interface, After that, a series of experiments, such as rolling publication, capacity expansion, capacity reduction, and instance deletion, are carried out for proxy instances in the middle and provider instances in the downstream of the link to simulate the production release process.
I recorded a lot of data during this period, which can show a clear contrast.
No elegant up-down logic: The success rate of updates drops dramatically, the number of errors continues to rise, and the client is forced to restart.
After elegant up-down optimization: no error requests, the success rate remains at 100%
Dubo-go’s outlook on service governance capabilities
Dubo-go V3.0 was officially released at the end of last year. It has been more than a month since the release of DUbo-Go V3.0. The release of DUbo-Go V3.0 is not a complete success for us, but a new step forward in the future. We are about to release version 3.1, which will have elegant online and offline capabilities.
In the preparation stage of 3.0, I have thought that if a service framework is to move from traditional design to the future, it needs to go step by step. There need to be multiple paths: from the most basic user-friendly support, configuration reconstruction, ease of use, integration testing, documentation construction; To implement the transport protocol (Dubbo3) Triple-Go cross ecology, stability, high performance, scalable, production available; In addition, after the 3.0 release of our service governance ability, operation and maintenance ability, visualization ability and stability, including elegant offline, traffic governance and Proxyless; To form ecology again, cross ecology integration. In this way, we can accumulate and iterate step by step.
The enhancement and optimization of operation and maintenance capability and Service governance will be an important Feature of the subsequent version. We will further improve traffic governance, routing, Proxyless Service Mesh, and flexible load balancing algorithm mentioned in the paper, which are the focus of community work this year.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.