When dealing with the release and upgrade of business services, the possibility of online problems is very high. Discuss how to ensure the graceful offline of services in the process of online services under the Serverless architecture

Solomon_ Xiao Ge play architecture with you “play” how to more elegant service offline, through the north and south and east and west traffic scheduling scheme.

Do you encounter the following problems during the release process?

  • A request that was being executed was interrupted during publication
  • When the downstream service node is offline, the upstream node still invokes the offline node. As a result, an error message is reported, which leads to service exceptions
  • Data inconsistency occurs during the release process, and dirty data needs to be repaired.

Usually the release is arranged at 2 or 3 o ‘clock in the morning, when the business flow is relatively small. How to solve the above problems, how to ensure that the application release process is stable, efficient, and business is not damaged.

Scenario analysis

The figure above describes a common scenario in which we develop applications using microservices architecture. Let’s first look at the service invocation relationship in this scenario:

  • Services B and C register services with the registry, and services A and B discover services to be invoked from the registry.
  • Service traffic is transferred from load balancing to service A. Health check of service A instances is configured on the SLB. When an instance of service A stops, the corresponding instance is removed from the SLB. Service A calls service B, which in turn calls service C;

The figure above shows two types of traffic

  • North-south flow:
    • That is, the service traffic forwarded to the back-end server through SLB, such as the invocation path of service traffic -> SLB -> A
  • East-west flow:
    • The traffic discovered by the registry service central service to be invoked, such as the invocation path of A -> B

North-south flow

North-south traffic problems

When service A is published and service A1 instance is down, SLB detects that service A1 is offline based on the health check and removes the instance from SLB. It takes several to ten seconds for instance A1 to be removed from the SLB depending on the SLB health check. During this process, if the SLB has continuous incoming traffic, some requests will continue to be routed to instance A1, resulting in a request failure.

How can SERVICE A ensure that the traffic passing through the SLB does not report errors during service A’s advertisement? Let’s look at how SAE does this.

North – south flow elegant upgrade scheme

The request failed because the back-end service instance was stopped before being removed from SLB. Can we remove the service instance from SLB first and then upgrade the instance?

SAE proposed a solution based on the capabilities of the K8S Service

  • When a user binds SLB to an application through SAE, SAE will create a service resource in the cluster and associate the application instance with the service. CCM component will be responsible for SLB purchase and SLB virtual server group creation. The ENI network card associated with the application instance is added to the virtual server group. Users can access the application instance through SLB.
  • When an application is published, CCM removes the ENI corresponding to the instance from the virtual server group before upgrading the instance to ensure that traffic is not lost.

This is SAE’s guarantee of north-south traffic during application upgrades.

East-west flow

East-west flow problem

In the traditional publishing process, the service provider stops and starts again, and the service consumer perceives that the service provider node stops as follows:

  1. Before the service is published, the consumer invokes the service provider according to the load balancing rule.
  2. Service provider B needs to release a new version and does something to one of the nodes, starting by stopping the Java process.
  3. The service stop process is divided into active logout and passive logout. Active logout is quasi-real-time, and the time of passive logout is determined by different registries. The worst case will take 1 minute.
    • If the application is stopped properly, Spring Cloud and Dubbo framework Shutdown hooks can be executed normally, and this step takes negligible time.
    • If the application stops abnormally, such as using kill -9 directly to stop, or the Docker image was built when the Java application was not process 1 and the kill signal was not passed to the application. The service provider will not actively unregister the service node, but will be passively removed by the registry after a period of time due to heartbeat timeout.
  4. The service registry notifies the consumer that one of the service provider nodes has gone offline. There are two modes: push and polling. Push can be considered as quasi-real-time, and the polling time is determined by the polling interval of service consumers, and in the worst case, it takes 1 minute.
  5. A service consumer refreshes the service list and senses that a service provider has gone offline. This step does not exist for the Dubbo framework, but Spring Cloud’s Ribbon, the load-balancing component, refreshes in 30 seconds by default and 30 seconds in worst-case scenarios.
  6. Service consumers no longer invoke nodes that have gone offline.

From step 2 to step 6, Eureka took two minutes in the worst case and Nacos 50 seconds in the worst case. During this time, the request can fail, so it can be published with various errors that affect the user experience, and then it can be published with dirty data that needs to be fixed in the middle of execution. Finally, I had to arrange for every release to be released at 2 or 3 am, and I was terrified and sleep-deprived.

East – west flow elegant upgrade scheme

In a traditional publishing process, the client has a service invocation error period because the client is not aware of instances of the server going offline in time. In the traditional publishing process, the registry notifies consumers to update the list of service providers. Can service providers notify service consumers directly, bypassing the registry? And the answer is yes, we did two things:

  • The service provider unregisters the application to the registry before and after publication and marks the application as offline. Changed the stop process phase logout service to prestop phase logout service.
  • Upon receiving the request from the service consumer, the call will be processed normally and the service consumer will be notified that the node is offline. The service consumer will immediately delete the node from the call list. After that, the service consumer does not invoke the node that has gone offline. This is a move away from relying on registry push, where the service provider notifies the consumer directly to remove itself from the call list.

With the above solution, the time of offline awareness can be greatly reduced from the original minute level to quasi-real-time, ensuring that the application can be offline without loss of service.

Batch release and grayscale release

Described above are some of SAE in dealing with elegant logoff ability, in the process of application upgrade, only instances of the elegant line is not enough, also need to have a set of form a complete set of release strategy, ensure that our new business is available, the SAE’s ability to provide partial release and gray, can make application release process more easier then;

An application contains 10 application instances. The deployment version of each application instance is Ver.1, and each application instance needs to be upgraded to Ver.

As can be seen from the figure, in the release process, two instances should be grayscale first, and the remaining instances should be released in batches after confirming the normal business. During the release process, there are always instances in the running state. During the instance upgrade process, according to the above scheme, each instance has an elegant offline process, which ensures that the business is not damaged.

Let’s look at batch publishing. Batch publishing supports manual and automatic batch publishing. For the 10 application instances above, assume that all application instances are deployed in three batches. According to the batch release policy, the release process is shown in the figure without detailed introduction.

Your likes and attention are the continuing power of the Solomon_Chaug framework.

🏆 issue 7 | all things can be Serverless technology projects