Author: Sanchen | Technical expert of aliyun cloud native micro service infrastructure team, responsible for the high availability architecture of MSE engine

**** This is the first in a series on micro-services high availability best practices that will continue to be updated, looking forward to your attention.

The introduction

Before we start, let’s share a real case.

A customer deployed many of his micro services on Ali Cloud using K8s cluster, but one day, the network card of one of the nodes was abnormal, which eventually led to the service unavailable, unable to call the downstream, and the business was damaged. Let’s take a look at the problem how do chains form?

  1. The ECS failed node is running all Pod of CoreDNS, the core basic component of K8s cluster, which is not broken, resulting in the cluster DNS resolution problem.

  2. The client’s service was found to be using a defective client version (nacos-Client version 1.4.1). The defect of this version is related to DNS. If a heartbeat request fails to resolve the domain name, the process will not renew the heartbeat until restart.

  3. This defective version is actually a known problem. Ali Cloud pushed a notice of serious bug in Nacos-Client 1.4.1 in May, but the client didn’t receive the notice, and then used this version in the production environment.

Risks are interlinked and indispensable.

The ultimate cause of failure is that services cannot be called downstream, resulting in reduced availability and impaired services. The following figure shows the root cause of the problem caused by client defects:

  1. DNS exceptions occur on the Provider client during heartbeat renewal.

  2. The heartbeat thread handled the DNS exception correctly, causing the thread to exit unexpectedly;

  3. The normal mechanism of the registry is that the heartbeat is not renewed and the registry is automatically logged out after 30 seconds. Because CoreDNS affects the DNS resolution of the entire K8s cluster, all instances of the Provider encounter the same problem and all instances of the entire service are offline.

  4. On the Consumer side, when an empty list of pushes is received and the downstream cannot be found, an exception occurs to the upstream (such as the gateway) that called it.

Looking back at the case as a whole, each risk appears to be a small probability, but when it does occur, it can have a bad effect.

Therefore, this article will explore how to design high availability solutions in microservices, down to service discovery and configuration management.

Microservices high availability solution

First, the fact remains: no system is 100% trouble-free, so high availability architectural solutions are designed for failure (risk).

Risks are everywhere, and although many occur with very small probabilities, they cannot be completely avoided.

What are the risks in microservices?

This is just a part of it, but in the course of more than a decade of micro-service practice inside Alibaba, all of these problems have been encountered, and some of them more than once. Although there may seem to be a lot of problems, we are still able to ensure the stability of the National Double Eleven Conference, which relies on the development of a mature and stable high availability system.

We can’t completely avoid risk, but we can control it, and that’s the nature of high availability.

What are the strategies for controlling risk?

Registration and configuration center in the core link of the micro-service system, pull the whole body, any jitter may affect the stability of the whole system to a large extent.

Strategy 1: reduce the scope of risk

Cluster high Availability

Multi-copy: Deploy instances on at least three nodes.

Multiple Availability Zones (Same-city Dr) : Deploy different nodes in a cluster in different availability zones (AZs). When a node or usable area fails, the scope of impact is only a part of the cluster. If a quick switch can be achieved and the failed node is automatically outlier, the impact can be minimized.

Reduce upstream and downstream dependence

The system design should minimize the upstream and downstream dependencies as much as possible. The more dependencies there are, the whole service may become unusable (usually a functional block) when the dependent system fails. If necessary dependencies must also be required to be highly available architectures.

Changing grayscale

When a new version is released iteratively, the gray scale should start from the minimum range and be graded by user and Region to gradually expand the change range. Once there is a problem, it is only in the gray range of the impact, reduce the explosion radius of the problem.

Service can be downgraded, restricted, and fusible

  • In the case of abnormal registry load, downgrade heartbeat renewal time, downgrade some non-core functions, etc

  • Limit abnormal traffic within the capacity range to ensure that part of the traffic is available

  • On the client side, exceptions are relegated to local caching (pushdown protection is also a downgrade scheme), temporarily sacrificing consistency of list updates to ensure availability

As shown in the figure, the three-node architecture of MSE, a microservice engine, ensures high availability architecture through streamlined upstream and downstream dependencies. Multi-node MSE instances are automatically allocated to different available areas through underlying scheduling capabilities to form multi-replica clusters.

Strategy two: shorten the duration of risk occurrence

The core idea is: identify as soon as possible, process as soon as possible

Identify — observable

For example, monitoring and alarm capability building for instances based on Prometheus.

Further, enhanced observation capabilities at the product level, including market cap, alarm convergence/classification (identifying problems), protection for key customers, and service level construction.

The MSE Registry configuration Center currently offers a service level of 99.95% and is moving toward four nines (99.99%).

Rapid processing — emergency response

The mechanism of emergency response should be established, the right scope of personnel should be notified quickly and effectively, the ability to quickly implement the plan (be aware of the efficiency difference between white screen and black screen), and the drill of fault emergency should be conducted regularly.

Preplan means that no matter who is familiar with your system or not, it can be carried out at ease. Behind this, a set of technical support (technical thickness) with good precipitation is needed.

Strategy three: Reduce the number of touch risks

Reduce unnecessary releases, such as: increase the efficiency of iteration and refrain from random releases; Important events, during the promotion of network blocking.

From a probability point of view, no matter how low the probability of a risk is, if you keep trying, the joint probability of a risk occurring will approach 1 indefinitely.

Strategy four: reduce the probability of risk

Architecture upgrade, design improvement

Nacos2.0 is not only a performance upgrade, but also an architectural upgrade:

  1. Upgrade the data store structure, improve the granularity of Service level to the fault tolerance of Instance level partition (bypass the problem of Service hang caused by inconsistent Service level data);

  2. Upgrade the connection model (long connection) to reduce the dependency on threads, connections, and DNS.

Identify risks in advance

  1. This “in advance” refers to the design, development, testing stage as much as possible to expose potential risks;

  2. Anticipate where the capacity risk level is through capacity assessment in advance;

  3. Regular fault drills are performed to detect upstream and downstream environmental risks in advance and verify system robustness.

As shown in the picture, Alibaba promotes the high availability system, constantly doing pressure test drills, verifying the robustness and flexibility of the system, observing and tracking system problems, verifying the feasibility of such plans as current limiting and degradation.

Service discovery high availability solution

Service discovery includes service consumers and service providers.

The Consumer terminal is highly available

Disaster recovery (Dr) on the Consumer end can be achieved by means of push off protection and service degradation.

Push the empty protection

In response to the previous example, the service empty list push was automatically demoted to cached data.

The service Consumer subscribes to the list of instances of the service Provider from the registry.

If an unexpected situation (for example, the availability zone is disconnected and the Provider fails to report heartbeat) or an unexpected exception occurs in the registry (configuration change, restart, upgrade or downgrade), subscription exceptions may occur, affecting the availability of service consumers.

No push protection

  • The Provider fails to be registered (for example, because of network or SDKbug)

  • The registry determines that the Provider heartbeat has expired

  • Consumer subscribed to empty list, service interruption error

Enable pushdown protection

  • Same as above

  • The Consumer subscribes to the empty list, push the null protection into effect, discard the changes, and ensure that the business service is available

Open means

The opening mode is relatively simple

The open source client nacos-Client is supported in versions 1.4.2 and above

Configuration items

  • SpingCloudAlibaba increase in spring configuration items: spring. Cloud. Nacos. Discovery. NamingPushEmptyProtection = true

  • Dubbo plus registryUrl parameters: namingPushEmptyProtection = true

Empting protection depends on caching, so the persistent cache directory is required to avoid loss after restart. The directory is ${user.home}/nacos/naming/${namespaceId}.

Service degradation

The Consumer side can choose whether to degrade an invocation interface according to different policies to protect the business request process (reserve valuable downstream Provider resources for important business consumers) and protect the availability of important services.

Specific strategies for service degradation include returning Null values, returning exceptions, returning custom JSON data, and custom callbacks.

This high availability capability is available by default in the MSE Microservice Governance Center.

The Provider side is highly available

On the Provider side, the registry and service governance provide solutions such as DISASTER recovery protection, outlier removal, and lossless offline services to improve availability.

Disaster protection

Disaster recovery (Dr) protection is used to prevent avalanches in a cluster due to abnormal traffic.

Let’s look at it in detail:

None Dr Protection (Default threshold =0)

  • When the number of burst requests increases and the capacity water level is high, some providers fail.

  • The registry will remove the failed nodes, the full traffic will be given to the remaining nodes;

  • When the load of remaining nodes becomes high, there is a high probability of failure.

  • In the end, all nodes fail and 100% cannot provide services.

Enabling Dr Protection (Threshold =0.6)

  • Same as above;

  • The number of failed nodes reaches the protection threshold, and the flow is distributed evenly to all machines.

  • Finally **** guarantees that 50% of the nodes can provide services.

Disaster protection capability, which can preserve service availability above a certain level in an emergency, can be said to be the bottom of the overall system.

This scheme has saved many business systems.

Outlier instance removal

Heartbeat renewal is the basic way the registry senses instance availability.

However, in certain cases, heartbeat survival is not the same as service availability.

There are still cases where the heartbeat is normal but the service is not available, for example:

  • The thread pool for Request processing is full

  • Dependent RDS connection is abnormal or slow SQL

The microservice governance Center provides outlier instance removal

  • Exception detection-based removal policy: including network exceptions and network exceptions + Service exceptions (HTTP 5XX)

  • Set the exception threshold, QPS lower limit, and removal ratio lower limit

The ability to remove outlier instances complements the measure of service availability based on the invocation exception characteristics of a particular interface.

Nondestructive offline

Lossless offline, call elegant offline again, perhaps smooth offline, it is a meaning. Let’s first look at lossy logoff:

During the upgrade of the Provider instance, it takes a certain period of time for the heartbeat to be stored in the registry and the changes to take effect after the Provider instance is offline. During this period, the subscription list on the Consumer end is not updated to the version after the Provider instance is offline. If the Provider is stopped, some traffic will be lost.

There are many different solutions for lossless logoff, but the least intrusive is the ability provided by the service Governance Center by default, to be integrated into the publishing process without any sense, for automatic execution. Avoid the tedious maintenance of o&M script logic.

Configure and manage ha solutions

Configuration management consists of two operations: subscription and publication.

What problems does configuration management solve?

Multi-environment, multi-machine configuration release, configuration dynamic real-time push.

High availability of services based on configuration management

How can microservices do high availability solutions based on configuration management?

Release Environment Management

One time to manage hundreds of machines, multiple sets of environment, how to correctly push, misoperation or online problems how to quickly roll back, how to release the gray scale process.

Dynamic service switch push

Function, active page and other switches.

Push the Dr Degradation plan

Preset schemes can be pushed to enable real-time adjustment of flow control thresholds.

The figure above shows the overall high availability solution for configuration management during the big push. For example, you can degrade non-core services, functions, logs, and high-risk operations.

The client is highly available

The CONFIGURATION management client also has a Dr Solution.

Local directories are divided into two levels. The high priority is a Dr Directory and the low priority is a cache directory.

Cache directory: After each data exchange between the client and the configuration center, the client saves the latest configuration information to the local cache directory. If the server is unavailable, the client uses the local cache directory.

Dr Directory: When the server is unavailable, you can manually update the configuration content in the local Dr Directory. The client preferentially loads the content in the Dr Directory to simulate the change push effect on the server.

Simply put, if the configuration center is unavailable, check the configuration of the Dr Directory first. Otherwise, use the cache that was pulled before.

Dr Directories are designed because sometimes cached configurations may not exist or services need to be overwritten in an emergency. Use new contents to enable necessary plans and configurations.

The whole idea is that nothing should go wrong, and that the client should be able to read the correct configuration anyway to keep the microservices available.

The server is highly available

In the configuration center, traffic limiting is mainly for read and write traffic. Limit connections, limit writes:

  • Connection limiting: maximum connection traffic limiting for a single client and traffic limiting for a single client

  • Write limiting interface: publish operation & traffic limiting in seconds and minutes for a specific configuration

Control operational risk

Control the risk of personnel doing configuration publishing.

The operations for configuring publications are grayscale, traceable, and rollback.

Configuration of gray

Release history & rollback

Change of contrast

Hands-on practice

Finally, let’s do an exercise together.

The scenario, taken from the high availability scenario mentioned earlier, looks at how the service consumer behaves with push protection turned on, in the case of a registration exception on all of the service provider’s machines.

Experimental framework and ideas

The figure above is the architecture of this practice. On the right side is a simple call scenario, in which external traffic is accessed through gateway. The cloud native gateway in MSE product matrix is selected here.

There are three applications downstream of the gateway, A, B, and C, which support the dynamic wiring of call relationships using configuration management, as we will see later.

Basic idea:

  1. Deploy the service, adjust the call relationship gateway ->A->B->C, check the success rate of the gateway call.

  2. By simulating the network problem, the heartbeat link between application B and the registry is disconnected to simulate the occurrence of registration exceptions.

  3. Check the success rate of gateway invocation again. Expect that the link of service A->B is not affected by the registration exception.

To facilitate comparison, application A will deploy two versions, one with pushdown protection enabled and one without pushdown protection enabled. The desired end result is that the pushguard switch, when turned on, helps application A continue addressing Application B in the event of an exception.

After the traffic from the gateway is sent to Application A, it can be observed that the success rate of the interface should be exactly 50%.

start

So let’s get your hands dirty. Here I choose Ali Cloud MSE+ACK combination to do a complete program.

Environment to prepare

First of all, buy a set of MSE Registry Configuration Center Professional edition, and a set of MSE cloud native Gateway. The specific purchase process is not introduced here.

Before deploying applications, prepare configurations in advance. Here we can first configure the downstream of A is C, and the downstream of B is ALSO C.

The deployment of application

Next we deploy three applications based on ACK. As can be seen from the following configuration, with version A spring-Cloud-a-B, the push protection switch has been turned on.

The nacOS client version used for this demo is 1.4.2, because push protection was not supported until later.

Configuration diagram (cannot be used directly) :

ApiVersion: apps/v1 kind: Deployment metadata: labels: app: spring-cloud-a name: apiVersion: apps/v1 kind: Deployment metadata: labels: app: spring-cloud-a name: spring-cloud-a-b spec: replicas: 2 selector: matchLabels: app: spring-cloud-a template: metadata: annotations: msePilotCreateAppName: spring-cloud-a labels: app: spring-cloud-a spec: containers: - env: - name: LANG value: C.UTF-8 - name: spring.cloud.nacos.discovery.server-addr value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848 - name: spring.cloud.nacos.config.server-addr value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848 - name: spring.cloud.nacos.discovery.metadata.version value: base - name: spring.application.name value: sc-A - name: Spring. Cloud. Nacos. Discovery. NamingPushEmptyProtection value: "true" image: mse - demo/demo: 1.4.2 imagePullPolicy: Always name: spring-cloud-a ports: - containerPort: 8080 protocol: TCP resources: requests: cpu: 250m memory: 512Mi --- apiVersion: apps/v1 kind: Deployment metadata: labels: app: spring-cloud-a name: spring-cloud-a spec: replicas: 2 selector: matchLabels: app: spring-cloud-a template: metadata: annotations: msePilotCreateAppName: spring-cloud-a labels: app: spring-cloud-a spec: containers: - env: - name: LANG value: C.UTF-8 - name: spring.cloud.nacos.discovery.server-addr value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848 - name: spring.cloud.nacos.config.server-addr value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848 - name: spring.cloud.nacos.discovery.metadata.version value: base - name: spring.application.name value: sc-A image: Mse-demo /demo:1.4.2 imagePullPolicy: Always name: spring-cloud -A ports: -containerPort: 8080 protocol: TCP resources: ApiVersion: apps/v1 kind: Deployment metadata: Labels: app: apiVersion: apps/v1 kind: Deployment metadata: labels: app: spring-cloud-b name: spring-cloud-b spec: replicas: 2 selector: matchLabels: app: spring-cloud-b strategy: template: metadata: annotations: msePilotCreateAppName: spring-cloud-b labels: app: spring-cloud-b spec: containers: - env: - name: LANG value: C.UTF-8 - name: spring.cloud.nacos.discovery.server-addr value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848 - name: spring.cloud.nacos.config.server-addr value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848 - name: spring.application.name value: sc-B image: Mse-demo /demo:1.4.2 imagePullPolicy: Always name: spring-cloud-b ports: -containerPort: 8080 protocol: TCP resources: Requests: CPU: 250M memory: 512Mi # C Application base version -- apiVersion: apps/v1 kind: Deployment metadata: labels: app: spring-cloud-c name: spring-cloud-c spec: replicas: 2 selector: matchLabels: app: spring-cloud-c template: metadata: annotations: msePilotCreateAppName: spring-cloud-c labels: app: spring-cloud-c spec: containers: - env: - name: LANG value: C.UTF-8 - name: spring.cloud.nacos.discovery.server-addr value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848 - name: spring.cloud.nacos.config.server-addr value: mse-xxx-nacos-ans.mse.aliyuncs.com:8848 - name: spring.application.name value: sc-C image: Mse-demo /demo:1.4.2 imagePullPolicy: Always name: spring-cloud-c ports: -containerPort: 8080 protocol: TCP resources: requests: cpu: 250m memory: 512MiCopy the code

Deploying the application:

Register the service at the gateway

After the application is deployed, associate the MSE registry in the MSE cloud native gateway and register the service.

We designed the gateway to only call A, so we just put A in and register it.

Verify and adjust links

Verify the link using the curl command:

{$curl http://$gateway IP} / IP sc - A [192.168.1.194] -- > sc - C [192.168.1.195]Copy the code

Verify the link. As you can see, A is calling C, and we will make A configuration change to change the downstream of A to B in real time.

Now let’s see, the calling relationship of the three applications is ABC, which conforms to our previous plan.

{$curl http://$gateway IP} / IP sc - A [192.168.1.194] -- > sc - B [192.168.1.191] -- > sc - C [192.168.1.180]Copy the code

Next, we invoke the interface continuously through a single command to simulate uninterrupted business traffic in a real scenario.

$ while true; do sleep .1 ; Curl -so /dev/null http://${gateway IP}/ IP; doneCopy the code

Observation calls

By monitoring the market through the gateway, the success rate can be observed.

Fault injection

All is well, now we can start injecting the fault.

Here, we can use the K8s NetworkPolicy mechanism to simulate egress network exceptions.

kind: NetworkPolicy apiVersion: networking.k8s.io/v1 metadata: name: block-registry-from-b spec: podSelector: MatchLabels: app: spring-cloud-b ingress: - {} egress: to: -ipblock: cidr: 0.0.0.0/0 Ports: -protocol: TCP port: 8080Copy the code

This port 8080 means that it does not affect the application ports that call downstream from the Intranet and only disables other egress traffic (for example, port 8848 to the registry is disabled). Here, downstream of B is C.

After the network is disconnected, the registry’s heartbeat is not renewed and after 30 seconds all IP addresses for application B will be removed.

Check again

When you look at the big database, the success rate starts to drop. At this point, the IP of application B is no longer visible on the console.

Back in the market, the success rate no longer fluctuates around 50%.

summary

Through practice, we simulate a real risk occurrence scenario, and through the client high availability scheme (push protection), successfully realize the risk control, prevent the occurrence of abnormal service invocation.