One of the benefits of adapting a single application to a microservice is that the failure is limited to the microservice system itself, rather than the entire single application crashing. For a microservice system, if there is a failure, how to deal with it?

1 Cluster Fault

Once the code is bugged, the entire cluster may fail to provide external services.

1.1 Fault Causes

  • Code bug

Such as OOM

  • Sudden traffic impact exceeds the maximum capacity of the system

For example, in the second kill activity, the e-commerce system will flood into a large amount of traffic at zero moment, exceeding the maximum carrying capacity of the system and crushing the whole system at a draught

1.2 Solutions

1.2.1 current-limiting

The amount of traffic a system can carry is fixed according to the size of the cluster, which can be called the maximum capacity of the system. When the actual traffic exceeds the maximum capacity of the system, the system responds slowly and a large number of service invocation timeouts occur. Users feel that the traffic is slow and there is no response. Therefore, set a threshold for the system based on the maximum capacity of the system. Requests exceeding the threshold are automatically discarded to ensure the normal services provided by the system to the maximum extent.

Generally, a microservice system will provide multiple services at the same time, and the number of requests for each service is different at the same time. It is likely that a sudden increase in the number of requests for one service in the system occupies most of the resources in the system, resulting in no resources available for other services. Therefore, a threshold should be set for the number of requests for each service in the system. If the requests exceed the threshold, they should be automatically discarded, so that one service does not affect all other services.

In a real project, two metrics can be used to measure the volume of service requests:

  • QPS, requests per second
  • Number of working threads

QPS because the response speed of different services is different, the QPS that the system can bear varies greatly. Therefore, the number of worker threads is generally selected as the indicator of flow limiting, and a total maximum number of worker threads and the maximum number of worker threads of a single service are set for the system. In this case, whether the total number of requests in the system is too large and the total number of worker threads reaches the maximum number of worker threads, or the number of requests in a service exceeds the maximum number of worker threads in a single service, the flow will be restricted to protect the whole system.

demotion

Degradation is to stop certain functions in the system to ensure the overall availability of the system. It is a passive defense plan, because it is generally a stop loss taken after the system has failed.

How to implement

By switching.

Create an area of memory in the operating system that is dedicated to storing the status of switches, that is, on or off. You also need to listen to a port through which commands can be sent to the system to change the status of switches in memory. When the switch is turned on, a certain section of service logic will not be executed. In normal cases, the switch is turned off.

Switches are generally used in two ways:

  • Added business logic

Because the newly added service logic is relatively immature and often has certain risks, switches need to be added to control whether the new service logic is executed

  • Dependent services or resources

Because a dependent service or resource is not always reliable, it is a good idea to have a switch that can control whether or not a dependent service or resource is called to ensure that if a dependency problem occurs, it can be avoided through degradation.

In actual service applications, downgrades are graded according to the impact on services:

  • The first-level downgrade has the least impact on services. In case of a fault, the first-level downgrade is performed first. Therefore, the first-level downgrade can also be set to automatic downgrade without manual operation
  • A two-level downgrade has an impact on services. If a fault occurs and the two-level downgrade does not take effect, you can manually perform the two-level downgrade
  • A triple downgrade is a downgrade that has a significant impact on the business, either on business revenue or on user experience, so it should be done with great caution and not be used until the last minute.

2 One IDC is faulty

To ensure high service availability, it is deployed on more than one IDC. The whole IDC off the network of things happen, mostly because of force majeure such as fire in the room, cable was dug off.

If all services are deployed on the same IDC, they are completely inaccessible. Therefore, multiple IDCs are used. Some use the city active-active, that is, in two IDC deployment in a city and some use remote multi-active, generally in two IDC deployment in two cities alipay this financial level application using “three places and five centers” deployment, this deployment cost is obviously higher than two IDC much higher, but the guarantee of availability is higher

When one IDC is faulty, traffic from the faulty IDC can be switched to the normal ONE to ensure normal service access.

Traffic Switching scheme

Traffic switching based on DNS resolution

By switching viPs requesting access to domain name resolution from one IDC to another. For example, when visiting “www.baidu.com”, under normal circumstances, users in the north will be resolved to the VIP of China Unicom machine room, and users in the south will be resolved to the VIP of China Telecom machine room. If the China Unicom machine room fails, users in the north will also be resolved to the VIP of China Telecom machine room, but the network delay may be longer at this time.

Traffic switching based on RPC packet

If a service is deployed on multiple IDCs, each IDC is a group. If an IDC is faulty, you can run a command to switch the traffic originally routed to this group to other groups. In this way, the traffic of the faulty IDC can be switched over.

3 Single-node Faults

The failure of a few machines in the cluster usually has little impact on the overall situation, but it will result in the failure of all requests to the failed machine, affecting the success rate of the entire system.

The highest probability of failure, especially for large Internet applications, the scale of tens of thousands of machines is also very common. In this case, the probability of single machine failure is very high. At this time, it is obviously not feasible to only rely on the operation and maintenance of human flesh, so it requires some means to automatically deal with single machine failure.

An effective way to deal with single machine failure is automatic restart. You can set a threshold value, such as an interface to the average time, when an interface on monitoring standalone took an average of more than a certain threshold, thought that this machine has a problem, you need to put this time there is a problem of the machine from the online clustering result, and then restart the service, to rejoin the cluster.

Ensure that the interface timeout caused by network jitter is prevented from triggering automatic restart. One method is to collect more time consuming data on a single-node interface. For example, five data points are collected every 10 seconds. When the data of more than three points exceeds the threshold range, the single-node fault is considered as a real problem and the automatic restart policy is triggered.

In order to prevent certain special circumstances, the single machine too much in a short period of time is restarted, caused the entire service pool available node number is too little, it is best to set up a number of stand-alone can restart accounts for the largest proportion of the whole cluster, this not more than 10% commonly, because under normal circumstances, is unlikely to be more than 10% of the single fault occurs.

conclusion

In case of an actual failure, often is with multiple means, in single IDC problems, for example, the first thing to quickly switch traffic to normal IDC, but this may be normal IDC is not a flow range sufficient to hold the two IDC, so this time first want to downgrade some functions, to ensure normal IDC support switch over the flow smoothly.

Also, automate troubleshooting as much as possible, which can greatly reduce the impact time. Because once need to introduce human intervention, often fault handling time is a few minutes, the influence of the type that is sensitive to most users business is huge, if can do automatic fault handling, fault handling time can be reduced to the level of 1 minutes or even seconds, so minimal disruption to the user.