God knows how many cards I played to make microservices highly available

High availability is not a set of overall solution, but by a lot of links, a link, the ghost knows in order to connect these links, I get how many cards to deal with, in order to finally form a high availability solution of the whole system.

What is high availability

In defining what is high availability, you can first define what is not available, the content of a website is finally presented to the user needs to go through a number of links, and as long as any link failure, may lead to the site page is not accessible, this is the site is not available.

How does Wikipedia define high availability

The ability of a system to perform its functions without interruption, representing the degree of availability of the system, is one of the criteria for system design.

The difficulty or key point is “no interruption”, to achieve 7 x 24 hours without interruption without exception service.

Why high availability

A foreign service system is the need for hardware, software, the combination of software but we will have a bug, the hardware will gradually aging, the network is always unstable, the software will be more and more complex and huge, in addition to hardware and software can’t do it “without interruption” in essence, the external environment also can lead to the interruption of service, such as electricity, earthquake, fire, optical fiber to dig into, The extent of these effects may be even greater.

Highly available evaluation latitude

Time latitude high availability assessment

In the industry, there is a well-known index for evaluating website usability, which is often quantified by N nines, which can be directly mapped to the percentage of website uptime

describe	N个9	Availability level	Annual downtime
Basic available	2 and 9	99%	87.6 hours
High availability	3 and 9	99%	8.8 hours
Automatic fault recovery capability available	4 to 9	99.99%	53 minutes
Highly available	5 and 9	99.999%	5 minutes

Before the inauguration of an Internet company is according to the index to define the usability, but also met some problems in the process of execution, for example, there are some upgrading of service or data migration clearly can stop or stop services in the middle of the night, however, after considering the report to show our system achieved the high availability of how many 9, Simple solutions such as giving up service outages, such as two hours of downtime, will never reach four nines. However, in some high-concurrency situations, such as seckilling or group, although the service stops for a few minutes, the impact on the whole company’s business may be very significant, and the orders lost in a minute may be a large number. So N nines to quantify availability is actually a matter of business.

Microservices high availability design approach

High availability is a complex proposition, basically involve in all the processing of high availability, all in the design of high availability scheme also involves many aspects, this are the details of the intermediate will appear a variety of, so we need a high availability solution of such a micro service a top-level design, around the high availability of services, Let’s check how many cards we have in our hand.

Service redundancy

Redundancy strategy

Every visit may have multiple services and become, each machine each service may appear problem, so the first considered each service is must be more than one can be more, the so-called more consistent service is redundant, said the service here refers to the service of the machine, the container service, and service the service itself.

At the machine service level, it is necessary to consider whether the redundancy among each machine is isolated and redundant in the physical space, for example, whether all machines are deployed in different machine rooms, whether docker containers are deployed in different physical machines if they are in the same machine room, whether they are deployed in different cabinets, and whether they are deployed on different physical machines if they are docker containers. Strategy also is actually according to the service business, so need to grading scores of service, to adopt different strategies, different strategies safety degree is different, along with the cost is also different, higher security level of the service may not consider different room, also need to each region of the rooms are considering, for example, Two computer rooms should not be in the same earthquake zone and so on

Stateless,

Service redundancy will demand we can at any time to the service expansion or contraction capacity, may we will from two machine into three machines, want to enlarge shrinks capacity of service anywhere at any time, requires that our service is a stateless, so-called services is that each stateless service content and the data are consistent.

For example, in our microservices architecture, we have several layers horizontally, and because we are stateless in each layer, it is very easy to scale at this level.

Let’s say we need to expand the gateway, we just need to add services, and we don’t need to worry about whether the gateway stores an extra piece of data.

The gateway does not store any session data, and does not provide services that may cause consistency. The inconsistent data is stored in several ways with the help of middleware that is better at data synchronization.

This is the mainstream solution at present. The service itself provides logical service as far as possible and ensures centralized processing of data consistency. In this way, the “state” can be extracted and the gateway keeps a “stateless”.

This is just the gateway example. Basically, all services in microservices should follow this idea. If there is state in the service, it should be extracted and processed by components that are better at processing data, rather than compatible with the state of data in microservices.

The data store is highly available

The service redundancy mentioned above can be simply understood as computing high availability. Computing high availability only requires stateless capacity expansion and capacity reduction. However, data itself is stateful for the system that needs to store data.

Compared with storage and computing, there is one essential difference:

Moving data from one machine to another requires transmission over a wire

The network is unstable, especially across the machine room network, and the ping delay may be tens or hundreds of milliseconds. Although milliseconds may seem insignificant to a human, for a high availability system, it is essentially different, which means that the data must be inconsistent at some point in time throughout the system. According to the formula “data + logic = Service”, inconsistent data and consistent logic lead to inconsistent service performance. For example

Data inconsistency may occur at a certain point in time, whether due to normal transmission delay or abnormal transmission interruption. Data inconsistency may cause service problems. However, high availability of the system cannot be guaranteed without data redundancy

Therefore, the difficulty of high availability storage is not how to back up data, but how to reduce or avoid the impact of data inconsistency on services

There is a famous CAP theorem in the distributed field, which theoretically demonstrates the complexity of storage high availability. That is to say, storage high availability cannot satisfy “consistency, availability and fault tolerance of partition” at most, and partition fault tolerance is a must in the distributed field, which means that, We must make a business trade-off between consistency and usability when designing an architecture.

The essence of a storage HIGH availability solution is to replicate data to multiple storage devices and implement high availability through data redundancy. Its complexity is mainly reflected in data inconsistency caused by delay or interruption of data replication. When designing a storage architecture, we must consider the following aspects:

How is data replicated
What are the responsibilities of each node in the architecture
What can I do if data replication is delayed
How to ensure high availability when nodes in the architecture fail

Primary/secondary data replication

Master-slave replication is the most common and simplest high availability solution for storage, such as Mysql, Redis, etc

The advantage of its architecture is simple, the host copy write and read, while the slave machine is only responsible for the read operation. When the read concurrency is high, the number of slave libraries can be expanded to reduce the pressure. When the host fails, the read operation can also ensure the smooth progress of the read service.

The disadvantage is that clients must be aware of the existence of the master-slave relationship and send different operations to different machines for processing. In the master-slave replication, the slave machine is responsible for reading operations. Data inconsistency may occur due to the long latency of the master-slave replication.

Data primary/secondary switchover

There are two problems with the primary/secondary switchover: 1. The host fails to write data. 2. You need to manually upgrade one of them from a machine to a host

In order to solve these two problems, we can design a set of master/slave automatic switch scheme, which is designed to detect the state of the host, switch decision, data loss and conflict problems.

1. Check the host status

Multiple checkpoints are required to check whether the host’s machine is normal, whether the process exists, whether timeout occurs, whether the write operation is not executable, whether the read operation is not executable, and all these are summarized and handed over to the switching decision

2. Switching decisions

Determine the timing of the switch decision, when the slave should be upgraded to the host, whether the process does not exist, whether the write operation is not this line, how many consecutive failed to detect the switch. Choose which secondary node to upgrade to the master node. In general, choose the secondary node with the largest synchronization steps to upgrade. Switching is automatic switching or semi-automatic switching, through the alarm mode, let the manual do a confirmation.

3. Data loss and data conflict The data is written to the host and hangs up before it is copied to the slave host. How to deal with this problem

There is also a data conflict problem in mysql, which is mostly caused by the increment of primary keys. Even if the increment of primary keys is not considered, the increment of primary keys will cause data conflicts. In fact, there are many problems caused by the increment of primary keys.

Data fragmentation

The above data redundancy can be solved by copying the data, but the expansion of the data needs to be solved by fragmentation of the data (if it is a partitioned table in a relational database).

Sharding mode of HDFS and mongoDB is basically realized based on this sharding mode. We mainly consider the following points in designing sharding:

Do data sharding, how to map data to nodes
The eigenvalue of a data fragment, that is, which attribute (field) in the data is sharded according to
How to ensure the high performance and high availability of metadata server? If it is a group of servers, how to ensure strong consistency

Flexible/asynchronous

asynchronous

In each call, the longer the time, the greater the risk of timeout, the more complicated the logic, the more steps to be performed, and the greater the risk of failure. If the business allows, the user calls only the results that the user must have, instead of the results that need to be synchronized, and can be asynchronously operated in another place, This reduces the risk of timeouts and reduces complexity by breaking down complex operations.

Of course, the benefits of asynchronization are numerous, such as decoupling and encapsulation, just from the perspective of availability.

Asynchronization can be implemented in three general ways:

After receiving the request, the server creates a new thread to process the business logic, and the server responds to the client first

After receiving the request, the server responds to the request and then continues processing the business logic

After receiving the request, the server saves the information in the message queue or database and sends the response to the client. Then the service processing process of the server reads the information from the message queue or database to process the business logic

flexible

What is a flexible, imagine a scenario, our system will give the users to increase their order amount of every order corresponding points, when a user after the order, we give him increase the integral service problems, this time, we want to cancel this order or let orders through, the problem of integral by again or call the police to deal with?

By being flexible, we mean not being able to give the user as much service as possible through downgrading as we allow in our business, rather than having to hand over either 100 or 0 points every time.

How to be flexible is more about understanding and judgment of business. Flexibility is more about thinking, which requires a deep understanding of business scenarios.

In the scenario of e-commerce orders, placing orders, destocking and payment are steps that must be performed. If there is a failure, the order will fail. However, adding points, delivering goods and after-sales services can be handled flexibly

Fault tolerance

The bottom line is maybe we often talk about a degraded plan, the plan is for implementation, but the bottom line here is probably more of an idea, more of a plan, every operation can make mistakes, we can accept mistakes, but every mistake we have to have a bottom line plan, This bottom-of-the-line plan is actually our fault tolerance or the best way to avoid greater harm, but it is also a process of continuous degradation.

Here’s an example:

Such as our front page request user personalized recommend commodities interface, found that the recommendation system error, we should not go to expand (direct throw exceptions to the user) or keep the call interface error, but should be compatible with the call interface error, be more flexible, at that time no failure before it can choose to get the cache data of the interface, If not, general goods can be obtained without personalized recommendation. If not, static text can be read to display.

As our architecture has been layered into APP, gateway, business logic layer, data access layer and so on, we have also divided the organizational structure into front-end group, back-end business logic group and even mid-stage group. Since there are layers of code and human architecture, each layer must have the idea of accommodating the errors of the next layer and providing the services of the next as confusing as possible.

Here’s an example:

The dollar price of a commodity assumptions to use commodity currency, price/data when this error occurs in the lower layer, a layer of if directly in addition to the affirmation is thrown. Java lang. ArithmeticException: / by zero, in line with the principle that we cannot trust any layer to call the service, it should be fault-tolerant, not to let the exception spread, but also to ensure that our layer made the best effort to identify the service last time.

Load balancing

I believe that the topic of load balancing has been deeply in the hearts of every micro-service developer or designer. The realization of load balancing includes hardware and software, F5, A10 and other machines for hardware, LVS, NGINx, HAProxy and so on for software, and random, RoundRobin and other algorithms for load balancing. ConsistentHash and so on.

Nginx load balancing failover

Transfer process

When a request is sent to Tomcat1 and nginx finds that Tomcat1 has a connection error (node failure), Nginx will remove Tomcat1 from the load list of the call according to some mechanism. On the next request, Nginx does not allocate requests to the problem Tomcat1, but to another Tomcat.

Node failure

By default, nginx determines node failure using connect refuse and timeout as criteria. When fails is greater than max_FAILS, the node fails.

Node recovery

When a node fails more than max_fails, but does not exceed fail_TIMEOUT,nginx will not probe the node again until the failure time expires or all nodes fail.

ZK load balancing failover

When ZK is used as the registry, fault discovery is carried out by ZK. The business logic layer registers itself with ZK through the heartbeat mechanism of Watch, and the gateway can know how many lists can be called by subscribing to ZK. Zk updates the callable list when the business logic layer breaks its heartbeat when it is restarted or shut down.

The biggest problem with using ZK as the load balancing coordinator is that ZK is based on Pingpong to determine whether the service is available. As long as the service heartbeat exists, ZK considers the service to be available, but zK has no way of knowing if the service is in the suspended state. At this point, only the gateway can know if the business logic service is actually available.

Idempotent design

The problem of idempotency design is mainly caused by the failover policy of load balancing, that is, the failed service will be retried. Generally speaking, if the service is a read operation, the repeated execution will not be a problem. However, imagine that if the operation is an order creation and inventory reduction operation, the first call tomcat1 also times out. Tomcat1 may not have been called at all. It may have been called successfully, but for some reason it returned a timeout. Therefore, the interface will be called twice. If we do not ensure idempotency, it is possible that an order results in two inventory reductions.

Idempotent means ensuring that the same interface is invoked multiple times in the same business, resulting in the same result.

Service traffic limiting degrade fuse

First, let’s talk about the purpose of current limiting/fusing in microservice. After microservice, the system is distributed and communicates with each other through RPC framework. The probability of failure of the whole system increases with the growth of the system scale.

Current limiting what is relationship with high availability, assume that our system can only carry a maximum of 500 concurrent access of the individual, but the whole time suddenly increased to 1000 people come in, in one fell swoop to collapse beneath the whole system, had 500 people can enjoy our service system, all of a sudden become all people cannot get service, It is better to have 500 people being served and refuse the other 500 than to have 1000 people being served illegally. Traffic limiting is the isolation of access to ensure the availability of users within a departmental system.

Fusing with high availability, what is the relationship between the above said micro service invocation chain is A complex relationship, assuming A call module B, B again call module C, C calls the module D, this time, the module D out of the problem is serious delay, at this time, to bring down the entire invocation chain will be module D, A B, B C, C D, and the resources of A, B, C and D are locked up and can not be released, and if the flow is large, it is easy to cause avalanches. Fusing, actively discarding module D calls, and making some functional downgrades can ensure the robustness of our system. Fuses are the isolation of modules to ensure maximum functional availability.

Service governance

Service module division

Service module and the profound relationship between service module, but the service module with weights in the business, such as order module can be a home appliance business of the company’s top priority, if the problems will directly affect the revenue of the whole company, and a background of query service module may be important, but its importance is not as important as order. Therefore, in service governance, the importance level of each service module must be determined, so as to better monitor and allocate resources. This is a standard for each company. For example, in an e-commerce company, determining the level of service may be more likely to use the number of user requests and revenue-related indicators.

The service level	Service module
Level of service	Payment system Order service goods service user service release system…
The secondary services	Message service authority system CRM System Points system BI system comment system…
Three levels of service	Background log system

May the division of real is more complicated than this, must be set according to the specific business, this can be from normal flow of traffic and to forecast service module, tend to be more important module will also provide more resources, so should not only know the technical architecture, even for this can be a variety of business forms.

Service classification in fault is defined not only play an important main, and determines the strength of the service monitoring, service monitoring in high availability have played an important role in a security, it can not only keep service crash site to wait for checking in the future, more important is that it can act as one of the prophets, and the role of leading judgment, a lot of times can prejudge danger prevention.

Service monitoring

Service monitoring is an important link in micro service governance, the perfect degree of the monitoring and control system directly affects our service quality, our micro service online running on time is there a set of perfect monitoring system can go to learn about the its health, is very important for the reliability and stability of the whole system, the reliability and stability is a prerequisite to ensure high availability.

The monitoring of the service is more about the prediction of risks, and the problem will be discovered in advance when it is unusable. If the system obtains the monitoring and alarm system can self-repair, the error can be eliminated in the invisible; if the system finds that the alarm cannot self-repair, the personnel can be informed to access in advance.

The following figure shows the levels involved in a relatively complete micro-service monitoring system, which can be roughly divided into five levels of monitoring

Infrastructure monitoring

For example, the reliability and stability of low-level devices such as networks, switches and routers directly affect the stability of upper-layer service applications. Therefore, it is necessary to monitor the network traffic, packet loss, packet error, connection number and other core indicators of infrastructure.

System Layer monitoring

It covers physical machines, virtual machines, and operating systems, which are all aspects of system-level monitoring. It monitors several core indicators, such as CPU usage, memory usage, disk I/O, and network bandwidth

Application Layer Monitoring

For example, the performance of URL access, the number of calls to access, the latency of access, as well as monitoring the performance of service delivery, service error rate, SQL also need to be monitored to see if there is slow SQL, and cache, need to monitor cache hit ratio and performance, response time and QPS of each service and so on

Business monitoring

For example, an e-commerce website needs to pay attention to its user login, registration, ordering and payment conditions, which directly affect the actual triggered business transactions. This monitoring can provide operators and company executives with the data they need to pay attention to, which may directly affect the company’s strategy.

End user experience monitoring

Practice to users through the browser, the client open to our service, so how are the client user experience, the performance of the client is what, if any error, this information is also a need to monitor and record, if there is no monitoring, likely user error for some reason or performance issues caused by experience is very poor, and we did not perceive, This includes monitoring the performance of the client, the return codes, the urban areas in which they are being used, and the operators, including telecom, Unicom, the connectivity of the users. We need to further know whether there are any problems with access channels and users, including the version of the operating system browser used by the client.

Conclusion:

Out of so many CARDS, the card only, the real way have to calm down and look at what is the nature of the service high availability, as micro service architecture call each other more and more complex, link will only be more and more, only by establishing clear structure and hierarchical analyze each link protection of high availability, keep it simple.

From means high availability: the main use of technology is a service and data redundancy backup and transfer failure, a set of services or a set of data can be found within many nodes and backup each other, when a machine downtime or problems, you can switch from the current service to other services are available, and will not affect the availability of the system, and will not lead to loss of data.
High availability in terms of architecture: Keep simple architecture. At present, most websites adopt the classic layered architecture, application layer, service layer and data layer. The application layer is dealing with some business logic, the service layer provides some closely related data and business services, the data layer is responsible for data reading and writing, simple architecture can make the application layer, service layer can keep stateless horizontal extension, this belongs to the high availability of computing, compared with the calculated high availability, high availability in data layer thinking belongs to the high availability of data, Data high availability is more complex than computing high availability, and the consistency of data needs to be considered. At this time, CAP theory will play a key role. The choice of AP or CP depends on the business model.
High availability in hardware: Make sure the hardware is always broken and the network is always unstable. The way to solve it is also a server is not enough to a few more, a cabinet is not enough to a few, a room is not enough to a few.
From the perspective of high availability of software: the development of software is not rigorous, the release of non-standard also lead to various unavailability, through the control of software development process quality control, through testing, pre-release, gray release and other measures to reduce unavailability.
High availability from the perspective of governance: a system runs well online, but we cannot ensure that it will not be unusable in the next second. Standardize the service, do a good job of service segmentation in advance, do a good job of service monitoring, predict the occurrence of unusable, find the problem before the occurrence of unusable, and solve the problem.

[Note: Part of the article refers to Li Yunhua’s “Learning Architecture from Zero” and Yang Bo’s “Micro Services”.]