How to systematically learn distributed systems?

The origin of this article is to answer the question “How to systematically learn distributed system” of zhihu Roundtable “The Beauty of Distributed System”. After a little sorting, formed this article (Zhihu ID: Kylin).

preface

A learning before, I think a better way is to first understand the ins and outs of it: namely the knowledge produced by the process, it can solve the problem of what, it is how to solve, and it introduces what new problem (no silver bullet), so that we can better catch its context and the key points, not lost in the details from the start.

So, before we learn about distributed systems, the first question we need to address is: What problem does distributed systems solve?

What problems do distributed systems solve?

The first is the cost problem caused by the single machine performance bottleneck, due to the failure of Moore’s Law, the performance bottleneck of cheap PC can not continue to break through, small and large machines to improve the performance of a higher single machine, but the cost is too high, the general company is difficult to bear;

The second is caused by the increase of users and the data quantity of explosive cost into the age of the Internet, users explosive increase, user generated, the amount of data has increased in the explosive, as it is, but the value of a single user or a single data software era (such as bank users) value is low is not high, only so you have to find more economical solution;

The third is the business of high availability requirements, for the product of the Internet, are required to provide 7 * 24 hours service, can’t bear to stop service failure, and to provide high availability of service, the only way is to increase the redundancy to complete, so even if the stand-alone systems can support service, because the requirement of high availability, will become a distributed system.

Based on the above three reasons, it can be seen that in the Internet era, single-machine system cannot solve the cost and high availability problems, but these two problems are very critical for almost all companies, so, from single-machine system to distributed system is an inevitable technology trend.

How do distributed systems solve problems?

So, how can distributed systems solve the cost and high availability problems faced by stand-alone systems?

In fact, the idea is very simple, is to connect some cheap PC through the network, work together, and provide redundancy in the system to solve the problem of high availability.

What new problems do distributed systems introduce?

Let’s consider the definition of a distributed system: a distributed system is a system composed of a group of computer nodes that communicate through a network and work in coordination to complete a common task. In the definition, we can see that the distributed system solves the cost and availability problems faced by the single machine system through multiple work nodes, but it introduces the coordination problem of the internal work nodes of the distributed system.

We often say that have a need to understand its causes and effects of knowledge, for distributed systems, the former is “what distributed system solves the problem”, the consequence is that “it is how to do the internal work node coordination”, so we have to solve the second problem is: distributed internal work node coordinate system is how to do?

What new problems does distributed computing introduce?

Let’s start with the simple case. For distributed computing (stateless), what does the coordination within the system need to do?

1. How to find the service?

In A distributed system, there will be different services (roles). How service A finds service B is A problem that needs to be solved. Generally speaking, service registration and discovery mechanism is A common idea, so we can understand the implementation principle of service registration and discovery mechanism. And we can think about service registration and find it is more reasonable to choose AP or CP system (strictly according to THE CAP theory, most of the systems we currently use are difficult to meet C or A, so here is just AP or CP in the general sense);

2. How to find examples?

After the service is found, which instance of the service should the current request choose to send? In general, if the instances of the same service are all exactly equal (stateless), then it is sufficient to handle them according to the load balancing policy (polling, weighting, hash, consistent hash, fair, etc.). If the instances of the same service are not peers (stateful), routing services (metadata services, etc.) need to be used to determine which instance the request data is currently being accessed on before accessing it.

3. How to avoid avalanches?

System avalanche refers to the failure of the continuous expansion of rules due to the progressive positive feedback. An avalanche is usually triggered by the failure of a small part of the system, leading to the failure of other parts of the system. For example, when an instance of a service fails, load balancing removes the instance and increases the load of other instances. As a result, all instances of the service fail one by one like dominoes.

The overall strategy to avoid avalanches is relatively simple, as long as there are two ideas, one is a fast failure and downgrade mechanism (circuit breaker, downgrade, current limiting, etc.), by rapidly reducing the system load to avoid avalanches; The other is the elastic capacity expansion mechanism, which rapidly increases the service capacity of the system to avoid avalanche. This can be chosen differently depending on the scenario, or both strategies can be used.

In general, a fast failure will result in partial request failure. If a distributed system has a high requirement for consistency, a fast failure will lead to data inconsistency. Elastic capacity expansion is a better choice, but the cost and response time of elastic capacity expansion are much greater than that of rapid failure.

4. How to monitor alarms?

For a distributed system, if we can’t clearly understand the internal state, so there is no way to fully protect the high availability, so for distributed system monitoring (such as the information such as the delay of the interface and usability), distributed track Trace, the chaotic engineering simulation fault, as well as related alarm mechanism must be perfect;

What new problems does distributed storage introduce?

Let’s take a look at how distributed storage (stateful) internal coordination works, and at the same time, the distributed computing coordination methods described above also apply to distributed storage.

1, distributed system theory and equity

ACID, BASE and CAP theory, to understand these three themes, and recommend the article related references: behind the English version: www.infoq.com/articles/ca… Chinese version: www.infoq.cn/article/cap…

2. How to do data sharding?

It is impossible to store all the data with the storage capacity of a single machine. Therefore, it is necessary to solve how to store the data on different machines according to certain rules. Currently, the most commonly used solutions are as follows: You can learn about the advantages and disadvantages of Hash, Consistent Hash, and Range Based sharding policies and their application scenarios.

3. How to do data replication?

To meet the high availability requirements of the system, redundant data processing is required. The current schemes are mainly as follows: To understand the advantages and disadvantages of centralized schemes (master-slave replication, consistency protocols such as Raft and Paxos) and decentralized schemes (Quorum and Vector Clock) and their respective application scenarios, And the level of data consistency (linear consistency, sequential consistency, final consistency, etc.) shown outside the system;

4. How to do distributed transactions?

For a distributed system to implement transactions, it is first necessary to have the ability to order concurrent transactions so that, in the event of a transaction conflict, it is determined which transaction delivered successfully and which failed to commit. For the single-machine system this is not a problem, simply through the time stamp and serial number can be achieved, but for the distributed system, the system machine time can not be completely synchronized, and single machine serial number is not global significance, according to the above way to say that it does not work. However, it is possible to select a single machine to produce transaction IDS in the whole system. There is no problem with intra-city polycenters or short-distance remote polycenters. However, if you want to make a globally distributed system, Then it is too expensive to go to a node to get the transaction ID for each transaction (for example, RTT from Hangzhou, China to the eastern part of the United States is 200 + ms). Google’s Spanner solves this problem by implementing the TrueTime API with GPS and atomic clocks to achieve a globally distributed database.

With the transaction ID, the atomicity of distributed transactions is realized through the 2PC or 3PC protocol. The other parts are not very different from stand-alone transactions, so I won’t go into details.

Advanced learning

Here, I have a basic concept of distributed system, and then I start to learn details. This is also a very difficult stage. The depth of understanding of distributed system and the depth of details are very important evaluation indicators, after all, the devil is in the details. Here we can study systematically from two aspects:

1. Start from practice

Study the design of distributed systems commonly used at present, such as HDFS or GFS (distributed file system), Kafka and Pulsar (distributed message queue), Redis Cluster and Codis (distributed cache), MySQL database and table (distributed scheme of traditional relational database), MongoDB’s Replica Set and Sharing mechanism Set as well as decentralized Cassandra (NoSQL database), centralized TiDB and decentralized CockroachDB (NewSQL), as well as some micro-service frameworks;

2. Start from theory

Read the book “Designing Data-intensive Applications” first, and then read the relevant references in the chapter that interests you.

conclusion

This article starts with the problem that distributed system solves, then discusses how it solves the problem, and finally discusses what new problems it introduces, and discusses the solutions to these new problems. This is the general knowledge of distributed system. With this context in hand, it is possible to study distributed systems in great detail from both practical and theoretical perspectives.

reference

Zhihu | how to systematically study the distributed system
Martin Kleppmann.Designing Data-Intensive Applications
CAP Twelve Years Later: How the “Rules” Have Changed