Quorum

The design pattern for Quorum is that each change in a distributed system must be passed over a majority of instances to ensure that the change passes.

The problem background

In a distributed storage system, user requests are sent to an instance. Changes that are normally made on one instance need to be copied to other instances so that users can still see the changes if the original instance dies. This raises the question of how many other instances a user request must be copied to before it returns a success. If too many instances are replicated, the request response time will be longer; If too few instances are copied, this change may be lost. Achieving this balance is important and is a trade-off between Consistency and L (Latency) in distributed PACELC.

The solution

When a change is accepted by a majority of the cluster’s nodes (let’s say N), the change is accepted by the cluster. This N is a significant number. If the number of clusters is n, then n = n/2 + 1. For example, if n = 5, then n = 3.

This significant number, which indirectly represents the maximum number of instances in the cluster that can fail, is f = n-n. In general, if we expect to tolerate F instance failures, then the cluster should have at least 2F + 1 instances.

Here are two classic cases of the design pattern that requires a significant number:

  • Update data in the storage cluster. It also involves the high-water Mark design pattern, which is used to Mark the end of the log and is synchronized to most instances in the cluster.
  • Choose the Lord. In the Leader and design pattern, whoever is elected master by a significant number is the ultimate master.

How to design the number of clusters

At present, there are two mainstream cluster design modes:

  • The first is master-slave synchronization:
    • In one case, the request is sent to the master, who is responsible for synchronizing with the others from above and then returning. If the request is sent to the slave, the slave master is processed. Zookeeper, for example, does this.
    • Alternatively, the instance to which the request is sent is the master, and the master synchronizes the request to the slave. Eureka, for example, is designed this way.
  • The second is the partitioned mode, in which different nodes in a cluster store different data. In general, this data sharding often uses consistent hashing. Suppose the request is sent to A, after A’s calculation, the data needs to be stored in D, and the storage backup we configured is A copy, which is on E, so the request will be synchronized to D and E. ElasticSearch, Riak, Dynamo are similar designs.

In this design mode of the system, two main considerations:

  • Throughput of write operations. Because each write to the cluster is replicated to multiple instances, performance is definitely affected. Typically, replication is concurrent, and this performance is primarily affected by the slowest instance of the synchronization.
  • The number of tolerable instance failures. Such a cluster would have to have at least 2F + 1 instances if we expected to tolerate F instance failures.

Implementation example

1. Two-phase commit + half write mechanism of Zookeeper

The client sends the write request to the leader node (if it is sent to the follower node, the follower node forwards the write request to the leader node), and the leader node sends the data to all nodes (including itself) through the proposal request. All the data received by the nodes will be written to the local disk, and then an ACK request will be sent to the leader. The leader will send a COMMIT message to each node as long as more than half of the nodes send back ack response. Each node puts the message into memory (for high performance) and the message becomes visible to the user.

2. Riak, DynamoDB

By default, P+A and E+L systems, but can be modified according to configuration, mainly based on the NWR model with synchronous and asynchronous backup. N indicates N backups. W indicates that at least W backups must be written before success. R indicates that at least R backups must be read. The configuration must be W+R > N. Because W plus R is greater than N, R is greater than N minus W. What does this mean? That is, the number of copies read must be greater than the difference between the total number of backups minus the multiple of guaranteed write success. That is, at least one latest version is read each time it is read. So you don’t read old data. We can configure W = 1 if N=3 then R =3 when we need a highly writable environment (for example, amazon’s shopping cart add request should never be rejected). In this case, if any node is successfully written, it is considered successful, but the data must be read from all nodes. If we want read efficiency, we can configure W=N and R=1. In this case, if any node succeeds in reading, it is considered successful, but if all three nodes succeed in writing, it is considered successful. Notice that the time of an operation is the time of the slowest of several parallel operations. For example, when R=3, the read request is actually sent to three nodes at the same time, and only when all three nodes return the result can it be considered as successful. If a node is slow to respond, it will significantly slow down the response of a read operation

3. MongoDB

Like Dynamo, MongoDB uses consistency and availability trade-offs:

  • write-concern: indicates that a write request is returned to the client after value MongoDB instances have been processed
  • read-concern: sets whether the latest data must be read from the primary or the final consistent data can be read from the secondary.
  • read-preference: For the replica set, do you return the latest data of the current node, the data that has been written to the node most, or the data calculated according to some functions?

Wechat search “my programming meow” public account, a daily brush, easy to improve skills, won a variety of offers