“In distributed system design, skepticism, pessimism and paranoia are all worthwhile; In fact, building more reliable systems from less reliable foundations is an old idea in computing.”
ZooKeeper, as a distributed application coordination system, has always been a benchmark in the Java ecosystem in terms of configuration management, master node election, service registry and other aspects. Since its open source, ZooKeeper has been widely used in the architecture of top open source systems at home and abroad (Hadoop, Kafka, Dubbo). As a rising star of Golang technology stack, ETCD is often compared with ZK in performance, reliability, consistency and other aspects, and exceeds ZK in some aspects and application scenarios. Recently, I have been reading the book “Data-Intensive System Design”, and I always want to study it with mature open source distributed system and theory. Besides, ZK was used in the company’s technology stack before, so I am not familiar with ETCD. I hope I can take this opportunity to master some of its advanced system design.
This paper is mainly divided into two parts:
- First, introduce the source and data structure comparison of the two;
- 2. Distributed consistency algorithm and its application in ZK and ETCD;
Note: The ETCD features discussed in this article are based on the V3 version;
Name the source
ZooKeeper was designed and open-source by Yahoo. It is said that at that time, many system names were named after animals. Zk was originally designed to be the coordinator and manager of various distributed systems, hence the name ZooKeeper.
The etCD name is made up of two parts (see ETCD Document) : the “/etc” directory for Unix systems and “D “istribute Systems (distributed systems). The /etc directory is responsible for storing system configuration data on Unix-like systems. It can be seen that ETCD is a distributed configuration database with high stability and consistency in the designer’s expectation.
ZooKeeper data model
The logical data structure of ZooKeeper is a tree structure similar to that of a Unix file system. Different from a file system, the node can be either File or directory. The smallest data unit of ZooKeeper is called a ZNode. To form a tree; Znodes are divided into persistent nodes and temporary nodes. The persistent nodes will be saved permanently unless they are deleted actively. Temporary nodes are bound to client sessions and are automatically deleted when the session ends. In addition, the sequence attribute can be assigned to each node to realize the sequence of nodes.
For example, when Dubbo uses ZooKeeper as the data structure of the registry, when registering the Provider of the service Provider, it determines whether to use the persistent node according to the dynamic configuration. If dynamic is true (the default is true), only temporary nodes are created. Dynamic service failure is realized through watch mechanism. As for the service consumption node Consumer, there is no need for dynamic change, so it can create a permanent node. The specific structure is shown in the following figure:
ZooKeeper
ZooKeeper
Attached: Why does Alibaba not use ZooKeeper for service discovery
Etcd data model
Etcd V2 version is memory storage implementation, not real-time disk, here we only discuss v3 version data model; Etcd is designed for reliable storage of infrequently updated data, and multi-version persistent KV data storage is used in the data model. Keys can use any character and are stored in a dictionary order to facilitate the range queries of keys. Therefore, a hierarchical directory structure such as ZK/APP can also be implemented.
Boltdb is used to persist physical data in the form of b+tree key-value pairs (currently boltDB is used), where key in KV is a triplet (major, sub, type) :
- Major represents the major revision of key. Major +1 for each transaction committed.
- Sub is used to distinguish different keys under the same major revision. Sub is in the same transaction and +1 for each operation.
- Type is used to represent the special attribute of key. For example, type=t means that the current node has been marked with tombstone, which means that it has been compressed and deleted. (EtCD needs to execute the Compact policy periodically to save disk space)
The value in KV contains changes to previous revisions, such as incremental changes to previous revisions.
The key stored in BoltDB is revision, so a mapping between the user key and reversion KeyIndex needs to be maintained in memory. The objects stored in KeyIndex include the user’s original key, generations, and modified data, where generations contains the historically modified version of the key:
type keyIndex struct {
key []byte
modified revision // the main rev of the last modification
generations []generation
}
// generation contains multiple revisions of a key.
type generation struct {
ver int64
created revision // when the generation is created (put in first revision).
revs []revision
}
Copy the code
The generation object contains the history revision array, each new reversion in the REVS array to add a new version, ETCD is also used to achieve MVCC(multi-version control), at the same time the memory index is also b+tree form, to speed up the query; If the KeyIndex is changed many times, the generation[0]. Revs array will become very large. In this case, we need to rely on etCD compact mechanism to compress, and actively discard some historical versions.
The following is a key lifecycle process described in the etCD code comment key_index.go:
/ / For example: put (1.0); Put (2.0); Tombstone, (3.0); Put (4.0); Tombstone (5.0) on key "foo" // generate a keyIndex: // key: "foo" // rev: 5 // generations: // {0, 0(t)} // {0, 0(t)} 3.0(t)} // // Compact a keyIndex yellow post the versions with smaller or equal to // rev except the largest one. If the generation becomes empty // during compaction, it will be removed. if all the generations get // removed, the keyIndex should be removed. // // For example: // compact(2) on the previous example // generations: / / / / {empty} {4.0, 5.0 (t)} / / {2.0, 3.0 (t)} / / / / compact (4) / / broke: / / / / {empty} {4.0, 5.0 (t)} / / / / compact (5) : / / broke: / / {empty} - > key SHOULD be removed. / / / / compact (6) : // generations: // {empty} -> key SHOULD be removed.Copy the code
As for the tombstone operation to be executed, delete will terminate the current generation and generate a new empty generation, and add type=t to the last operation reversion;
Summary: Memory B +tree maintains the mapping of user key and keyIndex. KeyIndex is responsible for accelerating the query and maintaining the mapping relationship of revision. Through revision, user value can be extracted from the actual DB storage.
Consistency algorithm and consensus
Let’s go back to the quote from the book Design for Data-intensive Applications at the beginning of this article:
“In distributed system design, skepticism, pessimism, and paranoia are all worthwhile; In fact, building more reliable systems from less reliable foundations is an old idea in computing. “
Distributed systems are very different from single-node systems, and you will encounter all kinds of strange failures and problems in system design practice, including:
- Partial cluster failure (kernel failure, disk failure, node restart, etc.)
- Unreliable network (data packet loss, network partition, network congestion, network timeout, etc.);
- Unreliable clocks (NTP synchronization relies on the network, local time jumps, etc.);
In a typical distributed environment, there are no global variables, no shared memory, and nodes don’t even know the exact time of day, not to mention other more complex situations. Information flow only through unreliable network transmitted from one node to another node, a single node cannot make accurate decisions, but need to reach a consensus between multiple nodes, make through consensus all nodes agree on a proposal, this would require the use of distributed field for more than a decade researchers explore summed up the consistency of the algorithm, Among them, some famous ones are Paxos algorithm, Raft algorithm, Zab algorithm, etc.
Paxos algorithm
First, a brief introduction to Paxos algorithm. Paxos algorithm has three roles:
- Proposer: a Proposer
- Acceptor: Responsible for responding to and accepting a Proposal, Chosen if it is accepted by a majority;
- Learner: is only responsible for synchronizing the Chosen proposal, and does not directly participate in the approval process of the proposal;
A Proposal is initiated to Chosen, which is a two-stage process in Paxos algorithm.
- A Proposer sends a Proposal containing a unique proposalId and value in the Prepare phase
- The Proposal approved by the majority in the Accept stage forms a resolution and is synchronized by Learner;
For Acceptor and Proposer actions, there are multiple constraints in the algorithm to ensure the accuracy of the algorithm. Paxos algorithm is known for its complexity, and it is difficult to achieve perfect engineering practice. Therefore, on the basis of basic-PAxOS algorithm, optimization schemes such as Multi-PaxOS (when the Leader is stable, part of the Prepare and Promise stages are removed to reduce the cost) and fast-PAxOS are evolved (see Paxos algorithm for details). In this paper, we mainly understand ZooKeeper Zab protocol (ZK atomic broadcast protocol), and RAFT algorithm with engineering practice in ETCD;
Zab agreement
Zab uses some ideas from the Paxos algorithm, but it is not a specific implementation of Paxos. Its full name is Zookeeper Atomic Broadcast. Zookeeper uses the Zab protocol to construct the active and standby system architectures to ensure the consistency of distributed data systems.
A simple description of Zab protocol: in the Zookeeper cluster, a single process is used to handle transaction operations, namely the Leader node, and the transaction is converted into a Proposal by atomic broadcast protocol, which is synchronized to all Follower nodes.
Therefore, in order to ensure data consistency and sequence, Zookeeper only uses the leader node to process write requests, and its write request performance is not capable of horizontal expansion.
Zab protocol has two basic modes:
- Crash recovery
- News broadcast
1. Message broadcast mode
- The Leader accepts the client transaction request and broadcasts the Proposal.
- Before broadcasting, each Proposal is assigned a globally monotonically increasing unique ID, which is also the transaction ID of ZK (ZXID). Each Proposal should be sorted according to the order of ZXID.
- The Leader assigns a separate queue to each Follower and puts the Proposal into the queue to send the message according to the FIFO policy.
- After receiving the Proposal, the Follower first writes the transaction log to the disk and sends an Ack to the Leader. After receiving more than half of the ACKS, the Leader broadcasts a COMMIT message to the Follower node and completes the local transaction submission.
Crash recovery mode
If the Zookeeper cluster starts or the Leader node loses contact with half of the followers due to network problems, the Leader node enters the recovery mode and selects the master again. Zab protocol needs to ensure these features in crash recovery:
- Ensure that transactions Committed at the Leader node are Committed by all servers.
- Ensure that transactions that are only proposed on the Leader node need to be discarded;
Zab protocol to meet the above requirements, in the design of transaction ID (ZXID), ZXID is a 64-bit number, whose lower 32 bits are monotonically increasing simple counter, and the Leader node increases each transaction by 1. Its higher 32 bits represent the epoch value of the Leader cycle. Every time a new Leader is elected, the largest ZXID is extracted from the local transaction log, the higher 32 bits are added by 1, the new EPOCH is used, and the lower 32 bits are restarted from 0 to generate a new ZXID.
Through this mechanism, Zab protocol simplifies the process of data resynchronization during Leader crash recovery. In extreme cases, it may take some time to recover a large number of cluster node data and synchronize data during re-election. During the process of re-election, ZK services are unavailable. This is also the availability sacrifice made in Zab protocol to ensure strong data consistency;
In summary, it can be seen from the above understanding of various scenarios of Zab protocol that the system design goal of ZK is a highly consistent distributed system, which meets CP conditions in CAP theory and has problems such as unscalable write performance and slow crash recovery. Specifically,
- CP system design can not meet the requirements of high availability of the registry, let alone allow the registry service unavailability;
- Write performance cannot scale horizontally, and it may be a disaster to restart a large number of services once the service scale grows.
- Zk activity detection based on Session is not completely reliable, and health detection should be actively feedback by the service provider.
- For disaster recovery, even if the registry is completely down, service invocation cannot be affected. Clients need to proactively implement the service caching mechanism, which is not supported by zK native clients.
- Zk, as the elder in the distributed system family, can not get timely support, and often needs to be dominated and maintained by ZK users.
For former zk mentioned in the article as the registry scenario, these problems are larger late after a fatal problem, therefore reasonable to consider in the project is important for zk usage scenarios, reading more write less high consistency demand scenarios (such as the big data offline calculation, data storage, etc.), zk is distributed architecture in the tool.
Raft algorithm
Raft algorithm reduces the complexity of the protocol compared with Paxos algorithm, which is more convenient for engineers to understand and make engineering practice. In Raft algorithm, the consensus algorithm problem is divided into two sub-problems: leader election, log synchronization;
I. Leader Election
Members in a Raft cluster come in three states:
- Leader, who handles all transaction operations and synchronizes to other replica nodes;
- The followers obtain synchronization data from the Leader and write it to the local log. Each Follower maintains a heartbeat with the Leader.
- In a Condidate, the intermediate state between the Leader and followers, the followers transform into a Condidate after the Leader crashes and loses heartbeat response, and a new round of master selection is initiated until a new Leader is generated.
Raft protocol maintains a unique term in the cluster, that is, the term of the Leader. Term_id increases with each new round of election. During initialization, all members of the cluster are followers, and during the election, all followers can participate in the election. Followers change their local maximum term_ID + 1 and initiate a Condidate election, then their status changes to Condidate and waits for other servers to vote. They become the Leader only when they receive more than half of the votes. The other Condidate changes back to the Follower starting log synchronization (the Byzantine general problem is not considered here, and each node is trusted in the cluster environment theoretically).
How to deal with the exception scenario where no Condidate receives more than half of the votes? There are two timeouts in Raft protocol to handle these exception scenarios:
-
Election timeout The time at which the followers wait to become a Condidate, which is a random time between 150 and 300ms, ensures that no multiple nodes become Condidate at the same time and launch a vote. At the same time, each node will only vote for the first time. After the node votes, the election timeout will be reset to ensure that the election can be re-initiated if no message is received about the Leader’s election success.
-
Heartbeat timeout Indicates the heartbeat timeout time maintained between the followers and the Leader after the Leader election is successful. When the followers lose their heartbeat, they restart the wait and become a Condidate, and then vote in the election.
Log Replication
After the Leader is elected, data synchronization, or log synchronization, begins. As with Zab, the master node in Raft receives all client processing requests; Specific data synchronization process:
- After receiving the transaction request, the Leader stores it into the local log (usually wal logs), marks it in uncommitted state, and sends a data synchronization request to all the followers. The data synchronization request is the same as the heartbeat request. Both are synchronized using Append Entries messages;
- Followers need to send an ACK to the Leader after receiving the Append Entries Message.
- After receiving half an ACK, the Leader commits the transaction, returns a success message to the client and causes the Follower to commit.
Raft algorithm needs to ensure that all committed logs are applied to all nodes. If a Follower loses a transaction due to network problems or does not send an ACK after receiving a transaction, this is basically the same as Zab. That is, the Leader node maintains a log index for each Follower that records the current Follower processing, and each sending of Append Entries carries term, index and data data (as in wal logs on ETCD). If Follower experiences any of the anomalies mentioned above, the Append Entries will be reposted until they succeed.
The above text describe the Raft agreement process may be boring, this article simply describes some common scenarios, more like a master-slave switch, network partitions, cluster topology adjustment scenarios, interested friends can see the Raft animation, will be more easy to understand, at the same time also can see the article on zhihu Raft protocol, Raft protocol is covered in great detail.
Scene contrast
Finally, etCD and Zookeeper are compared and explained in terms of usage scenarios and application functions by referring to etCD WHY in official documents of ETCD.
Etcd and Zookeeper solve the same problem, namely: Distributed system coordination and metadata storage, but etCD, as a latcomer, can stand on the shoulders of Zookeeper, which is more forward-looking. In system design, ETCD can fully refer to the design and implementation of ZK, and has many references in consistent protocol practice and data structure design. While learning, Etcd also implements some capabilities beyond ZK based on ZK:
- Dynamic Cluster topology Adjustment (Dynamic capacity expansion and reduction)
- More stable read and write ability under high load
- MVCC multi-version concurrency control data model
- Reliable event monitoring and version-specific replay based on the MVCC data model
- The lease mechanism separates the Connection and session
- More secure API calls
At the same time, ETCD is better than Zookeeper in support of multi-language invocation. Zookeeper is basically limited in the Java ecosystem due to implementation mechanism reasons. Etcd provides very convenient API for multi-language invocation, and simple curl command can invoke its Http service.
Therefore, in the system architecture design, etCD can be preferred for higher performance, more functions and more timely community support. At the same time, ETCD can also support some old systems to use the API form of Zookeeper to call ETCD cluster capability. Please refer to ZETCD for details.
In addition, for RPC service discovery scenario, there are systems that are more focused on this segmentation field, such as Consul, Eureka, Nacos, etc. For long-term architecture planning and specific application scenarios, These open source software systems are better suited than ETCD and Zookeeper as service discovery components in distributed remote invocation architectures and have best practices in larger scale service discovery scenarios.
The original address
List of references:
- ZooKeeper official website document
- From Paxos to Zookeeper
- Dubbo ZK registry documentation
- Etcd architecture and implementation analysis
- Etcd design principle analysis
- Etcd V3 principle analysis
- Raft protocol in detail