ETCD introduction

What is ETCD?

Etcd is a distributed and highly available consistent key and value storage system written by Go. It provides reliable distributed key and value storage, configuration sharing, and service discovery.

It has the following characteristics:

  • Simple:
    • Curl curl curl curl curl curl curl curl curl curl
    • Easy to deploy: written in Go language, cross-platform, easy to deploy and maintain.
  • Reliable:
    • Strong consistency: Raft algorithm is used to fully ensure the strong consistency of distributed system data;
    • High availability: Provides fault tolerance. If a cluster has N nodes, services are still provided when (N-1)/2 nodes fail to send data.
    • Persistent: Data updates are persisted to disks in WAL format and snapshots are supported.
  • Fast: Each instance supports 1000 write operations per second, with maximum write performance up to 10K QPS.
  • Security: Optional SSL client authentication mechanism.

The overall framework

From the architecture diagram of ETCD, we can see that ETCD is divided into four parts:

  • HTTP Server: Used to process API requests sent by users and synchronization and heartbeat information requests from other ETCD nodes.
  • Store: Transactions that handle the various functions supported by ETCD, including data indexing, node state changes, monitoring and feedback, event processing and execution, and so on. It is the concrete implementation of most of the API functions provided by ETCD to users.
  • Raft: The concrete implementation of Raft strong consistency algorithm is the core of ETCD.
  • WAL: Write Ahead Log, which is the data storage mode of etCD. In addition to holding the state of all the data and the index of the node in memory, ETCD is persisted through WAL. In WAL, all data is logged before submission. Snapshot is a status Snapshot to prevent too much data. Entry Indicates the specific log content.

Raft agreement

The basic concept

Noun explanation

The Raft protocol contains three types of roles:

  • The Leader is elected by popular vote. Only one Leader can be elected at a time.
  • Candidate: When there is no leader, certain people can become candidates and then compete for the position of leader;
  • Follower: This is well understood, so I won’t explain it.

And then there are a couple of important concepts in the electoral process:

  • A Leader Election is an Election in which a Leader is chosen from among the candidates.
  • Term: it is actually a single increasing serial number, with each Term leading to a new election;
  • Election Timeout: an Election will be held again if the crowd does not receive a heartbeat from the leader.

Role transformation

This picture shows the roles of leaders, candidates and the masses. Let me briefly summarize:

  • Crowd -> candidate: when the election begins, or when “the election runs out of time”
  • Candidate -> Candidate: When the “election expires”, or a new “term” begins
  • Candidate -> Leader: When obtaining a majority of votes
  • Candidate -> Crowd: other nodes become leader, or start a new “term”
  • Leader -> Crowd: automatically relinquishes the leader position if your tenure ID is smaller than other nodes
  • Note: Each case will be explained in detail later.

The election

Leadership election

For the sake of further explanation, I have drawn a sketch of the “election timer” which is essentially the “timeout time” for each node.


Candidate: each node has its own “timeout time”, which is random and ranges from 150 ms to 300ms. Therefore, the probability of the same random time is relatively small. Node B is the first node to timeout and becomes the candidate.


Election of leader: Candidate B begins to vote, and people A and C return to vote. When candidate B wins the majority of votes, candidate B becomes the leader.


Heartbeat detection: In order to pledge their leadership status, leader B needs to initiate heartbeat to the masses at all times. When masses A and C receive the heartbeat of leader B, the “timeout time” of masses A and C will reset to 0, and then count again and again.

It needs to be explained here that the heartbeat broadcast cycle of the Leader must be shorter than the timeout time of the “election timer”, otherwise the masses will frequently become candidates, which will lead to frequent elections and the change of the Leader.


The leader died

When leader B dies, the “election timer” of masses A and C will always run. When masses A times out first, they will become the candidate. Then the following process is the same as the “leader election” process, that is, notification vote -> receive vote -> become leader -> heartbeat detection.


Multiple candidate situations arise

When there are multiple candidates A and D, the two candidates will vote at the same time. If the number of votes is different, the node that gets most votes first will become the leader. If the number of votes is equal, a new vote will be held.


When C becomes the new candidate and the Term is 5, a new round of voting is initiated. After other nodes vote, they will update their Term value and finally choose the new leader as C node.


Log copy

Replication state machine

The basic idea of a replication state machine is a distributed state machine. The system consists of multiple replication units. Each replication unit is a state machine, and its state is saved in operation logs. As shown in the figure below, the consistency module on the server is responsible for receiving external commands and then appending them to its own operation logs, and it communicates with the consistency module on other servers to ensure that the operation logs on each server eventually contain the same instructions in the same order. Once the instructions have been copied correctly, each server’s state machine processes them in the order of the operation logs and then returns the output to the client.


Data Synchronization Process

The data synchronization process, which borrows from the idea of a replication state machine, is “commit” and then “apply”. When the Client initiates a data update request, the request will first go to the leader node C, which will update the log data, and then notify the masses node to update the log. When the masses node updates the log successfully, it will return a success notification to the leader C, thus completing the “submit” operation. When leader C receives the notification, he will update the local data and inform the masses to update the local data. At the same time, he will return a success notification to the Client, thus completing the “application” operation. If the Client has a new data update operation, the above process will be repeated.


Log principle

Each Log entry typically contains three attributes: the integer Index Log Index, the Term number, and the directive Commond. The “integer index” for each entry is its slot in the log file, the “tenure number” corresponds to the number in each box in the figure to detect log inconsistencies on different servers, and the instructions are the external commands executed by the state machine, the numbers with arrows in the figure.

Does the leader decide when it is safe to apply log entries to the state machine and commit them? An entry created by a leader is said to be committable once it has been copied to more than half of the nodes. For example, entry 9 in the figure is replicated on four of the seven nodes, so entry 9 can be submitted; But entry 10 is replicated on only three of the nodes, so entry 10 is not committable.


Generally, the logs of the Leader and followers are saved the same. If the Leader node does not copy all previous entries in the log file to other nodes before a fault occurs, log inconsistency may occur. In Raft algorithm, the Leader forces the followers to save the same logs as his own. Therefore, logs that conflict with the Leader on the followers will be overwritten by the Leader’s logs. In order to achieve the above logic, it is necessary to know where the Follower logs are inconsistent with the Leader’s logs. Then how does the Leader accurately find the slot where the Follower logs are inconsistent?

The Leader maintains a Nextlndex for each Follower, which represents the index of the next log entry that the Leader will send to the Follower. When a Leader wins an election, it assumes that the logs on each Follower are consistent with its own. Therefore, nexTLndex is initialized to its latest log entry index +1. In the figure above, since the Leader’s latest log entry index is 10, the initial value of NexTLndex is 11. When the Leader sends AppendEntries to the Follower RPC, it carries binary group information (Item_id, nextindex-1), where item_id is the term of the log entry in slot Nextindex-1. After receiving the AppendEntries RPC message, the Follower performs a consistency check. That is, the Follower searches for the presence of such log entries in its own log file. If the log does not exist, the Leader returns AppendEntries with failed RPC and decrement nextIndex. Retry until successful. The following logic is relatively simple: the followers reserve all the logs before nextIndex, delete all the logs after nextIndex, and then synchronize all the logs after nextIndex from the Leader.

The above is only about the method, the following example, to deepen the understanding, or the picture above example. AppendEntries RPC(6,9)(6,8) (5,7) (5,6) (4,5) (4,5) (4,4) is not found until the Leader’s nextlndex is 11. Delete all logs after nextlNdex =4, and append all logs after nextlNdex =4 on the Leader.

Fissure situation

When the network problems lead to split brain, a double Leader, each network can be understood as an independent network, because the original Leader alone in one area, so the data submitted to him can’t be copied to most of the nodes, so never submit data, this can carry out now in the fourth picture (no submission SET 3).


After the network is restored, the old Leader finds that the Term of the new Leader in the cluster is longer than its own, and is automatically demoted to Follower. The new Leader synchronizes data from the new Leader to achieve data consistency in the cluster. For details about how to synchronize data, see 3.3.3 Logging Principles.


In fact, the split brain condition is just one of the abnormal conditions. When the Leader notifies the Follower to update the log and the Leader submits the update, there are all kinds of problems caused by the abnormal conditions. I will not elaborate on this. For details, please refer to the chapter “1.4.3 Abnormal Situation” in the book “Cornerstone of Cloud Native Distributed Storage – ETCD In-depth Analysis”, which is relatively clear.

Welcome everyone to like a lot, more articles, please pay attention to the wechat public number “Lou Zai advanced road”, point attention, do not get lost ~~