Translated from Eli Bendersky’s blog series, with permission from the original author.
This article is the introduction to a series of articles designed to introduce Raft distributed consistency protocol and its implementation with the Go language. The full list of articles is as follows:
- Preface: Introduction (Article)
- Part one: Choose the master
- Part two: Instruction and log replication
- Part three: Persistence and optimization
Raft is a relatively new algorithm (2014), but has been widely used in the industry. The best known example is Probably Kubernetes, where the distributed key-value storage component ETCD relies on the Raft protocol.
The purpose of this series of articles is to describe a fully functional and rigorously tested implementation of the Raft protocol and to provide some intuitive understanding of how Raft works. This is not the only way you can learn the Raft protocol. I assume that you have at least read the Raft paper; It’s also highly recommended that you take the time to explore the resources on the Raft website — watch a talk or two from the creators, play around with the visualization tools for the algorithm, browse Ongaro’s PhD thesis to learn more details, and so on.
Don’t expect to master the Raft protocol completely in one day. Although Raft is designed to be easier to understand than Paxos, the Raft algorithm is still quite complex. The problem it is trying to solve (distributed consistency) is a difficult one, so the solution is naturally not easy.
Replication state machine
Distributed consistency algorithms can be thought of as solving the problem of replicating a deterministic state machine across servers. The term state machine here can be used to refer to any service. After all, state machines are one of the foundations of computer science, and everything can be represented in a state machine. Databases, file servers, lock servers, and so on can be thought of as complex state machines.
Consider a state machine to represent some service to which multiple clients can connect, making requests and expecting a response:
This system works as long as the server running the state machine is stable and reliable. If the server crashes, our service becomes unavailable, which is unacceptable. Often, the reliability of our systems depends on the servers they run on.
One common way to improve service reliability is replication. We can run multiple instances of the service on different servers. This creates a cluster of servers that work together to provide services, and the failure of any one of them will not cause service disruption. Server isolation [1] can further improve system reliability by eliminating common failures that affect multiple servers at the same time.
Instead of connecting to a single serving machine, the client will connect to the entire cluster. In addition, the service copies that make up the cluster must communicate with each other to correctly replicate state:
Each state machine in the figure above is a copy of a service. The idea is that all state machines run synchronously, taking the same input from client requests and performing the same state transitions. This ensures that even if some servers in the cluster fail, the same results will be returned to the client. Raft is one algorithm for this purpose.
It’s a good time to clarify some terms that will be used frequently later in this article:
- A Service is a logical task that we will implement in a distributed system, for example, a key-value database.
- Server or Replica: AN instance of a Raft service running on an isolated machine that connects to other replicas or clients over the network.
- Cluster: A set of Raft servers that collaborate to implement distributed services. Typical Cluster size is 3 or 5.
Consistency modules and Raft logs
Now let’s take a look at one of the state machines shown above. Raft, as a generic algorithm, doesn’t care how the service is implemented according to the state machine. Its goal is to reliably and accurately record and reproduce the sequence of inputs (also known as instructions in Raft terminology) received by the state machine, and given the initial state and all the inputs, the state machine can be replayed completely accurately. Another way to think about it is this: if we have two independent copies of the same state machine, and send them the same sequence of inputs from the same starting state, then both copies will end up in the same state and produce the same output.
Here is the structure of a generic service using Raft:
These components are described in detail as follows:
- The state machine is the same as we said earlier. It stands for any service: a common example used when introducing Raft is key-value storage.
- The Log is the location where all instructions (inputs) issued by the client are stored. These instructions are not applied directly to the state machine; instead, the Raft algorithm submits them only when they have been successfully copied to most servers. In addition. Logs are persistent — they are kept in stable storage that is resistant to system crashes and are available after a system crashThe replayThe state machine.
- Consistency moduleIs at the heart of the Raft algorithm. It takes instructions from the client, makes sure they are saved in the log, copies them to other Raft copies in the cluster (green arrows in the image above), and commits them to the state machine when it’s safe. The commit to status notifies the client of the actual change.
Leaders and followers
Raft uses a strong leadership model where one replica of the cluster is the leader and the others are followers. The leader is responsible for accepting customer requests, copying instructions to followers, and returning responses to the client.
In normal operation, the purpose of followers is simply to copy the leader’s log. If the leader fails or the network goes down, a follower takes over, so the service is still available.
There are pros and cons to this model. An important advantage is simplicity, data always flows from vong leaders to followers, and only the leaders respond to client requests. This design makes Raft protocol easier to analyze, test, and debug. The downside is performance — because there is only one server in the cluster interacting with the client, this can become a bottleneck when client requests surge. The usual answer to this question is: Raft protocol is not suitable for high-traffic services. Raft protocol is more suitable for low-traffic services that ensure consistency at the expense of availability — we’ll come back to this in fault tolerance.
Client interaction
What does it mean to write earlier that “clients will no longer connect to a single serving machine, but to an entire cluster”? A cluster is a group of servers connected over a network, so how do you connect “the whole cluster”?
The answer is simple:
- When accessing the Raft cluster, the client knows the network address of the replica in the cluster. How it knows (for example through some service discovery mechanism) is beyond the scope of this article.
- The client initially sends the request to any copy, and if the copy is a leader, it accepts the request immediately, and the client waits for a complete response. After that, the client remembers that the copy is a leader and does not have to search for the leader again (unless it encounters some failure, such as a leader crash).
- If the replica indicates that it is not the leader, the client attempts to connect to another replica. This can be optimized so that the follower replica directly tells the client which replica is the leader. Because replicas are always communicating with each other, they usually know the correct answer, which saves the client guessing time.
- There is also a case where the client realizes it is not connecting to the leader, and that is when its request is not successfully submitted for a timeout period. This could mean that the copy it connects to is not actually the leader (even if it thinks it is) — it could be separated from other Raft servers. When the timeout runs out, the client researches for other leaders.
The optimizations mentioned in point 3 are not necessary in most cases. In general, it’s useful to distinguish between “normal” and “abnormal” situations in Raft environments. A service is typically “up and running” 99.9% of the time, at which point clients know who the leader is because they cache this information the first time they connect to the service. Failure scenarios can certainly cause chaos (more on that in the next section), but only for a short time. As we’ll cover in more detail in our next article, Raft clusters can quickly recover from temporary machine failures or network partition problems — in most cases less than a second. There may be a brief period of unavailability when the new leader claims the leadership and the client looks for a specific leader copy, but then the cluster returns to “normal operating mode.”
Raft fault tolerance mechanism and CAP theory
Let’s take a look at a schematic of three copies of Raft, this time without connecting to a client:
What types of failures can we expect in this cluster?
Every component in a modern computer can fail, but for the sake of discussion, let’s think of the server running in the Raft instance as an atomic unit. In this case, we face two main categories of failures:
- The servers crashed and one of the servers stopped responding to all network requests for a period of time. A crashed server is usually restarted and brought back online after a brief interruption.
- Network partition, in which one or more servers are disconnected from other servers and/or clients due to network device or transport media problems.
From the perspective of server A, it communicates with server B, and the fault of server B is indistinguishable from the network partitions between server A and server B. In both cases, the behavior is the same — A receives no information or response from B. However, from a system perspective, network partitions are more significant because they affect multiple servers simultaneously. In the next part of this series, we will discuss some of the complex scenarios caused by network partitioning.
To gracefully deal with arbitrary network partitions and server failures Raft requires that most servers in the cluster are up and running and available to the leader at any given time. Raft can allow 1 machine to fail if there are 3 servers and 2 machines for a 5 server cluster. For 2N+1 servers, N servers can fail.
This leads to CAP theory, whose practical conclusion is that when there is network partitioning (an unavoidable part of practical applications), we must carefully balance availability and consistency.
In this trade-off, Raft is firmly in the consensus camp. The idea is to prevent situations where the cluster might reach an inconsistent state, where different clients might get different responses. Raft sacrifices some usability for this.
As I briefly mentioned earlier, Raft is not designed for high-throughput, fine-grained services. Each request from the client triggers a series of jobs — inter-copy Raft communication to copy instructions to most services and persist them; This all happens before the client gets a response.
For example, you wouldn’t want to design a replicated database where all client requests go Raft, it would be too slow. Raft is better suited to coarse-grained distributed primitives — implementing locking servers, electing leaders for higher-level protocols, copying key configuration data in distributed systems, and so on.
Why Go
The Raft implementation described in this series is written in the Go language. In my opinion, Go has three major advantages that are the reason this series and the general web services chose Go as the implementation language:
- Concurrency: The type of algorithm Raft is completely parallel in nature, with each copy performing continuous actions (instructions), running timers for timed events, and having to respond to requests from other copies and clients. I’ve written before about why I think Go is an ideal language for writing this kind of code.
- Standard library: The Go language has a powerful industrial-grade standard library that makes it easy to write complex web servers without importing and learning any third-party libraries. Especially in Raft, the first question you have to face is “How do messages happen between replicas?” Many people get bogged down in designing protocols and serialization, or using onerous third-party libraries. It’s in the Go language
net/rpc
This is a solution adequate for such tasks, can be used quickly and does not need to be imported. - Simplicity: Achieving distributed consistency is complex enough, regardless of the programming language. It is possible to write clear, simple code in any language, but this is the default convention in Go, which opposes code complexity on every possible level.
The next step
Thank you for reading this! If there is anything you think I could have done better, please let me know. Although Raft may seem simple in concept, there were a number of issues once we coded the implementation. Later parts of this series will cover different aspects of the Raft algorithm in more detail.
You should now be ready to move on to Part 1, where we’ll start implementing Raft.
.
In this series of articles, WE use Golang to implement Raft protocol, which not only directly explains some of the difficulties in Raft protocol, but also helps to learn concurrent programming in Go language.
After reading the original blog, I have learned a lot. After asking the author’s permission, I am translating this series of blogs into Chinese and sharing them with you. I hope that any students who are interested in Go or Raft will learn something from it.
It is highly recommended that after reading this article you execute the test cases in the author’s code to reinforce your understanding of the Raft protocol against the test output log.
During the learning process, I forked the original author’s code, added Chinese annotations on the original basis, and also added the output results of test cases. For those who are not comfortable with testing, you can view the test output log directly there.
On demand, Github address: github.com/GuoYaxiang/…
-
For example, they can be placed in different racks, connected to different power sources, or even placed in different buildings. The really important services provided by large companies are often replicated globally, with copies distributed in different regions. ↩ ︎