Scalable Open Financial Architecture Stack (SOFAStack) is a finance-level cloud native Architecture independently developed by Ant Financial, which contains various components required to build the finance-level cloud native Architecture and is the best practice developed in the Financial scene.
SOFAJRaft is a production-level, high-performance Java implementation based on the Raft consistency algorithm that supports multi-raft-Groups and is suitable for high load and low latency scenarios.
Zongtang Hu, SOFAJRaft Committer, China Mobile. This paper mainly introduces an important optimization scheme of SOFAJRaft in Leader election process — a semi-deterministic priority election mechanism. It will briefly introduce the general content of random timeout election mechanism in Raft algorithm first. If the readers don’t understand this content deeply enough, Suggestions can read under the SOFAJRaft election mechanism analysis | SOFAJRaft principle “. After reading this article, come back and have a better understanding of semi-deterministic priority voting.
SOFAJRaft:github.com/sofastack/s…
The concept and characteristics of election mechanism in Raft algorithm
Raft algorithm is a “consensual” algorithm where multiple participants reach a complete agreement on one thing: one thing is a conclusion, and the conclusion that has been agreed on is irrefutable. Based on this fundamental feature, the Raft algorithm has the following main characteristics:
- Strong leader: A Raft cluster has a maximum of one leader and logs can be copied from the leader to the followers only.
- Leader Election: Raft algorithm uses random timeout to trigger election to avoid the situation of votes being divided and ensure the smooth completion of election. This is to ensure that only one Leader node can exist in the Raft cluster at any given time.
- Membership changes: it deals with the Membership or withdrawal of members in the cluster in a two-stage manner, during which the external services of the cluster are not affected;
Elections are an important part of Raft’s algorithm. An election is the choice of a Leader node in a Raft cluster of multiple nodes to provide write services (and, by default, read services) externally.
Here, I will first introduce the concept of one Term — Term, which is used to logically cut a continuous timeline into intervals, similar to the meaning of “26” in the expression “26th President of the United States”. At the same time, the value of this Term ID is monotonically increasing along the timeline, which constitutes a necessary property for the Raft Leader election.
The first thing a cluster does during each Term is elect a Leader. At first, all servers serve as followers. If a Follower does not receive any messages from another Server after waiting for election timeout, Followers then assume that there are no leaders available in the cluster and start preparing for an election. In order to make all nodes in Raft cluster as objective and fair as possible, a random timeout is used to trigger the election to avoid the situation where several nodes try to vote at the same time and the votes are split up to ensure the smooth completion of the election. SOFAJRaft does this by setting the point in time at which each timer will execute in the timed task – electionTimer that Node triggers the election: Any time point in the time range [electionTimeoutMs, electionTimeoutMs + maxElectionDelayMs].
At the time of initiating the election, the Server will change from the Follower role to a Candidate and then try to run for the Leader of Term + 1. At this time, it will send voting requests to other servers. When receiving the reply from most of the machines in the cluster, Candidate Was elected Leader. However, the following two situations will make the Candidate step down to followers and give up running for the current Leader:
- If a Candidate receives a vote request from another Server with a higher Term while waiting for a vote on Servers;
- If the Candidate receives a heartbeat from another Server with a higher Term while waiting for the vote on Servers;
At the same time, a Leader will return to the Follower state when he finds a Leader with a higher Term. Once the Leader is elected, the Raft cluster can normally provide read and write services. As shown in the figure above, the cluster consists of a Leader and two followers. The Leader processes the read and write requests sent by the Client. They also need to keep a heartbeat with the followers and copy the Log to the followers.
However, Raft’s “random timeout election mechanism” has the following problems and limitations:
- In the next Term, it is uncertain who will become the Leader in Raft cluster. Other nodes in Raft cluster will become the Leader at random and cannot be predicted. Imagine a scenario where a Raft cluster is deployed on a server with different performance specifications. Business users expect the Leader node to be on the server with the highest performance to provide clients with better read and write capabilities.
- As shown in the figure above, each Candidate role node in the cluster will re-initiate the election in the next cycle due to the scenario of votes being split up. In this very short period of time, the Leader role does not exist in the cluster, so the read and write capability cannot be provided to the client. Therefore, service users need to use other methods to avoid the impact of the short period of unavailability.
Second, SOFAJRaft semi-deterministic election mechanism based on priority
2.1 Principle of SOFAJRaft based on priority election mechanism
In order to solve the problem caused by Raft algorithm “random time-out election mechanism” and increase the certainty of election, the author proposes a “priority-based semi-deterministic election mechanism”. The main algorithm idea is: Pre-set the value of priority for each node in the Raft cluster by setting a parameter. Each Raft node process will know the value of priority for all nodes in the cluster (including its own locally maintained priority variable) once it is up and running.
The configuration of each Raft node is as follows (the following is an example of the configuration of one of the nodes), where the string value of PeerId is in the form of {IP}:{port}:{index}:{priority};
Set the value of priority to the local global variable maintained by the Raft node itself by maximizing the value of priority for all nodes during the initialization of the Raft node process. In the example above, the node’s targetPriority local global variable value is set to 160 and its priority value is 100.
Before each Raft node triggers the PreVote pre-election phase via a random timeout mechanism, it compares its priority value to its targetPriority value to decide whether to vote for the Leader election in this round. Therefore, when a Raft cluster starts running, the node with the highest priority (160 in the example above) will be chosen as the Leader of the cluster to provide read and write services.
2.2 SOFAJRaft priority election reelection mechanism in case of failure
In most normal business scenarios, the Leader node in the Raft cluster can be predetermined using the method described in Section 2.1 above. But in a real production environment, all the unknowns can happen, and if the Leader node in the Raft cluster fails and goes down, is there a problem with priority voting based on the above?
As you can imagine, if the local global variable targetPriority maintained by other nodes does not change after the node with the highest priority value goes down, because the priority value of the node itself is less than the former, The other Raft nodes will never be able to participate in the Leader role, and without the Leader node the whole Raft cluster will not be able to provide read and write services to the outside world, which would be a major design flaw!!
To address the above problem of other nodes not being able to campaign for the new Leader role when a Raft cluster fails over. In the design, the author introduces the decayTargetPriority() target priority attenuation degradation function. If the Leader role is not elected in the last election cycle triggered by random timeout, The other nodes in the Raft cluster decrement the local global variable targetPriority by 20% each time until the decrement priority is the minimum “1”. The source code for the target priority attenuation degradation function is as follows:
After other nodes attenuate the local global variable targetPriority maintained by themselves, if the priority value of the node itself is greater than or equal to the targetPriority value, the node can participate in the next Leader election process triggered by random timeout. In general, the node with the second priority value can preempt the opportunity to elect the Leader in the next round.
As shown in the sequence diagram above, when Node1, the Leader node in the Raft cluster, goes down, Node2 and Node3 will trigger the election process of the new Leader through a random timeout because they cannot synchronize heartbeat information with it.
At t2 (> T1), Node3 preemptively triggers the Leader election process because its own priority value (40) is less than the local global targetPriority variable value (100), There is no way to initiate a PreVote pre-vote request to other surviving nodes in the Raft cluster.
Similarly, at time T3 (> T2), Node2, like Node3, cannot initiate a PreVote request. But at t4 (> T3), the local global variable targetPriority is attenuated by 20% (as in “Set T3 = 80” in the figure above) because no Leader node was created in the Raft cluster during the previous election cycle. After the attenuated Node3, the priority value is still less than the attenuated targetPriority variable value (80), so the vote request still cannot be initiated.
Similarly, at t5 (> T4), Node2’s targetPriority value (80) after attenuation is equal to its priority value (80), so it can initiate a PreVote request to other nodes in Raft cluster. After Node3 responds, The Raft cluster produces a new Leader node, Node2.
As Node triggers electionTimer, the execution time point triggered by random timeout is uncertain. In the actual application scenario as shown in the figure above, Node2 or Node3 may preempt the process of target priority attenuation and degradation. So if Node2 and Node3 themselves are set to similar priority values, such as 50 and 40, then at some point, Node2 with low priority can also be the Leader in a Raft cluster. Therefore, in a real production environment, it is recommended to define the priority of each node in the Raft cluster as a highly differentiated number.
A practical example of SOFAJRaft priority election mechanism
A more detailed example can be found on GitHub of SOFAJRaft at github.com/sofastack/s…
The startup code is as follows:
Interested users can set the parameters required by the demo program in the local editing environment, such as Idea or Eclipse command line, to experience the actual effect of priority election. The configuration parameters related to priority election are in the NodeOptions class.
As shown in the figure above, where:
- ElectionPriority: priority value of the Node. If set to 0, this node does not participate in the Raft cluster Leader election process and will never become the Leader. If set to -1, this node does not support priority voting and it still performs the Raft random timeout voting process.
- DecayPriorityGap: interval value of priority attenuation. If the user thinks that the priority value of the Node itself decays too slowly, this configuration parameter can be appropriately increased, so that the Node with a smaller priority value does not need too much time to complete the attenuation.
Source code analysis of SOFAJRaft priority election mechanism
As shown in the figure above, in the handleElectionTimeout() method executed by the timed task jraft-ElectionTimer-x, which is triggered by SOFAJRaft random timeout, Before preVote, determine whether the current Node participates in the Leader election process of Raft cluster by comparing the priority variable of the current Node with the local global variable targetPriority.
The allowLaunchElection() method defines the logic for the current Node to determine the priority value versus the local global targetPriority value. At the same time, if no Leader node is elected in the previous election cycle, then the target priority decrement and demotion method is implemented and the relevant variable value is set.
In addition, there is a problem need to note that in the stepDown NodeImpl () method will be called stopAllAndFindTheNextCandidate () method to suspend all log copy Replicator threads, Identify the next node with the most complete log as the last Candicate candidate to take over the role of the next Leader. Therefore, after the concept of priority election is introduced, in addition to the need to compare the log_index value of logs, if the log_index value of two nodes is equal, the need to determine the priority value. The specific code is as follows:
5. Jepsen test of SOFAJRaft priority election mechanism
Jepsen can verify that the system is consistent under specific failures. On the one hand, it provides a means of fault injection, which can simulate a variety of failures, such as network partitions, process crashes, and CPU overloads. On the other hand, it provides various verification models, such as Set, Lock, Queue, etc., to check whether the various distributed systems still meet the expected consistency under failure. Therefore, through Jepsen test, hidden errors of distributed system under extreme faults can be found, so as to improve fault tolerance of distributed system.
For distributed systems, the following types of faults are generally used for injection:
(1) the partition-random-node and partition-random-halves faults are common symmetric network partitions simulated; (2) Kill-random-processes and crash-random-Nodes faults simulate the process crash and node crash. (3) Hammer-time failure is to simulate the situation of some slow nodes, such as Full GC and OOM; (4) Bridge and partition-Majorities ring simulate extreme asymmetric network partitions;
To verify the reliability of the SOFAJRaft priority election mechanism, we chose to conduct injection tests on the above (1), (2), and (4) failures. The performance of the SOFAJRaft cluster during testing can be better analyzed in the form of diagrams. The following figure shows the client delay for each operation of the SOFAJRaft cluster in the case of simulated injection of symmetric network partition failure:
The blue box indicates that the data is added successfully, the red box indicates that the data is added failed, the yellow box indicates that the data is added successfully, and the gray box indicates that the data is added successfully. It is reasonable to see that some fail-injection periods cause the cluster to be temporarily unavailable, and some fail-injection periods do not. Due to random network partitions, a cluster re-election will only occur if the current Leader nodes are isolated to a few node regions, but even if a cluster re-election occurs, the SOFAJRaft cluster will be available again in a relatively short time. In addition, you can see that due to SOFAJRaft’s fault-tolerant design for symmetric network partitions, there is no re-election of the cluster after each failure recovery.
The figure below shows the percentage point delay of SOFAJRaft during testing.
You can see that the SOFAJRaft cluster is stable at all times except during periods of time when a failure is introduced that causes the cluster to re-elect. SOFAJRaft performed stably and as expected under random symmetric network partition fault injection. In addition to random symmetric network partition, SOFAJRaft also passed the consistency verification of Set test under other fault injection conditions, proving that SOFAJRaft’s fault tolerance and good reliability for network partition, process, node crash and other faults.
Six, summarized
Starting from the uncertainty caused by random time-out election mechanism in Raft algorithm, this paper elaborates on the basic process and semi-determinism of priority election mechanism by focusing on the concept, characteristics and principle of priority election mechanism and combining with the design and implementation details of SOFAJRaft priority election mechanism proposed by the author. This paper introduces the practical application of SOFAJRaft priority election, and analyzes the implementation details in the source code.
Before SOFA team and community students build over the analysis | SOFAJRaft principle “series, the related principle of SOFAJRaft source code parsing. This is the first of a series of articles on SOFAJRaft features that will continue to be shared. If you are interested in SOFAJRaft, please join us
SOFAJRaft:github.com/sofastack/s…
The anatomy of | SOFAJRaft principle “series
- SOFAJRaft Snapshot principle analysis | SOFAJRaft implementation principle
- SOFAJRaft – RheaKV distributed lock implementation analysis | SOFAJRaft implementation principle
- SOFAJRaft log copy – pipeline implementation analysis | SOFAJRaft implementation principle
- SOFAJRaft – RheaKV MULTI – RAFT – GROUP implementation analysis | SOFAJRaft implementation principle
- SOFAJRaft election mechanism analysis | SOFAJRaft implementation principle
- SOFAJRaft linear consistent read implementation analysis | SOFAJRaft implementation principle
- How does SOFAJRaft work with RheaKV
- SOFAJRaft implementation principle – Production level Raft algorithm library storage module analysis principle
Financial Class Distributed Architecture (Antfin_SOFA)