Wechat search “code road mark”, point attention not lost!
The automatic fail-over capability of Sentry mode provides high availability assurance, as does the automatic fail-over capability of Redis Cluster for Cluster availability. There are a lot of similarities between them in design ideas. This section will analyze this topic.
heartbeat
As a distributed system without center, the fault tolerance mechanism of Redis Cluster relies on the cooperation of all nodes. When a node fault is detected, it spreads the node fault and reaches a consensus, and then triggers a series of node election and failover. This is accomplished on the basis of the maintenance of cluster state between nodes through the heartbeat mechanism.
The diagram below shows the state of the cluster from the perspective of node A (it only plots the contents related to cluster fault tolerance). “Myself” refers to node A itself, which describes node A’s state. Nodes [B] points to node B, which views the status of node B from node A. There are also the current era of the cluster, hash slot and node mapping relationship.In a cluster, it passes between every two nodesPING
andPONG
Both types of messages maintain a heartbeat. From the previous section, you can see that the two types of messages have exactly the same structure (the header and the body are the same)type
The field is different and we call this message pair a heartbeat message.
PING
/PONG
The message header contains the configuration epoch (configEpoch), replication offset (offset), node name (sender), responsible hash slot (myslots), node state (Flags) of the message source node, which corresponds to the node information maintained by the target node. It also includes currentEpoch and state in the view of the source node.PING
/PONG
The message body contains severalclusterMsgDataGossip
, eachclusterMsgDataGossip
Corresponding to a cluster node state, it describes the heartbeat state between the source node and it and the source node’s judgment on its running state.
Heartbeat messages are transmitted and exchanged between two nodes of the cluster in the form of “plague spread”, which ensures that the node status can be reached in a short period of time. We have a deep understanding of heartbeat mechanism from the aspects of timing of heartbeat trigger, composition of message body and application.
trigger
In cluster mode, the heartbeat action is triggered by the periodic function clusterCron(), which executes every 100 milliseconds. In order to control the size of messages in the Cluster and take into account the timeliness of heartbeat between nodes, Redis Cluster adopts different processing strategies.
Normally, clusterCron() sends PING messages to a node every second (this function is executed every 10 times). The selection of this node is random, and the random method is:
- Select a node randomly from the node list five times each time. If the node meets the conditions (cluster bus link exists, not waiting), the cluster bus link is not waiting
PONG
Reply, non-handshake state, non-local node), as an alternative node; - If the alternate node is not empty, it is selected from the alternate nodes
PONG
Restores the earliest node;
Note The basis for checking heartbeat messages sent and received by Redis Cluster, which is also very important for subsequent fault detection:
- When the source node sends a message to the destination node
PING
Command to set the target nodeping_sent
Is the current time.- When the source node receives the packet from the destination node
PONG
After the reply, the target node will be setping_sent
Is 0 and updated at the same timepong_received
Is the current time.
That is, ping_sent> is 0, indicating that a > PONG> reply has been received and waiting for the next send; > ping_sent> Not 0, waiting for > PONG> reply.
The timeout parameter cluster-node-timeout is set in the cluster configuration file, and the corresponding variable NODE_TIMEOUT is used as the basis for the heartbeat timeout of the target node. To ensure that pingpong messages do not time out and to preserve retries, Redis Cluster will perform heartbeat compensation at the NODE_TIMEOUT/2 threshold.
ClusterCron () checks each node in turn each time it is executed (100 ms) :
- If the destination node is received
PONG
The messageNODE_TIMEOUT/2
Has not been sentPING
, the source node sends a message to the destination node immediatelyPING
. - If already sent to the destination node
PING
Message, but inNODE_TIMEOUT/2
The source node will try to disconnect the network connection and eliminate the impact of the network connection fault on the heartbeat by reconnecting.
In general, the number of heartbeat messages sent and received per second in the cluster is stable, and the instantaneous NETWORK I/O will not be too large to burden the cluster even if the cluster has many nodes. There are N*(n-1) links in the whole cluster according to the directed complete graph of N nodes. Each link needs to maintain the heartbeat, and heartbeat messages appear in pairs.
If the cluster has 100 nodes and NODE_TIMEOUT is 60 seconds, that means that each node sends 99 pings in 30 seconds, or an average of 3.3 pings per second. With 100 nodes sending a total of 330 messages per second, this order of magnitude of network traffic pressure is acceptable.
However, when the number of nodes is fixed, the NODE_TIMEOUT parameter needs to be set properly. If the heartbeat message is too small, the heartbeat message exerts great pressure on the network. If the number is too large, node faults may be detected in a timely manner.
The message form
PING/PONG messages use a consistent data structure. The content of the header comes from myself in the cluster state, which is easy to understand; The message body needs to append the state of several nodes, but there are many nodes in the cluster. Which nodes should be added?
According to the design of Redis Cluster, each message body will contain the normal node and the PFAIL status node, the specific obtain method is as follows (this part of the source is located in the clusterSendPing function.
- Determine that the heartbeat message needs to contain the theoretical maximum value of healthy nodes (this is theoretical because the following process also needs to consider the node state and remove nodes that are in the grip or down) :
- Condition 1: The maximum number is equal to the number of cluster nodes -2. “Minus 2” refers to removing the source node and target node.
- Condition 2: The maximum number is 1/10 of the number of nodes and not less than 3.
- Take the minimum value of the above two results to obtain the required theoretical quantity
wanted
.
- Ensure that the heartbeat information needs to be included
PFAIL
Number of status nodespfail_wanted
: Gets all information in the cluster statusPFAIL
Number of nodes (server.cluster->stats_pfail_nodes
). - Add a normal node: The node is randomly added from the cluster node list
wanted
Create the gossip message fragment and add the message body. Nodes must meet the following conditions:- Not the source node itself;
- not
PFAIL
State, as added separately laterPFAIL
Node; - The node is not in the handshake state or no address state, or the node cluster bus link exists and the number of responsible hash slots is not 0;
- add
PFAIL
Status node: traversal to obtainPFAIL
Status node, create the gossip fragment and add the message bodyPFAIL
State, not handshake state, not no address state.
Message application
After receiving a PING or PONG message, the node updates the node status maintained locally according to the content in the message header and message body. Following the source code description, which involves fields and logic is still more complex, I will illustrate the message processing from the perspective of application scenarios (combined with the source code function clusterProcessPacket).
Cluster chronicle and configuration epoch
The heartbeat header contains the configuration epoch of the source node and what it considers the currentEpoch of the cluster. Upon receipt, the destination node checks the configuration epoch of the source node that it maintains and the currentEpoch of the cluster. Specific methods are as follows:
- If the source node configuration era (cached configuration era) cached by the destination node is smaller than the configured era (declared configuration era) declared in the heartbeat message of the source node, change the cached configuration era to declared configuration era.
- If the cluster current era (cached current era) cached by the destination node is smaller than the cluster current era (declared current era) declared in the heartbeat message of the source node, change the cached current era to declared current era.
Hash slot change detection
The message header contains a list of hash slots that the source node is currently responsible for, and the destination node checks the mapping between the locally cached hash slots and the node to see if there are hash slots that are inconsistent with the mapping. When an inconsistent mapping relationship is found, it is handled as follows:
- If the source node is the primary node and its declared epoch is greater than that cached by the destination node, the mapping between the local hash slot and the node is modified according to the declared hash slot.
- If the local cached node configuration era is greater than the era declared by the source node, pass
UPDATE
The message tells the source node to update the local hash slot configuration. - If the locally cached node configuration era is the same as the one declared by the source node, and both the source node and the target node are primary nodes, then the configuration era conflict is handled: the target node updates the current era of the local cluster and its own configuration era (consistent with the current era of the cluster). In subsequent heartbeats, the hash slot conflicts will be adjusted again to achieve consistency.
New Node Discovery
In the process of node handshake, we know that a new node only needs to shake hands with any node in the cluster to join the cluster, but other nodes do not know that the new node joins the cluster, and the new node does not know the existence of other nodes. In both cases, it is the discovery of new nodes.
During the heartbeat process, the source node adds the new node information unknown to the other node to the message body to notify the destination node. The target node will perform the following process:
- The destination node finds a node in the message body that does not exist in the local cache and creates it for it
clusterNode
Object and add it to the cluster nodes list. - The target node is
clusterCron
Function to create its network link, the two through two heartbeat interaction to complete the discovery of new nodes.
Node Fault Discovery
Node fault detection is a core function of heartbeat, which is described separately in the next section.
Fault discovery and transfer
PFAIL and FAIL concepts
Redis Cluster uses two concepts PFAIL and FAIL to describe the running status of nodes, which is similar to SDOWN and ODOWN in Sentinel mode.
- PFAIL: It may be down
When a node is unable to access a node after NODE_TIMEOUT, it identifies the unreachable node with PFAIL. Regardless of the node type, both primary and secondary nodes can identify other nodes as PFAIL.
The unreachability of nodes in Redis cluster means that after the source node sends the PING command to the target node and fails to get its PONG reply after NODE_TIMEOUT, the target node is considered unreachable.
This is a concept derived from the heartbeat. To keep PFAIL as close as possible, NODE_TIMEOUT must be larger than the network round-trip time between the two nodes; To ensure reliability, when no reply from the target node to the PONG command is received after half of the NODE_TIMEOUT, the source node immediately tries to reconnect to the target node.
Therefore, PFAIL is the result of heartbeat detection from source node to target node, which is subjective to a certain extent.
- FAIL: downtime
PFAIL status is subjective. It does not mean that the target node is down. Only when the target node is in FAIL state, it means that the node is down.
However, we already know that during the heartbeat, each node notifies other nodes of the detection of PFAIL. Therefore, if one node is down objectively, other nodes must also detect the PFAIL state.
When a certain number of nodes in the cluster consider the target node to be in the PFAIL state within a certain window of time, the node raises the status of the target node to FAIL.
Node fault detection
This section describes the principle of node fault detection in detail. For example, nodes A, B, and C are the active nodes. Node B is down as an exampleping_sent
,pong_received
,fail_reports
Several fields. How does a node reach the PFAIL state?
The cluster status maintained by each node contains the node list. Node information is shown in node B in the figure above. The ping_sent field indicates the heartbeat status of node B to node A. If the value is 0, the heartbeat between node A and node B is normal. If the value is not 0, node A has sent PING to node B and is waiting for node B to reply to PONG.
The cluster node executes the clusterCron() function every 100 milliseconds, which checks the heartbeat and data interaction status of each node. If node A does not receive any data from node B within the NODE_TIMEOUT period, node B is regarded as faulty, and node A sets its status to PFAIL. The specific code is as follows:
void clusterCron(void) {
/ * omitted... * /
// The time when the ping message has been sent
mstime_t ping_delay = now - node->ping_sent;
// How long has it been since we received data from the node
mstime_t data_delay = now - node->data_received;
// Take the earlier of both
mstime_t node_delay = (ping_delay < data_delay) ? ping_delay : data_delay;
// Determine timeout
if (node_delay > server.cluster_node_timeout) {
/* The node times out. If the current node is not in the PFAIL or FAIL state, set it to PFAIL */
if(! (node->flags & (CLUSTER_NODE_PFAIL|CLUSTER_NODE_FAIL))) { serverLog(LL_DEBUG,"*** NODE %.40s possibly failing", node->name);
node->flags |= CLUSTER_NODE_PFAIL;
update_state = 1; }}/ * omitted... * /
}
Copy the code
The PFAIL state is propagated
According to Heartbeat Mechanism – Message Composition, nodes in the PFAIL state will be propagated to all reachable nodes in the cluster with heartbeat.
The PFAIL state changes to the FAIL state
The PFAIL to FAIL status switch requires the approval of more than half of the primary nodes in the cluster. The cluster nodes collect the PFAIL flag of the node through heartbeat messages. If node B fails, both A and C will detect the failure of node B and mark node B as PFAIL. The heartbeat messages between nodes A and C contain the PFAIL status of node B. From the perspective of node A, how to deal with Redis Cluster.
Due to the state of the other nodes in the heartbeat message message, the message receiver by clusterProcessGossipSection function for processing, node C is the master node, node B for PFAIL state and statement. From the source code, the following process is executed:
- Add a fault reporting node to node B, that is, add node C to node B
fail_reports
Inside.
Fail_reports indicates the clusterNodeFailReport list, which stores the list of all nodes that consider node B to be faulty. The structure is as follows, where the time field represents the time when the node is added, that is, the latest time when the node is declared faulty. When the node status is reported again, only the time field is refreshed.
typedef struct clusterNodeFailReport {
/* Node reporting node failure */
struct clusterNode *node; /* Node reporting the failure condition. */
/* Time when the fault is reported */
mstime_t time; /* Time of the last report from this node. */
} clusterNodeFailReport;
Copy the code
- Check whether the
FAIL
Conditions: Redis Cluster specifies if more than halfThe master nodeConsiders a node to bePFAIL
Status: Sets the node status toFAIL
State. Following the above example, the specific process is as follows:- calculation
FAIL
Legal number of nodes: in this case, if the cluster contains three primary nodes, at least two nodes are required to approve it. - In the cluster status of node A, node C has been added to node B
fail_reports
List, and node A has marked node B as faulty. That is, two nodes confirm that node B is faulty. Therefore, set node B toFAIL
State. - Node A cancels node B
PFAIL
State, set itFAIL
State, and then sends a message about node B to all reachable nodesFAIL
The message.
- calculation
The detailed code procedure is shown in the following function:
void markNodeAsFailingIfNeeded(clusterNode *node) {
int failures;
// Calculate the legal number of node outages
int needed_quorum = (server.cluster->size / 2) + 1;
// Determine whether the current node considers the node to have timed out
if(! nodeTimedOut(node))return; /* We can reach it. */
if (nodeFailed(node)) return; /* Already FAILing. */
failures = clusterNodeFailureReportsCount(node);
/* The current node also recognizes the node is down */
if (nodeIsMaster(myself)) failures++;
if (failures < needed_quorum) return; /* No weak agreement from masters. */
serverLog(LL_NOTICE, "Marking node %.40s as failing (quorum reached).", node->name);
/* Set the node to the FAIL state */
node->flags &= ~CLUSTER_NODE_PFAIL;
node->flags |= CLUSTER_NODE_FAIL;
node->fail_time = mstime();
/* Broadcast the FAIL status of the node to all reachable nodes, and all nodes will be forced to accept the acceptance after receiving it */
clusterSendFail(node->name);
clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE|CLUSTER_TODO_SAVE_CONFIG);
}
Copy the code
Note the following: Records in fail_reports have a validity period. The default value is two times NODE_TIMEOUT. Records that exceed the validity period are removed. In other words, sufficient records must be collected within a certain time window to complete the PFAIL to FAIL state transition. If the heartbeat between a primary node and the node recovers, it is removed from fail_Reports immediately.
After node A sets node B to the FAIL state, node A sends A message about node B’s FAIL to all reachable nodes. The message type is CLUSTERMSG_TYPE_FAIL. As soon as other nodes receive a FAIL message, they immediately set node B to the FAIL state, regardless of whether node B is in the PFAIL state in their opinion.
When a primary node fails, a FAIL message about it is propagated to all reachable nodes in the cluster, which mark it as a FAIL state. In order to ensure the availability of the Cluster, the slave nodes of the master node will start the failover action and select the optimal slave node to promote the master node. Failover of the Redis Cluster consists of two key processes: slave node election and slave node promotion.
Slave node election
If the primary node fails, all the secondary nodes of the primary node will start an election process, and only the secondary nodes that win the vote will have the chance to be promoted to the primary node under the voting of the other primary nodes. The preparation and execution of the slave node election takes place in clusterCron.
Conditions and timing for calling an election
From node election process must meet the following condition (before the election process by checking work in function clusterHandleSlaveFailover (void)) :
- The master node of the slave node is in
FAIL
State; - The hash slot responsible by the master node of the secondary node is not empty.
- To ensure the timeliness of data on the slave node, the disconnection time between the slave node and the master node must be shorter than the specified time. For this specified time, I extracted it from the code as follows:
/* Data validity time */
mstime_t data_age;
/* Takes the interval between the secondary node and the primary node */
if (server.repl_state == REPL_STATE_CONNECTED) {
data_age = (mstime_t)(server.unixtime - server.master->lastinteraction) * 1000;
} else {
data_age = (mstime_t)(server.unixtime - server.repl_down_since) * 1000;
}
/ * * /
if (data_age > server.cluster_node_timeout)
data_age -= server.cluster_node_timeout;
data_age >
(((mstime_t)server.repl_ping_slave_period * 1000)
+ (server.cluster_node_timeout * server.cluster_slave_validity_factor)
Copy the code
If a primary node in the FAIL state has multiple secondary nodes, the Redis Cluster always expects the secondary node with the most complete data to be promoted to the new primary node. However, if all slave nodes start the election process at the same time and all slave nodes compete fairly, the node with the most complete data cannot be guaranteed to be promoted preferentially. To increase the priority of this node, Redis Cluster introduces a delayed start mechanism when starting the election process. Combined with the source code, each slave node will calculate a delay value and calculate the start time of the election process of the node according to the calculation formula as follows:
/* election process start DELAY */ DELAY = 500 + random(0,500) + SLAVE_RANK * 1000 /* calculate the election process start time from the node */ server.cluster->failover_auth_time = DELAY + mstime()Copy the code
To explain this formula:
- The fixed delay value of 500 is to ensure that the FAIL status of the primary node is propagated in the cluster and prevent the primary node from rejecting the primary node when the election is initiated.
- Random delay value random(0,500) : to ensure that the election process is not started at the same time from the slave node;
- SLAVE_RANK: This value depends on the completeness of data from the slave node. When the primary node becomes
FAIL
After state, from node to node will passPONG
Command to swap states in order to establish an optimal rank ranking; Rank from nodes The ranking rules are as follows: Rank from nodesrepl_offset
Rank = 0; And then rank = 1, and so on.
So, the node with the highest data integrity will start the election process first, and if all goes well, it will be promoted to the new master node.
After failover_AUTH_Time is set, when clusterCron() runs again, if the system time reaches the default value (and failover_AUTH_SENT =0), the election process will start.
The process of electing from nodes
To start the election process from the node, add 1 to currentEpoch and set failover_AUTH_SENT =1 to indicate that the election process has started. Then, an election request is sent to all primary nodes in the cluster through FAILOVER_AUTH_REQUEST messages, and the response of the primary node is waited for within 2 times NODE_TIMEOUT (at least 2 seconds).
The other primary nodes in the cluster are the decision makers elected from the nodes and need to make strict checks before voting. To prevent multiple secondary nodes from winning the election and ensure the legitimacy of the election process, the primary node verifies the following conditions after receiving the FAILOVER_AUTH_REQUEST command message:
- The primary node of the secondary node that initiates the election process must be in
FAIL
Status (the fixed value of 500 ms in the previous DELAY is to ensure that FAIL messages are propagated adequately within the cluster); - For a given
currentEpoch
, the primary node will only vote once and be saved in lastVoteEpoch, other older epochs will be rejected. Also, if the currentEpoch in the slave node vote request is smaller than the currentEpoch in the current primary nodecurrentEpoch
, the request to vote will be rejected. - Once the master node passes
FAILOVER_AUTH_ACK
Type of message voted to the specified secondary node, the primary node in 2 timesNODE_TIMEOUT
Time will not vote for other secondary nodes under the same primary node.
After voting, the primary node will record the information and save it to the configuration file in a secure and persistent manner:
- Save the current era of the cluster from the last vote:
lastVoteEpoch
。- Save the voting time, stored in the cluster node list
voted_time
In the.
To avoid counting the previous vote as the current vote, the slave node checks the currentEpoch declared by the FAILOVER_AUTH_ACK message. If the value is less than the current era of the slave node’s cluster, the vote is discarded. If the vote is valid, secondary nodes will run cluster->failover_auth_count.
After being approved by a majority of the master nodes, the slave node wins the election. If the majority of nodes do not vote for it within 2 times NODE_TIMEOUT (at least 2 seconds), the election process will terminate and a new election process will start after 4 times NODE_TIMEOUT (at least 4 seconds).
Ascending from node
After get valid votes from node, will add 1 vote counter failover_auth_count, and through the node from the election and ascend handler clusterHandleSlaveFailover periodically check, if get most from the node number (legal) master node recognition, The promotion process from the node is triggered.
In this case, the number of failed nodes is the same as the number of failed nodes, that is, (server.cluster->size / 2) + 1, more than half of the primary nodes.
Start the promotion process from the node, it will make a series of modifications to its own state information, and finally promote itself as the main node, the details are as follows:
- Configures the epoch from the node
configEpoch
Add 1; - Switch your role to the master node and reset the master/slave replication relationship (this is similar to the sentry mode promoted from node);
- Copy the original master node responsible for the hash slot, change to their own responsibility;
- Save and persist the above configuration changes.
After the slave node sets itself as the master node, it broadcasts the change of its state to all nodes in the cluster through the PONG command, so that other nodes can modify the state in time and accept the role promotion of the slave node.
After a node wins an election, its role promotion process is relatively simple, and more importantly, it is recognized by all other nodes in the cluster. Considering the roles of other nodes in the cluster, three scenarios need to be considered:
- Other nodes: Before the secondary node is promoted, other nodes in the cluster have no master-slave relationship with the current node and are not the same primary node as the secondary node.
- Sibling slave: A slave of the same master node before the slave node is promoted.
- Old primary node: How to join the cluster again after a fault recovery.
General processing logic
When the new master node is promoted, PONG messages are sent to all reachable nodes in the cluster. When other nodes receive the PONG message, in addition to the general processing logic (such as the promotion of configuration era, etc.), they will detect the role change of the node (from the node to the master node), so as to update the cluterState of the local cluster. The specific updates are as follows:
- [Fixed] Update the configuration chronicle and the cluster current era because it was upgraded when promoted from the node
- Remove the slave node from the node list.
- Update the node role: set the slave node as the master node, cancel the slave node flag;
- Hash slot conflict processing: the new master node takes over all the hash slots of the old master node, and changes the hash slots of the original master node to the new master node;
The secondary sibling node switches over the primary/secondary replication
The above process is common to “other nodes” and “sibling slave nodes”. The “old master node” is temporarily disconnected and cannot be notified. Based on this PONG message, “Other Nodes” have accepted the role change information of the new master node. However, the “sibling node” still takes the old master node as its own master node. According to the idea of failover, it should take the new master node as its own master and slave replication object. How to achieve this?
In the process of hash slot conflict processing, the “sibling slave node” will find that the conflicting hash slot is the responsibility of its original master node. When the “sibling slave node” detects this change, it will take the new master node as its own master node and use it as the new master node for master slave replication.
The original primary node is rejoined
When the old master node recovers, it maintains heartbeat with other nodes using the same configuration information as before the outage (cluster current era, configuration era, hash slot, and so on).
When a node in the cluster receives its PING message and finds that its configuration information is out of date (node configuration era) and the allocation of hash slots conflicts, the node will notify it to UPDATE its configuration through an UPDATE message.
The UPDATE message contains information about the nodes that are responsible for the conflicting hash slots. After receiving the UPDATE message, the old master node finds that its node configuration era is out of date. Therefore, the old master node regards the UPDATE message node as its master node, changes its identity to slave node, and updates the local hash slot mapping.
In subsequent heartbeats, other nodes will update the old master node as the slave node of the new master node.
At this point, failover is complete.
Other topics related to fault tolerance
Migrating from a Node
In order to provide the availability of the Cluster system, Redis Cluster achieved from the node transfer mechanism: the building of a Cluster, each master node has a number of slave nodes, if in the process of running for several independent node failure event, leading to a master node is not normal (isolated) from the node, then once the master node goes down, the Cluster will not be able to work. The Redis Cluster will find the isolated primary node and the primary node with the largest number of slave nodes in time, and then select the appropriate slave node to migrate to the isolated primary node, so that it can withstand another outage event, thus improving the availability of the entire system.
The following figure shows an example: In the initial state, the cluster has seven nodes, of which A, B, and C are the master nodes. A has two slave nodes A1 and A2, and B and C have one slave node each, respectively, B1 and C1.The function of migration from a node when a node fails is described as follows:
- In the process of cluster operation, due to the failure of node B, B1 is promoted to the new master node through election, resulting in the isolation of B1 without nodes. If B1 fails again, the cluster will be unavailable.
- However, A has two slave nodes at this time, and Redis Cluster will start the slave node migration mechanism to transfer A1 to the slave node of B1, so that B1 is no longer isolated.
- Even if B1 fails again, A1 can be promoted to the new master node and the cluster can continue to work.
Cluster fissure
As a distributed system, it is necessary to solve all kinds of complex problems caused by network partition. In Redis Cluster, due to network partition problems, Cluster nodes are distributed in two partitions, resulting in Cluster “brain split”. How does the secondary node election and promotion work in both network partitions at this time?As shown in the figure above, node A and its slave node A1 are disconnected from other nodes due to the network partition. Let’s see how the nodes in the two partitions work.
- Most nodes are partitioned
Nodes in this zone will detect the PFAIL state of node A, and then confirm that node A reaches the FAIL state through propagation. The A2 node triggers the election process and wins, and is promoted to the new primary node to continue working. After a failover, the network partition that contains most of the nodes can continue to work.
- Few node partitions
Nodes A and A1 located in A few node zones will detect the PFAIL status of other nodes B and C. However, as they cannot be confirmed by most of the primary nodes, B and C cannot reach the FAIL state, resulting in failure of subsequent failover.
Redis Cluster summary
This paper introduces the working principle of Redis Cluster from three main parts: Cluster structure, data sharding, fault tolerance mechanism, almost covering all the content of Redis Cluster, hoping to bring help to everyone to learn Redis Cluster.
In the study of official documents, system source code process, did encounter a lot of puzzled content, through repeated comb code process, one by one to uncover the answers to each mystery, and finally established the whole knowledge system.
Here I am at last. This article took one month from the beginning to the completion. I wrote a little bit every day.
The resources
- Redis Cluster Specification: Redis. IO /topics/clus…
- Tutorial on Redis Cluster: Redis. IO /topics/clus…
- Redis source code 6.0.10