Implementation of CLUSTER MEET command

By sending the CLUSTER MEET command to node A, the client can ask the receiving node A to add another node B to the CLUSTER where node A is currently located:

    CLUSTER MEET <ip> <port>

Node A that receives the command will handshake with node B to confirm their existence and establish the basis for further communication in the future:

1) Node A creates A clusterNode structure for node B and adds the structure to its ClusterState. nodes dictionary.

2) After that, node A will send A MEET message to node B according to the IP address and port number given by the CLUSTER MEET command.

3) If everything goes well, node B will receive the MEET message sent by node A, and node B will create A clusterNode structure for node A and add the structure to its ClusterState. nodes dictionary.

4) After that, node B will return A PONG message to node A.

5) If everything goes well, node A will receive the PONG message returned by node B. Through this PONG message, node A can know that node B has successfully received the MEET message sent by it.

6) After that, node A returns A PING message to node B.

7) If everything goes well, node B will receive the PING message returned by node A. Through this PING message, node B can know that Node A has successfully received its PONG message, and the handshake is complete.

Then, node A transmits the information of node B to other nodes in the cluster through the Gossip protocol, so that other nodes shake hands with node B. Finally, after A period of time, node B will be known by all nodes in the cluster.

 

Two, slot assignment

The Redis cluster stores key-value pairs in the database by sharding: the whole database of the cluster is divided into 16384 slots, and each key in the database belongs to one of the 16384 slots. Each node in the cluster can handle zero or up to 16384 slots.

When all 16384 slots in the database have nodes processing, the cluster is online (OK). Conversely, if any slot in the database is not processed, the cluster is in an offline state (fail).

You can assign one or more slots to a node by issuing the CLUSTER ADDSLOTS command:

    CLUSTER ADDSLOTS <slot> [slot . . .]

127.0.0.1:7000> CLUSTER ADDSLOTS 0 12 3 4.. 5000

    OK

 

127.0.0.1:7000 > CLUSTER INFO

    cluster_state:ok

 

The clusterNode slots and NumSlot attributes record which slots the node is responsible for:

    struct clusterNode {

        // …

        unsigned char slots[16384/8];

        int numslots;

        // …

    };

The slots attribute is a bit array that is 2048 bytes long and contains 16384 bits. If the bits of the Slots array on index I have a value of 1, then the node is responsible for processing slot I, and 0 means not.

 

In addition to recording its slots in the clusterNode slots and NumSlots properties, a node also sends its slots array to other nodes in the cluster to tell them which slots it is currently handling.

 

The slots array in the clusterState structure records the assignment of all 16384 slots in the cluster:

    typedef struct clusterState {

        // …

clusterNode *slots[16384]; // Each entry points to a clusterNode

        // …

    } clusterState;  

    

3. Run commands in the cluster

Once all 16,384 slots in the database have been assigned, the cluster comes online and clients can send data commands to nodes in the cluster.

When a client sends a command related to a database key to a node, the node receiving the command calculates which slot the database key to be processed by the command belongs to and checks whether the slot has been assigned to it:

If assigned to the current node, the node executes the command directly. Otherwise, the node will return a MOVED error to the client, redirect the client to the correct node, and send the command it wanted to execute again.

Calculate which slot the key belongs to:

    def slot_number(key):

        return CRC16(key) & 16383

// crC-16 checksum

Check whether slot I is handled by the current node:

    clusterState.slots[i] == clusterState.myself

 

A cluster client typically creates socket connections with multiple nodes in the cluster, and node steering is essentially replacing a socket to send commands.

One difference between a node and a stand-alone server in terms of database is that the node can only use database 0.

 

Fourth, re-sharding

The resharding operation of a Redis cluster can change any number of slots assigned to one node (source node) to another node (target node), and the key-value pairs of the associated slots will also be moved from the source node to the target node. (Resharding is not rehash, please distinguish it from client consistent hash sharding.)

Resharding can be done online without the cluster going offline, and both source and target nodes can continue to process command requests.

Resharding is performed by Redis’s cluster management software, Redis-Trib, which provides all the commands needed for resharding, while Redis-Trib does it by sending commands to source and target nodes.

The steps for redis-trib to refragment a single slot in the cluster are as follows:

1) Redis-trib sends CLUSTER SETSLOT <slot> IMPORTING <source_id> to the target nodes to prepare them to import the key/value pairs of slots from the source nodes.

Migrate 2) Redis-trib can migrate key-value pairs from slot to target by sending CLUSTER SETSLOT <slot> MIGRATING <target_id> to the source node.

3) Redis-trib sends CLUSTER GETKEYSINSLOT <slot> <count> to the source node to obtain a maximum of count of key-value pairs belonging to slots.

4) For each key name obtained in Step 3, redis-trib sends a MIGRATE <target_ip> <target_port> <key_name> 0 <timeout> command to the source node to MIGRATE the selected key atomically from the source node to the target node.

5) Repeat Step 3 and Step 4 until the source node migrates all key-value pairs that belong to slots to the destination node.

6) Redis-trib sends the CLUSTER SETSLOT <slot> NODE <target_id> command to any NODE in the CLUSTER to assign slot slots to the target NODE. This assignment information will be sent to the entire CLUSTER via message. Eventually all nodes in the cluster will know that slot slots have been assigned to target nodes.

ASK error:

During the process of resharding, when the source node migrates a slot to the target node, it may occur that some key-value pairs belonging to the migrated slot are kept in the source node, while the other key-value pairs are kept in the target node.

When a client sends a command to a source node related to a database key that happens to belong to the slot being migrated:

The source node first looks for the specified key in its own database and, if found, executes the command sent by the client. If not, the key may have been migrated to the destination node, and the source node will return an ASK error to the client, directing the client to the destination node being imported into the slot and sending the command it wanted to execute again.

 

Replication and failover

The nodes in the Redis cluster are divided into master node and slave node. The master node is used to process slots, while the slave node is used to replicate a master node and continue processing command requests in place of the offline master node when the replicated master node goes offline.

Set the slave node CLUSTER REPLICATE <node_id>

Fault detection:

Each node in the cluster periodically sends PING messages to other nodes in the cluster to check whether they are online. If the node that receives the PING message does not return the PONG message within the specified time, The sending node will then flag the receiving node as suspected to be offline (PROBABLE fail, PFAIL).

If more than half of the nodes responsible for processing slots in a cluster report a primary node X as suspected offline, the primary node X is marked as FAIL, and the node marked as FAIL broadcasts a FAIL message to the cluster about X. All nodes that receive this FAIL message will immediately mark X as FAIL.

Failover:

When a slave node finds that it is replicating a master node in the FAIL state, the slave node starts to failover the offline master node. Here are the steps to perform failover:

1) Of all slave nodes that replicate the offline master node, one slave node will be selected.

2) The selected secondary node executes the SLAVEOF no one command and is called the new primary node.

3) The new master node will revoke all slots assigned to the offline master node and assign all slots to itself.

4) The new master node broadcasts a PONG message to the cluster. This PONG message can let other nodes in the cluster know immediately that this node has changed from a slave node to a master node, and this master node has taken over the slot that was handled by the offline node.

5) The new master node starts to receive command requests related to the slot it is responsible for processing, and the failover is complete.

Elects a new master node:

1) The configuration era of the cluster is an increment counter with an initial value of 0.

2) When a node in the cluster starts a failover operation, the value of the cluster configuration era is increased by one.

3) For each configuration era, each primary node responsible for processing slots in the cluster has one vote, and the first secondary node that asks for a vote from the primary node gets the vote from the primary node.

4) When the slave node finds that the master node it is replicating has gone offline, the slave node will broadcast a CLUSTER_TYPE_FAILOVER_AUTH_REQUEST message to the cluster, requiring all the master nodes that have received this message and have voting rights to vote for the slave node.

5) If a master node has a vote (it is handling the processing slot) and the master node has not voted for another slave node, the master node will return a CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK message to the slave node requesting the vote, indicating that the master node supports the slave node becoming the new master node.

6) Each secondary node that participates in the election receives CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK messages and determines how many primary nodes it has received according to the number of such messages.

7) If there are N primary nodes with voting rights in the cluster, then when a secondary node collects N/2+1 votes or more, the secondary node will be elected as the new primary node.

8) Because each primary node can vote only once in each configuration era, if there are N primary nodes voting, there will only be one secondary node with N/2+1 votes or greater, ensuring that there will only be one new primary node.

9) If not enough support votes are collected from the nodes in a configuration era, the cluster enters a new configuration era and elects again until a new primary node is elected.

This method of electing new master nodes is very similar to the method of electing Lead Sentinel because both are implemented using the lead election method based on Raft algorithm.