introduce

After Redis 3.0, complete solutions such as Sharding, replication(replication mechanism still reuses the original mechanism, but cluster has the ability to sense the master and slave) and failover are provided between nodes in a decentralized way, which is called Redis Cluster. Merge the proxy/ Sentinel work into a regular Redis node.

topology

A Redis Cluster consists of multiple Redis node groups. There is no intersection of data served by different node groups (each node group corresponds to a sharding of data). The data of two types of nodes in a node group is consistent in quasi-real time, which is guaranteed by the asynchronous active/standby replication mechanism. A node group has only one master node (read/write service) and zero to multiple slave nodes (read service).

Redis Cluster stores key-value pairs by fragmentation. The entire database is divided into 16384 slots, and the keys in the database belong to one of the 16384 slots. Each node can process 0 or up to 16384 slots (each node processes different slots).

The node structure of the Redis Cluster is shown below:

In this example figure, the complete set of key-value data is divided into five slots. For example, the actual Redis Cluster has a total of 16,384 slots. Nodes A and B are master nodes and provide read/write services. A and A1 form an active/standby node group and synchronize data in active/standby replication mode. A1 is the slave of A, and A owns slot data 1/2/3. A1 also owns slot data of A. Similarly, B1 and B2, as slaves of B, also form a node group.

In the above Redis Cluster node structure example, you can see that two nodes interact with each other through the Redis Cluster Bus. The interaction information is as follows:

  1. Mapping between data fragments (slots) and nodes

  2. Available status of each node in the cluster

  3. When the cluster structure (configuration) is changed, the configuration information is agreed through certain protocols. The migration of data fragments, master/slave switchover, discovery of single master, and the change of master/slave relationship all lead to the change of cluster structure.

  4. The publish/subscribe function needs to interact with the internal implementation of cluster edition.

The Redis Cluster Bus is connected through a separate interface. The Bus is the internal communication mechanism between nodes and exchanges the byte serialization information. Compared with the character serialization from the client to the Redis server, the interaction efficiency is higher. The client can connect to any node in the cluster and gradually learn the data fragment mapping relationship of the whole cluster through the interaction process.

Configuration consistency

For a decentralized implementation, the topology of the cluster does not exist on a single configuration node, and the introduction of the latter also introduces new consistency issues. How do nodes agree on the CLuster topology, and how does the Redis CLuster configuration mechanism work? Redis Cluster makes the Cluster configuration ultimately consistent across nodes by introducing two self-incrementing EPOCH variables.

Configure information data structures

Each Node in the Redis Cluster saves the configuration information of the Cluster, which is stored in the ClusterState, as shown in the following figure:

The semantics of the above variables are as follows:

  • ClusterState records the cluster configuration state from the perspective of a node in the cluster
  • CurrentEpoch indicates the maximum version number in the entire cluster. The version number increases every time the cluster information changes
  • Nodes is a list of all cluster Nodes known to the node, including the node itself
  • ClusterNode records information about each node, including the epoch of the node, the data fragment (slot) of the node, the list of slave nodes when the node is master, and the master node when the node is slave.
  • Each node contains a globally unique NodeId
  • When data shards migrate between node groups, the Redis Cluster still maintains external service. During the migration process, the migration process is controlled by a set of variables of “Shard migration related state”
  • When a master in the Cluster fails, the Redis Cluster automatically detects and triggers a failover operation to upgrade a slave of the master to a new master (containing a set of variables to control the failure).

It can be seen that each node preserves the cluster structure from its own perspective, describes the data fragmentation mode and the node master/standby relationship, and realizes the consistency of cluster structure information (configuration) by using epoch as the version number, and also controls the process of data migration and failover.

Information interaction

In the decentralized architecture, there is no unified configuration center. The nodes’ cognition of the Cluster comes from the information interaction between nodes, which is completed by Redis Cluster Bus (port independent). The structure of interactive information is shown in the figure below:

The Type field of clusterMsg specifies the message type. The consistency of configuration information depends on two types of MSG, PING and PONG. Each node sends frequent and periodic PING messages to other nodes and receives PONG messages. The Gossip section of the interaction information contains information about other nodes known to the sender node or the receiver node, which the receiver uses to update his or her knowledge of the cluster.

For large clusters, PING/PONG frequent interactions with the entire cluster structure information will cause great network burden. Therefore, each PING/PONG package of Redis Cluster only contains some randomly selected node information, so that the Cluster status can be reached quickly after several interactions in a short time.

Achievement of consistency

When the Cluster does not change, each node obtains the structure information of the Cluster through the gossip protocol after several rounds of interaction and achieves the consistent state. In case of failover, fragment migration and other situations, the cluster structure will be changed. The nodes that know the change first will spread their latest information to the cluster by using the EPOCH variable to achieve final consistency.

  • The epoch attribute description granularity of clusterNode is a single node (data fragment of a node, master/slave information version).
  • The currentEpoch property of clusterState has the granularity of the entire cluster to aid in the generation of the EPOCH incredently. Because currentEpoch information is also maintained in each node, Redis Cluster controls and updates the currentEpoch of each node through a certain time window when the structure changes.

Updates follow the following rules:

  • When a node is the first to know the information change, the currentEpoch of this node increases to the maximum value in the cluster, and the currentEpoch is used as the new version of the epoch.
  • When a node receives a currentEpoch larger than its own, it updates its own currentEpoch
  • If the epoch value of a node in the received Redis Cluster Bus message is greater than its epoch, its mapping information is updated to the content of the message
  • If a node is not included in the received Redis Cluster Bus message, the node is added to its own configuration

The above rules ensure that the update of information is always unidirectional and converges to the greater epoch value. Meanwhile, the EPOCH also increases unidirectional with the currentEpoch increment during each configuration change, ensuring the stability of information update direction of each node.

sharding

Different node groups serve subsets of data that have no intersection with each other (sharding,sharding). Because the Redis Cluster does not have a separate proxy or configuration server, requests from clients need to be correctly routed to the corresponding shard.

Data fragmentation

The Redis Cluster divides all data into 16,384 shards (slots), each of which is responsible for a portion. Each key-value is mapped to one of 16384 slots using the data distribution algorithm based on the key value. The data distribution algorithm is as follows:

slotId = crc16(key)%16384

The client decides which Redis node to route the request to based on slotId. Cluster does not support single commands across nodes. For example, SINTERSTORE, if the slots corresponding to the two keys are distributed on different nodes, the operation fails.

Key usually has a business meaning, for example:

-a Key value of commodity transaction records :product_trade_prod123 -A Key value of commodity transaction details :product_detail_prod123

Slotids calculated by the above two keys based on the data distribution algorithm may be distributed in different slots. If you need to run a command to operate the two highly correlated data, you cannot run the command in atomic mode. To solve this problem, Redis introduced a HashsTag (which can be calculated based on a portion of the key) so that related records are placed in the same shard. The above example key can be changed as follows:

-a Key value of commodity transaction records :product_trade_{prod123} -A Key value of commodity transaction details :product_detail_{prod123}

You can see that Redis takes strings between {}\color{red}{\{}}{} as input to the data distribution algorithm.

Client routing

Compared with stand-alone Redis, the client of Redis Cluster has the ability to identify route semantics and cache routes.

If the key accessed by the client is not on the corresponding node (for example, A), the client will receive A Moved command informing it of the correct routing information (suppose B). As shown below:

After receiving the moved command from the client, the client sends the moved command again to node B in the moved response. If a fragment migration occurs and node B is not the correct node, the client will receive the moved command again. The client updates the internal routing cache information based on the Moved response so that it can directly route to the correct node next time, reducing the number of interactions.

When cluster is in the process of slot migration, client routing can be controlled by using the ask command, as shown in the following figure:

In the figure above, the client requests that the key of node A is stored on Slot2. When the client requests node A, Slot2 is no longer on node A (Slot2 has been migrated to node B). The node gives the client an ask redirect command (for example, ask 2 127.0.0.1:7003) and tells the client to try the node in the return of the ask command. Then the client will first send the asking command to node B (turn on the REDIS_ASKING symbol of the client that sent the command), and then resend the command to node B. Node B will execute the execution process as shown below:

From the preceding flowchart, you can see the function of the ASKING icon. If slot2 is being imported and the client has the ASKING icon, the command will be executed. Otherwise, the Moved command will be returned. Note that the REDIS_ASKING button on the client is a one-time button (the REDIS_ASKING button on the client will be removed after the REDIS_ASKING button is executed once).

The difference between the ask and moved commands:

  • The return scene is different. The key accessed by the client is not in the node slots. Return the Moved command. Return ask if you are migrating.
  • The moved command will update the client data routing cache (subsequent operations will directly locate the destination node). The ask command simply redirects to the new node. (The client routing cache will not be updated and the same slot operations will still be routed to the old node)

The migration process may last for a period of time, during which time a slot data may be distributed between the old and new nodes. Because the Moved operation will cause the client’s routing cache to change, the client’s routing cache may change frequently if the old and new nodes respond to all the keys in the moved slot. Therefore, ask messages are introduced to separate redirection from routing cache updates.

Shard migration

Nodes corresponding to each slot in a stable Redis Cluster are identified. However, in some cases, the relationship between nodes and shards needs to be changed:

  • New nodes are added as master
  • A node group needs to go offline
  • Slot distribution needs to be adjusted for uneven load

In this case, shard migration is required. The triggering and process control of the migration is done by an external system, and Redis Cluster provides only the primitives needed for the migration process. There are two types of primitives:

  • Set node migration status and mark source/target nodes before migration
  • Key migration atomization command: specific steps of migration

In the following example, slot1 is migrated from node A to node B.

1. Send a status change command to node B and set the slot status of node B to Importing

2. Send A status change command to node A to change the status of slot A to MIGRATING

3. Send the Migrate command to A for all keys of slot A to tell A to migrate the keys to slot B

When the status of node A is “MIGRATING”, the slot is MIGRATING from NODE A. To ensure data consistency, node A provides read and write services to the slot differently from the normal state. For A MIGRATING slot:

  • If the key accessed by the client is not migrated, the key is processed normally
  • If the key has been migrated or does not exist, the system sends an ASK message to the client to switch to B

When node B is in the state of IMPORTING, it indicates that the corresponding slot is migrating to node B. Even if B still provides read/write services to the slot, the following differences exist:

  • When A client request is not transferred from ASK, the client does not know that the migration is going on. It is likely that the client operates on A key that is on A before the migration is completed. At this time, the key is changed on A, and the changed values of B and A will conflict in the future. Therefore, B will not process the operations on this slot that are not MOVED by ASK. Instead, B will use the MOVED command to move the client to A for execution.

In this way, state control ensures that the same key is always executed on the source node before migration and on the target node after migration, eliminating the problem of value conflict caused by simultaneous write on both sides. In addition, keys added during migration are always executed on the target node (the original node will not add keys), so that the migration process time is bounded and can be determined to end at a certain point.

You can use the atomized MIGRATE command to transfer data to B, wait for B to receive the data, and delete the key on A. Slave A and SLAVE B synchronize data added and deleted by the master through the master/slave replication.

After all keys are migrated, the client uses the CLUSTER SETSLOT command to set the fragment information of B to include the migrated slot. Set the process to increment the EPOCH (the maximum EPOCH for the current cluster), and then use the EPOCH variable to spread its information across the cluster (consistency as described above).

failover

Like Sentinel, Redis Cluster also has a complete set of node fault detection, fault state consistency guarantee and master/standby switchover mechanism. Let’s take a look at it below.

Failover Status changes

1. Fault discovery: When a master goes down, how is it sensed by other nodes in the cluster

2. Troubleshooting: How do multiple nodes agree on whether a master is down

3. Slave election: If the cluster confirms that a master is down, how to upgrade its slave to a new master? How to upgrade multiple slaves

4. Cluster structure change: How to update the cluster structure information after the slave is upgraded to master after the successful election

Fault found

The Redis Cluster nodes periodically interact with each other through the Redis Cluster Bus through pings /PONG. When a node breaks down, other PING messages sent to it cannot be responded in time. When PONG response is not received after a certain period of time (NODE_TIMEOUT), The node is considered faulty and changes to POSSIBLE FAIL (PFAIL). In subsequent PING/PONG messages sent through Gossip, the PFAIL status of this node is propagated to other nodes in the cluster.

Two Redis Cluster nodes connect to the Redis Cluster Bus through TCP. If PING PONG does not reply, the node may be faulty or the TCP connection may be disconnected. If the response times out due to TCP connection disconnection, false positives will be generated. Although false positives are also ignored because other nodes are properly connected, such false positives can be avoided in certain ways. Redis Cluster uses the pre-retry mechanism to eliminate such false positions. If no response is received within NODE_TIMEOUT/2, the Redis Cluster reconnects and resends the PING message. If a response is received within a short period of time, the peer device is normal.

Failure to confirm

In the case of network separation, A node is not faulty but cannot be connected to A, but can be connected to other nodes such as C and D normally. In this case, ONLY A marks B as PFAIL, and other nodes consider B normal. In this case, the information of other nodes such as A and C/D is inconsistent, and the Redis Cluster reaches A consensus through the fault confirmation protocol.

Each node in the cluster is the Gossip receiver, and A will also receive Gossip messages from other nodes telling whether B is in the FFAIL state. After receiving A certain number of PFails from other master nodes, USER A changes the PFAIL status of user B to FAIL status. It indicates that B is confirmed to be faulty and the slave election process will be initiated later.

The flow chart of the transition from PFAIl to FAIl status of B in the cluster information inside node A is as follows:

Slave election

In the figure above, B is the master of A, and B has been recognized as A FAIL state by the cluster, so A initiates A campaign to become the new master.

If B has multiple slaves (A/E/F) who are aware that B is in the FAIL state, A/E/F may launch A campaign at the same time. When the number of slaves of B is greater than or equal to 3, it is likely to launch multiple rounds of election failures. To reduce conflicts, the slaves with the highest priority (the newer the data, i.e., the more complete the data, the higher the priority) are more likely to run, thus increasing the probability of success.

The slave sends the FAILOVER_AUTH_REQUEST message. Before the election, the currentEpoch is added and the latest Epoch is added to the FAILOVER_AUTH_REQUEST message. The slave sends the FAILOVER_AUTH_REQUEST message to the other master to initiate a campaign. After receiving the FAILOVER_AUTH_ACK message, the master replies with the FAILOVER_AUTH_ACK message to tell whether it agrees or not. If it has not voted, the slave replies with the FAILOVER_AUTH_ACK message to tell whether it agrees or not.

Notice of Structural change

When the slave receives the reply from more than half of the masters, it will replace B as the new master. At this time, it will broadcast the information that it has become master through the PONG message with the latest epoch, so that other nodes in the cluster can update the topology as soon as possible.

After B becomes available again, she still thinks that she is master at first, but gradually learns that A has replaced her through the Gossip protocol and then demoted to A’s slave.

Availability and performance

Redis Cluster also provides several tools to improve performance and availability.

Read/write separation of Redis Cluster

For scenarios requiring read/write separation, certain data consistency can be sacrificed to achieve higher throughput for certain read requests. In this case, the slave is expected to handle the read request to share the pressure of the master.

In a data sharding mapping relationship, the node corresponding to slot must be a master. The client routes requests to each master through a Moved message. Even if the client sends the request directly to the slave, the slave will reply to the moved to master response.

To do this, Redis Cluster introduces the READONLY command. After the client sends this command to the slave, the slave will no longer move to master for read operations. Instead, the slave will directly process the read operations, which is called the slave READONLY mode. You can reset the readonly mode of the slave by running the READWRITE command.

Master single point protection

As an example of a single point of protection for master, suppose the initial structure of the cluster looks like the following:

Master A and master B have one slave and two slaves respectively. Assuming A1 goes down, the cluster structure would look like this:

In this case, A becomes A single point. Once A breaks down, it will be unavailable. At this point, Redis Cluster will migrate A slave of B (suppose B1) to make B1 become A slave of A. The structure of the replicated migration is as follows:

This enables each master in the cluster to have at least one slave, so the cluster only needs to maintain 2 master+1 nodes to automatically maintain high availability through migration even if any node is down.

Reference books: Deep Into Distributed Caching, Redis Design and Implementation