Small knowledge, big challenge! This article is participating in the creation activity of “Essential Tips for Programmers”.

This article has participated in the “Digitalstar Project” and won a creative gift package to challenge the creative incentive money.

Primary/secondary replication

1, the introduction of

Master-slave replication is the cornerstone of Redis distribution and the guarantee of Redis high availability. In Redis, the replicated server is called the Master server and the replicated server is called the Slave server.



The configuration of master/slave replication is very simple. There are three methods (IP- IP address of the master server /PORT- Redis service PORT of the master server) :

  1. Conf file, configure slaveof IP port
  2. Command – Enter the Redis client and run slaveof IP port
  3. Boot parameter –./redis-server –slaveof IP port

2. Master/slave replication evolution

The master-slave replication mechanism of Redis is not as perfect as the 6.x version from the beginning, but is iterated version by version. It goes through roughly three iterations:

  • 2.8 before
  • 2.8 ~ 4.0
  • After 4.0

With the growth of version, Redis master slave replication mechanism gradually improved; But their essence is all about sync and command propagate:

  • Sync: Refers to updating the data state of the slave server to the current data state of the master server, mainly during initialization or subsequent full synchronization.
  • Command propagate: When the data state of the master server is modified (write/delete, etc.) and the data state between the master server and the slave server is inconsistent, the master service propagates the command that has the data change to the slave server, so that the state between the master and the slave server is consistent again.

2.1 Version prior to 2.8

2.1.1 synchronization

Prior to 2.8, synchronization from the slave server to the master server was accomplished by issuing sync from the slave server to the master server: \

  1. The slave server receives the Slaveof IP prot command from the client, and establishes a socket connection to the master server based on IP :port
  2. After a socket successfully connects to the master, the slave server associates the socket connection with a file event handler dedicated to handling replication, RDB files sent and commands propagated by subsequent master servers
  3. The secondary server sends the sync command to the primary server
  4. After the primary server receives the sync command and executes the bgsave command, the child of the primary process fork on the primary server generates an RDB file and records all write operations after the RDB snapshot is generated in the buffer
  5. When the bgsave command is executed, the master server sends the RDB file to the slave server. After receiving the RDB file, the slave server first clears all its data, and then loads the RDB file to update its data status to that of the master server
  6. The master server sends the buffer write command to the slave server, which receives the command, and executes it.
  7. The primary and secondary replication synchronization is complete

2.1.2 Command Propagation

After the synchronization is complete, the master and slave need to propagate commands to maintain the consistency of data status. As shown in the figure below, after the synchronization between the master server and the slave server is completed, the master server deletes K6 after receiving the DEL K6 command from the client. At this time, the slave server still has K6, and the data status of the master and slave server is inconsistent. To keep the status of the primary server consistent with that of the secondary server, the primary server propagates the command that causes the change of its data status to the secondary server. After the secondary server runs the same command, the data status of the primary and secondary servers is consistent. \

2.1.3 defects

We don’t see any flaws in master-slave replication prior to 2.8 because we haven’t taken network fluctuations into account. CAP theory is the cornerstone of distributed storage system. P(partition network partition) must exist in CAP theory, and Redis master-slave replication is no exception. If a network fault occurs between the primary server and secondary server, the secondary server cannot communicate with the primary server for a period of time. If the secondary server reconnects to the primary server during this period, the data status of the primary server changes, causing data status inconsistency between the primary server and secondary server. In versions of master/slave replication prior to Redis 2.8, this inconsistency was resolved by resending the sync command. Although sync ensures data status consistency between the primary and secondary servers, it is obviously a very resource-intensive operation.

Resources occupied by the primary and secondary servers when the sync command is executed:

  • The primary server performs BGSAVE to generate RDB files, which consumes a lot of CPU, disk I/O, and memory resources
  • The primary server sends the generated RDB file to the secondary server, which consumes a lot of network bandwidth.
  • Receiving and loading the RDB file from the server will block the slave server and make it unavailable for service

According to the above three points, the sync command not only reduces the response capability of the primary server, but also causes the secondary server to refuse to provide external services during this period.

Version 2.2 2.8 to 4.0

2.2.1 improvement point

For versions prior to 2.8, Redis has improved data state synchronization after reconnecting from servers since 2.8. The direction of improvement is to reduce the occurrence of full resynchronizaztion and use incremental partial resynchronization whenever possible. After version 2.8, the psync command was used instead of the sync command to perform synchronization operations. The psync command has both full synchronization and incremental synchronization functions:

  • Full synchronization is the same as the previous version (sync)
  • In incremental synchronization, different measures will be taken for the replication after disconnection and reconnection according to the situation. If conditions permit, only the missing part of the data from the service is still sent.

2.2.2 How to implement Psync

Redis adds three auxiliary parameters to enable incremental synchronization after disconnection from the server:

  • Replication offset
  • Replication Backlog
  • Server Run ID (RUN ID)
2.2.2.1 Replication Offset

A replication offset is maintained on both the master and slave servers

  • The master server sends data to the slave service, propagating N bytes of data and increasing the replication offset of the master service by N
  • When receiving N bytes of data from the master server, the replication offset of the slave server is increased by N

Normal synchronization is as follows: \

By comparing whether the replication offsets between the primary and secondary servers are the same, you can know whether the data status between the primary and secondary servers is consistent. If A/B propagates normally and C is disconnected from the server, the following situation will occur: \



Obviously with the replication offset, the master server only needs to send the 100 bytes of data that the slave server is missing when it disconnects and reconnects from server C.But how does the master server know which data the slave server is missing?

2.2.2.2 Copying backlogs

The replication backlog buffer is a fixed-length queue with a default size of 1MB. When the state of the primary server data changes, the primary server synchronizes the data to the secondary server and saves a copy to the replication backlog buffer. \



To match the offset, the copy backlog not only stores the contents of the data, but also records the offset for each byte: \



When the slave server is disconnected and reconnected, the slave server sends its replication offset to the master server through the psync command. The master server can use this offset to determine whether to perform incremental propagation or full synchronization.

  • If the data at offset+1 is still in the replication backlog, an incremental synchronization is performed
  • Otherwise, full synchronization is performed, consistent with sync

The default size of the Redis copy backlog buffer is 1MB. How to set it if you need to customize it? Obviously, we want to use incremental synchronization as much as possible, but we don’t want buffers to take up too much memory. So we can set the size of the replication backlog buffer S by estimating the time T for Redis to reconnect from the service disconnection and the memory size M for write commands received by the Redis master server per second.

S = 2 * M * T

Note that this is doubled to allow some leeway to ensure that most broken reconnections can be synchronized incrementally.

2.2.2.3 Server Running ID

Why run ID when you can do incremental synchronization for disconnection reconnection? In fact, there is another case that is not considered, is when the primary server goes down, a secondary server is elected as the new primary server, in this case we compare the run ID to distinguish.

  • Run ids are 40 random hexadecimal strings automatically generated when the server starts. Run ids are generated by both the primary and secondary servers
  • When the secondary server synchronizes data from the primary server for the first time, the primary server sends its run ID to the secondary server, which is saved in the RDB file
  • When the secondary server is disconnected and reconnected, the secondary server sends the previously saved running ID of the primary server to the primary server. If the server running ID matches, the primary server has not changed, and incremental synchronization can be performed
  • If the server running ids do not match, full synchronization is performed

2.2.3 Complete Psync

The complete psync process is very complex and has been perfected in master/slave replicas from 2.8 to 4.0. The psync command sends the following parameters:

psync

When the slave server has not replicated any master servers (not the first master/slave replication, as the master server may change, but the first full synchronization of the slave server), the slave server will send:

psync ? – 1

The complete psync process is shown below:\

A complete psync.png

  1. The SLAVEOF 127.0.0.1 6379 command is received from the server. Procedure
  2. Return OK from the server to the initiator of the command (this is an asynchronous operation, return OK first, then save the address and port information)
  3. The Master server saves the IP address and Port information to the Master Host and Master Port
  4. The slave server initiatively initiates a socket connection to the Master server according to the Master Host and Master Port. Meanwhile, the slave server will associate this socket connection with a file event handler specialized for file replication for subsequent RDB file replication and other work
  5. The master server receives a socket connection request from the slave server, creates a socket connection for the request, and looks at a client from the slave server (in a master-slave replication, the master and slave servers are actually clients and servers)
  6. When the socket connection is established, the slave server actively sends the PING command to the master server. If the master server returns PONG within the specified timeout period, the socket connection is proved to be available; otherwise, the socket connection is disconnected and reconnected
  7. If the master server has a password set (masterauth), the slave server sends the AUTH masterauth command to the master server for authentication. Note that if the secondary server sends a password, the master server does not set the password, and the master server sends no password is set error. If the primary server requires a password and the secondary server does not send the password, the primary server will send a NOAUTH error. If the passwords do not match, the primary server sends an invalid password error.
  8. The slave server sends REPLCONF listening-port XXXX to the master server (XXXX indicates the port of the slave server). After receiving the command, the primary server saves the data, and the client can return the data when using INFO Replication to query primary and secondary information
  9. To send the psync command from the server, see the two cases of psync in the preceding figure
  10. The master server and the slave server are mutual clients for data request/response
  11. The primary and secondary servers use the heartbeat packet mechanism to check whether the connection is down. The slave server sends the command REPLCONF ACL offset (replication offset of the slave server) to the master server every one second. This mechanism ensures correct data synchronization between the master and slave servers. If the offset is not equal, The master will perform an incremental/full synchronization to ensure that the data state is consistent between the master and slave (the choice of incremental/full depends on whether the data at offset+1 is still in the replication backlog).

Version 2.3 4.0

Redis 2.8-4.0 still has some room for improvement. Is incremental synchronization possible when the primary server is switched? Therefore, Redis 4.0 has been optimized to address this problem, and psync has been upgraded to Psync2.0. Psync2.0 ditches the server run ID in favor of ReplID, which stores the current primary server run ID, and REPLid2, which stores the last primary server run ID.

  • Replication offset
  • Replication Backlog
  • Primary server Run ID (replID)
  • Last primary server run ID (replid2)

With replID and REPLid2 we can solve the problem of incremental synchronization during primary server switchover:

  • If replID is equal to the running ID of the current primary server, then determine the synchronization mode for incremental/full synchronization
  • If replID is not equal, it determines whether REPLid2 is equal (whether it belongs to the secondary server of the previous master server). If yes, incremental/full synchronization can still be selected. If not, full synchronization can only be performed.

Second, the Sentinel

1, the introduction of

Master-slave replication is the foundation of distributed Redis, but normal master-slave replication is not highly available. In normal master-slave replication mode, if the primary server goes down, o&M personnel have to manually switch the primary server, which is obviously not an option. In response to this situation, Redis has officially launched a high availability solution to resist node failures, Redis Sentinel. Redis Sentinel (Sentinel) : Sentinel system consisting of one or more Sentinel instances. It can monitor any number of primary and secondary servers. When the monitored primary server goes down, the primary server is automatically offline and the secondary server is selected to upgrade to the new primary server.

The following is an example: If the offline time of the old Master exceeds the upper limit set by the user, the Sentinel system will perform a failover operation for the old Master. The failover operation consists of three steps:

  1. Select the latest Slave as the new Master
  2. Send new replication instructions to other slaves to make them become the new Slave of the Master
  3. Continue to monitor the old Master and set the old Master as the Slave of the new Master if it goes online

This paper is carried out based on the following resource list:

The IP address The node role port
192.168.211.104 Redis Master/ Sentinel 6379/26379
192.168.211.105 Redis Slave/ Sentinel 6379/26379
192.168.211.106 Redis Slave/ Sentinel 6379/26379

2. Sentinel initialization and network connection

Sentinel is a simpler Redis server that loads different command tables and configuration files when Sentinel is started, so Sentinel is essentially a Redis service with fewer commands and some special features. When a Sentinel is started, it goes through the following steps:

  1. Initialize the Sentinel server
  2. Replace the plain Redis code with a special code for Sentinel
  3. Example Initialize the Sentinel status
  4. Initializes the list of primary servers monitored by Sentinel based on the Sentinel profile given by the user
  5. Create a network connection to the primary server
  6. Get the slave server information based on the master service to create a network connection to the slave server
  7. Obtain Sentinel information according to publish/subscribe, and create network connections between Sentinels

2.1 Initializing the Sentinel Server

Sentinel is essentially a Redis server, so starting Sentinel requires starting a Redis server, but Sentinel does not need to read the RDB/AOF file to restore the data state.

2.2 Replace the regular Redis code with the Sentinel special code

Sentinel is used for a small number of Redis commands, most of which are not supported by the Sentinel client, and Sentinel has some special features that require Sentinel to replace the code used by the Redis server with a special code for Sentinel at startup. During this time Sentinel loads a different command table than the regular Redis server. Sentinel does not support commands such as SET and DBSIZE. Reserve support for PING, PSUBSCRIBE, SUBSCRIBE, UNSUBSCRIBE, INFO and other commands. These instructions provide assurance in Sentinel work.

2.3 Initializing the Sentinel status

After loading Sentinel’s specific code, Sentinel initializes the sentinelState structure, which is used to store sentinel-related state information, the most important of which is the Masters dictionary.

1struct sentinelState {2 3 uint64_t current_epoch; 5 6 // Sentinel monitor master information 7 // key -> master name 8 // value -> Pointer to sentinelRedisInstance 9 dict *masters; 10 / /... 11} sentinel;Copy the code

2.4 Initialize the list of primary servers monitored by Sentinel

The list of master servers monitored by Sentinel is kept in sentinelState’s Masters dictionary, and initialization begins when sentinelState is created.

  • The key for Masters is the name of the master service
  • Masters’ value is a pointer to sentinelRedisInstance

The name of the primary server is specified by our sentinel.conf configuration file. The following primary server name is redis-master:

1daemonize yes 2port 26379 3protected-mode no 4dir "/usr/local/sof/redis-6.2.4 /sentinel-tmp" 5sentinel monitor Redis-master 192.168.211.104 6379 2 6sentinel down-after-milliseconds redis-master 30000 7sentinel Failover -timeout redis-master 180000 8sentinel parallel-syncs redis-master 1Copy the code

The sentinelRedisInstance holds the Redis server information (master, slave, and Sentinel information are all stored in this instance).

1typedef struct sentinelRedisInstance {2 3 // Identifies the type and status of the current instance. Such as SRI_MASTER, SRI_SLVAE, SRI_SENTINEL 4 int flags; 5 6 // Instance name Set the instance name, secondary server, and Sentinel to IP :port 7 char *name; 8 9 // Server running ID 10 char *runid; 11 12 // Configure epoch, failover uses 13 uint64_t config_epoch; 16 sentinelAddr *addr; 17 18 sentinel down-after-milliseconds redis-master 30000 19 mstime_t down_after_period; 20 21 // Number of votes sentinel monitor redis-master 192.168.211.104 6379 2 22 int quorum; Sentinel parallel-syncs redis-master 1 25 int parallel-syncs; 26 27 // Maximum timeout period for refreshing the fault migration status sentinel failfail-timeout redis-master 180000 28 mstime_t failover_timeout; 29 and 30 / /... 31} sentinelRedisInstance;Copy the code

According to the one master and two slave configuration above, the following instance structure will be obtained: \

2.5 Creating a Network Connection to the primary server

When the instance structure is initialized, Sentinel will start to create network connections to the Master, which will become the Master’s client. A command connection and a subscription connection are created between Sentinel and Master:

  • Command connection is used to obtain primary/secondary information
  • Subscription connections are used to broadcast information between sentinels, with each Sentinel subscribes to the master and slave servers it monitorssentinel: Hello channel (Note that sentinels do not create subscription connections between them, they pass subscriptionssentinel: Hello channel to get initial information for other Sentinels)



Sentinel sends an INFO command to the Master every 10 seconds after the connection is created.

  • Master itself
  • Slave information under Master



2.6 Creating a Network Connection to the secondary server

Sentinel can create network connections to Slave based on master server information. Command connections and subscription connections are also created between Sentinel and Slave. \



When a network connection is created between Sentinel and Slave, Sentinel becomes a client of Slave. Sentinel also requests Slave to obtain server information through the INFO command every 10 seconds.

At this point Sentinel captures the relevant server data for the Master and Slave. The important information is as follows:

  • Server IP address and port
  • Server running ID Run ID
  • Server role
  • Server connection status Mater_link_status
  • Slave Replication offset slave_REPL_offset (used to elect a new Master during failover)
  • Slave Priority slave_priority

The instance structure information is as follows: \

2.7 Creating a network connection between Sentinels

How the Sentinels detect each other and communicate with each other is a matter of whether or not they subscribe to the Sentinel: Hello channel. The Sentinel will subscribe to the Sentinel: Hello channel with all the masters and slaves monitored by Sentinel. The Sentinel will send a message to the Sentinel: Hello channel every 2 seconds with the following content:

PUBLISH sentinel:hello “,,,,,,,”

Where S code Sentinel, M stands for Master; IP indicates the IP address, port indicates the port, rUNId indicates the running ID, and epoch indicates the configuration era.

Multiple Sentinels will be configured with the same primary server IP and port information in the configuration file. Therefore, multiple Sentinels will subscribe to Sentinel: Hello channel, and the IP and port of other Sentinels can be obtained by the information received through the channel. The following two points need to be noted:

  • If the obtained RUNID is the same as Sentinel’s own RUNID, it indicates that the message is published by Sentinel itself and is discarded directly
  • If not, it indicates that the received messages are published by other Sentinels. In this case, Sentinel instance data needs to be updated or added according to THE IP address and port

Subscription connections are not created between sentinels, only command connections are created: \



The instance structure information is as follows: \



3. Sentinel work

Sentinel’s primary job is to monitor the Redis server and switch to a new Master instance when the Master instance exceeds its preset time limit. There are a lot of details in this process, which can be roughly divided into four steps: detecting whether the Master is subjectively offline, detecting whether the Master is objectively offline, electing the lead Sentinel, and failover.

3.1 Checking whether the Master is offline

Every second, Sentinel sends a PING command to all the Master, Slave, and Sentinel servers in the sentinelRedisInstance to determine whether they are still online.

1sentinel down-after-milliseconds redis-master 30000
Copy the code

In the Sentinel profile, when an instance of Sentinel PING returns an invalid command within the duration of consecutive down-after-milliseconds configurations, Sentinel currently considers it offline. The down-after-milliseconds configured in the Sentinel profile will apply to all masters, slaves, and sentinelRedisInstance.

Invalid instructions refer to instructions other than +PONG, -loading, and -masterdown, including no response

If the Sentinel detects that the Master is subjectively offline, it will change its sentinelRedisInstance flags to SRI_S_DOWN\



3.2 Checking whether the Master is offline

The current Sentinel considers that its offline status can only be subjective offline. In order to determine whether the current Master is objectively offline, other Sentinels need to be asked, and the sum of all the subjective or objective offline status of the Master needs to reach the quorum configuration value. The Master is currently marked as objective offline by Sentinel. \

The current Sentinel sends the following command to the other sentinelRedisInstance sentinels:

1SENTINEL is-master-down-by-addr <ip> <port> <current_epoch> <runid>
Copy the code
  • IP: IP address of the Master that is judged to be the subjective offline
  • Port: indicates the port of the Master that is judged to be subjectively offline
  • Current_epoch: Configuration epoch of the current Sentinel
  • Runid: indicates the runid of the current sentinel, runid

Current_epoch and RUNID are both used in Sentinel elections. Once the Master is offline, it is necessary to elect a lead Sentinel to elect a new Master. The current_epoch and RUNId play an important role in this.

If the Sentinel command is received, the system checks whether the primary server is offline according to the parameters in the command. After the check is complete, the following three parameters are returned:

  • Down_state: check result 1 indicates offline, 0 indicates offline
  • Leader_runid: Returns * for deciding whether to go offline, and returns RUNId for electing lead Sentinel
  • Leader_epoch: Configuration epoch will have a value when Leader_rUNId returns RUNId, otherwise 0 will always be returned
  1. When Sentinel detects that the Master is subjectively offline, it queries other Sentinels and sends CURRENT_EPOCH and RUNId, where CURRENT_EPOCH =0 and RUNId =*
  2. The Sentinel that receives the command returns down_state = 1/0, leader_rUNId = *, Leader_EPOCH =0 when it determines whether the Master is offline



3.3 Election lead Sentinel

If down_state returns 1, the Sentinel receiving the IS-master-down-by-addr command considers the master to be subjectedly offline. If down_state returns 1 (including itself) greater than or equal to quorum (the value configured in the configuration file), So Master is officially marked as objective offline by the current Sentinel. Sentinel sends the following command again:

1SENTINEL is-master-down-by-addr <ip> <port> <current_epoch> <runid>
Copy the code

At this point, the runid will no longer be 0, but the value of the Sentinel’s own runid (runid), indicating that the current Sentinel wants other sentinels receiving is-master-down-by-addr to set it as the lead Sentinel. This setting is on a first-come, first-served basis, and the first Sentinel to receive a set request will be the lead Sentinel. The Sentinel that sends the command will determine whether it is set as the lead Sentinel based on the replies from other sentinels. If more than half of sentinels are set as lead sentinels by other sentinels (this number is available in sentinelRedisInstance’s Sentinel dictionary), Then the Sentinel will think that it has become the lead Sentinel and start subsequent failover work (since half is required and only one lead Sentinel is set up for each Sentinel, only one lead Sentinel will appear. If none of them meets the requirements of the lead Sentinel, Sentinel will be re-elected until a lead Sentinel is elected).

3.4 Failover

Failover will be given to the lead Sentinel, which will do the following:

  1. Select the best slave from the original master as the new master
  2. Make other slaves slaves of the new master
  3. Continue listening on the old master, and if it goes online, set it as the slave of the new master

The hardest part is that if the best new Master is selected, the lead Sentinel does the following cleaning and sorting:

  1. Check whether any slave is offline and if any slave is removed from the slave list
  2. Delete the slave that did not respond to the sentinel INFO command within 5 seconds
  3. Delete all secondary servers that have been disconnected from the offline primary for longer than down_after_milliseconds * 10
  4. Based on slave_priority, the slave with the highest priority is selected as the new master
  5. If the priorities are the same, the slave with the largest offset slave_REPL_offset is selected as the new master
  6. If the offset is the same, sort the slave server by the slave server run ID, and select the slave with the smallest RUN ID as the new master

After a new Master is created, the lead Sentinel sends the SLAVEOF IP port command to other slave servers (excluding the new Master) that have taken the Master offline to make them slave of the new Master.

This is the end of the Sentinel workflow, and if the new master is offline, the loop can be used!

Third, the cluster

1, the introduction of

Redis cluster is a distributed database solution provided by Redis. The cluster shares data through sharding. Redis cluster mainly achieves the following objectives:

  • Still performs well at 1000 nodes and scalability is linear.
  • There are no merge operations (multiple nodes do not have the same key), which can perform well in the most typical big data values in Redis’s data model.
  • Write security. The system tries to save all writes that are done by clients connected to most nodes. However, Redis cannot guarantee complete data loss, and asynchronously synchronized master/slave replication will have data loss anyway.
  • Availability. The primary node is unavailable. The secondary node can replace the primary node.

The Redis cluster tutorial is one of the first Redis cluster tutorials

Redis. Cn/switchable viewer/clus…

Redis cluster specification

Redis. Cn/switchable viewer/clus…

Redis3 Active/secondary pseudo-cluster deployment

Blog.csdn.net/qq_41125219…

The following content depends on the following three-master three-slave structure: \



List of resources:

node IP Groove (slot) range
Master[0] 192.168.211.107:6319 Slots 0 – 5460
Master[1] 192.168.211.107:6329 Slots 5461 – 10922
Master[2] 192.168.211.107:6339 Slots 10923 – 16383
Slave[0] 192.168.211.107:6369
Slave[1] 192.168.211.107:6349
Slave[2] 192.168.211.107:6359

2. Inside the cluster

Instead of using consistent hash, the Redis cluster introduces the concept of a hash slot. Redis cluster has 16384 hash slots, each key through CRC16 verification of 16384 module to determine which slot to place, this structure is easy to add or delete nodes. Each node in the cluster is responsible for a portion of the hash slots. For example, if the cluster in the resource list above has three nodes, its slots are allocated as follows:

  • The node Master[0] contains hash slots 0 through 5460
  • Node Master[1] contains hash slots 5461 through 10922
  • Node Master[2] contains hash slots 10923 to 16383

Before learning more about Redis clusters, you need to understand the internal structure of Redis instances in clusters. If cluster_enabled is set to yes for a Redis node, the Redis node will continue to use the server components in single-machine mode. Structures such as custerState, clusterNode, and custerLink are added to store special data in cluster mode.

The following three data bearing objects must be carefully looked at, especially in the structure of the notes, after looking at the cluster in general how to work, the heart will know, hey hey hey hey;

2.1 clsuterNode

ClsuterNode is used to store node information, such as node name, IP address, port information, configuration era, etc. The following code lists some of the most important properties:

1typedef struct clsuterNode {2 3 // create time 4 mstime_t ctime; 5 6 // Node name, consisting of 40 random hexadecimal characters (the same as the server running ID in sentinel) 7 char name[REDIS_CLUSTER_NAMELEN]; 8 9 // Node ID, which identifies the role and status of the node 10 // Role -> Primary node or secondary node For example, REDIS_NODE_MASTER(primary node) REDIS_NODE_SLAVE(secondary node) 11 // Status -> Online or offline For example: REDIS_NODE_PFAIL(suspected offline) REDIS_NODE_FAIL(suspected offline) 12 int flags; 15 unit64_t configEpoch 16 unit64_t configEpoch 16 unit64_t configEpoch 17 18 // Node IP address 19 char IP [REDIS_IP_STR_LEN]; 20 21 node port 22 int port; 23 24 // Link node information 25 clusterLink *link; 26 27 // A 2048 byte array of binary bits 28 // the array index value may be 0 or 1 29 // The array index I position value is 0, indicating that the node is not responsible for processing slot I 30 // The array index I position value is 1, Delegate node handles slot I 31 unsigned char slots[16384/8]; 32 33 // Record the total number of slots on the node. 34 int numslots; 38 struct clusterNode *slaveof; 38 struct clusterNode *slaveof; 38 struct clusterNode *slaveof; 39 40 // If the current node is the master node 41 // Number of slave nodes that are being replicated on the current master node 42 int numslaves; 42 struct clusterNode ** Slaves; 46 47} clsuterNode;Copy the code

Slots [16384/8] is a 16384-size array. If the index subscript is 1, the slot belongs to the current clusterNode process. If the index subscript is 0, the slot does not belong to the current clusterNode process. ClusterNode can be identified by slots. The current node process is responsible for processing which slots.

The slots of the initial clsuterNode or cluster of unallocated clsuterNodes are as follows:



Assuming the cluster looks like the resource list I gave above, the clusterNode slots representing Master[0] look like this: \

2.2 clusterLink

ClusterLink is an attribute in clsuterNode that stores information needed to connect nodes, such as socket descriptors, input/output buffers, and so on. The following code lists some of the most important attributes:

1typedef struct clusterState {2 3 mstime_t ctime; 5 6 // TCP socket descriptor 7 int fd; 8 9 // Output buffer where messages need to be cached to other nodes 10 SDS sndbuf; 11 12 input buffer, receive messages to other nodes are cached here 13 SDS rcvbuf; 16 struct clusterNode *node; 16 struct clusterNode *node; 17} clusterState;Copy the code

2.3 custerState

Each node has a custerState structure that stores all data about the current cluster, such as the state of the cluster, information about all nodes in the cluster (master node, slave node), and so on. The following code lists some of the most important attributes:

1typedef struct clusterState {2 3 // The current node pointer points to a clusterNode. 5 6 // Current configuration era of the cluster, used for failover, similar to that used in sentinel. 7 unit64_t currentEpoch; 8 9 // Cluster Status Online/Offline 10 int state; 11 12 total number of nodes processing slots in the cluster 13 int size; 14 15 // clusterNode dictionary, all ClusterNodes include themselves 16 Dict *node; 19 clsuterNode *slots[16384]; 20 21 // Reassign slots - Record the slots that the current node is importing from other nodes. 22 clusterNode * importing_SLOts_from [16384]; 25 clusterNode * migrating_SLOts_to [16384]; 25 Migrating_SLOts_to [16384]; 26 and 27 / /... 28 29} clusterState;Copy the code

The first one is the slots array. The slots array in clusterState is different from the slots array in clsuterNode. The Slots array in clusterNode records the slots of the current clusterNode. The Slots array in clusterState records the slots of the cluster. Therefore, each index of the clusterState Slots array points to the clusterNode that is responsible for the slots when the cluster is working properly and points to null before the slots are allocated.

The figure shows the slots array in clusterState and the slots array in clsuterNode in the resource listing: \



The Redis cluster uses two slots arrays for performance reasons:

  • To obtain clusterNode slots in the cluster, we just need to query the Slots array in clusterState. If you don’t have the clusterState Slots array, you need to traverse all the clusterNode structures, which is obviously slower
  • In addition, the slots array of clusterNode slots is also necessary, because each node in the cluster needs to know the slots of each other. In this case, nodes only need to transmit the clusterNode slots array structure to each other.

The second structure that needs to be carefully understood is the Node dictionary. Although this structure is simple, the Node dictionary stores all the ClusterNodes, which is also the main location where a single node in the Redis cluster obtains information about other master nodes and slave nodes. Therefore, we need to pay attention to it. The third important structure to understand is the importing_SLOts_from [16384] array and migrating_SLOts_to [16384] array, which are used in cluster re-sharding.

3. Cluster work

3.1 How do I assign Slots?

The Redis cluster has a total of 16,384 slots. As shown in the resource list above, each master node is responsible for its own slot in the cluster with three master nodes and three slave nodes. However, in the deployment process of three master nodes and three slave nodes above, I did not see that I assigned slots to corresponding master nodes, because the Redis cluster divided slots for us internally. But what if we want to assign slots ourselves? We can assign one or more slots to the current node by sending the following command to the node:

CLUSTER ADDSLOTS

For example, if we want to assign slots 0 and 1 to Master[0], we simply send the following command to Master[0] :

CLUSTER ADDSLOTS 0 1

When a node is assigned a slot, the clusterNode slots array is updated. The node sends its own slots, the Slots array, to other nodes in the cluster via message. Other nodes update the slots array and Solts array of clusterNode state when receiving the message.

3.2 How is ADDSLOTS implemented within a Redis cluster?

When we send the CLUSTER ADDSLOTS command to a node in the Redis CLUSTER, the current node first uses the clusterState slots array to verify that the slots assigned to the current node are not assigned to other nodes. An exception is thrown directly, returning an error to the assigned client. If all slots assigned to the current node are not assigned to other nodes, the current node assigns those slots to itself. There are three main steps in assigning:

  1. Update the clusterState Slots array to specify slots[I] to the current clusterNode
  2. Update the Slots array for clusterNode. Change the value of the specified slots[I] to 1
  3. Send a message to the other nodes in the cluster. Send the clusterNode Slots array to the other nodes. When the other nodes receive the message, they update the corresponding clusterState Slots array and clusterNode Slots array

3.3 How can the Client know which Node to request in a Cluster with So many Nodes?

Before we get to that, we need to know one thing: how does the Redis cluster calculate which slot this key belongs to? According to the official website, Redis actually does not use consistent hash algorithm, but takes the module of 16384 for each request key after CRC16 verification to decide which slot to place it in.

HASH_SLOT = CRC16(key) mod 16384

At this point, when the client connection sends a request to a node, the node currently receiving the command will first calculate the slot I where the current key belongs through the algorithm. After the calculation, the current node will judge whether the slot I of clusterState is responsible for itself. If it happens to be responsible for itself, the current node will respond to the request of the client. If the current node is not responsible, the following steps are performed:

  1. The node returns a MOVED redirection error to the client. In the MOVED redirection error, the IP address and port of the clusterNode that has calculated and correctly processed the key will be returned to the client
  2. When the client receives a MOVED redirection error returned by the node, it forwards the command to the correct node based on IP and port. The process is transparent to the programmer and is shared between the server and client of the Redis cluster.

3.4 What can I do if I want to reassign A slot allocated to node A to node B?

This problem actually covers a lot of problems, such as removing some nodes in the Redis cluster, adding nodes, etc., can be summarized as moving the hash slot from one node to another node. This is where the Redis cluster comes in. It supports online (non-stop) allocation, which is officially called live reconfiguration.

Take a look at the CLUSTER instruction before it will be implemented, the instruction will operate:

  • CLUSTER ADDSLOTS Slot1 [Slot2]… [slotN]
  • CLUSTER DELSLOTS slot1 [Slot2]… [slotN]
  • CLUSTER SETSLOT slot NODE node
  • CLUSTER SETSLOT slot MIGRATING node
  • CLUSTER SETSLOT slot IMPORTING node

ADDSLOTS and DELSLOTS are used to quickly assign and delete slots. They are usually used for quick allocation when a CLUSTER is first set up. CLUSTER SETSLOT slot NODE NODE is also used to directly assign slots to specified nodes. If the cluster is already established we usually use the last two to reassign, which means something like the following:

  • When a slot is migrated, the node that originally owned the hash slot still accepts all requests related to the slot, but only if the queried key still exists on the original node. Otherwise, the query is forwarded to the migrated target node via a -ask redirection.
  • When a slot is set to IMPORTING, it begins to accept all requests for the hash slot only after it receives the ASKING command. If the client never sends the ASKING command, the query is forwarded to the node that actually handles the hash slot via the -moved redirection error.

Are the above two sentences difficult to understand? This is the official description. If not, I will give you a popular description.

  1. Redis-trib (the CLUSTER management software Redis-trib is responsible for allocating slots in the Redis CLUSTER) sends the CLUSTER SETSLOT slot IMPORTING node command to the target nodes. The destination node is ready to import the slot from the source node (the node exported from the slot).
  2. Redis-trib then sends the CLUSTER SETSLOT slot MIGRATING node command to the source node. The source node prepares for exporting slots
  3. Redis-trib then sends the CLUSTER GETKEYSINSLOT slot count command to the source node. After receiving the command, the source node returns a maximum of count keys that belong to slots
  4. Redis-trib will send MIGRATE IP port key 0 timeout to the source node according to the key returned by the source node. If the key is in the source node, the source node will MIGRATE to the destination node.
  5. After the migration, the redis-trib sends the CLUSTER SETSLOT slot NODE NODE command to a NODE in the CLUSTER. The NODE receives the command and updates the clusterNode and clusterState structures. The NODE then transmits the slot assignment information through the message. The slot migration is complete, and other nodes in the cluster have updated the new slot allocation information.

3.5 What Can I Do If the Slot where the Key accessed by a client belongs is Being migrated?

Good you always think of this kind of concurrency, cow! Bosses! \



This issue has also been considered by the authorities. Remember when we talked about clusterState structure? Importing_slots_from and migrating_SLOts_to are used to handle this.

1typedef struct clusterState { 2 3 // ... 6 clusterNode *importing_slots_from[16384]; 6 clusterNode *importing_slots_from[16384]; 9 clusterNode * migrating_SLOts_to [16384]; 9 Migrating_SLOts_to [16384]; 10 of 11 / /... 12 13} clusterState;Copy the code
  • When a node is exporting a slot, the migrating_SLOts_TO array in the clusterState is set to the corresponding clusterNode, which points to the imported node.
  • When a node is importing a slot, it points to the corresponding clusterNode at the index of the importing_SLOts_FROM array in clusterState, which points to the exported node.

With the above two mutual arrays, we can determine whether the current slot is migrating, and from where, and to where? Funny is that simple…

At this point, back to the problem, if the key requested by the client happens to belong to the slot being migrated. The node that receives the command will first try to find the key in its own database. If the slot has not been migrated and the current key has not been migrated, it will directly respond to the client’s request. If the key is no longer there, the node queries the index slot corresponding to migrating_SLOts_TO. If the value at the index is not null but points to a clusterNode, it indicates that the key has been migrated to the clusterNode. Instead, the node returns the ASKING command, which also carries the IP address and port corresponding to the clusterNode in the imported slot. After receiving the ASKING command, the client needs to redirect the request to the correct node, but there is a bit of caution here (so I put an emoji here for readers’ attention). \



As mentioned earlier, a node returns the MOVED directive when it discovers that the current slot does not belong to it. What happens when it migrates a moving slot? This Redis cluster is for this play.

If the node detects that the slot is being migrated, it returns the ASKING command to the client. The client receives the ASKING command, which contains the IP address and port of the clusterNode to which the slot is migrated. The client will first send an ASKING command to the inbound clusterNode. This command must be sent to tell the current node that you need to exceptionally process the request because the slot has been migrated to you and you can’t refuse me outright (so if Redis doesn’t receive the ASKING command, ClusterState = ‘MOVED’; clusterState = ‘MOVED’; clusterState = ‘MOVED’; The node that receives the ASKING command forces the request to be executed once (only once).

4. The cluster is faulty

Redis cluster failures are relatively simple, similar to the way in which sentinel primary nodes go down or do not respond for a specified maximum period of time, and then re-elect a new primary node from a secondary node. Of course, the premise is that each primary node in the Redis cluster, we set the secondary node in advance, or hey hey hey… No dice. The general steps are as follows:

  1. If the node receiving the command does not return the PONG message within the specified time, the current node will set the flags of the clusterNode receiving the command to REDIS_NODE_PFAIL. PFAIL is not offline. It’s suspected offline.
  2. Cluster nodes send messages to inform other nodes of the status of each node in the cluster
  3. If more than half of the primary nodes responsible for processing slots in the cluster have a primary node set to suspected offline, the node will be marked with a bit offline state and the flags of the clusterNode that receives the command will be set to REDIS_NODE_FAIL. FAIL indicates that the node is offline
  4. A cluster node sends a message to inform other nodes of the status of each node in the cluster. When a slave node of an offline node discovers that its primary node has been marked offline, it is time to step forward
  5. Slave nodes that go offline to the master node will elect a slave node as the latest master node. Execute the selected node to SLAVEOF no one to become the new master node
  6. The new primary node will cancel the slot assignment of the original primary node and modify the slot assignment to itself, that is, modify the clusterNode structure and clusterState structure
  7. The new master node broadcasts a PONG instruction to the cluster, and other nodes will know that a new master node has been created and update the clusterNode structure and clusterState structure
  8. The new master node will send new SLAVEOF instructions to the remaining slave nodes of the original master node, making them its slave nodes
  9. Finally, the new master node will take care of the original master node’s slot response

I’m very vague here, but if you need to dig deeper, you should read this article:

Redis. Cn/switchable viewer/clus…

Or you can see huang Jianhong’s “Redis design and implementation” this book is very good, I also refer to a lot of content.