Automating fault recovery: Redis Sentry technology

Shared earlier
After reading this article, you are almost done with Redis master-slave replication
We have mentioned that the functions of Redis master-slave replication include data hot backup, load balancing, fault recovery, etc. One problem with master-slave replication is that failover cannot be automated. Sentinel, which this article will introduce, is based on Redis master-slave replication. Its main function is to solve the automation problem of master node failure recovery and further improve the high availability of the system.

Note: This article is based on Redis version 3.0.

I. Role and structure

1. The role

Before introducing Sentry, let’s review the technologies associated with Redis for high availability from a macro perspective. They include: persistence, replication, sentry, and clustering. Their main functions and problems are:

Persistence: Persistence is the simplest method of high availability (sometimes it is not even classified as a high availability method). It is mainly used for data backup. That is, data is stored on hard disks to ensure that data is not lost due to process exit.
Replication: Replication is the foundation of highly available Redis, and both Sentinels and clusters are highly available on top of replication. Replication mainly realizes multi-machine data backup, load balancing for read operations and simple fault recovery. The disadvantage is that failure recovery cannot be automated; Write operations cannot be load balanced. The storage capacity is limited by a single machine.
Sentinel: Based on replication, Sentinel implements automated failover. The disadvantage is that the write operation cannot be load balanced. The storage capacity is limited by a single machine.
Clustering: Through clustering, Redis solves the problem that write operations cannot be load balanced and storage capacity is limited by single machine, realizing a relatively complete high availability solution.

Details can be reviewed:

Redis high availability details: Persistence technology and solution selection
After reading this article, you are almost done with Redis master-slave replication

Back to the sentry.

Redis Sentinel, or Redis Sentinel, was introduced in Redis 2.8. The core function of Sentry is automatic failover of the master node. Here’s how the Redis official documentation describes the Sentry feature:

Monitoring: The sentry continuously checks whether the master and slave nodes are functioning properly.
Automatic Failover: When the master node does not work properly, the Sentry starts an Automatic Failover operation by upgrading one of the slave nodes of the failed master node to the new master node and making the other slave nodes replicate the new master node instead.
Configuration Provider: During initialization, the client connects to the sentinel to obtain the address of the current Redis service master node.
Notification: The sentry can send the result of a failover to the client.

Among them, the monitoring and automatic failover function, so that the sentry can detect the master node failure and complete the transition; Configuring the provider and notification functions is reflected in the interaction with the client.

In previous articles, any Redis server accessed through the API was referred to as a client, including Redis- CLI, Java client Jedis, etc. For the sake of clarification, the client in this article does not include Redis-CLI, but is more complex than redis-CLI: Redis-CLI uses the low-level interfaces provided by Redis, and the client encapsulates these interfaces and functions to take advantage of Sentry’s configuration provider and notification capabilities.

2. The architecture

A typical sentry architecture diagram looks like this:

It consists of two parts:

Sentinel node: The sentinel system consists of one or more sentinel nodes, which are special Redis nodes that do not store data.
Data node: Both primary and secondary nodes are data nodes.

Second, the deployment of

This section will deploy a simple sentinel system with one master node, two slave nodes, and three sentinel nodes. For convenience, all these nodes are deployed on the same machine (LAN IP: 192.168.92.128), distinguished by port numbers; The configuration of nodes is as simple as possible.

1. Deploy the primary and secondary nodes

The sentry master/slave node is the same as the normal master/slave node configuration and does not require any additional configuration. The following are the configuration files of the primary node (port=6379) and the two secondary nodes (port=6380/6381). The configuration files are relatively simple and will not be detailed:

#redis-6379.conf

port 6379

daemonize yes

logfile “6379.log”

dbfilename “dump-6379.rdb”

#redis-6380.conf

port 6380

daemonize yes

logfile “6380.log”

dbfilename “dump-6380.rdb”

Slaveof 192.168.92.128 6379

#redis-6381.conf

port 6381

daemonize yes

logfile “6381.log”

dbfilename “dump-6381.rdb”

Slaveof 192.168.92.128 6379

After the configuration is complete, start the primary and secondary nodes in sequence.

redis-server redis-6379.conf

redis-server redis-6380.conf

redis-server redis-6381.conf

After the node is started, connect the primary node to check whether the primary/secondary status is normal, as shown in the following figure:

2. Deploy the sentinel node

Sentinel nodes are essentially special Redis nodes.

The configurations of the three sentinel nodes are almost the same. The main difference lies in the different port numbers (26379/26380/26381). The following uses 26379 as an example to describe the configuration and startup mode of the sentinel node. The configuration section is as simple as possible, and more configurations will be introduced later:

#sentinel-26379.conf

port 26379

daemonize yes

logfile “26379.log”

Sentinel Monitor MyMaster 192.168.92.128 6379 2

The sentinel Monitor MyMaster 192.168.92.128 6379 2 configuration is as follows: The sentinel node monitors the master node 192.168.92.128:6379. The name of the master node is MyMaster. The meaning of the last 2 is related to the fault determination of the master node.

The sentinel node can be started in two ways, which are identical:

redis-sentinel sentinel-26379.conf

redis-server sentinel-26379.conf –sentinel

Once configured and started in this manner, the sentinel system is up and running. This can be verified by redis-CLI connection to the Sentinel node, as shown below: 26379 Sentinel node is already monitoring the myMaster node (i.e. 192.168.92.128:6379) and has found its two slave nodes and two other sentinel nodes.

If you look at the sentinel configuration file, you will find some changes. Take 26379 as an example:

Where dir simply explicitly declares the directory where the data and logs are located (only logs in sentinel context); Known -slave and Known – Sentinel indicate that the sentry has found the slave node and other sentries; The parameter with epoch is related to the configuration epoch (the configuration epoch is a counter starting from 0, +1 for each leader sentry election; The leader sentry election is an operation in the failover phase, as described in the principles section below.

3. Demonstrate failover

Of the sentinel’s four roles, configuring providers and notifications requires the cooperation of the client, which is described in the next chapter when the client accesses the Sentinel system. This section demonstrates sentinel’s monitoring and automatic failover capabilities in the event of a primary node failure.

Step1: First, use the kill command to kill the primary node:

Step2: If you run the info Sentinel command in the Sentinel node, you will find that the primary node has not been switched over, because it takes some time for the Sentinel to find that the primary node is faulty and transferred.

Step3: After a period of time, perform info Sentinel check again in the Sentinel node and find that the primary node has been switched to 6380 node.

But at the same time, it can be found that the sentinel node thinks that the new master node still has two slave nodes. This is because when the sentinel switches 6380 to the master node, it sets 6379 node as its slave node. Although the 6379 slave node has been suspended, it is assumed that the slave node has always existed because the sentry does not objectively offline the slave node (the meaning of which is explained in the principles section). When node 6379 is restarted, it automatically becomes the secondary node of node 6380. So let’s verify that.

Step4: restart node 6379. You can see that node 6379 becomes the secondary node of node 6380.

Step5: during the failover phase, the sentry and master/slave configuration files are overwritten.

For master and slave nodes, the slaveof configuration changes: the new master node does not have the Slaveof configuration, and its slave node slaveof the new master node.

For the sentinel node, in addition to the change of the information of the master and slave nodes, epoch will also change. It can be seen in the following figure that the parameters related to epoch are +1.

4. To summarize

There are several points to note in the sentry system setup process:

The master-slave node in the sentry system is no different from the common master-slave node, and failure detection and transfer are controlled and completed by the sentry.
Sentinel nodes are essentially Redis nodes.
For each sentinel node, the monitoring master node is configured to automatically discover other sentinels and slave nodes.
During sentinel node startup and failover, the configuration files for each node are rewritten (Config Rewrite).
In the example in this chapter, a sentinel monitors only one master node; In fact, a sentinel can monitor multiple primary nodes by configuring multiple Sentinel Monitors.

Third, the client accesses the sentinel system

While the previous section demonstrated the two main roles of sentry: monitoring and automatic failover, this section demonstrates the two other roles of Sentry in combination with the client side: configuring providers and notifications.

1. Code examples

Before introducing the principle of the client, let’s take the Java client Jedis as an example to demonstrate how to use it: The following code can connect to the Sentinel system we just built, and carry out various read and write operations:

public static void testSentinel() throws Exception {

String masterName = “mymaster”;

Set sentinels = new HashSet<>();

Sentinels. Add (192.168.92.128: “26379”);

Sentinels. Add (192.168.92.128: “26380”);

Sentinels. Add (192.168.92.128: “26381”);

JedisSentinelPool pool = new JedisSentinelPool(masterName, sentinels); //The initialization process does a lot of work

Jedis jedis = pool.getResource();

jedis.set(“key1”, “value1”);

pool.close();

}

(Note: the code only demonstrates how to connect to the sentry, exception handling, resource closing, etc., are not considered)

2. Client principle

The Jedis client provides excellent support for Sentry. As shown in the code above, we simply provide Jedis with a collection of sentinelnodes and masterName to construct the Jedis SentinelPool object. You can then use it like a normal Redis connection pool: get the connection through pool.getResource() and execute the specific command.

Throughout the process, our code can connect to the master node without explicitly specifying the address of the master node; With no mention of failover in the code, the master node can be switched automatically after the sentry completes the failover. This is possible because of the work done in the JedisSentinelPool constructor, which includes the following two main points:

Traverse sentinel nodes to obtain master node information: traverse sentinel nodes to obtain master node information through one sentinel node +masterName; This function is implemented by invoking the sentinel get-master-addr-by-name command of the sentinel node. The following is an example of this command:

Once the master node information is obtained, the loop stops (so the loop generally stops when the first sentinel node is traversed).

Add sentry listening: This allows the client to be notified by the sentry in the event of a failover to complete the master node switch. The specific approach is to use the publish and subscribe function provided by Redis to start a separate thread for each sentinel node, subscribe to the sentinel node’s +switch-master channel, and re-initialize the connection pool when a message is received.

3. Summary

Through the introduction of the sentry client principle, you can better understand the sentry function as follows:

Configure the provider: The client can get master node information via the Sentinel node +masterName, where the sentinel is used to configure the provider.

It is important to note that the sentinel only configures the provider, not the broker. The difference between the two is:

If it is a configuration provider, the client will directly establish a connection to the master node after obtaining the master node information through the sentinel, and subsequent requests (such as set/ GET) will be directly sent to the master node.
In the case of a proxy, each request from the client is sent to the sentinel, which processes the request through the master node.

An example of sentry’s role is to configure the provider, not the proxy. In the sentinel system deployed above, modify the configuration file of the Sentinel node as follows:

Sentinel Monitor MyMaster 192.168.92.128 6379 2

Instead of

Sentinel Monitor MyMaster 127.0.0.1 6379 2

Then, run the client code on another machine on the LAN and find that the client cannot connect to the primary node. This is because sentry, as the configuration provider, is used by the client to query the IP address of the primary node as 127.0.0.1:6379. The client will establish a Redis connection to 127.0.0.1:6379, which cannot be connected. If sentry were an agent, this problem would not arise.

Notification: The Sentinel node sends the new master node information to the client after failover so that the client can switch the master node in time.

4. Basic principles

The basic methods of sentry deployment and use have been introduced before. This part introduces the basic principles of sentry implementation.

1. Commands supported by the sentinel node

As a Redis node running in a special mode, the Sentinel node supports different commands from ordinary Redis nodes. In operation and maintenance, we can query or modify the sentinel system through these commands. But more importantly, the sentry system cannot realize the functions such as fault discovery and failover without the communication between the sentries, and a large part of the communication is realized through the commands supported by the sentries. The following are the main commands supported by the Sentinel node:

Basic query:

By running these commands, you can query the topology, node information, and configuration information of the sentinel system.

Info Sentinel: Obtains basic information about all monitored primary nodes.
Sentinel Masters: Get details of all master nodes monitored.
Sentinel Master myMaster: Gets details about the master node myMaster that is monitored.
Sentinel Slaves MyMaster: Obtain details of the slave nodes of the monitored master node MyMaster.
Sentinel Sentinels MyMaster: Get detailed information about the sentinel node of myMaster, the master node of monitoring.
Sentinel get-master-addr-by-name mymaster: Obtains the address information of the primary node mymaster.
Sentinel is-master-down-by-addr: The sentinel nodes can use this command to query whether the primary node is offline to determine whether the primary node is offline.

Add/remove monitoring for the master node:

Sentinel Monitor MyMaster2 192.168.92.128 16379 2: The sentinel monitor function is the same as that in the sentinel node configuration file.

Sentinel remove myMaster2: Cancel the monitoring of primary node myMaster2 by the current sentinel node.

Forced failover:

Sentinel failover MyMaster: This command forces a failover of myMaster even if the current primary node is running well. For example, if the current active node is about to become obsolete, you can run the failover command to perform failover in advance.

2. Fundamentals

The key to understanding how sentinels work is the following concepts:

Scheduled task: Each sentinel node maintains three scheduled tasks. The functions of a scheduled task are as follows: You can send the info command to the primary and secondary nodes to obtain the latest primary/secondary structure. Obtain the information of other sentinel nodes by publishing and subscribing; You can send the ping command to another node to check whether the node is offline.

Subjective offline: In the scheduled task of heartbeat detection, if other nodes do not reply for a certain period of time, the sentinel node will take them subjective offline. As the name suggests, subjective logoff means that a sentinel node “subjectively” judges a logoff; The opposite of subjective downline is objective downline.

Objective offline: After subjectively offline the primary node, the sentinel node queries other sentinel nodes about the status of the primary node through the sentinel is-master-down-by-addr command. If the number of sentinels on the primary node goes offline reaches a certain value, the primary node goes offline objectively.

It is important to note that objective logoff is a concept only for the master node; If the slave node and sentinel node fail, there are no subsequent objective offline and failover operations after being subjectively offline by sentinel.

Leader sentry node election: When the master node is judged to be offline objectively, each sentry node will negotiate to elect a leader sentry node, and the leader node will failover it.

All sentinels monitoring the master node are potentially elected as leaders using the Raft algorithm; The basic idea behind Raft’s algorithm is first come, first served: in A round of elections, sentry A sends A request to BE leader to B, and if B doesn’t agree with any of the other sentries, he agrees with A to be leader. The specific process of election is not described in detail here, generally speaking, the sentry selection process is very fast, who completes the objective referral first, generally can become the leader.

Failover: The elected leader sentry starts the failover operation, which can be broken down into three steps:

Select a new master node from among nodes: First filter out unhealthy slave nodes; Then select the slave node with the highest priority (specified by slave-priority). If the priorities cannot be distinguished, the secondary node with the largest replication offset is selected. If it still cannot be distinguished, the slave node with the smallest RUNId is selected.
Update the master/slave status: Run the slaveof no one command to make the slave node become the master node. And using the slaveof command to make other nodes its slave nodes.
Set the offline primary node (that is, 6379) as the secondary node of the new primary node. When 6379 comes online again, it will become the secondary node of the new primary node.

Through the above key concepts, you can understand the basic working principle of sentinels. To illustrate, the following figure shows the log of the Leader Sentinel node from node startup to failover.

5. Configuration and practical suggestions

1. The configuration

Several sentry related configurations are described below.

Configuration 1: Sentinel monitor {masterName} {masterIp} {masterPort} {quorum}

Sentinel Monitor is the core configuration of sentinels, which has been explained in the previous section about sentinel node deployment. Where: masterName specifies the name of the master node, masterIp and masterPort specify the address of the master node, and quorum is the threshold of sentinels to judge the objective offline of the master node: When the number of sentinels determining that the primary node is offline reaches quorum, the primary node is taken offline objectively. The recommended value is half the number of sentinels plus one.

Configuration 2: sentinel down-after-milliseconds {masterName} {time}

The sentinel uses the ping command to detect the heartbeat of other nodes. If the node has not responded to down-after-milliseconds, the sentinel takes it offline. This configuration is valid for subjective offline decisions of master, slave, and sentinel nodes.

The default value for down-after-milliseconds is 30,000 (30s). The value can be adjusted based on different network environments and application requirements. A larger value indicates that the subjective offline judgment is more lenient, which reduces the possibility of misjudgment. However, it takes longer to discover and failover faults, and the client waits longer. For example, if the application has high availability requirements, the value can be adjusted to a smaller value to complete the failover as soon as a failure occurs. If the network environment is poor, you can increase the threshold to avoid frequent miscalculation.

Configuration 3: Sentinel parallel – syncs {masterName} {number}

Sentinel parallel-Syncs is related to replication of slave nodes after failover: it specifies the number of slave nodes that initiate a replication operation to the new master node at a time. For example, after the primary node is switched over, three secondary nodes need to initiate replication to the new primary node. If parallel-syncs=1, the slave nodes are replicated one by one; If parallel-syncs=3, then the three slave nodes start replication together.

The larger the parallel-syncs value is, the faster the secondary node completes the replication, but the greater the network load and disk load of the primary node. Set this parameter based on the actual situation. For example, if the primary node has a low load and the secondary node has a high demand for service availability, you can increase the value of PARALLEL syncs appropriately. The default value for parallel-syncs is 1.

Configuration 4: Sentinel failover – timeout {masterName} {time}

Sentinel failfail-timeout is related to the determination of failover timeout. However, this parameter is not used to determine the timeout of the entire failover phase, but the timeout of several sub-phases. For example, if the time for the primary node to be promoted to the secondary node exceeds timeout, Or the time (excluding data replication time) for the secondary node to initiate replication to the new master node exceeds timeout, the failover timeout fails.

The default value of failover-timeout is 180000, that is, 180s. If you time out, the value will be twice as high the next time.

Configuration 5: In addition to the preceding parameters, other parameters, such as security authentication parameters, are not described here.

2. Practical suggestions

There should be more than one sentinel node. On the one hand, the redundancy of sentry nodes is increased to avoid sentry itself becoming the bottleneck of high availability. On the other hand, reduce the misjudgment of the offline. In addition, these different sentinel nodes should be deployed on different physical machines.
The number of sentries should be an odd number to facilitate sentries to make “decisions” through voting: leader election decisions, objective offline decisions, etc.
The configuration of each sentinel node should be consistent, including hardware and parameters, etc. In addition, all nodes should use NTP or similar services to ensure accurate and consistent time.
Sentry’s configuration provider and notification client functions require client support to implement, as mentioned above in Jedis; If the developer is using a library that does not provide support, you may need to implement it yourself.
When nodes in a sentry system are deployed in Docker (or any other software that may perform port mapping), special care should be taken that port mapping may cause the Sentry system to fail to function properly because the sentry’s work is based on communicating with other nodes, and Docker’s port mapping may cause the sentry to fail to connect to other nodes. For example, sentinels discover each other depending on their declared IP and port. If A sentinel A is deployed in A Docker with port mapping, other sentinels cannot connect to A using the port declared by A.

Six, summarized

This article begins with an introduction to the role of sentinels: monitoring, failover, configuring providers, and notification; Then describes the sentinel system deployment method, and access to the sentinel system through the client method; Then the basic principle of sentry implementation is briefly explained. Finally, some suggestions about sentry practice are given.

On the basis of master-slave replication, Sentry introduced automatic failover of the master node to further improve the high availability of Redis. However, the defects of sentry are also obvious: Sentry cannot automatically failover slave nodes. In read/write separation scenarios, slave node failure will lead to unavailability of read services, requiring us to perform additional monitoring and switchover operations on slave nodes.

In addition, Sentry still does not solve the problems of load balancing for write operations and storage capacity limited by single machine; The solution to these problems requires clustering, so stay tuned for community content.

reference

https://redis.io/topics/sentinel
http://www.redis.cn/
Redis Development and Operation
Redis Design and Implementation

Author: The programming myth

Source:www.cnblogs.com/kismetv/p/9609938.html

The dBAPlus community welcomes technical staff to contribute their articles. Email: [email protected]

Automating fault recovery: Redis Sentry technology

Related Posts

1000 simple and exquisite icon icon download

Shellcode analysis in CVE-2015-1538 Vulnerability exploitation

Live salon registration | ClickHouse application in real-time scene and optimization