Redis as a very hot in-memory database, in addition to its very high performance, but also to ensure high availability, in the event of failure, as far as possible to reduce the impact of failure, Redis also provides a perfect fault recovery mechanism: sentry.

Let’s take a look at how Redis fault recovery works and how it works.

Deployment patterns

When Redis is deployed, it can be deployed in a number of ways, each of which corresponds to a different availability level.

  • Single-node deployment: Only one node provides services. All read and write services are performed on this node. If this node goes down, all data is lost, affecting services.
  • Master-slave deployment: In master-slave mode, data is written to the master and data is read from the slave. Read/write separation improves access performance. If the master is down, you need to manually promote the slave to master.
  • Master-slave + Sentinel deployment: Master-slave is the same as the preceding ones. The difference is that a group of sentinels are added to check the health status of the master in real time. If the master breaks down, the slave is automatically promoted to the new master, minimizing the unavailability time and shortening the impact on services.

As you can see from the deployment patterns above, the key to improving Redis availability is multi-replica deployment + automatic failover, which relies on master/slave replication.

Highly available practices

Redis natively provides master-slave data replication to ensure that slave data is always consistent with master data.

In the event of a master problem, we need to promote the slave to master and continue to provide service. Therefore, Redis provides sentinel nodes to manage the master-slave nodes and automatically recover the master when it fails.

The whole fault recovery is done automatically by Redis Sentry.

The sentry is introduced

Sentinel is a highly available Redis solution. It is a service tool that manages multiple Instances of Redis, enabling monitoring, notification, and automatic failover of Redis instances.

When sentry is deployed, we only need to configure the master node to be managed in the configuration file, and sentry node can manage the Redis node according to the configuration to achieve high availability.

Generally, we need to deploy multiple sentinel nodes. This is because in distributed scenarios, it may not be accurate to use only one machine to detect whether there is a failure on a node of a machine. It is likely that there is a failure on the network of the two machines, and there is no problem on the node itself.

Therefore, in the scenario of node health detection, multiple nodes are generally used to detect at the same time, and multiple nodes are distributed on different machines with an odd number of nodes, so as to avoid sentry decision-making errors due to network partitioning. In this way, multiple sentinel nodes exchange detection information with each other so that the final decision can be made to confirm whether a problem has actually occurred on a node.

After the sentinels are deployed and configured, the sentinels automatically manage the configured master-slaves. If the master fails, the sentinels promote the slave to a new master in time to ensure availability.

So how does it work?

Sentry principle

Sentry’s workflow is divided into the following stages:

  • State of perception
  • The heartbeat detection
  • Elect sentry leaders
  • Select a new master
  • Fault recovery
  • The client senses the new master

These phases are described in detail below.

State of perception

After starting, the sentry only specifies the address of the master. In order to recover from a master failure, the sentry needs to know the slave information corresponding to each master. Each master may have more than one slave, so sentinels need to know the complete topology of the entire cluster. How do they get this information?

The sentry sends the info command to each master node every 10 seconds. The info command returns information about the master/slave topology, including the address and port number of each slave. With this information, the sentry remembers the topology information of these nodes and selects the appropriate slave node for fault recovery in the event of subsequent failures.

In addition to sending info to the master, the sentry will also send the master’s current status information and the sentry’s own information to the special Pubsub of each master node. Other sentries can get the information sent by each sentry by subscribing to this Pubsub.

There are two main reasons for doing this:

  • The sentinel node can detect the joining of other sentinels, which facilitates the communication of multiple sentinel nodes and provides a basis for subsequent joint negotiation
  • Exchange master status information with other sentry nodes to determine whether the master is faulty

The heartbeat detection

When a failure occurs, you need to start the fail-over mechanism immediately, so how do you ensure timeliness?

Each sentry node sends the ping command to the master, slave, and other sentries every second. If the master, slave, and other sentries respond within the specified time, the sentry node is healthy and alive. The sentinel considers the node to be offline if it does not respond within the specified time (configurable).

Why is it called subjective referral?

Because the current sentry node is not responding to each other’s probes, it is likely that the network between the two machines is faulty, and the master node itself has no problems, at which point the master failure is considered incorrect.

Multiple sentinels are required to confirm that the master node has failed.

Each sentinel node jointly confirms that there is a true failure on the node by asking the other sentinels about the status of the master node.

A node is marked as objective offline only if more than a specified number of (configurable) sentinels consider it subjectively offline.

Elect sentry leaders

After confirming that the node is truly faulty, you need to enter the fault recovery phase. How to perform fault recovery also requires a series of processes.

The first step is to elect a sentinel leader who will be responsible for fail-over operations without having multiple sentinels involved. The process of electing sentry leaders requires the consultation of multiple sentry nodes.

This process of election negotiation is called consensus in the distributed domain, and the algorithm of negotiation is called consensus algorithm.

Consensus algorithm is mainly to solve how to reach a consensus result for a certain scene in distributed scenario.

There are many kinds of consensus algorithms, such as Paxos, Raft, and Gossip algorithm. If you are interested, you can search for relevant information by yourself.

The sentry process for choosing a leader is similar to the Raft algorithm, which is simple enough to understand.

The process is as follows:

  • Each sentry sets a random timeout period, after which it sends a request to the other sentries to become a leader
  • Other sentries can only reply to the first request they receive
  • First reach the sentry node of the majority of confirmed votes and become the leader
  • If, after confirmation of replies, all sentries fail to reach a majority, a new election is held until a leader is chosen

After the Sentinel leader is selected, subsequent fail-over operations are performed by the sentinel leader.

Select a new master

The sentinel leader needs to select one of its slave nodes to replace the master node when it fails.

The process of selecting a new master also has a priority. In the scenario of multiple slaves, the priority of the new master is as follows: slave-priority configuration > Data Integrity > RUNID lower.

In other words, the slave node with the minimum slave-priority is selected first. If the configuration is the same for all slaves, the slave node with the most complete data is selected. If the data is the same, the slave node with the smaller RUNID is selected last.

Promote the new master

After prioritizing and selecting alternate master nodes, the next step is to perform a true master/slave switch.

The Sentinel leader sends the slaveof no one command to the alternate master node to make it master.

The sentinel leader then sends the slaveof $newmaster command to all slaves of the failed node, making them the slave nodes of the newmaster to begin synchronizing data from the newmaster.

Finally, the sentinel leader demotes the faulty node to a slave and writes it into his own configuration file. After the faulty node recovers, it automatically becomes the slave of the new master node.

At this point, the whole failover is complete.

The client senses the new master

Finally, how does the client get the latest master address?

After a failover, the sentinel writes a message to its node’s specified Pubsub, which clients can subscribe to to be notified of changes to the master. Our client can also retrieve the latest master address by actively querying the current master address at the Sentinel node.

In addition, Sentry provides a “hook” mechanism. You can also configure some script logic in the Sentry configuration file to trigger the “hook” logic when the failover is complete, notifying the client that the failover has occurred, and allowing the client to retrieve the latest master address on sentry again.

In general, the first approach is recommended. Many client SDKS have integrated the method of getting the latest master from the Sentinel node, so we can use it directly.

conclusion

It can be seen that in order to ensure the high availability of Redis, sentry node needs to accurately judge the occurrence of faults and quickly select a new node to replace it to provide services, which is a relatively complicated process.

In the middle, distributed consensus, distributed negotiation and other knowledge are involved to ensure the accuracy of failover.

It is important to understand how Redis high availability works so that we can use it more accurately when using Redis.