Distributed Redis Deep Adventures -Sentinel

This article is the second in the distributed Redis Deep Adventure series, focusing on Redis Sentinel feature. More articles can be found on my blog: github.com/farmerjohng…

The previous article explained how data is synchronized between the master and slave servers of Redis. Imagine that in a single-master, single-slave or single-master, multi-slave structure, if the primary server fails, the entire cluster becomes unavailable, and the single-point problem is not solved. Redis uses Sentinel to solve this problem and ensure high availability of the cluster.

How to ensure high availability of a cluster

To ensure high availability of a cluster, the following requirements must be met:

Can monitor the status of the server, when the master server is unavailable, can detect in time
If the primary server is unavailable, select a most appropriate secondary server to replace the original primary server
There is only one primary server storing the same data at a time

The most intuitive way to do this is to use a monitor server to monitor the status of the Redis server.

A heartbeat connection is maintained between the monitoring server and the master and slave servers. When the master server’s heartbeat is not received after a certain period of time, the master server is marked as offline and the slave server is notified to come online and become the master server.

When the original master server goes online, the monitoring server converts it to the slave server.

Following the above process seems to solve the problem of high cluster availability, but there seems to be something wrong: what if the monitoring server fails? We can add a secondary monitoring server when the primary server is unavailable.

But the question is who monitors the ‘monitor server’? There e is no end to grandchildren.

Doubts aside, let’s look at the implementation of the Redis Sentinel cluster

Sentinel

Like the idea in the previous section, Redis monitors the data server by adding additional Sentinel servers. Sentinel keeps connections with all primary and secondary servers to listen to server status and issue commands to the server.

Sentinel itself is a special status Redis server that starts with the command: Redis-server/XXX /sentinel.conf — The startup process of sentinel mode is different from that of common Redis server. For example, RDB files and AOF files are not loaded, and service data is not stored.

Establish a connection to the primary server

When Sentinel is started, it establishes two connections, one command connection and one subscription connection, with all the master servers provided in the configuration file.

Command connections are used to send commands to the server.

The subscription connection is the _sentinel_: Hello channel for the subscriber to get additional Sentinel information, more on that below.

Get master server information

Sentinel sends the Info command to the master server at a certain frequency to obtain information about the master server, such as the server ID, and the corresponding slave server information, including IP and port. Sentinel updates its own server information based on the information returned from the INFO command and establishes a connection with the slave server.

Get slave server information

Similar to how Sentinel interacts with the master server, Sentinel also uses the Info command to obtain information about the slave server, including the ID of the slave server, the connection status of the slave server to the master server, the priority of the slave server, and the replication offset of the slave server.

Subscribe and publish messages to the server

The section on how to ensure high availability of clusters leaves a question: How to ensure high availability of monitoring servers? Here we can start with a simple answer: use a cluster of monitoring servers (aka Sentinel clusters). How do you do that, how do you keep the consistency of the monitoring server aside for the moment, just keep in mind that you need to have several Sentinels for high availability, how does one Sentinel sense the other sentinels?

As mentioned earlier, Sentinel establishes two connections to the server, one of which is a subscription connection. Sentinel regularly sends messages to the _sentinel_: Hello channel via a subscription link. These include:

Information about Sentinel itself, such as IP address, port number, configuration era (see below), etc
Sentinel monitors primary server information, including IP, port, configuration era (see below), and so on

Sentinel also subscrires to the _sentinel_: Hello channel, meaning Sentinel posts to and subscrires from the channel.

Sentinel has a dictionary object called Sentinels, which holds all the other Sentinel servers that monitor the same master server. When a Sentinel receives a message from the _sentinel_: Hello channel, it compares the sender of the message to itself and ignores it if it does. Otherwise, the content in Sentinels will be updated and connections will be made to the new Sentinels.

Subjective offline

By default, Sentinel sends a PING command once per second to all servers (primary, secondary, Sentinel) that have established a connection. If no response is received within down-after-milliseconds,Sentinel marks the server as subjective offline, meaning it thinks the server is offline. Note that different sentinels have different down-after-milliseconds.

Objective offline

To ensure that a server is actually offline, Sentinel sends the Sentinel IS-master-down-by-addr command to other Sentinel instances. The Sentinel instances that receive the command return the status of the primary server. Represents the connection of the Sentinel to the primary server.

Sentinel counts all replies to Sentinel IS-master-down-by-addr commands issued and counts the number of consent to take the primary server offline. If the number exceeds a certain threshold, the primary server is marked as objective offline.

Election Lead Sentinel

After Sentinel marks a master server as objective offline, the sentinels monitoring the server negotiate through Raft algorithm to elect a lead Sentinel. It is recommended that you look at the basics of the Raft algorithm before moving on to the rest of the article.

Rules:

All sentinels may qualify as lead Sentinels
After each election, the configuration era is +1 regardless of whether the lead Sentinel is elected or not
Every Sentinel has a chance to vote in an era
We call sentinels that ask other people to vote themselves source Sentinels, and sentinels that are asked to vote are target sentinels
Each Sentinel that finds that the primary server is marked as objective offline and has not yet been voted by other Sentinels will require the other Sentinels to set themselves as the head
Target Sentinel In a configuration era, once a Sentinel (and possibly itself) has voted, subsequent commands to vote will be rejected
The target Sentinel will reply with the ID of its elected Sentinel and the current configuration era for a command that asks for a vote
When the source Sentinel receives a reply asking for a vote: if the reply has the same configuration era as its own, then it checks whether the head of the target Sentinel election is its own
If a Sentinel is set as the lead Sentinel by more than half of the sentinels, it is called the lead Sentinel
Only one header is selected in a configuration era (because a header requires more than half the support)
If the head has not been selected within a given period of time, the election will be reelected at a later time (config epoch +1)

Remember the question we asked at the beginning of this article about how to make Redis servers highly available? The answer is to use several Sentinel servers and use Raft consistency algorithm to ensure high availability of the cluster, as long as more than half of the Sentinel servers are healthy, the cluster is available.

failover

The lead Sentinel will perform the following three steps for failover:

1. Select a secondary server from the offline primary server as the new primary server

2. Set the primary server of other secondary servers to the new one

3. Change the role of the offline primary server to that of the secondary server and set the role of the offline primary server to a new one. When the offline primary server comes online again, the role of the secondary server will continue to work

The rules for selecting a new master server in step 1 are as follows:

1. Filter out all offline secondary servers

2. Filter out the secondary servers that have not responded to the Sentinel command in the last 5 seconds

3. Filter out the secondary server that has been disconnected for more than Down-after-milliseconds *10

4. Sort the secondary servers by their priorities and select the server with the highest priority

5. If multiple secondary servers have the same priority, select the one with the largest replication offset

6. If there are multiple servers in the previous step, select the one with the smallest ID