preface

In the second interview with a well-known factory, I was asked about the high availability of Kafka and MySQL from the perspective of high availability, and I was asked to explain how to design the high availability of my own application from the perspective of high availability. My answers were not good enough, especially in my own project how to design the system when QPS increased sharply and to accept a large number of requests quickly. In this paper, the sentry strategy of Redis is introduced and illustrated in detail. We hope you can give us more comments and questions.

The author’s knowledge context

  • What problems does the Sentinel strategy solve?
  • How is the Sentinel strategy implemented?
    • How to achieve Sentinel high availability? If Redis is to be highly available, sentry itself must be highly available. So the first thing we’re going to show you is how sentinels ensure that they’re highly available.
    • How do I monitor Redis cluster status? Once you’ve made sure sentry doesn’t run into problems on its own, it’s time to monitor the Redis server.
    • How do I find server faults? The Redis servers monitored by Sentry are designed to detect and fix problems, and that starts with failure detection.
    • How to failover. Now that the problem has been identified, the problem needs to be solved, and failover is the key to solving the problem with the sentry strategy.

What problem is the Sentinel strategy supposed to solve?

Background: We all know that Redis as a fast cache, stored in memory, since the data stored in memory will face the problem of data loss, how to do? Redis offers two solutions: persisting data to hard disk (more on that later) and high availability.

High availability you can easily understand, originally we Redis only deployed one machine, once this machine is down, our caching mechanism is invalid. Now to prevent this problem, we deploy a master-slave architecture with one Master and four slaves. Redis’ master-slave architecture is designed to increase concurrency, with primary write and secondary read. So now the problem is, now that the Master machine has crashed, all the other machines are slaves, there’s no way to write, so we have to read the cache. When o&M finds that the machine is down, please help to restart it immediately. What about the requests during this period? Manual response is a high latency operation. When you say that, it’s a stupid way to solve the problem.

So Redis’ sentry mechanism is designed to solve this silly problem. The sentry strategy is the most important problem to solve when the Redis server cluster fails and fails over (primary/secondary switchover). Now that we know what sentinel mode is trying to solve, let’s look at how sentinel mode solves this problem.

The implementation principle of sentry strategy

How does the Sentinel strategy achieve high availability

Sentinel_hello. Sentinels are automatically detected by Sentinel_hello. Sentinels monitor each other.

Several machines need to be deployed

Since sentinels are meant to be highly available in Redis, sentinels themselves must also be highly available. Sentinel system to achieve high availability is the cluster, in order to avoid a single point of failure, we can only take several machines. The general sentry system requires at least three machines. You can ask, why not two? I can check the heartbeat with two. I can tell the administrator if one of them breaks down. This involves two functions of Sentinel mode, “election lead Sentinel” and “objective offline”, both of which require sentinels to vote together (especially election lead Sentinel). So you need at least three machines, preferably an odd number of machines.

Note: Sentinel is essentially a special mode Redis server that runs different commands from regular Redis server commands. A sentinel system can monitor multiple Redis clusters.

Now that you know that the Sentinel strategy requires several machines to be highly available, let’s walk you through the process.

Sentinel machine deployment process

  1. We will configure the file on the machine deployed as a sentinel. The configuration file contains the Master IP address and port information.
  2. Initialize the Sentinel server. Similar to initializing a normal Redis server, but some features are not available: set, persistence, etc. Some features are unique to Sentinel: slaveOf, etc.
  3. Use the Sentinel special command. Using Sentinelcmds as a command list, such as INFO and sentinel, some common Redis can execute commands that sentinel cannot execute because sentinelcmds does not contain these commands.
  4. Initialize theSentinelState instance structure

    The status of all Sentinel functions stored in this instance structure is the basis for our understanding of Sentinel strategy, which is implemented by modifying the data in the instance structure.

Figure: sentinelState instance structure

  1. Create network connections to the Master: Create command connections, create subscription connections – Sentinel_ : Hello channel for the Master server. The command connection is used to communicate with the Master server. Function of subscription connection: each Sentinel discovers each other, notification of subsequent master/slave switchover, etc.

  2. Create network connections to each slave server: Create command connections, create subscription connections.

So how do we configure the sentinel cluster? Automatic detection, connected with other sentries; Maintain status information of other sentries; Sentinel_hello channel;

How does sentry monitor the Redis cluster?

As mentioned above, Sentinel connects with Master commands and subscriptions. How does Sentinel system build a complete monitoring system?

Found from servers, other sentinels

  1. Sentinel sends INFO commands to Master every 10 seconds.
  2. The Master returns INFO, including address information of slaves for its role and slaves, so that Sentinel knows the information of Slaves. To automatically discover the Master’s slave server.
  3. After discovery, create command connections and create subscription connections are also sent to the slave server.
  4. Sentinel subscribes to Master’sSENTINELChannel: hello.
  5. Sentinel publishes every two seconds to the primary and secondary serverssentinel:hello <sentinel_info><master_info>
  6. All Sentinel subscribers to the channel will receive the message. According to this information, each Sentinel can automatically discover other Sentinels, and then construct the structure of the whole Sentinel system completely.
  7. You can send create command connections directly to other Sentinels.

At this stage, the whole Sentinel system has completed the monitoring of the Redis cluster. Each Sentinel maintains the complete Sentinel cluster and Redis cluster server information, which is stored in the sentinelState of each Sentinel. Armed with this information, we can begin to design how to failover.

How is failover implemented?

Figure: Failover schematic (link to online image access at the end of article)

All machines are subject to downtime and offline (network failure), including Master, slave, and Sentinel machines. The failover mechanism of Redis Sentry is for the Master node. By default, the sentry notifies the o&M personnel to check the status of the machine and solve the problem. Because the Master is responsible for writing, it can be cumbersome to switch manually if something goes wrong. (In the case of deploying Master/slave architecture only, o&M is required to perform manual maintenance of Master).

Failover steps

Now let’s explore the implementation process of failover, divided into the following steps:

  1. Check whether the Master is offline: sdown and oDOWN
  2. Election Lead Sentinel
  3. Lead Sentinel to perform failover:

    Select the optimal Slave; Set optimal Slave to new Master: tell other slaves to copy data from new Master. Tell Redis application client to change Master address; Old Mater is back online set to Slave.

1. Check whether the Master is offline

Subjective offline

According to the Sentinel file we set up earlier, we configured down_after_peroid=5000 as the down_after_peroid parameter of the monitored master in the SentinelState we maintained, which means that Sentinel sends Ping commands per second, If the Master does not return for more than 5 seconds, it is deemed to be subjectively offline. Set the flag add SentinelRedisInstance sri_master | sri_s_down

Since each Sentinel has its own Sentinel profile, different down_after_peroid values can be configured, and each Sentinel can achieve different subjective downline judgment, that is, one thinks 5 seconds, and one thinks 50 seconds.

This is the implementation of subjective downsizing, which obviously causes problems, everyone has their own algorithm, you ask for 5 seconds, someone else asks for 50 seconds, so when does the Master go downsizing? So in order to solve this problem, we have objective logoff.

Objective offline

Going back to the properties in SentinelState, we set the quorum parameter, quorum=2, so the other two Sentinels also consider this machine subjectively offline, Then the current Sentinel can add flag set sri_master | sri_s_down | sri_o_down.

The process of achieving objective downline labeling will be explained in the next section of Election lead Sentinel.

Election Lead Sentinel

How do you implement this objective offline marker?

  1. The current Sentinel is sent to other sentinelsSentinel is_down_master_by_addr<ip><port><current_epoch><runid>The command
  2. Other Sentinel returned information


    < leader_EPOCH >

  3. The returned information was analyzed to determine whether the Master subjective offline was judged by other sentinels, and the Master was set as objective offline according to quorum. Whether to set itself as a local lead Sentinel.
  4. Sentinel got more than half of the other Sentinels to set it up as the local lead Sentinel, so it was the final lead Sentinel.

Here’s an example:

Sentinel: Sentinel1:down_after_peroid=5000, quorum=1; Sentinel2: down_after_peroid = 8000, quorun = 1; Sentinel3: down_after_peroid = 30000, quorun = 2; This setup makes it almost impossible for Sentinel3 to become the lead, because the first two sentinel3 will already be the lead if something goes wrong.

Note: Sentinel will run again if he loses.

Now that we have selected the lead Sentinel, we are ready to failover.

Perform failover

Three main steps

  1. Select the optimal slave.
  2. Set Slave to replicate data from the new Master.
  3. The old Master becomes a Slave.

It’s essentially choosing the best based on a bunch of rules, including rules for determining online status, copying offsets, and so on, and you can understand that achievements are the result of a bunch of sorting and if decisions.

Once we’ve selected the best Slave, we’re going to set it to Master. The process is as follows:

Figure: Performing failover (link to online image access at end of article)

  1. The lead Sentinel is sent to the optimal Slaveslave no oneCommand.
  2. The lead Sentinel sends INFO to the optimal Slave every second (normally 10 seconds) until the Slave returns its role=master. The upgrade is successful.
  3. Send to the previous Slaveslaveof <new_master_ip><new_master_port>Command to copy data from the new Master node instead.
  4. After the old Master node is online again, it is also sent to itslaveof <new_master_ip><new_master_port>Command to make it a Slave.

What problems can arise from the application of the sentry strategy.

Every use of technology inevitably raises new questions, and all solutions are the result of trade-offs. If you don’t have a large number of users and don’t require much caching, you don’t even need to introduce a master-slave architecture. So what are the problems with introducing sentry tactics?

  • System architectures become more complex and machine resources (money) become more demanding.
  • This will degrade Redis performance to some extent.
  • Including but not limited to the above two points, welcome readers to give advice.

The resources

  • Design and Implementation of Redis – Huang Jianhong

Schematic diagram of failover