This article will introduce the sentinel mechanism from the high availability of Redis, and explain the principle and function of sentinel in detail. I hope readers can have a deeper understanding of the sentinel principle after reading it.
For those who have used Redis to develop applications, it is better to know the Redis master-slave architecture
Think about a problem
How to achieve high availability in Redis master-slave architecture?
First you need to explain what is high availability?
High availability: If your system is available 99.99% of the time throughout the year, it is called high availability.
Now that we have said the concept of high availability, there must be the word unavailable.
Unavailable means that for some reason or another your system is down and unable to provide service.
Common reasons:
- JVM OOM
- The server is down.
- CPU 100%
- .
In the redis master-slave architecture, redis becomes unavailable when the master fails (the Redis process is down/the machine where the Redis process is running is down) and client data cannot be written to Redis
In the Redis master-slave architecture, the master is responsible for writing data and synchronizes the written data to the slave. The slave can only read the data
At this time, data cannot be written to Redis, because the slave can only synchronize data from the master. Our system is actually unavailable, and data cannot be written to Redis.
If a slave fails, the availability of the Redis master/slave architecture will not be affected. Since there are other slaves processing read requests, the entire Redis master/slave architecture can still write data. Therefore, the failure of a slave will not affect the availability of the redis master/slave architecture.
But the master downtime, the entire master and slave architecture backbone, soul downtime, that also play a hair ah!
If our system uses Redis to implement the cache architecture, and the master node fails at this time, we go to get the commodity information, for example, and find that there is no commodity information in the cache, we find out from the database, and want to write the commodity information into Redis, but the master fails and cannot write. In this case, our requests for commodity information can only be sent to mysql forever. For a while, because your master node is down, mysql has to bear a lot of requests. If mysql fails, your system will also crash.
When redis is used as a cache, it not only improves the query speed (pure memory), but also resists high concurrency, greatly easing the pressure on mysql.
Awkward: The Redis master node is down -> The Redis master/slave node is unavailable -> The system is unavailable
So how can we ensure the high availability of redis master and slave, avoid saying that the master node is down, the whole Redis is not available, and then the system is not available? The answer is to use sentinels.
Let’s take a look at the principles underlying redis Sentry’s master-slave high availability.
Basics of sentry architecture
What is a sentry
Sentinel, the Chinese name is sentinel
Sentinel is a very important component in the Redis cluster architecture. Its main functions are as follows
- Cluster monitoring: monitors whether the Redis master and slave processes are working properly
- Message notification. If a Redis instance fails, the sentinel sends a message as an alert to the administrator
- Failover: If the master node fails, the slave node automatically transfers to the master node
- The configuration center notifies the client of the new master address if failover occurs
The sentinels themselves are distributed, running as a cluster of sentinels that work together
-
When a master node is down, most of the sentries need to agree, which involves the distributed election problem
-
Even if some of the sentinel nodes fail, the sentinel cluster will still work, because it is bad to have a failover system that is itself a single point, which is an important part of the high availability mechanism
Components that support high availability must also achieve high availability themselves
The current version is Sentinel 2, which has a lot of code rewrite compared to Sentinel 1, mainly to make the failover mechanism and algorithm more robust and simple
Sentry’s core knowledge
- Sentinels need at least 3 instances to maintain their robustness
- Sentinel + Redis master-slave deployment does not guarantee zero data loss, only high availability of the Redis cluster
- For the sentinel + Redis master-slave complex deployment architecture, try to conduct adequate testing and rehearsals in both test and production environments
Why can’t redis sentinel cluster work with only 2 nodes?
A sentinel cluster must contain at least two nodes
If the sentinel cluster has only two sentinel instances deployed, quorum=1
Quorum indicates most
+—-+ +—-+ | M1 |———| R1 | | S1 | | S2 | +—-+ +—-+
Configuration: Quorum = 1
If the master fails, only one sentry in S1 and S2 thinks that the master fails, the switch can be performed, and a sentry in S1 and S2 will be elected to perform the failover
The majority of the two sentries is 2 (the majority of 2 =2, 3 =2, 5 = 3,4 =2). Both sentries are running. Failover is allowed
But if the entire machine running M1 and S1 goes down, then there is only 1 sentinel left, and there is no majority to allow failover, even though the other machine has an R1, failover will not occur
In summary, the two sentinel nodes cannot reach majority and fail over.
Classic 3-node sentinel cluster
+—-+ | M1 | | S1 | +—-+ | +—-+ | +—-+ | R2 |—-+—-| R3 | | S2 | | S3 | +—-+ +—-+
Quorum =2, majority=2
If M1 is down, and there are only two of the three sentinels left, S2 and S3 can agree that the master is down and elect one to failover
The majority of three sentries is 2 at the same time, so the remaining two sentries are running, allowing failover to be performed
Data loss in the redis sentry primary/secondary switchover
The redis sentry primary/secondary switchover may cause data loss.
Data loss caused by asynchronous replication
Because the master -> slave replication is asynchronous, it is possible that some data is not copied to the slave, the master may be down, and the partial data is lost.
Data loss caused by cluster split brain
A split brain is when a master machine is suddenly disconnected from the network and disconnected from other slave machines, but the master is still running.
At this point, the sentry may think that the master is down, and then turn on the election to switch the other slave to master
At this point, there are two masters in the cluster, which is called a split brain.
Even if a slave is switched to the master, the client may not be able to switch to the new master, and the data that continues to be written to the old master may be lost.
Therefore, when the old master is restored, it will be attached to the new master as a slave, its own data will be cleared, and data will be copied from the new master.
Resolve data loss caused by asynchronous replication and split brain
min-slaves-to-write 1
min-slaves-max-lag 10
Copy the code
There must be at least one slave and the delay of data replication and synchronization cannot exceed 10 seconds.
If at least there is
min-slaves-to-write
And the latency value of these servers is less thanmin-slaves-max-lag
Second, the primary server will perform the write requested by the client.
If all slave, data replication, and synchronization delays exceed 10 seconds, then the master will not receive any more requests
The above two configurations can reduce data loss caused by asynchronous replication and split brain
Reduce data loss in asynchronous replication
With min-slaves-max-lag, we can ensure that all slaves will become slaves to copy data and ack for a long time, so we will reject the write request because we think that too much data will be lost after the master goes down. In this way, data loss caused by partial data not synchronized to slave during the master failure can be reduced within a controllable range.
It’s basically asking the slave to keep up with the master
Reduce data loss in split brain
If a master has a brain split and loses contact with other slaves, the above two configurations ensure that if the master cannot continue sending data to a specified number of slaves and the slave does not ack the master for more than 10 seconds, the client will be denied write requests.
In this way, the old master will not accept new data from the client, thus avoiding data loss. (The old master will not communicate with the slave, so it will not accept write requests from the client.)
The above configuration ensures that if you lose connection to any slave and find no slave ack after 10 seconds, you will reject new write requests
Therefore, in the split brain scenario, at most 10 seconds of data is lost
An in-depth analysis of the underlying principles of the Redis Sentinel core
Sdown and ODown conversion mechanism
Two failure states are sDown and ODown
-
If a sentry thinks a master is down, then it’s a subjective down
-
An ODown is an objective outage, which is an objective outage if all the sentries in the quorum count feel that a master is down
If a sentry pings a master for more than the number of milliseconds specified in is-master-down-after-milliseconds, then the master is considered down
The condition for the transition from SDown to ODown is simple. If a sentinel receives a quorum for a specified number of hours and believes that the master is sdown, then the sentinel is considered odown and the master is objectively considered down
Automatic discovery mechanism of sentinel cluster
Sentinels’ discovery of each other is made possible through Redis’ pub/ Sub system. Each sentinel sends a message to __sentinel__: Hello channel, and all the other sentinels consume the message and are aware of the other sentinels’ presence
Every two seconds, each sentry sends a message to the __sentinel__: Hello Channel for one of its master+ Slaves, with its host, IP, rUNID and its monitoring configuration for the master
Each sentry will also listen to the __sentinel__: Hello Channel of each master+ Slaves he monitors and become aware of the presence of other sentries who are also listening to the same master+ Slaves
Each sentry will also exchange monitoring configurations for the Master with other sentries to synchronize monitoring configurations with each other
Automatic correction of the slave configuration
The sentry will automatically correct some of the slave configurations. For example, if the slave is a potential master candidate, the sentry will make sure that the slave is replicating the data of the existing master. If the slaves are connected to the wrong master, such as after a failover, then the sentinel will ensure that they are connected to the correct master.
Slave -> Master election algorithm
If a master is considered odown and a majority of sentries allow a master/slave switchover, then a sentry will perform the master/slave switchover and first elect a slave.
At this point, we’re going to consider some information about the slave
- Duration of disconnection from the master
- Slave priority
- copy
offset
run id
If a slave disconnects from the master for more than 10 times the number of Down-after-milliseconds, and the master is down for longer, then it is deemed unfit to vote for master
(down-after-milliseconds * 10) + milliseconds_since_master_is_in_SDOWN_state
Now we’re going to sort the slaves
- The order is sorted by slave priority. The lower the slave priority, the higher the priority
- If the slave priority is the same, then look at the replica offset, which slave replicates more data, the lower the offset, the higher the priority
- If both conditions are the same, select the slave with the smaller run ID
The quorum and majority
For a sentry to make a master/slave switch, first the sentry with the quorum number considers odown and then elects a sentry to make the switch. This sentry must be authorized by the majority sentry to officially perform the switch.
-
If quorum < majority, such as 5 sentries, majority is 3, and quorum is set to 2, then only 3 sentries can perform the switchover
-
However, if quorum >= majority, then all the sentries with quorum must be authorized. For example, if 5 sentries and quorum is 5, then all 5 sentries must be authorized to perform the switchover
configuration epoch
Sentinel will monitor a set of Redis Master + Slave, with corresponding monitoring configuration
The sentinel that performs the switch will get a Configuration epoch from the new master (salve->master) to which the switch is to be made. This is a version number that must be unique for each switch
If the first sentinel fails to perform the switchover, the other sentinel waits for the failover time and then continues to perform the switchover. In this case, the sentinel obtains a new configuration epoch as the new version number
Configuraiton spread
When the sentry completes the switch, it updates the master configuration locally and synchronizes it to other sentries via pub/sub message mechanism
The previous version number is important here because messages are posted and listened to through a channel, so after a sentinel completes a new switch, the new master configuration is followed by the new version number.
Other sentinels update their master configuration based on the size of the version number
Refer to the content
To be honest, much of the content about the sentry principle in this paper is based on the explanation of Redis sentry principle in the hundred million flow course of China Huxwood Teacher. I hereby explain that I hope I can master the principles and practice of Redis well with the later study, and live up to the teacher’s careful explanation! At the same time, I also hope to make progress together with readers. Thank you!
This article was typeset using MDNICE