Redis high availability summary: Redis master-slave replication, Sentinel cluster, split brain…

The harder you work, the more lucky you are. This article has been collected in GitHub JavaCommunity, which has an interview to share, source code analysis series articles. Welcome to collect, like github.com/Ccww-lx/Jav…

In the actual project, the service high availability is very important, such as, when Redis used as cache service, alleviate the pressure of the database, improve data access speed, improve the performance of the website, but if you use Redis is a stand-alone mode, as long as a server outage can not provide services, this will may cause low service efficiency, Even the corresponding service application is unavailable.

So in order to achieve high availability, what high availability solutions does Redis provide?

  • Redis primary/secondary replication
  • Redis persistence
  • The sentry cluster
  • .

Redis is based on the mode of one Master Master node and multiple Slave Slave nodes and Redis persistence mechanism, which keeps one copy of data in multiple instances to increase copy redundancy. It also uses the sentry mechanism to realize Master/Slave switchover. When the Master fails, it automatically detects and switches a Slave to Master. Finally achieve Redis high availability.

Redis primary/secondary replication

Redis master-slave replication, master-slave database mode A mode in which a copy of data is saved to multiple Slave instances to increase the copy redundancy. Redis can still be used when some instances go down.

However, there will be data inconsistency, so how does redis copy set data consistency?

In order to ensure the consistency of data copies, Redis adopts read/write separation between master and slave libraries:

  • Read operation: master library, slave library can perform processing;
  • Write operations: the write operations are performed in the master library and then synchronized by the master library to the slave library.

The advantage of using read/write separation is that the master/slave libraries can handle write operations and lock a series of costly overhead when both master and slave libraries can handle write operations.

With read/write separation, write operations are only performed in the master library and then synchronized to the slave library. How does the master and slave library synchronize data?

There are two ways to synchronize data between master and slave libraries:

  • Full synchronization: When the primary and secondary servers are first connected, full synchronization is performed
  • Incremental synchronization: Incremental synchronization is performed after full synchronization is complete. For example, data is synchronized after the network between the primary and secondary libraries is disconnected.

Full amount of synchronization

The first full synchronization between master and slave libraries is divided into three stages:

  • When a slave database is started, the slave database sends the psync command to the master database for data synchronization (psync command contains runID of the master database and offset of the replication progress).

  • When the master library receives the psync command, the RDB file will be saved and sent to the slave library. During the transmission, the replication buffer will be used to record all subsequent write operations. After the data is received from the slave library, the current database will be emptied and the RDB file obtained from the master library will be loaded.

  • When the master library has finished sending the RDB file, it also sends the replication buffer that will hold the writes performed during the RDB file sending to the slave library, which then re-performs those operations. In this way, the master and slave libraries are synchronized.

In addition, in order to share the pressure of generating RDB files and transferring RDB files from the master library and improve efficiency, the “master-slave” mode can be used to cascade the pressure of generating RDB and transferring RDB from the master library to the slave library.

The incremental synchronization

Incremental synchronization, based on the circular buffer repl_backlog_buffer cache.

In the ring buffer, the master library records the position it wrote to, master_REPL_offset, Slave_repl_offset the slave library records its read location slave_REPL_offset and the master library synchronizes data with the difference between master_REPL_offset and Slave_REPL_offset.

When the network between the master and slave libraries is disconnected, the master and slave libraries continue to synchronize using incremental replication. The master library writes the write commands received during the disconnection to the replication buffer and also writes them to the repl_backlog_buffer buffer. The master library is then synchronized to the slave library via the master_REPL_offset and Slave_REPL_offset difference data.

Because repl_Backlog_buffer is a circular buffer, what happens when the main library continues writing after the buffer is full?

Overwrites previously written operations. If reads from the slave library are slow, it is possible that unread operations from the slave library are overwritten by newly written operations from the master library, resulting in data inconsistencies between the master and slave libraries. Therefore, pay attention to the REPL_backlog_size parameter and adjust the buffer space to avoid data overwriting and inconsistency between primary and secondary data.

Master/slave replication, in addition to data inconsistency, and may even appear the master library downtime, Redis will have the master/slave autonomous switch mechanism, how to achieve it?

Redis sentinel mechanism

When the primary library hangs, redis write operations and data synchronization cannot be carried out. To avoid this situation, you can re-elect a new primary library from the secondary library after the primary library hangs, and notify the client. Redis provides a sentry mechanism, which is the Redis process running in a special mode.

Redis will have a master/slave switchover mechanism, so how is this implemented?

Sentinel mechanism is the key mechanism to realize the automatic switch between master and slave libraries, which is mainly divided into three stages:

  • Monitoring: The Sentry process periodically pings all master and slave libraries to check if they are still online.
  • Select master (select master) : After the master library has died, the sentry elects a new master library for a slave library instance based on a certain score.
  • Notification: Sentry sends information about the new master library to other slave libraries to connect to and copy data from the new master library. At the same time, sentry notifies clients of the new master library’s message broadcast so that they can send the requested actions to the new master library.

Among them, how to determine whether the master library is offline during monitoring?

Sentry’s offline judgment of the main library is divided into:

  • Subjective offline: The sentinel process uses the PING command to determine the state of its network connection to the master and slave libraries. If a single sentinel detects that the master or slave library’s response to the PING has timed out, it marks it as “subjective offline” first
  • Objective offline: In the Sentinel cluster, based on the principle that the minority is subordinate to the majority, the master library is judged to be “subjectively offline” by most instances, and the master library is considered to be “objectively offline”.

Why are there two types of “subjective logoff” and “objective logoff”?

Due to single sentinel is easy to produce misjudgment, misjudgment away from switch will produce a series of additional overhead, in order to reduce misjudgment, avoid the unnecessary overhead, the sentinel cluster, introduce multiple sentinel examples to determine together, can avoid a single sentinel because your network condition is bad, the misjudgment of the main library offline,

Based on the principle that the minority is subordinate to the majority, when there are N sentinel instances, it is better to have N/2 + 1 instances to judge the master database as “subjective offline”, so as to determine the master database as “objective offline” (threshold can be customized).

So how do sentries communicate with each other?

Sentinel cluster sentinel instances can discover each other, based on the pub/sub mechanism provided by Redis,

Sentinels can publish/subscribe messages to the master library, which has a channel called “\__sentinel__: Hello”. Sentinels find each other and communicate with each other through this channel. Only apps that subscribe to the same channel can exchange messages.

Sentinel1 connects the relevant information (IP port) to the “\__sentinel__: Hello” channel, to which Sentinel2 and 3 subscribe.

Sentries 2 and 3 can get sentry 1 connection information directly from this channel. In this way, a sentry cluster is formed, enabling each sentry to communicate with each other.

Once the sentinels communicate, you can determine whether the master library has been taken offline objectively.

How can a new master library be elected after it has been determined to be offline?

The new master library elects qualified slave libraries screened out according to certain conditions, and scores them according to certain rules. The new master library gets the highest score.

Usually certain conditions include:

  • From the current online state of the library,
  • Check its previous network connection status and passdown-after-milliseconds * num(Number of disconnections), when the number of disconnections exceeds the threshold, it is not suitable for the new master library.

Certain rules include:

  • From the library priority, passslave-priorityConfiguration item that sets different priorities for different slave libraries. The slave library with the highest priority gets the highest score
  • Replication progress of slave library, and the degree of synchronization with the old master library is closest to the slave library score high, passrepl_backlog_bufferBuffer records the primary librarymaster_repl_offsetAnd from the libraryslave_repl_offsetMinimum difference high score
  • Slave library ID number, small slave library ID score higher.

It’s all based on the fact that the election ends when the highest score is scored in a certain round of rules, and the sentry initiates a master-slave switch.

Leader sentry

After the new master library is elected, each sentry cannot initiate the master/slave switchover, and the leader sentry needs to be elected. How to elect the leader sentry to perform the master/slave switchover?

Leader sentries are also elected by majority vote.

  • When any of the slave libraries determines that the master library is “subjectively offline”, the command is sents-master-down-by-addrThe command sends a signal that wants to become the Leader,
  • The other sentries make relative response according to the connection with the host, yes Y, no N, and if multiple sentries initiate a request, each sentry can only vote for one of them, and the other sentries can only vote against.

To become a Leader’s sentry, two conditions must be met:

  • First, get more than half of the votes;
  • Second, the number of votes obtained must also be greater than or equal to that in the Sentry profilequorumValue.

How does the Leader Sentinel notify the client after electing the leader Sentinel and switching to a new master library?

Or based on the Pub /sub function of the Sentinel itself, the event notification between the client and the sentinel is realized. The client subscribes to the sentinel’s own message channel, and there are many subscription channels provided by the Sentinel, different channels include:

The event Related to the channel
Main library offline event + sDown (instance goes “subjectively offline”)

-sdown (The instance exits the “subjective offline” state)

+ oDown (instance goes offline objectively)

-odown (The instance exits the Objective offline state)
New master library switch + switch-master (primary library address changed)

Where, when the client switches from sentinel subscription message master to slave, when the master library switches, the client will receive the connection information of the new master library:

switch-master <master name> <oldip> <oldport> <newip> <newport>  
Copy the code

In this way sentry can notify the client that the new library has been switched.

Based on the above mechanism and principle, Redis achieves high availability, but it also brings some potential risks, such as data loss.

Data problems

Redis implements high availability, but may produce some risks during implementation:

  • Data loss caused by asynchronous replication during the active/standby switchover
  • Data loss due to split-brain
  • Data inconsistency occurs during the active/standby switchover due to asynchronous replication

Data loss – Primary/secondary asynchronous replication

Because the master copies data to the slave asynchronously, during the replication process, some data from the master may not be copied to the slave, and the master may break down, and some data will be lost.

Summary: Data from the master database is not synchronized to the slave database. As a result, the master database fails and the unsynchronized data is lost.

Data loss. – Split brain

What is split brain? When the master ina cluster fails to communicate with Sentinal, Sentinal considers the master offline and elects a slave as the new master. In this case, there are two masters.

In this case, some clients may write data to the old master before switching over to the new master. When the master recovers again, the data is attached to the new master as a slave, and its own data is cleared and copied from the new master. This can lead to data loss.

Summary: The data in the master database has not been synchronized to the slave database. As a result, the master database fails. After the slave database is upgraded to the master database, the unsynchronized data is lost.

Data loss solutions

Data loss can be resolved by properly configuring the parameters Min-rabes-to-write and min-rabes-max-lag, for example

  • min-slaves-to-write 1
  • min-slaves-max-lag 10

At least one slave must be configured. The data replication and synchronization delay cannot exceed 10 seconds. If the data replication and synchronization delay exceeds 10 seconds, the master will not receive any requests.

Data inconsistency

In the master-slave asynchronous replication process, when the slave library executes synchronous commands late due to network delay or command blocking due to high execution complexity, data inconsistency may occur

Solution: An external program can be developed to monitor replication progress between master and slave libraries (master_repl_offsetslave_repl_offset), through monitoringmaster_repl_offsetslave_repl_offsetIt is important to know the replication progress. If the replication progress does not meet the expected Settings, clients will not read data from the slave library.

conclusion

Redis uses master-slave replication, persistence, sentry mechanism and other mechanisms to achieve high availability. It is necessary to understand its implementation process, and understand the risks and solutions involved, so as to better optimize the actual project and improve the reliability and stability of the system.

Finally, wechat search “Ccww Technology Blog” to watch more articles, but also welcome to pay attention to a wave