Recently, I had a chat with a reader. I was very upset. The company’s Redis was down, and the online business was affected.
I am also curious, ask brother Redis master node is down, there are standby ah. How does that affect the business?
Their system architecture only deploys a single instance of Redis, he said. The node is down and the data is lost.
Well, speaking of backups, today we’re going to talk about master-slave synchronization in Redis
First of all, what is master and slave? A master-slave cluster, also known as a master-slave cluster, has multiple Redis instances deployed, as shown in the following figure:
Each instance has its own unique responsibility
Primary library: receives read and write operations Secondary library: periodically synchronizes data from the primary library and provides external read operations Curious children may ask, why cannot write from the secondary library?
Considering the complexity of data merging, if a key is updated for several times and each operation is executed on different instances, it is necessary to add global locks to ensure the global consistency of data, so as to ensure the serial operation on the cluster scope and update on the basis of the latest data, which still costs a lot.
In order to reduce the system complexity, cost saving. Master-slave synchronous architecture is typically written on the master library and read on the slave library. Clear division of labor, single responsibility.
Some of you may have mentioned the Redis Cluster pattern, which is another design solution. The CRC16 (key) algorithm is used to split data into multiple instances. Each instance reads and writes data only for its slot, thus sharing cluster pressure. This is a different kind of gameplay and I won’t go into it in this video.
To ensure that data is not lost, Redis provides two data synchronization methods
1. RDB, full data synchronization
2. AOF, incremental data synchronization, playback log
What’s the difference between the two?
When to adopt RDB? When to adopt AOF?
Next, let’s analyze it step by step
First, start two instances of Redis with IP addresses 192.168.0.1 and 192.168.0.2. At first, there is no relationship between them.
We log in to 192.168.0.2 through the terminal command and execute the command
Replicaof 192.168.0.1 6379 now makes the 192.168.0.2 instance the slave to 192.168.0.1.
Once the master and slave instances have been associated, the next step is data synchronization
Master-slave synchronization
There are three steps for master/slave database data synchronization: 1
Send the psync command from the secondary (192.1768.0.2) to the primary (192.168.0.1) with two parameters (runID for the primary and synchronization progress offset).
When the connection is first established, the slave library does not know the runID of the master library, so it is set to? . Offset = -1: indicates the first replication. Note: When each Redis instance is initially started, a random ID is automatically generated to identify the current instance.
The primary library responds to the psync request with FULLRESYNC with two parameters (runID of the primary library and sync progress offset).
Note: FULLRESYNC indicates full replication
Step 2
The master database forks the RDB file to the slave database. When the slave database receives a response, the master database clears the current RDB file and loads the RDB file. Once the RDB file is generated, the master library can be serviced without being affected during data synchronization. Subsequent write command data is saved to the Replication buffer
Step 3
The master sends incremental write commands to the slave, and the slave executes these commands in a reciprocal manner, thus achieving master-slave synchronization.
Here, the core logic of the master and slave is basically finished.
However, the production environment is usually one master with many slaves. When each slave library is initially synchronized, the RDB file generated by the master library is obviously very expensive. What are the solutions?
When there are multiple slave nodes, the pressure of the master library increases significantly, which is embodied in two aspects:
1. When synchronizing the master with the slave, fork as many child processes as there are slave nodes, and each child process generates an RDB. The main library system is under too much pressure
2. The generated RDB must be synchronized to the secondary database, occupying network bandwidth
Based on the above dilemma, a new mode has evolved, “master — slave — slave” mode. The specific gameplay is shown as follows:
Although there are four slave databases, there are only two instances of 192.168.0.2 and 192.168.0.3 directly associated with the master database to synchronize data, which greatly reduces the pressure on the master database.
Anything is not invariable, there is a great risk of network transmission, what to do if the network is broken? What is the impact on master-slave synchronization?
Impact of intermittent network disconnection on master/slave synchronization Data can be synchronized between master/slave instances in full synchronization and incremental synchronization. Full synchronization is the synchronization of RDB files, but how to achieve incremental synchronization?
There is a buffer, repl_backlog_buffer, which is a circular design where delta commands are stored first. The master library has a production shift, called master_REPL_offset. There is a pull shift from the library called Slave_REPL_offset
Normally, the master_REPL_offset and slave_REPL_offset sizes are close, meaning that data between the master and slave libraries is almost in sync.
Each time data is synchronized, the slave sends the psync command to the master to send its Slave_REPL_offset to the master, and the master sends incremental data to the slave based on the offset. This is easy to understand.
Is it safe?
Because of the ring structure, if the production speed of the master library is much faster than the pull speed of the slave library, the ring phenomenon will occur.
Why rings? In order to recycle the space, most of the traffic data recorder and monitoring equipment in the market are stored in cyclic coverage type. If the space is full, overwrite the previous oldest data. Although some data may be lost, but cost-effective.
Back to the question above, what if it gets looped?
As shown in the figure above, from the library psync command, the requested offset is 4, but the master node has produced 15, overwriting the previous 1, 2, 3, 4, and 5.
This next silly eye, need to synchronize data is covered, causing big trouble….
There are two solutions:
1. Increase the repl_backlog_buffer buffer size, which is controlled by the REPL_backlog_size parameter
Buffer size = primary library write speed
Operation size – Take speed from kula
Operating size
That’s something we can control. For example, if you are worried about the traffic peak brought by big promotion, you can increase the value by 2 times, 3 times or 4 times. You can set it freely according to your business situation.
2. There is also a solution provided by Redis itself.
In this case, full replication is triggered, just as data is synchronized after the primary/secondary relationship is established for the first time. Make up the data gap between master and slave in a one-time way.
What if the primary node fails? If the primary node fails in traditional master/slave mode, you need to manually perform the switchover.
Efficiency speaks for itself, especially in online production systems, which simply cannot accept this solution.
At this time, it is necessary to introduce the sentry mechanism, which can realize the automatic switch between master and slave libraries and effectively solve the failover. The whole process is divided into three stages: monitoring, master selection, notification.
1. Monitoring. The sentinel periodically pings all the master and slave libraries to check whether the machine is in service. If you do not receive a reply within the set time, you are considered offline.
Of course, network jitter, there will be misjudgment may, how to avoid?
Introduce sentry cluster, multiple sentry instances judge together, reduce misjudgment rate. The judgment criterion is, if n sentry instances, at least n/2+1 can be determined.
2. Choose the master. It mainly depends on the scoring of each node. Scoring rules are divided into secondary library priority, secondary library replication progress and secondary library ID number. As long as there is a round, the slave library with the highest score is elected master.
In terms of library priority, different machines may have different configurations. Machines with higher configurations have higher priorities and are configured through slave-priority
The replication progress from the library depends on the slave_REPL_offset value. The larger the value is, the more data has been synchronized and the higher the score is.
Slave library ID number. When each Redis instance is started, an ID will be generated. Under the same priority and replication progress, the slave library with the smallest ID has the highest score and will be selected as the new master library.
3. Notice. After the election, the new master database will be sent to all the nodes, and all the slave databases will execute replicaof command to establish a master/slave relationship and synchronize data replication with master. In addition, the latest master library information is synchronized to the client.