It is believed that many friends have configured the master-slave replication, but for redis master-slave replication workflow and common problems are not a lot of in-depth understanding. This time it took two days to compile a redis master copy of all knowledge points.
This article implements the required environment centos7.0 redDis4.0
@TOC
What is Redis master-slave replication?
Master-slave replication is where you have two Redis servers and you synchronize data from one Redis database to the other. The former node is called the master node and the latter is the slave node. Data can be synchronized from master to slave only in one direction.
In practice, however, it is not possible for only two Redis servers to replicate master and slave, which means that each redis server may be called the master node.
In the example below, our slave3 is both the slave node and the master node of the slave.
To understand this concept, check out the following for more details.
Why do you need Redis master-slave replication?
Let’s say we only have one Redis server.
The first problem that can occur in this situation is that the server goes down, leading directly to data loss. If the project is related to the cost, the consequences can be imagined.
The second problem is memory. When there is only one server, memory will definitely peak. It is impossible to upgrade a server indefinitely.
So there’s a lot of questions, how do master and slave connect? How do you synchronize data? What if the master server goes down? Don’t worry. Take it one step at a time.
The role of Redis master-slave replication
Redis is used for master-slave replication. This is why redis is used for master-slave replication.
- Let’s continue to use this diagram to talk about
- The first point is that data redundancy, enabling hot backup of data, is an alternative to persistence.
- The second point is for single machine failure. When the master node (the master node) has a problem, the slave node (the slave node) can provide services, which realizes fast fault recovery and service redundancy.
- The third point is the separation of read and write. The master server mainly writes data, and the slave server mainly reads data, which can improve the load capacity of the server. The number of slave nodes can also be added as the requirements change.
- The fourth point is load balancing. With the separation of read and write, the primary node provides write service and the secondary node provides read service to share server load. Especially in the case of less write and more read, the concurrent amount and load of Redis server can be greatly improved by sharing read load among multiple secondary nodes.
- The fifth is the cornerstone of high availability. Master-slave replication is the foundation on which sentries and clusters can be implemented, so we can say that master-slave replication is the cornerstone of high availability.
Configure the Redis primary/secondary replication
With that said, let’s simply configure a master-slave replication case and then talk about the implementation.
The redis storage path is usr/local/redis
Logs and configuration files are stored in: usr/local/redis/data
We first configure two configuration files, redis6379.conf and redis6380.conf
redis-server redis6380.conf
redis-cli -p 6380
redis-server redis6379.conf
redis-cli
1. Start the cli on the client
When configuring master-slave replication, all operations are performed on the slave node, or slave.
So we run the slaveof 127.0.0.1 6379 command on the slave node, and when we’re done, we’re connected.
Set kaka 123 and set master 127.0.0.1
2. Enable the configuration file
Before starting a master-slave replication with a configuration file! Run the slaveof no one command on the slave host to disconnect the primary/secondary replication.
info
This is a picture of the info printed on the client of the master node after the slave node is connected to the master node using the client command line. You can see a slave0 message.
slaveof no one
info
redis-server redis6380.conf
After the secondary node restarts, you can directly view the connection information of the secondary node on the primary node.
3. Start the Redis server
Run the redis-server –slaveof host port command to start the master/slave replication when the redis server is started.
4. View logs after the primary/secondary replication is started
This is the log information of the primary node
Working principle of master-slave replication
1. Three phases of master-slave replication
The complete master-slave replication workflow is divided into three phases. Each section has its own internal workflow, so we’ll talk about all three.
- Establishing connection process: This process is the process of connecting slave and master
- Data synchronization: Synchronizes data from the master to the slave
- Command propagation process: synchronize data repeatedly
2. Stage 1: Establishing the connection
- Set the address and port of the master, and save the master information
- Set up a socket connection (what this connection does will be described later)
- Send the ping command continuously
- The authentication
- The slave port information is sent
During connection establishment, the slave node stores the address and port of the master node, and the master node stores the port of the slave node.
3. Phase 2: Data synchronization
When the secondary node connects to the primary node for the first time, a full replication is performed and this full replication is inevitable.
After the full replication is complete, the primary node sends the copy backlog buffer, and the secondary node performs bgreWriteaof to recover the data, which is also known as partial replication.
Three new points are mentioned in this phase, full replication, partial replication, and the replication buffer backlog. These points are explained in more detail in the FAQ below.
4. Stage 3: Command propagation stage
When the master database is modified and the data on the master and slave servers are inconsistent, the data on the master and slave servers will be synchronized to be consistent. This process is called command propagation.
The master sends the data change command to the slave. After receiving the command, the slave executes the command to ensure that the master and slave data are consistent.
Partial replication of the command propagation phase
-
Disconnect during command propagation, or network jitter causes connection lost.
-
The master continues to write data to the Replbackbuffer
-
The slave node will continue to try to connect to the master.
-
When the slave node sends its rUNID and replication offset to the master node, it executes the pysnc command to synchronize
-
If the master determines that the offset is within the scope of the copy buffer, the continue command is returned. And sends the copy buffer data to the slave node.
-
After receiving data from the secondary node, run the bgreWriteaof command to restore data
6. Detailed introduction to the principle of master/slave replication (Full replication + Partial replication)
- The slave node sends instructions
psync ? 1 psync runid offset
To find the correspondingrunid
Ask for data. But consider that when the slave node first connects, it doesn’t know about the master node at allRunid and offset
. So the first command sent isPsync? 1
It means I want all the data on the master node. - The primary node starts to run bgsave to generate an RDB file to record the current replication offset
- The master node sends its RDB file to the slave node using the +FULLRESYNC rUNId offset command.
- The secondary node receives +FULLRESYNC to save the rUNID and offset of the primary node and then clears all the current data. The secondary node receives the RDB file through the socket and starts to recover the RDB data.
- After the full replication, the rUNID and offset of the primary node have been obtained from the node, and the instruction is sent
psync runid offset
- The primary node receives the instruction to determine whether the RUNID matches and whether the offset is in the replication buffer.
- The primary node determines that either rUNId or offset does not satisfy and returns to the step
2
The full replication continues. The only possible rUNID mismatch here is that the node was restarted. This will be resolved later. An offset mismatch is a copy backlog buffer overflow. If the rUNID or offset check succeeds, the offset of the secondary node is the same as that of the primary node. If the rUNID or offset checks pass and the secondary node’s offset is different from offset, +CONTINUE offset(for the primary node) is sent, sending the data from the secondary node offset to the primary node offset in the replication buffer through the socket. - After receiving the information through the socket, run the bgreWriteAof command to restore the data.
1-4 is full replication. 5-8 is partial replication
In step 3 of the master node, the master node is receiving data from the client for the duration of the master/slave replication, and the master node offset is always changing. Only changes will be sent to each slave, this sending process is called the heartbeat mechanism
7. Heartbeat mechanism
In the command propagation phase, information is exchanged between the primary node and the secondary node all the time, and the heartbeat mechanism is used for maintenance to keep the connection between the primary node and the secondary node online.
-
Master heartbeat
- Instructions: ping
- By default, the repul-ping-slave-period is performed every 10 seconds. This parameter is determined by the repul-ping-slave-period parameter
- The main thing to do is to determine whether the slave node is online
- You can use Info Replication to check the interval between leased connections of the secondary node. A lag of 0 or 1 is normal.
-
Slave Heartbeat Task
- Command: replconf ack {offset}
- Once per second
- The main thing you do is to send your own replication offset to the master node, get the latest data change command from the master node, and one more thing you do is determine if the master node is online.
Precautions during the Heartbeat Phase To ensure data stability on the primary node, the number of secondary node failures or the delay is too high. All information synchronization will be denied.
There are two parameters that can be configured:
min-slaves-to-write 2
min-slaves-max-lag 8
The two parameters indicate that the number of slave nodes is only two, or when the delay of slave nodes is greater than 8 seconds, the master node will forcibly disable the MASTE function and stop data synchronization.
How does the master know the number of slave failures and the latency? In the heartbeat mechanism, the slave sends the perlconf ACK command every second. This command can carry the offset, the delay time of the slave node, and the number of slave nodes.
Eight, the three core elements of partial replication
1. Server run ID (run ID)
Let’s first take a look at what the run ID is, as shown by executing the info command. You can also see this when we looked at the startup log information above.
When the master-slave replication starts for the first time, the master sends its rUNID to the slave. The slave saves the ID of the master. You can use the info command to view the id
Upon reconnection, the slave sends the ID to the master. If the slave has the same rUNID as the master has, the master will try to use a partial copy (another factor in whether the copy succeeds is the offset). If the slave holds a different rUNID from the master’s current rUNID, a full copy is performed.
2. Copy the backlog buffer
The replication buffer backlog is a first-in, first-out queue where users store records of commands that the Master collects data from. The default storage space for the copy buffer is 1 MB.
You can control the buffer size by changing the repl-backup-size to 1MB in the configuration file. This ratio can be changed according to your own server memory.
What exactly does a copy buffer store?
When executing a command set name kaka, we can view the persistent file
So why is the replication buffer backlog likely to lead to full replication
During command propagation, the master node stores the collected data in the replication buffer and then sends it to the slave node. The problem is that when the amount of data on the master node is very large at a moment, it exceeds the memory of the replication buffer, and some data will be squeezed out, resulting in data inconsistency between the master node and the slave node. So you can make a full copy. If you don’t set this buffer to the right size it’s very likely to lead to an infinite loop, and the slave node will always copy full, clean up, copy full.
3. Copy the offset.
It is used to synchronize information, compare the difference between the master and slave nodes, and recover data when the slave is disconnected.
This value is the offset from the copy buffer backlog.
Common problems in master-slave replication
1. Restart of the primary node (internal optimization)
After the primary node is restarted, the value of rUNID changes, causing all secondary nodes to perform full replication.
We don’t have to worry about that, we just have to know how the system is optimized.
After the master-slave replication is established, the master node creates the master-replid variable, which generates the same policy as the RUNID, which is 41 bits long and the RUNID is 40 bits long, and then sends it to the slave node.
After the shutdown save command is executed on the primary node, rUNId and offset are saved to the RDB file. You can run the redis-check-rdb command to view the information.
2. The network interruption offset of the secondary node exceeds the threshold, leading to full replication
The network of the slave node is interrupted due to poor network environment. The memory size of the replication backlog buffer is too small. As a result, data is overflowed and the offset from the node is out of bounds, leading to full replication. It may lead to repeated full replication.
Solution: Change the replication backlog buffer size to repl-backup-size
Suggestion: Test the time for the primary node to connect to the secondary node. Obtain the average number of commands generated by the primary node per second, write_size_per_second
Replication buffer space setting = 2 * primary/secondary connection time * Total data generated per second on the primary node
3. Frequent network interruption
The CPU usage of the primary node is too high, or the secondary node is frequently connected. As a result of this situation, various resources of the master node are heavily occupied, including but not limited to buffers, broadband, connections, etc.
Why are resources on the primary node heavily occupied?
In the heartbeat mechanism, the secondary node sends the Replconf ACK command to the primary node once per second. The secondary node executes a slow query, which consumes a lot of CPU. The primary node invokes the replication timing function replicationCron every second, but the secondary node does not perform the replication function for a long time.
Solution:
Set a timeout release from a node
Set the parameter: repl-timeout
This parameter defaults to 60 seconds. After 60 seconds, release the slave.
4. Data inconsistency
Data from multiple slave nodes may be inconsistent due to network factors. This factor cannot be avoided.
There are two solutions to this problem:
The first data needs to be highly consistent. Configure a Redis server for both reading and writing. This method is limited to a small amount of data, and the data needs to be highly consistent.
The second monitors the offset of the primary and secondary nodes. If the delay of the secondary node is too large, it temporarily shields clients from accessing the secondary node. Set the parameters for the slave – serve – stale – data yes | no. Once set, this parameter can only respond to a few commands such as info slaveof.
5. The secondary node is faulty
The problem is to maintain a list of available nodes directly on the client side, and to switch to another node when the secondary node fails. This problem will be addressed later in the cluster.
Ten. Summary
This article mainly explains what is master-slave replication, the three stages of master-slave replication work and the three cores of workflow and partial replication. The heartbeat mechanism during command propagation. Finally, the common problems of master-slave replication are explained.
Spend two days to write the article, this is also the most time consuming article recently, after the article is estimated to be like this, not in a problem alone out of a number of articles to explain, will be an article all finished. Imperfect knowledge or wrong knowledge, with the increase in the knowledge back to improve. This article is mainly for the convenience of review. See you in the comments section if you have any questions.
Kaka hope is everyone to exchange learning, wrong can point out, do not like spray.