How to ensure that Redis is highly concurrent and highly available? It’s from the hemlock code farming course
How does Redis handle read requests with QPS over 100,000 + through read/write sharing
1. The relationship between high concurrency in Redis and high concurrency in the whole system
Redis, if you want high concurrency, inevitably, you have to get the underlying cache really good
Mysql, high concurrency, so it is also through a series of complex sub-database sub-table, order system, transaction requirements, QPS to tens of thousands, relatively high
To do some e-commerce product details page, really high concurrency, QPS on one hundred thousand, or even a million, a million requests a second
Redis alone is not enough, but redis is a very important part of the larger cache architecture that supports high concurrency
First, your underlying cache middleware, the cache system, must be able to support what we call high concurrency, and second, through a good overall cache architecture design (multi-level cache architecture, hot cache) to support really hundreds of thousands or even millions of high concurrency
2. Where is the bottleneck that Redis cannot support high concurrency?
stand-alone
3. What should redis do if it wants to support more than 100,000 + concurrent transactions?
Stand-alone Redis is almost unlikely to say more than 100,000 QPS, except in some special cases, such as your machine performance is very good, configuration is very high, physical machine, maintenance is very good, and your overall operation is not too complex
Single machine in tens of thousands
Read/write separation, generally speaking, for cache, is generally used to support high read concurrency, write requests are relatively few, maybe write requests are only a few thousand, one thousand or two thousand per second
A lot of requests are reads, 200,000 reads a second
Reading and writing separation
Master-slave architecture -> Read/write separation -> architecture that supports 100,000 + read QPS
Security implications of Redis Replication and Master persistence for master-slave architectures
Course outline
Redis Replication’s core mechanism 3. The significance of Master persistence for security of master/slave architecture
Redis Replication -> Master/slave Architecture -> Read/write Separation -> Horizontal expansion supports high read concurrency
The most fundamental principles of Redis Replication, foreshadowing
1. Illustrate the fundamentals of Redis Replication
2. The core mechanism of Redis Replication
(1) Redis uses asynchronous replication to replicate data to slave nodes, but starting from Redis 2.8, The slave node periodically checks the amount of data to be replicated. (2) A master node can be configured with multiple slave nodes. (3) A slave node can also be connected to other slave nodes. (5) The slave node does not block its own query operations when performing replication. It uses old data sets to provide services. When the replication is complete, the old data set needs to be deleted and the new data set needs to be loaded. In this case, the external services are suspended. (6) The slave node is mainly used for horizontal capacity expansion and read/write separation
Slave, high availability, has a lot to do with it
3. The significance of master persistence for the security of master/slave architecture
If you are using a master/slave architecture, it is recommended that persistence be enabled for the Master Node!
It is not recommended to use slave nodes as hot standby data for master nodes, because then if you turn off master persistence, the data may be empty when the master is down and restarted, and the slave node data may be lost after replication
Master -> RDB and AOF are both off -> all in memory
Master goes down, reboots, has no local data to recover, and then thinks its data is empty
The master will synchronize the empty data set to the slave, and all slave data will be cleared
100% of data is lost
- Master nodes must use persistence
- Master all kinds of backup plan, do not want to do, say that all local files lost; Select an RDB from the backup to restore the master; So that you can make sure that when the master starts up, it has data
The slave node can automatically take over the master node even with the high availability mechanism described below, but the master node may restart automatically before Sentinal detects a master failure. Or it may cause all slave node data to be cleared
The principle of redis master-slave replication
1. Core principles of master-slave architecture
When a slave node is started, it sends a PSYNC command to the master node
If this is the slave node reconnecting to the master node, the master node will only copy the missing data to the slave. Otherwise, if the slave node connects to the master node for the first time, full resynchronization is triggered
When full resynchronization starts, the master starts a background thread to generate an RDB snapshot file and stores all write commands received from the client in the memory. After the RDB file is generated, the master sends the RDB to the slave. The slave writes the RDB to the local disk and then loads the RDB from the local disk into memory. The master then sends the write commands cached in memory to the slave, and the slave synchronizes the data.
If the slave node disconnects from the master node due to a network fault, the slave node automatically reconnects to the master node. If the master finds that multiple slave nodes are reconnecting, it simply initiates an RDB save operation to service all slave nodes with a single data copy.
2. Resumable transmission of the master/slave replication
Since Redis 2.8, breakpoint continuation of master/slave replication is supported. If the network connection is down during master/slave replication, the replication can continue where the last copy was made, rather than starting from scratch
Master nodes store a backlog in memory. Both master and slave nodes store a replica offset and a master ID. Offset is stored in the backlog. If the connection between the master and slave breaks, the slave tells the master to continue copying from the replica offset
But if no corresponding offset is found, a resynchronization is performed
3. Diskless replication
The master creates the RDB in memory and sends it to the slave instead of landing on its own disk
Repl-diskless-sync repl-diskless-sync-delay Indicates that the replication starts after a certain period of time because more slaves need to be connected again
4. Handle expired keys
The slave does not wait for the master key to expire. If the master expires a key or discards a key via the LRU, a del command is emulated and sent to the slave.
Fourth, the complete operation process and principles of Redis Replication again in-depth analysis
1. Complete process of replication
(1) The slave node starts. Only the master node information is saved, including the host and IP address of the master node, but the replication process does not start
Where did master host and IP come from? Slaveof in redis.conf
(2) There is a scheduled task inside the slave node to check whether there is a new master node to be connected and replicated every second. (3) The slave node sends the ping command to the master node. (4) Password authentication. If the master has set RequirePass, The slave node must send the masterauth password for authentication. (5) The master node performs full replication for the first time and sends all data to the slave node. (6) The master node continues to asynchronously copy the write command to the slave node
2. Core mechanisms related to data synchronization
This refers to the full replication performed the first time the slave connects to the MSater, and some of the details of your mechanism in that process
(1) Both master and slave maintain an offset
The master will continuously add offsets to itself, and the slave will continuously add offsets to itself. The slave will report its offsets to the master every second, and the master will also save the offsets of each slave
This is not specific for full replication, mainly because the master and slave need to know their data offset, so as to know the data inconsistency between each other
(2) the backlog
The master node has a backlog. The default size is 1MB. When the master node copies data to the slave node, the master node synchronizes the backlog for incremental replication after the full replication is interrupted
(3) Master run ID
If the master node restarts or data changes, the slave node should be distinguished according to different run ids. To restart Redis without changing the run ID, run the redis-cli debug reload command
(4) psync
The secondary node uses psync to replicate data from the master node. Psync runid offset the master node returns a response based on its own situation. The response may be FULLRESYNC runid offset to trigger full replication. Possibly CONTINUE triggers incremental replication
3. Full copy
(2) The master node sends the RDB snapshot file to the slave node. If the RDB replication time exceeds 60 seconds (Repl-timeout), the slave node considers the replication failed. (4) When the master node generates an RDB, it will cache all new write commands in memory. After the SLAVE node saves the RDB, the slave node will cache all new write commands in memory. Then copy the new write command to the slave node (5) client-output-buffer-limit slave 256MB 64MB 60. If the memory buffer consumption exceeds 64MB continuously during the replication, or exceeds 256MB at a time, the replication is stopped. Replication failure (6) After receiving the RDB, the slave node clears its old data, reloads the RDB into its own memory, and provides services based on the old data version. (7) If AOF is enabled, the Slave node immediately executes BGREWRITEAOF to rewrite the AOF
RDB generation, RDB copying over the network, slave old data cleansing, slave aof rewrite, takes time
If the amount of data to be replicated ranges from 4 GB to 6 GB, the full replication takes 1 and a half minutes to 2 minutes
4. Incremental replication
(1) If the master-slave network connection is disconnected during full replication, the slave reconnects to the master, triggering incremental replication. (2) The master directly obtains some of the lost data from its own backlog and sends it to the slave node. The default backlog is 1MB (3). Msater gets data from the backlog based on offset in the psync sent by slave
5, the heartbeat
Both the primary and secondary nodes send heartbeat information to each other
By default, the master sends a heartbeat every 10 seconds, and the slave node sends a heartbeat every 1 second
6. Asynchronous replication
Each time the master receives a write command, it now writes data internally and asynchronously sends it to the slave node
5. How does Redis master/slave architecture achieve high availability
What is 99.99% high availability?
Architecturally, high availability, 99.99% high availability
The academic, 99.99%, the formula, the amount of time the system is available/the amount of time the system is down, 365 days, 365 days * 99.99% of the time, your system is available to the outside world, that’s high availability, 99.99%
System available time/total time = high availability, and then a whole bunch of explanations about various concepts of time
2. What is not available in Redis? Single instance unavailable? Master/slave schema unavailable? What are the consequences of unavailability?
-
What is redis not available
- The Redis process died
- The machine on which the Redis process is running died
-
Single instance unavailable? Master/slave schema unavailable?
- The failure of one slave does not affect availability. There are other slaves providing external query services with the same data
- If the master node dies, the data cannot be written to, and the slave node is useless because there is no master to copy data to, the system is essentially unusable
-
What are the consequences of unavailability?
The cache with high concurrency and high performance is unavailable, and the maximum capacity of mysql is exceeded. Heavy concurrent traffic floods into mysql, causing mysql to break down
3. How can Redis achieve high availability?
The high availability architecture of Redis is called failover, also known as master/slave switchover.
When a master node fails, it automatically detects and switches a slave node to the master node. This process is called active/standby switchover. This process implements high availability under the master-slave architecture of Redis.
If the master fails, it will be switched to another master in a very short time, maybe a few minutes, a few seconds redis is not available.
It all depends on sentinal nodes, sentinels.
Sixth, the sentinel
1. Introduction of sentries
Sentinal, Chinese name is sentry
Sentinel is a very important component in redis cluster architecture. Its main functions are as follows
(1) Cluster monitoring, which monitors whether the Redis master and slave processes are working properly. (2) Message notifications. If a Redis instance fails, the sentinel sends a message as an alarm notification to the administrator. (3) Failover. (4) The configuration center notifies the client of the new master address if failover occurs
The sentinels themselves are also distributed, operating as a cluster of sentinels, working cooperatively with each other
(1) failover, judge a master node is down, can’t do without most of the sentinels agreed, involves the problem of distributed election (2) even if some hang off the sentinel node, the sentinel cluster can still work normally, because if a as an important component of the failover mechanism of high availability system itself is a single point, Well, that would be embarrassing
The current version is Sentinal 2, which rewrites much of the code compared to Sentinal 1, mainly to make failover mechanisms and algorithms more robust and simple
2. Core knowledge of sentinels
(1) Sentry needs at least 3 instances to ensure its own robustness. (2) Sentry + Redis master-slave deployment architecture does not guarantee zero data loss, but only high availability of REDis cluster. (3) For sentry + Redis master-slave complex deployment architecture, try to test environment and production environment. Are tested and rehearsed adequately
3. Why can’t the Redis Sentinel cluster work properly with only 2 nodes?
At least two nodes must be deployed in the Sentinel cluster
If the sentinel cluster has only two sentinel instances deployed, quorum=1
+----+ +----+
| M1 |---------| R1 |
| S1 | | S2 |
+----+ +----+
Copy the code
Configuration: quorum = 1
If the master is down, s1 and S2 can still be switched if only one sentry in s1 and S2 thinks the master is down, and a sentry in S1 and S2 will be elected to perform the failover
A majority of the two sentries are running. A majority of the two sentries are running. A majority of the two sentries are running. You can allow failover
However, if the entire M1 and S1 machine goes down and there is only one sentinel left, there is no majority to allow a failover, even though the other machine still has an R1
4. Classic 3-node sentinel cluster
+----+
| M1 |
| S1 |
+----+
|
+----+ | +----+
| R2 |----+----| R3 |
| S2 | | S3 |
+----+ +----+
Copy the code
Configuration: quorum = 2, majority
If M1’s machine goes down, then there are two sentinels left, S2 and S3 can agree that the master is down, and then elect one to failover
Majority of the 3 sentinels is 2, so the remaining 2 sentinels are running to allow failover
7. Data loss in redis Sentry active-standby switchover
Course outline
1. Two data loss scenarios 2. Solve the data loss caused by asynchronous replication and split brain
1. Two cases of data loss
Data may be lost during the active/standby switchover
(1) Data loss caused by asynchronous replication
Because master -> slave replication is asynchronous, some data may break down before it is replicated to the slave, and some data may be lost
(2) Data loss caused by split-brain
Split-brain, that is, one of the master’s machines is suddenly disconnected from the normal network and cannot be connected to other slave machines, but in fact the master is still running
At this point, the sentry might assume that the master is down and initiate an election, switching the other slaves to master
At this point, there will be two masters in the cluster, which is called split brain
At this time, a slave is switched to the master, but data written to the old master may be lost before the slave is switched to the new master
Therefore, when the old master is recovered, it is attached to the new master as a slave, and its own data is cleared and data is copied from the new master
2. Solve data loss caused by asynchronous replication and split brain
min-slaves-to-write 1 min-slaves-max-lag 10
There must be at least one slave and the delay of data replication and synchronization cannot exceed 10 seconds
If the data replication and synchronization delay for all slaves exceeds 10 seconds, then the master will not receive any more requests
The above two configurations can reduce data loss caused by asynchronous replication and split brain
(1) Reduce data loss in asynchronous replication
With min-rabes-max-lag this configuration ensures that if the slave replicated data and ack delay was too long, it would think that too much data was lost after the master went down and would reject the write request. In this way, the data loss caused by the failure of the master to synchronize some data to the slave can be reduced to a manageable extent
(2) Reduce the data loss of split-brain
If a master has a split brain and loses contact with other slaves, the above two configurations ensure that if the master cannot continue to send data to a specified number of slaves and the slave does not send itself an ACK message for more than 10 seconds, the write request will be rejected
In this way, the old master will not accept new data from the client, thus avoiding data loss
This configuration ensures that if you lose a connection to any slave and find no ack from any slave after 10 seconds, you will reject any new write requests
So in the split-brain scenario, at most 10 seconds of data is lost
An in-depth analysis of several underlying principles of Redis Sentry
1. Sdown and ODOWN conversion mechanisms
Sdown and ODOWN fail
Sdown is a subjective outage, just a sentry who thinks a master is down is subjective outage
Odown is an objective outage, and if the sentinels of the quorum quantity feel that a master is down, it is an objective outage
Sdown does something very simple: if a sentry pings a master for more than the number of milliseconds specified for IS-IS master-down-after-milliseconds, the master is down
The condition of sDOWN to ODOWN conversion is very simple. If a sentry receives a quorum quantity within a specified period of time and other sentries consider the master to be SDown, the master is considered to be ODown
2. Automatic discovery mechanism of sentinel cluster
Sentinels discover each other through redis’ pub/sub system, where each sentinel sends a message to the __sentinel__: Hello channel, where all other sentinels can consume the message and become aware of the presence of the other sentinels
Every two seconds, each sentinel will send a message to the __sentinel__: Hello Channel corresponding to master+ Slaves that they monitor about their host, IP and RUNId as well as the monitoring configuration of the master
Every sentinel will also monitor __sentinel__: Hello Channel corresponding to every master+ Slaves they monitor, and then sense the presence of other sentinels who are also monitoring this master+ Slaves
Each sentry also exchanges master monitoring configurations with other sentries to synchronize the monitoring configurations with each other
3. Automatic correction of slave configuration
The sentry is responsible for automatically correcting some configurations of the slave. For example, if the slave is to become a potential master candidate, the sentry ensures that the slave is copying data from the existing master. If the slaves are connected to the wrong master, such as after a failover, then the sentinels ensure that they are connected to the correct master
4. Slave -> Master election algorithm
If a master is considered to be oDown and the majority of sentries allow a master/slave switchover, a majority sentry will perform the master/slave switchover, and a slave must be elected
Some information about the slave is taken into account
(1) Disconnection duration (2) Slave priority (3) Replication offset (4) RUN ID
If a slave disconnects from the master for more than 10 times the number of down-after-milliseconds, plus how long the master has been down, then the slave is considered unfit to be elected master
(down-after-milliseconds * 10) + milliseconds_since_master_is_in_SDOWN_state
The slave is then sorted
(2) If the slave priority is the same, then look at replica offset. Which slave replicates more data, the lower the offset is. (3) If the above two conditions are the same, select a slave with a smaller RUN ID
Quorum and majority
Each time a sentinel switches master/standby, the quorum sentinels must first consider the switch odown and then elect a sentinel to do the switch, which must also be authorized by the majority
If quorum < majority, such as five sentinels, majority is 3, and quorum is set to 2, then three sentinels authorization can perform the switch
However, if quorum >= majority, then all sentinels of the quorum number must be authorized, such as five sentinels, and quorum is 5, then all five sentinels must agree on authorization before the switch can take place
6, the configuration of epoch
Sentry monitors a set of Redis master+ Slave and has the corresponding monitoring configuration
The sentry performing the switch gets a Configuration epoch from the new master (Salve -> Master) to which it is switching. This is a version number that must be unique for each switch
If the first elected sentry fails to switch, the other sentries will wait for fail-timeout and then continue to switch. At this time, a new Configuration epoch will be obtained as the new version
Configuraiton propagation
After the switch is complete, the sentry updates the master configuration locally and synchronizes it to the other sentries via pub/ SUB messaging
The previous version number is important here because messages are published and listened to through a channel, so when a sentinel makes a new switch, the new master configuration follows the new version number
The other sentinels update their master configuration based on the size of the version number
conclusion
Redis high concurrency: master-slave architecture, one master with many slaves, generally speaking, many projects are actually enough, single master for writing data, single machine tens of thousands of QPS, multiple slave for querying data, multiple slave instances can provide 100,000 QPS per second.
While redis has high concurrency, it also needs to accommodate a large amount of data: one master with many slaves, each instance holds complete data. For example, Redis has 10GB of memory, but you can only accommodate 10GB of data at most. If your cache is going to hold a lot of data, tens of gigabytes, or even hundreds of gigabytes, or terabytes, then you need a Redis cluster, and with a Redis cluster, you can have hundreds of thousands of concurrent reads and writes per second.
Redis high availability: If you deploy master slave architecture, in fact, it is added to the sentinel can be implemented, any instance down, the automatic master/standby switchover.