During telecommuting, the demand of online conference users surges. Tencent Conference completed the expansion of 1 million core cloud server in 8 days. Redis cluster completed the capacity expansion of dozens of times efficiently in only half an hour. Behind this, Tencent cloud Redis how to do it? This article is teacher Wu Xufei’s sharing arrangement in “Yunjia Community Salon online”, and elaborates the practice and challenges of Tencent Cloud Redis nondestructive expansion.
Click this link to see the full live video replay
First, challenges posed by the epidemic
This year, the challenges posed by the epidemic are clear, as telecommuting and online education users have skyrocketed, with an average of 1.5 million hosts being added from January 29 to February 6. Business 7×24 hours non-stop service, remote office and online education requirements can not stop service, stop service for a minute will affect the study and work of millions of people, so this business is very high requirements for us.
Redis is widely used in online meetings and telecommuting. Tencent Cloud Redis also provides support for Tencent meetings with rapidly increasing users. At the same time, massive requests put forward requirements for the rapid expansion capacity of Redis. We have some business cases, from the beginning of 3 pieces to 5 pieces in one day, and then found that it was not enough, and then expanded to 12 pieces, and then expanded again the next day.
Second, open source Redis expansion scheme
1. Redis cluster version architecture of Tencent Cloud
Tencent Cloud Redis is different from general Redis. We added Proxy to improve the ease of use of the system. This is because not all languages support cluster version client. For this part is compatible with the customer, we did a lot of compatibility processing, compatible with more common client USES, as do the automatic routing management, the switch can freely MOVE and ASK, increase the end-to-end slow query statistics, and other functions, Redis default slowlog contains only command operation time, does not include the network time back and forth, The Proxy can perform end-to-end slow log statistics, excluding the delay caused by the local physical machine being slow, to more accurately reflect the actual service delay.
For multiple accounts, Redis does not support this function, now we move this function to Proxy, as well as read and write separation, which is very helpful for customers, customers do not need to type code, just click on the platform, we automatically implement the read and write distribution in Proxy. This function can also be put into Redis, so why do we do Proxy?
The main consideration is safety! Because Redis carries user data, if you do data in Redis, it will frequently upgrade function iteration, which will pose a great threat to user data security.
2. How to expand Tencent Cloud Redis
Tencent Cloud Redis how to expand? Our capacity expansion starts from three dimensions. The capacity of a single node is expanded. For example, we can expand the capacity of each node to 8G by three fragments, each of which is 4G. The capacity of a single node can be expanded as long as the capacity of the node is sufficient. There is copy expansion, now the customer is using a master and a slave, some students open read and write separation, read all hit slave machine, in this case, the simple way to increase read QPS is to increase the number of copies, but also increase data security. In recent actual services, we mainly encounter fragment expansion. For the number of fragments in a cluster, the most important thing is the processing capacity of the CPU. Expanding the number of fragments is equivalent to expanding the CPU, and expanding the processing capacity also indirectly expands the memory.
The earliest Tencent cloud did a version, using the open source version of the expansion of the original expansion. Briefly describe how to do this:
First, the Proxy needs to calculate the slot capacity. Otherwise, once the Proxy is moved, the memory of the new fragment will be exhausted. After each slot memory is calculated, it is allocated according to the algorithm to determine which slots are in the target fragment. Set the target node slot to importing, and the source node slot to migrating.
There’s a catch. In normal development, the order of these two Settings doesn’t seem to matter, but it does. If you set slot migrating on the source node and slot importing on the target node, then in the interval between the two commands, if a command from the slot is called to the source node and the key does not exist, it is redirected to the target node. Since the target node is not set to slot in time, it redirects this command to the source node, so it is infinitely redirected.
Fortunately, common clients (such as Jedis) have a limit on the number of redirects and will throw an error when they reach the limit.
(1) Preparation
(2) Relocation
Once this step is set up, the next step is relocation. Slot moves are obtained from the source node and slowly moved from the source process to the target node one by one. What does it mean that this operation is synchronous?
Before the end of the Migrate command, the process will not be able to process client requests directly. In fact, the source end will temporarily create a socket, connect to the destination node, execute the command synchronously, and delete the local Key after the successful execution. The entire cluster is still available to handle requests during the relocation process.
If there is a Key read request and the slot is sent to the source process, the process can determine if the Key has data in the process and return it as a normal request.
What if it doesn’t exist? An ASK will be sent to the customer. After receiving the ASK, the user knows that the data is not in the process, and immediately sends another ASKING to the target node, and then sends the command there. What good would that do?
The number of Key slots on the source end is slowly reduced, not increased, because new slots are sent to the target node. As the relocation continues, the number of keys on the source end decreases, while the number of keys on the target end increases gradually. Users do not perceive errors, but only have a forwarding delay of a few tenths of a millisecond.
(3) Switching
When will the plan be switched? The source process of a slot finds that no data exists in the slot, indicating that all data is transferred to the target process. What to do at this point? The set slot command is sent first to the target, then to the source node, and finally to all the other nodes in the cluster.
This is to update the route more quickly, avoiding the promotion of its own cluster version protocol, which slows down the speed. As with setting up the pre-migration steps, there is a small caveat.
If a slot is assigned to the source node first, the source node will return a move message to the client because the source node thinks the slot belongs to the target node permanently. After the client searches for a move, it will not send an ASKing message to the target. If the target does not receive an ASK, the source node will return a move message to the source process. It’s like playing ping-pong, going back and forth endlessly, so the client can sense that something is wrong.
3. Nondestructive capacity expansion challenges
1. Big Key issues
There is no problem migrating like this, and the customer will not feel the problem with normal access requests. But there are still some challenges. The first one is the big Key.
In the relocation mentioned above, because the relocation is synchronous, synchronous relocation will be stuck. What determines the stuck time? It is not the network speed, but the size of the relocation Key that determines the relocation. Because relocation is carried out according to the Key, a list is also a Key, and a hash table is also a Key. A list will have tens of millions of data, and a hash table will also have a lot of data. Synchronous relocation easy card is very long, synchronous relocation 100 MB, packaging has one or two seconds, customers will feel stuck one or two seconds, all access will timeout, generally Redis business Settings timeout is mostly 200 milliseconds, some 100 milliseconds. If the Key is moved synchronously for more than a second, there will be a large number of timeouts and customer business will be slow.
If the delay time is more than 15 seconds, this time includes relocation packing time, network transmission time, and loading time. When a transition occurs for more than 15 seconds, Redis can even perform a “MIGRATING” or “performing a transition” on a slave. When a slave transitions to a “migrating” status, it cannot perform a “migrating” status on the slave. Then, the moving target node receives the new Master node’s ownership claim to the slot, and since it is importing, it refuses to recognize that the new Master owns the slot. Therefore, in the view of this node, slot coverage is incomplete. Some slots have no node to provide services and the cluster status is FAIL.
Once appear this kind of circumstance, if customers continue to write, because there is no migrating tag, a new Key will be written to the source node, the Key may be on the target node, and even artificial processing, also can appear relatively new, which side of the data on which side should use the data, so some of the problems, affects the availability and reliability of the user.
This is the core problem of the whole open source Redis, which is easy to get stuck, do not provide services, and even affect data security. How does the open source version solve this problem? Old rule: if the slot has a maximum Key of 100M or 200M, do not move the slot. This threshold is difficult to set. Because the Migrate command migrates many keys at a time, if the threshold is too small, most slots will fail to migrate. If the threshold is too large, most slots will be stuck. Not knowing when to move to big Key will have an impact, as long as the move is not finished, customers will be terrified for quite a long time.
2. The problem of Lua
In addition to the overall relocation of Key, we also have a problem with Lua.
If the script load is used to load the code when the business is started, and evalsha is used to execute the code, the capacity expansion is a new process, which is transparent to the business, so the Key is moved according to the method of Redis open source version, the Key is moved to the target node, but the Lua code is not. Whenever evalSHA is executed on a new node, an error occurs.
The reason for this is simple: Key moves do not migrate code, and Redis has no command to move Lua code to another process (other than master/slave synchronization).
There is no solution to this problem in open source. What can the business do to solve this problem? The business side needs to change the code. If evalsha execution is found to have errors that the code does not exist, the business side should take the initiative to execute a script load to avoid this problem. But it is unacceptable for many businesses. Because of the process of adding code and then releasing it again, the business will suffer for a very long time.
3. Multiple Key commands /Slave are used to read data
Another challenge is the multi-key command, which is a serious problem. One Key of Mget and Mset is in the source process, while the other Key does not exist or is in the target process. In this case, an error will be reported directly.
This is also a problem that native Redis cannot solve, because it is easy to have part of the MGET Key on the source side and part on the target side whenever the Key is moved.
In addition to this problem, there is another problem, which has nothing to do with sharding expansion, but there is a bug in the open source version, that is, Redis provides a function of read and write separation, which is provided in Proxy to send all commands to slave, which can reduce the performance pressure of master. This business is very convenient to use, the business can be opened, found not to be immediately closed.
The obvious question here is: When the data volume of each fragment is large, for example, 20G or 30G, we just attached the slave. The slave identity promotion mechanism is different from the master and slave data. Maybe the slave has been recognized by the cluster, but it is still waiting for the data from the master. Because 20 gigabytes of data can take several minutes to pack (depending on the data format). If the client sends a read command to the slave, a loading error will occur. If the RDB has been sent to the slave and the slave is loading when the client sends a read command to the slave, a loading error will be sent to the client.
Both of these errors are unacceptable, and customers will have obvious perception. The expansion of copy was originally intended to improve performance, but the result of expansion lasted for several minutes to more than ten minutes, and there were many business errors. This problem is actually related to the basic mechanism of Redis: identity promotion mechanism and master/slave data synchronization mechanism. Because these two mechanisms are completely independent and have little relationship, we also need to modify this state to solve the problem, which will be detailed in the following.
The last point is speed. As mentioned above, Redis has an impact on the business by moving Key. Due to synchronous operation, the speed will be slow and the business will experience obvious delay. Such delayed business must hope to end as soon as possible. But we are the relocation Key and rely heavily on the speed of Key. Because the Key cannot be moved at full speed, Redis is a single thread, and the basic upper limit is between 80,000 and 100,000. If the Key is moved too fast, the USER’S CPU will be occupied. The user was slow due to synchronous relocation, relocation and occupy his CPU, resulting in worse, so generally this scheme is not possible to do particularly fast relocation. For example, moving 10,000 keys at a time is equivalent to 12.5% or worse, which is unacceptable to users.
Iv. Other industry plans
If the open source version has so many problems, why not change it? There are many reasons for not changing. It may not be easy to change, and it is not easy.
As for the difficulty of relocation fragmentation expansion in Redis, many people have given feedback, but there is no feedback from the author and no obvious trend of solution at present. The most common solution in the industry is DTS solution.
DTS synchronization is similar to redis-port. DTS synchronization disguises a slave and initiates a full synchronization from the source slave through sync or psync. After the full synchronization, the increment is performed. The DTS receives this data and translates the RDB into a command and writes it to the target instance. This eliminates the need for the same number of shards between the target and the source instance, and DTS does the work in the middle.
After the DTS migration is fully stable, the incremental data can be synchronized and pushed from the source to the target. At this point, you can consider switching.
The toggle first observes whether all DTS delays are within the threshold, which is the intermediate delay from the Master here to the Master there. If the amount of data is less than a certain amount, you can disconnect the client, wait for a certain period of time, wait for the target instance to catch up, and then point the LB to the new instance and delete the source instance. A capacity expansion is fully realized, which is a common solution in the industry.
What problem does the DTS solution solve? The big Key problem has been solved. Because DTS is synchronized through a process of the source process slave. Has the Lua problem been solved? This problem is also resolved, DTS receives RDB with lua information, which can be translated into script load command. The multi-key command is also resolved, so that normal user access is not affected, and users are not aware of it before switching. Migration speed can also be better improved. The speed of migration itself is because the original instance is translated through RDB, and after translation, the target instance is written concurrently, so that the speed can be fast and can be fully sketched. This speed must be faster than the open source version of key migration, because the target instance does not work before switching, can be written at full speed, migration speed is guaranteed. HA and availability and reliability in migration are also ok. Of course, in the middle of the availability to be disconnected for 30 seconds to 1 minute, this time users are unavailable, a very small period of time to affect the user’s availability.
Are there any disadvantages to DTS? There are! The first is its complexity. This migration scheme relies on DTS components and requires external components to implement, which are complex and error-prone. The second is availability. As mentioned earlier, there was a client kicked off in the step, 30 seconds to 1 minute. This is the general usability impact of experience, completely inaccessible. There is also a cost problem. In the process of migration, it is necessary to ensure the full amount of two resources, which is very large in the case of large migration. If all the customers at the same time increase 1 shard, need the whole warehouse 2 times the resources, many customers would fail, this is a fatal problem, means that I will in theory will empty half of resources to ensure the success of the expansion, is unacceptable for cloud services, based on the above reasons we finally did not use DTS.
V. Tencent Cloud Redis expansion plan
Our approach is this: our goal is to not rely on third-party components at first, but also migrate from the command line. Second, our resources should not be retained before and after migration like DTS, which brings us considerable pressure. The final option is to move by slot. The specific steps are as follows:
The first step is to calculate the memory size of each slot and how many slots need to be moved. After allocating slots, you also calculate the slots that can be allocated to the target node. In migrating, you do not need to set the status of a source process in the hope that a new Key will be automatically written to the target process, but in this version you do not need to do this.
The source process receives the sync command, executes a fork, and generates an RDB for all synchronized slot data. Synchronize to the target process.
The source process records the keys in each slot. The RDB generated by the Key in each slot is transmitted to the target process. The target process receives the RDB to start loading, and then receives the AOF, which also receives slot-related interval data. The source process will not give the target process data that does not belong to this slot.
A target process can establish such connections from one or two source points. Once all connections are established and the synchronization status is normal, the failover operation can be initiated when the offset is small enough. Same as Redis official active Failover mechanism. Before failover, the target node does not provide services, which is a huge difference from the open source version.
With this solution, the big Key problem is solved. Because we did it through the fork process, not the source node relocation key. Services are not provided before the switchover. Therefore, it does not matter if the node is loading for one or two minutes. The customer does not know that the node is loading.
Also, the Lua problem has been solved. The new node accepts RDB data, which contains Lua information. The same is true for multi-key commands, because we do not affect the normal access of customers at all. The multi-key commands are still accessed as before. The migration speed is much faster than that of a single Key because batch slots are packed into RDB mode.
With regard to the impact of HA, does it matter if a node fails during migration? If a MIGRATING node fails, one of the cluster nodes cannot provide services. However, this problem does not exist in our plan, and services can still be provided after switching. This is because the target node was not serviced until the switch.
There is also the usability problem, our solution does not break the client connection, the client is not affected from beginning to end, only the switching moment has a small impact, millisecond level impact. Has the cost problem been solved? This is also solved because during capacity expansion, only the final nodes are created and no intermediate nodes are created, resulting in zero loss.
Six, Q&A
Q: Does the number of clusters change the number of slots?
A: The number of slots will not change, it is always 16814, which is consistent with the open source version.
Q: How do you evaluate whether the new slot contacts will not overflow before migration?
A: According to the above, we have A preparation stage to calculate the memory size of all slots. We will directly perform A scan calculation on the slave to calculate the memory size, which is basically accurate. Instead of using the backup data of the previous day, the slave is used for real-time calculation. The CPU has corresponding control, so that the total slot size can be calculated. We will advance the amount (1.3 times) to ensure that the target will not be written out during migration. In case of writing out, the process will only fail and the user will not be affected.
Q: Do you have any backup plans and emergency measures for the problems that may occur during the redis expansion operation?
A: In the new capacity expansion scheme, if there is A problem, customers will not be affected. There will only be redundant resources, which are handled by tools at present. Data security itself relies on daily backups in addition to primary and secondary backups.
Q: Teacher, the capacity needs to be expanded during peak hours, but there is always a return to the normal number of requests. At this time, the capacity expansion seems redundant. How to make the Redis cluster can quickly recover the excess capacity, and at the same time facilitate the next peak request to expand again?
A: You may want to know about serverless, pay-on-demand automatic expansion mode. At present, most databases are provided with PAAS service. The database at the PASS layer has changed the past manual resource management into semi-automatic, and operation and maintenance are still needed for expansion and reduction. Serverless is the next step, but even Serverless may face the problem of requiring manual resource scaling, especially for very large computing services.
Q: Does the slot originate from multiple shards and run synchronously with multiple nodes?
A: Yes, they do.
Q: How do I synchronize new data generated during slot generation?
A: Mechanism similar to aOF concept synchronizes to target process. This AOF is different from the normal AOF transmission to the slave in that only the data related to the target slot is synchronized to the slave.
Q: Are you still hiring?
A: Tencent Cloud Redis is still recruiting. Please send your resume to [email protected]. You can also choose to submit your resume on the official website, or search the public account of Tencent cloud database to check the pushed article on the 21st. We have attached the link of the recruitment position, or you can @community assistant, which will collect and send to me.
The lecturer introduction
Wu Xufei, senior engineer of Tencent Cloud, Redis technology director of Tencent Cloud, has years of practical experience in game and database development, focusing on game development and NOSQL database application practice in various fields.
Pay attention to yunjia community public account, reply “online salon”, you can get the teacher’s speech PPT~