Live | troubled years of depots review table finally solved the problem

Tencent cloud database Domestic database topic online technology salon is in hot progress, March 17, Zheng Han’s sharing has ended, the partners who did not have time to participate do not worry, the following is the live video and text review.

Pay attention to “Tencent Cloud database” public account, reply “0317 Zheng Han”, you can download live PPT sharing.

Click here for a live review

So without further ado, let’s get to today’s share. Today’s topic is “Smooth Capacity Expansion under 100 million Traffic Scenarios: TDSQL Horizontal Expansion Scheme Practice”.

Today’s sharing will mainly include the following four parts:

The first part firstly introduces the background of horizontal expansion, mainly introduces why horizontal expansion, mainly compares with vertical expansion, and talks about the general problems we will encounter in horizontal expansion.

Part 2 will briefly introduce how TDSQL does horizontal expansion, so that you have an intuitive impression.

Part 3 will describe in detail the design principles behind TDSQL horizontal expansion, mainly in parallel with Part 1, to see how TDSQL solves common horizontal expansion problems.

The fourth part will introduce practical cases.

I. Background and challenges of database horizontal expansion

First, let’s look at the background of horizontal expansion. The reason of expansion is very straightforward, in general is mainly as business visits, or is the need to scale, while the existing capacity or capability to meet the needs of the business, mainly in TPS, insufficient QPS or delay more than the tolerance range of the business, or the existing capacity can’t meet the requirements, the latter mainly refers to the disk or network bandwidth. Usually when we have this kind of problem, we have to expand. In terms of capacity expansion, there are two common ways, one is vertical capacity expansion, the other is horizontal capacity expansion. These two kinds have different characteristics, advantages and disadvantages are also very obvious.

1.1 Horizontal Expansion VS Vertical Expansion

First, let’s look at vertical expansion. Vertical expansion, mainly to improve the configuration of machines, or improve the configuration of instances. Because, as we know, the purchase of a database or an instance on the cloud is actually allocated on demand, that is to say, for users, the current business may not be large, only need two CPUS or a few gigabytes of memory; As his business grows, he may need to expand this instance, and he may currently need 20 cpus, or 40 GIGABytes of memory.

At this point, in the cloud we can control resources to dynamically adjust them to meet business needs — that is, we can dynamically add cpus to the same machine. The limit of capacity expansion is that when the entire machine is given CPU and memory, if it is not enough, it needs to prepare a better machine for capacity expansion. In MySQL, you can perform a master/slave switchover: select a slave and synchronize data; After data synchronization is complete, make a master/slave switch so that you can use the current better machine.

As you can see, during the whole process, there is basically no impact on the business — the primary/secondary switchover, if the IP is changed, it is in the way of front-end or VIP, and there is basically no impact on the business. One of its biggest downsides is that it relies on stand-alone resources: you can give it a better machine that meets a certain number of requirements. And as the business grows more rapidly, you’ll find that the best machines you can offer today may not be enough. Therefore, the biggest disadvantage of vertical expansion is that it depends on the resources of a single machine.

1.2 Horizontal Capacity Expansion

The other way to do it, as opposed to vertical expansion, is what we call horizontal expansion. The biggest advantage of horizontal capacity expansion is that it solves the problem of vertical capacity expansion. Theoretically, horizontal capacity expansion can be infinite, and it can dynamically adapt to service requirements by adding machines.

Compared with vertical expansion, horizontal expansion can solve the problem of vertical expansion, but it will introduce some other problems. Since horizontal expansion is more complex than vertical expansion, we will analyze the problems we may encounter and introduce TDSQL solutions later:

First of all, in vertical capacity expansion, after system capacity expansion, data still exists on one node in general. In an active/standby architecture, all data is stored on the standby machine. In the process of horizontal expansion, data will be split. The first question is: How to split data? Because if you don’t split well, when hot data is present, you can end up with a single node that stores hot data becoming a performance bottleneck, even though the data has been split into many pieces.

Second, in the whole process of horizontal expansion, data relocation and routing changes will be involved. So can the whole process be done without business awareness? Or how intrusive is it to the business?

Third, in the whole expansion process, because there are so many steps, if one of them fails, how can you roll back? In addition, how to ensure high data consistency during switchover during capacity expansion?

Furthermore, after capacity expansion, how can we ensure the performance after capacity expansion because the data is divided into various nodes? Because in theory, we want a linear increase in performance as we increase the number of machines, which is ideal. In fact, in the whole process of horizontal expansion, different architectures or different methods have a great impact on performance. Sometimes you find that it’s hard to scale linearly, even though you’ve added a lot of capacity.

Similarly, how can we continue to maintain the distributed nature of the database when the data has been split into multiple pieces? In a stand-alone architecture, data is stored in a single copy, similar to MySQL’s support for atomicity — ensuring that all data in a single transaction either succeeds or fails. In distributed architectures, atomicity only ensures consistency within a single point. So, globally speaking, since the data is now across nodes, how do you ensure global consistency across nodes, how do you ensure that all data is either written or rolled back across multiple nodes? This involves distributed transactions.

So as you can see, the advantage of horizontal expansion is obvious, it solves the limitation of vertical expansion machines. But it is more complex and introduces more problems. With these issues in mind, I’ll show you how TDSQL performs horizontal scaling and how it addresses these issues.

Ii. TDSQL horizontal expansion practice

2.1 TDSQL architecture

First, let’s look at the architecture of TDSQL. TDSQL simply consists of several parts:

The first part is the SQL engine layer: primarily acting as an access end, masking the data storage details of the entire TDSQL back end. For the business, the business accesses the SQL engine layer.

Next comes the data storage layer consisting of multiple sets: in a distributed database, data is stored on various nodes, and each SET is treated as a data unit. It can be one active/standby or one active/multiple standby, which can be deployed based on business needs. Some service scenarios have high data security requirements. You can work in active/standby mode or work in active/standby mode. This is the data store.

There is also a Scheduler module, which is mainly responsible for the monitoring and control of the whole system cluster. During system expansion or active/standby switchover, the Scheduler module is like the brain control module of the whole system. In fact, the business only pays attention to the SQL engine layer without paying attention to Scheduler or how data is divided into nodes, which is insensitive to the business.

2.2 TDSQL Horizontal Expansion Process

At the beginning, data is stored in a Set, that is, in a node. So scaling is essentially scaling up the data to — there’s 256 sets in there, it’s scaling up to 256 machines. We can see that there are several key points in the expansion:

At first, the data is on one node, on one machine, but the data has been split, in this case into 256 pieces.

Horizontal scaling simply means migrating these shards to other sets, that is, to other node machines, so that more machines can be added to provide system performance.

To sum up, the data has been partitioned at the beginning, and the expansion process is like moving shards to new nodes. During the expansion process, the number of nodes is increased, from 1 to 2 to 3, and even to 256, but the number of shards remains unchanged. If you start with 256 shards on one node and expand to two nodes, you might have 128 shards on one node; At the end of the scale, you can scale up to 256 nodes, and the data is on 256 machines, each machine is responsible for one of the shards. Therefore, the whole expansion is simply relocation sharding. We’ll talk more about that later.

In private cloud or public cloud, TDSQL provides a unified front page for the entire expansion, which is very convenient for users in the process of using.

Let’s look at this example. Now the example has two sets, that is, two nodes, each of which is responsible for part of the route. The first node is responsible for 0-31, and the other node is named 3, which is responsible for the route information 32-63. Now is two nodes, if must carry on the expansion, at the front desk page we will have an added “Set” button, click “add Set”, will pop up a dialog box, inside the default will automatically choose before a configuration, the user can own custom, including the Set now, how many resources and the allocation of memory, disk, etc.

In addition, because route switching is required for capacity expansion, we can manually select a time to switch routes automatically or manually operate route switching based on the actual situation of services. These can be set according to the needs of the business.

Once you’ve created that first step, the brain module is responsible for allocating the various resources and the whole logic of initialization and data synchronization. And finally, you’ll see that the first node — it was two nodes, but now it’s three nodes. Before capacity expansion, the first node was responsible for 0-31, now it is only responsible for 0-15, the other part of the route by the new node. So the whole process, as you can see, can go from two nodes to three nodes very quickly with a single click on the web — and we can continue to add sets, continue to expand with one click depending on the business needs.

Iii. Design principle behind TDSQL horizontal expansion

Just introduce TDSQL’s core architecture, as well as horizontal expansion features and front operations, (help) we establish an intuitive impression.

In chapter 3, I will introduce the design principle behind TDSQL horizontal expansion in detail, mainly looking at some of the problems mentioned in chapter 1 of horizontal expansion, and how we solve these problems. These problems need to be solved no matter which system is doing horizontal expansion.

3.1 Design Principle: How to consider compatibility and performance in partitioning key selection

First of all, as we mentioned earlier, the first problem with horizontal expansion is how to split the data. Because data splitting is the first step, this will affect the rest of the usage process. For TDSQL, the logic of data splitting is put into a syntax for creating tables. The business needs to specify shardkey “equal to a field” — the business needs to select a field as the partitioning key when designing the table structure. In this way, TDSQL will split the data according to the partitioning key and aggregate the data according to the partitioning key when accessing the table. We want the business to be involved in designing the table structure by specifying a field as a shardkey. In this way, compatibility and performance are well balanced.

In fact, we can also do not specify shardkey when users create tables, and we randomly select a key at the bottom to split the data, but this will affect the subsequent use efficiency, for example, the performance of distributed database cannot be particularly well played. We believe that the business layer can have a significant performance advantage if it has a small involvement in the design of the table structure, balancing compatibility and performance. In addition, if the business chooses shardkey — partition key, when the business designs the table structure, we can see multiple tables and select the relevant column as shardkey, which can ensure that the relevant data is placed on the same node when the data is split. This avoids cross-node data interaction in many distributed situations.

Partitioned tables are the ones we use most often when creating tables, which split data across nodes. In addition, we provide the other two – will provide three types of tables in total, the main thinking is behind for performance, that is to say by this kind of data will be global sheet is full amount on each node of the table – you will see at the beginning, all the data measured on each node, is quite so not distributed characteristic, not the characteristics of horizontal split, However, this kind of table is generally used in some configuration tables with small amount of data and few changes to ensure subsequent access through data redundancy. In particular, cross-node data interaction can be avoided during operation. In other aspects, for Shardkey, we will make a Hash according to the user. This advantage is that our data will be evenly distributed on each node to ensure that there will be no hot spots for data.

3.2 Design Principle: High availability and reliability in capacity expansion

As mentioned just now, because the process of the whole expansion process will be complicated, can the whole expansion process ensure high availability or high reliability, and what is the perception of business? What is TDSQL doing?

Data synchronization

The first step is the data synchronization phase. Suppose we now have two sets, and we find that the disk capacity of one Set is dangerous, for example, it may be more than 80%. In this case, we need to expand it. We will first create an instance, copy the image, create an instance, and create a synchronization relationship. The process of establishing synchronization is business-insensitive, and the process is real-time synchronization.

Data validation

In the second stage, data leveling and data verification are carried out continuously. This process may take a while, but if the delay difference between two synchronizations is infinitely close — say, we set a threshold of 5 seconds, and when we find that we have caught up to 5 seconds, then we enter the third phase, the routing update phase.

Routing updates

In the route update phase, we first freeze the write request. At this time, if the service receives a write request, we reject it and ask the service to try again after two seconds. In fact, the service will be affected by seconds. But this time is very short, frozen written request, a third instance of synchronization time soon finds all catch up data, and checking is no problem, this time we will modify the routing, related atomic operations at the same time, in the underlying storage layer partition screen, so it can ensure the SQL data access layer in enough time if routing update nor write wrong. Partitions have been masked because of underlying changes. This ensures data consistency. Once the route is updated, the third SET can receive the user’s request. At this point, you can see that the first SET and the third SET have the full amount of data because of the synchronization.

Delete redundant data

The last step is to delete the redundant data. Redundant data is deleted in delay mode to ensure that the data can be deleted slowly and not cause large I/O fluctuations, affecting services on the live network. In the whole deletion process, we did partition shielding and SQL rewriting in the SQL engine layer to ensure that although we have redundant data in the bottom layer, when the user comes to check, even if it is a full scan, we can also ensure that there is no more data.

As you can see, the process of capacity expansion, data synchronization, verification and redundancy deletion takes a long time. If the amount of data is large, it takes a long time to copy the image or pursue the binlog. But these phases have no effect on the business, which is unaware of the new synchronization relationship. If a problem is found during the establishment of synchronization relationship or a problem occurs during the creation of the standby server, you can replace the standby server or retry. This has no impact on services. Routing updates stage, the theory of business writing request hard to avoid can cause the influence of the second grade, but we will influence the control the time window is very short, because itself is frozen write requests need to keep synchronization has within 5 seconds so that a smaller threshold, the synchronous at this stage, we can launch the routing update operations. At the same time, we have made partition shielding to the storage layer to ensure that there will be no data confusion if there is different update between multiple modules. This is how we can ensure high availability and high reliability in the expansion process, and the impact of the expansion on the service is very small.

3.3 Design principle: distributed transaction

This is an overview of the expansion phase and how TDSQL solves the problem. Now let’s look at how to solve the problem of horizontal expansion after the expansion is completed. The first is distributed transactions.

Atomicity, decentralization, linear growth in performance

After capacity expansion, the data is cross-node. The system originally had only one node, but now it is cross-node. How to ensure the atomicity of data? The entire processing logic is completely shielded from the underlying complexity for the business, and using a distributed database for the business is like using standalone MySQL. If the business SQL accesses only one node, a normal transaction will do; If we find a single SQL or transaction from a user that operates on multiple nodes, we use two-phase commit. Ultimately, the atomicity of the entire distributed transaction is guaranteed by logging. At the same time, we achieve complete decentralization of the whole distributed transaction in the implementation process, can do TM through multiple SQL, performance can also achieve linear growth. In addition, we also do a lot of various exception verification mechanisms, have very robust exception handling and global trial-and-error mechanism, and we have passed TPCC standard verification.

3.4 Design Principle: How to achieve linear performance growth in capacity expansion

For horizontal scaling, data splitting into multiple nodes presents two main problems: one is the atomicity of transactions, which is solved by distributed transactions; Another is performance.

Vertical capacity expansion is usually achieved by replacing a better CPU or similar methods to achieve linear performance increase. Horizontal expansion, because the data split to a number of nodes, how to make good use of the split down each node, parallel computing, really give play to the advantages of horizontal distributed database, need a lot of operations, a lot of optimization measures. TDSQL does some optimizations like this.

First, relevant data exists on the same node. When building the table structure, we want the business to participate in the design of the table structure by specifying the relevant keys as shardkeys, so that we can ensure that the relevant data at the back end is on the same node. There is no need to cross nodes if this data is queried jointly.

Similarly, we achieve performance improvements through parallel computing and streaming aggregation — we split SQL and distribute it to nodes in the background, then compute it in parallel through each node, then aggregate it again through the SQL engine and return it to the user. In order to reduce pulling data from the back end to SQL and reduce one pull of data, we will do some push-down queries – pushing more conditions to DB. In addition, data redundancy is also implemented to minimize cross-node data interaction.

Let’s take a quick look at an aggregation — how TDSQL is able to achieve horizontal expansion with almost no awareness of the business and is used in the same way as standalone MySQL. For a service, assume that there are seven pieces of data in the table. The service only needs to insert seven pieces of data regardless of whether the table has one node or multiple nodes. The system will parse the syntax according to the PASSED SQL and rewrite the data automatically. 7 pieces of data, the system will calculate according to the partitioning key, and find that these 4 are sent to the first node, the other 3 are sent to the second node, and then rewrite, rewrite and insert these data. For the user, it’s like this, but across the nodes, we’re going to use a two-phase commit, which is multiple SQL, and we’re going to make sure that if there’s a problem both sides are rolled back at the same time.

After the data is inserted, if the user wants to do some queries — in fact, the user does not know that the data is split, it is a complete table to him, he uses aggregation functions and so on to do the queries. Similarly, this SQL will also be rewritten, the system will send this SQL to the two nodes, while adding some average functions, the corresponding conversion. At each node, the system will do data aggregation first, and then do aggregation again over here. The advantage of adding this step is that we can avoid consuming too much memory all at once by doing an aggregation that means we don’t need to cache too much data here and do a streaming calculation.

For more complex SQL, such as multiple tables or more subqueries, if you are interested, you can pay attention to our share later – SQL engine architecture and engine query practice.

In the third chapter above, we introduce in detail some principles of TDSQL’s horizontal expansion, such as how to split data, practice of horizontal expansion, and how to solve problems in the expansion process. We also introduce some problems brought by horizontal expansion, and how to solve TDSQL’s horizontal expansion.

4. Practice cases of horizontal expansion

In chapter four, we briefly introduce some practices and cases.

4.1 Practice: How to select partitioning keys

As we said earlier, we would like to consider partitioning key choices when creating tables for business participation in table structure design. How to select the partitioning key? Here is a brief introduction based on several types.

If it is a user-facing Internet application, we can use the corresponding field of the user, such as the user ID, as the partition key. This ensures that when you have a large number of users, the data can be split into each backend node based on the user ID.

For game applications, the logical subject of business is the player, and we can pass the corresponding field of the player; E-commerce applications, according to the buyer or seller of some fields as a partition key. For the Internet of Things, for example, the device ID can be used as a partition key. The choice of partitioning key in general is to do for the data can be better split, to avoid the final leak point. That is, selecting the field with the partitioning key allows the data to be spread evenly across the nodes. On the access side, when there are more SQL requests, there are partitioning key conditions. Because only in this case, the best use of the advantages of distribution – if the condition has a partition key, then the SQL can be directly typed into a node; If there is no partitioning key, it means that the SQL needs to be sent to all the nodes in the back end.

As you can see, if the horizontal expansion is more — from one node to 256 nodes, if a SQL is not written well, it may need to do the aggregation of all data of 256 nodes, then the performance will not be very good.

In summary, we want the business to be as involved as possible in creating tables and designing table structures. This is because the operations, whether aggregate functions or various transactions, are essentially unaware of the business, and the business’s participation at this time can mean significant performance gains.

4.2 Practice: When to Expand capacity?

When do we expand the capacity? In TDSQL, we will have a large amount of monitoring data. For each module, we will monitor the running status of the whole system locally, and there will be various log information reported on the machine. Based on this information, we can decide when to expand.

For example, if the data disk usage is too high, you can expand it. Or SQL requests, or CPU usage approaching 100% — currently, basically 80% is required for capacity expansion. Another situation is that at this time, the number of requests is relatively small and resources are fully used, but if the business tells you in advance that there will be an activity at a certain time, the number of requests for this activity will increase several times, we can also complete the expansion in advance.

Let’s look at a few more examples of clustering on the cloud. Each SET is responsible for a part of the shardkey. The routing information is 0-127, which means that it can eventually be expanded to 128 nodes, so it can be expanded by 128 times. This “128” can be predetermined by the initial business estimate. Because if the pool gets too big, yes, you could end up with thousands, but the data would be scattered. In fact, every cloud or physical machine today is so good that you don’t need thousands of specs.

Here’s another cluster — it has a bit more nodes, eight nodes, and each node is also responsible for a portion of the routing information. This number is only 64, so this can eventually be expanded to 64 nodes. Here’s an example on the cloud.

Today my share is mainly these content, if you have any questions, please feel free to leave a comment.

Five, the Q&A:

Q: Before the expansion, all the tables in the SET were partitioned.

A: Yes, before expansion, which is equivalent to this, let’s say we have only one node now, so we tell it 256, which we determined when we initialized it. And the value cluster does not change after initialization. Let’s say our cluster sets a value of 256 — because it probably thinks the amount of data is going to be very, very large and 256 can be set. At this point, the data is all on one node. At this point, the user, following the syntax we just described, creates a table, which at the bottom is actually divided into 256 pieces. So even though he didn’t expand it, it had 256 copies. Create another table, also 256 copies. The user may create two tables, but we have 256 partitions at the bottom of each table. Expanding the partition means moving it somewhere else.

Q: How do I ensure the consistency of backup files on each node during restoration?

A: There is no relationship between nodes. Each node is responsible for part of the routing number segment, only part of the data is stored, and only part of the data is responsible for horizontal expansion. In fact, there is no relationship between the backup between them, so the backup is actually irrelevant. We may have one active node and two standby nodes. In fact, we have a strong synchronization mechanism to ensure strong data consistency during replication. You can refer to the previous share, which will be more detailed introduction of “TDSQL in a single node, TDSQL a master multiple backup architecture is how to ensure the strong consistency of data”.

Q: Can the two phases avoid single points of failure when coordinated?

A: First of all, in the two-phase commit, we use the SQL engine for transaction coordination. This is A single transaction. If other connections are sent, other SQL engines can be used for transaction coordination. And each SQL engine is stateless and can scale horizontally. So this is actually not too many failures, we can randomly expand according to the performance, can achieve linear performance growth, no centralization. Log these are scattered, logging will also be recorded in the TDSQL backend data nodes, one master and multiple standby, internal to ensure strong consistency, no single point of failure.

TDSQL is one of the three product lines of Tencent TEG Database Working group. It is a financial level distributed database product developed by Tencent. It is widely used in finance, government affairs, Internet of things, smart retail and other industries, with a large number of best practices of distributed database.

Preferential experience cloud database

Click to enjoy Tencent Cloud MySQL database preferential activities