Author introduction: Lei Yu, TIFLASH R & D engineer, graduated from Sun Yat-sen University majoring in software engineering. At present, I am mainly engaged in the R&D of TIDB SQL MPP in PingCap.
This article is shared by PingCap R&D engineer Lei Yu. It mainly analyzes what TIDB can do and what kind of value it can create from a macro point of view, as well as some design points in the process of R&D. The article will be shared in four parts:
- Firstly, the evolution of data management technology;
- Second, what can TIDB do?
- Third, how do people use TIDB?
- Fourth, the future of TIDB HTAP.
Evolution of data management techniques
First, a brief review of the evolution of data management techniques.
- In the 1970s, IBM developed the world’s first relational database, System R, which was the first database to use SQL as a query language and laid the foundation for the design of relational databases later.
- In the 1980s and 1990s, relational databases began to grow brutally, and a large number of commercial relational databases emerged, such as the well-known Oracle, IBM’s DB2, Microsoft’s SQL Server, as well as the more popular open source relational databases such as PostgreSQL and MySQL. During this period, the technical focus was mainly on the functional completion of the database, such as stored procedures, triggers, and various indexes to meet different business requirements.
- In the early 2000s, the world entered the Internet era, and the data began to grow exponentially. The traditional relational database could not accommodate such huge data. At the same time, some Internet companies are taking the lead in open-source their in-house solutions for handling huge amounts of data. Around 2004, Google led the publication of three papers, respectively on their distributed file system GFS, distributed computing system MapReduce, distributed storage system Bigtable. Under the guidance of these three papers, the Hadoop ecosystem community thrived. At the same time, distributed KV databases such as Cassandra and MongoDB also appeared in this period. Traditional relational databases are also evolving, with key technologies such as MySQL’s InnoDB engine and Oracle RAC analysis engine. A single database product has been unable to meet the needs of users, the whole field of data processing technology direction appeared serious differentiation. OLTP is still dominated by traditional relational databases, and OLAP has become the main battlefield of big data technology.
- Pre-2010s, thanks to the development of hardware, the memory capacity and network bandwidth and delay have been greatly improved, and the database architecture ushered in changes. In-memory databases and distributed databases are mass-produced. Representative products: Google Spanner, SAP HANA, SQL Server Hekaton, Amazon Aurora. During this period, the concepts of OLTP and OLAP began to blur, and HTAP was proposed, which mixes OLTP and OLAP together and processes both loads on the same database, returning to the original purpose of database products.
- Post-2010s, a continuation of the glory of the early 2010s, a variety of NewSQL databases appear, can carry more complex loads. Represent products: CockroachDB, TDB, VoltDB, Azure Cosmos DB — all these technologies are starting to go in different directions.
On the whole, since 2000, the technology of big data has entered the Internet ecology, and it has been common to use big data technology to establish data warehouse. Although the concept of data warehousing has been around since the 1990s, the various data warehousing products have not been open source yet, and the industry lacks consensus. After the open source of Hadoop, the database architecture based on Hadoop gradually becomes the mainstream, that is, the traditional database architecture.
Traditional warehouse architecture
As shown in the figure above, the left side is the database used by OLTP online business. Since it is not possible to conduct analysis directly on it, data changes or full data will be transferred to Hadoop platform through MySQL’s Binlog CDC or directly read and write database ETL. We then use Hive and other software in Hadoop to perform Data analysis, generate reports, and write the results to another OLTP database, which is the Data Serving layer on the right that presents the results for offline analysis. Finally, the Data is presented to the application by the Data Serving layer.
Due to the length of this set of links and the various implementations in Hadoop, the architecture was initially limited to T+1, where the day’s data was written and calculated the next day.
Although the problem of mass storage and computation has been solved, the real-time performance of data processing has been lost. In recent years, with the increasing demand for real-time performance, a new architecture, Lambda architecture, has emerged in order to ensure the real-time performance of data processing.
Lambda real-time data warehouse
Lambda architecture is characterized by adding a real-time computing Layer, commonly called Speed Layer, to offline Hadoop. In the early stage, it mainly used Spark Streaming or Storm Streaming computing engine to directly collect OLTP data and calculate it into real-time data. It is then mixed with off-line T+1 data and provided to the application. In this way, the application can get a relatively real-time data.
In the traditional data warehouse era, only T+1 can be achieved, but with the Lambda architecture, T plus a few points can be realized, and yesterday’s data and today’s half-day’s data can be combined together for processing. But can more real-time data analysis be done on top of that?
Kappa real-time data warehouse
The Kappa architecture was born. The pain point of the previous Lambda architecture was that it needed to do very complex maintenance, because you had to write the data to T+0 at the same time, and you had to write the data to the real-time part, and then combine the results of the two parts. With the Kappa architecture, it is simply a matter of pulling OLTP business changes on demand through a real-time computing layer and presenting the resulting data. However, this system has not been widely used for performance reasons.
It can be seen that in the process of the presentation of data warehouse architecture, real-time data has become a common demand, and massive data processing capacity is also essential. In this case, let’s see what TIDB can do.
What can TIDB do?
TiDB before 4.0
When TIDB 1.0 was released, the diagram below was the first impression that many people had of TIDB.
The architecture of TIDB is very simple, starting with Load Balancer, which can shatter the SQL requests of users and send them to the TIDB Server. The TIDB Server is a stateless computing layer that can be expanded at will. The actual data is stored in the distributed KV storage TIKV. In addition, there is a PD component to schedule and organize the data.
The most prominent part of this framework is capacity expansion, with capacity expansion as the first priority. Capacity expansion is reflected in two aspects. One is storage expansion. TIDB can store the amount of data that cannot be carried by traditional stand-alone database into distributed storage. Second, in terms of computation, the traditional database single machine cannot bear the high QPS. Through this way of capacity expansion, the QPS can be scattered to different computing nodes.
We continued with this architecture until TIDB 4.0. Here’s a summary of what we were able to do before TIDB 4.0:
- Relational database compatible with MySQL protocol and features;
- Storage naturally has the ability of horizontal expansion, no need to divide the library into tables;
- Carrying tens of millions of QPS online services;
- Calculated storage separation, can carry on the elastic resource allocation;
- Quality carrier of the data warehouse Serving layer (data center).
First of all, the foothold of TIDB is a relational database compatible with MYSQL protocol and features of MYSQL, with the ability of horizontal expansion, including storage and calculation can be horizontal expansion, and does not need to divide the database into tables. On this basis, because it supports the horizontal expansion of computing, it can support the online business with high QPS, and the separation of storage and computing, which provides the foundation for the elastic resource allocation.
But beyond our imagination, many members of the open source community are using TIDB as a premium repository. TIDB can accept the storage of massive data, but also can provide more convenient access interface, so many users naturally take it as the middle layer of the database warehouse.
Prior to TIDB 4.0, this usage was not considered in the design at all, so there were many problems, such as single node computing, which could not be distributed scaling, and some heavy computing tasks were not supported. At the same time, TIKV, the storage engine of TIDB, uses the storage format of row storage. The advantage of row storage is that it can handle concurrent transactions well in OLTP scenarios, but its performance is not ideal in OLAP scenarios.
We have received a variety of user requirements, so we specially developed TIFlash, a column engine of TIDB, to carry the OLAP load of TIDB. In TIDB 4.0, TIFlash officially became a member of the TIDB family.
4.0 after TiDB
Prior to 4.0, the community had already provided a set of TiSpark. TISpark is essentially a Spark plug-in that allows you to access, read and write data in a TIDB cluster within Spark. However, using Spark to access TIDB data can be problematic because it is a highly concurrency scan request that will affect the OLTP load of TIKV itself.
With TIFlash, you can completely isolate the OLAP and OLTP load and also ensure consistency. The consistency of TIFlash is done through RAFT’s synchronization protocol, and for those of you familiar with RAFT, it’s a synchronous replication protocol, and all the data is presented in log form. Each log has a globally consistent ID, which is also the index of its position. If there are two logs, one of which is 4 and one of which is 5, then the RAFT protocol can guarantee that 5 will be written after 4, and that all clients (TIDBs) will read 4 when 5 is written, thus satisfying linear consistency.
Generally speaking, only the leader can read and write in RAFT. However, if the state of learner or follower is optimized, the condition that the same index on the leader can be read can be satisfied. You can read data directly from Learner. TIFLASH uses such a mechanism to synchronize data from TIKV cluster and achieve linear consistency. The advantages of this are:
First, assume that data is synchronized into the column analysis engine using something like a binlog, and there is an additional transfer overhead or middleware-like processing overhead. Write directly to the RAFT protocol. When a piece of data is written to the leader, RAFT’s quorum confirmation process takes place, and the data has already been sent to the TiFlash for writing. In addition, although the TIFlash write confirmation does not require synchronization, its data is the same as the high availability priority within TIKV, which is the key to achieving consistency.
Overall, with TIDB 4.0, analytics capabilities are up a notch. At this point, we can proudly say that TIDB is a true HTAP database. The characteristics of TIDB are as follows:
- A true HTAP database;
- Isolated OLAP and OLTP payloads
- Analysis-friendly, strong real-time, strong consistency column storage;
- Integrated deployment operation and maintenance system, optimizer intelligent choice of storage engine;
- ALTER TABLE \ ‘db\’.\ ‘TABLE \’ SET TIFLASH REPLICA 1, one simple SQL can experience the TIFLASH reinplica.
5.0 HTAP TiDB
In 5.0, in order to solve the above pain points, we developed the MPP of TIDB. Let’s get a sense of what MPP is.
In the execution of SQL, a set of Volcano model is used. Its advantage is that operators can be decoupled, while its disadvantage is that calls between upstream and downstream are coupled, that is, the upstream must ask for data from the downstream, and then the downstream will calculate the data and provide it to the upstream. There is a big mismatch between the consumption and production capacities of each operator. Although TIDB itself has also done a lot of optimization, within the operator through parallel calculation to speed up its calculation. But at the end of the day it’s a standalone computing engine with a very low limit. To solve this problem, we make full use of the TiFlash node.
First, let’s see how to do it.
A SQL comes in from the TIDB, passes through the Parser and Planner to generate a TIDB Logical Plan, and after the Logical Plan passes through the optimizer of the TIDB, it will determine if it is an OLAP request.
If it is an OLAP request, you need to choose between read and write from TIKV or TIFLASH based on the cost estimate. In this process, we will add exchange to the operators of these joins, which is the way of parallelization mentioned in the Volcano paper, generate a parallel execution plan, and then push the fragments of these execution plans to the corresponding TiFlash nodes for execution.
Let’s look at a practical example.
The above data are from the TPCH dataset. TPCH data set has a table called LINEITEM, and the LINEITEM table contains information about all items, generally about 600 million rows. In addition, there is the Orders table, which is the fact table for the goods order, and we add a Count Star aggregate after a simple Join. At this time, the Plan is different under the MPP architecture. Previously, there were usually two Table scans under the Join. If you were doing calculations in the TIDB, you could put the two Table scans directly into the operator of the Join. But after MPP, we will first Shuffle the Table from Scan according to Join Key, and then push the data to the corresponding calculation node. After the overall calculation is completed, we will push it to the TIDB and return it to the user.
This has two advantages. On the one hand, if a single TIDB node is used for calculation, a large amount of data needs to be put in memory, and even the data may be beyond the capacity of TIDB. At this time, it must be dropped to the disk, and the calculation efficiency is very low. However, after shuffle partition, the amount of data that needs to be calculated on each computing node is reduced and can be all contained in memory, which can achieve the effect of acceleration. In addition, MPP can use the CP of multiple machines at the same time, which theoretically can achieve very strong scalability.
In order to verify the performance of TIDB MPP, we compared with other products. The cluster is a cluster of three nodes, and each node uses NVME’s SSD, so the impact of reading on the storage on the entire computing speed can be excluded as much as possible.
As shown in the figure above, you can see the performance of the TIFlash MPP in blue, the length of which represents its execution time, the shorter the better. As you can see from the chart above, MPP is dominant for the vast majority of queries compared to Greenplum and Apache Spark. The reason is: on the one hand, TIDB 5.0 itself integrated a set of column computing engine, performance is very powerful; On the other hand, the advantage of the MPP architecture over the batch engine is that all the tasks are parallel and there is no interdependence, so it can be concurrency in a better way. The drawback is that it can’t support large amounts of data compared to batch processing, but in most scenarios the MPP architecture is quite adequate.
Summarize the MPP of TIDB.
-
Multiple parallel execution algorithms are supported:
- Broadcast the Join.
- Repartition (Shuffle) Join;
- Two Phase Aggregation;
- One Phase Aggregation;
- Scalable complex query processing capabilities;
- TIDB is highly integrated and automatically selected by the optimizer;
- To upgrade to TIDB 5.0, simply turn ON the switch SET TIDB \ _ALLOW \ _MPP =ON.
With the MPP architecture, several new features introduced in TIDB 5.0 greatly improve the HTAP capability of TIDB:
-
OLTP:
- Async Commit, 1PC provides lower transaction latency.
- Staging Index reinforces latency and throughput at a specific load.
-
OLAP:
- SQL MPP significantly improves TIDB’s ability to handle complex queries.
The above has shared the functional features and product capabilities of TIDB at different stages. The following will explain how you use TIDB in detail.
How do you use TIDB?
According to user feedback and our own sorting, we found the most commonly used TIDB scenarios at present.
Transaction/analytics integration
First of all, the integration of transaction analysis, in this scenario, the data magnitude is generally in the medium level, namely the TB level.
If you simply use MySQL, you can’t do data calculation well, so you generally need to import these data into an analytical database for calculation, such as ClickHouse, GreenPlum, etc., and then present the calculated report. With TIDB, these two parts can be combined. TP writes TIDB directly, AP also calculates on TIDB directly, and then presents the results, which can greatly save the cost of operation and maintenance, and possibly improve the performance.
The scene of integration of transaction analysis is relatively common, such as CRM system, ERP system, etc., which is also the most complete HTAP scene that we highly praise. But Internet companies generally cannot use it, and must also have offline parts to deal with huge amounts of data.
Therefore, in this system, TIDB is mainly used for real-time analysis.
Real-time analysis
The business data is pre-aggregated or assembled in Flink by means of Kafka + Flink, and then the results are written into the TIDB for query. This is a common real-time analysis architecture. However, if the online business of the application has already used TIDB, the whole architecture is more natural. The CDC function of TIDB can be directly used to import data into Flink for processing.
Due to the practicality of the architecture, it has been widely used in many business scenarios, which will be illustrated in the following examples.
Real-time analysis: The Flink architecture
There are also several common architectures for using Flink in real-time analysis.
- Parse MySQL CDC using the Flink MySQL Connector
In the first architecture, MySQL is used for front-end business, such as database and table splitting scheme. Data changes in MySQL are obtained through Flink MySQL Connector, and then data is written to TIDB.
- Use formats such as Kafka push Canal JSON
In the second architecture, data is processed through middleware processed by MySQL Binlog, such as Canal, and then written to Kafka for consumption by Flink, and then written to TIDB. This is a common approach.
- Push Canal JSON to Kafka using TICDC
In the third architecture, the front end of the user has already used TIDB. Through the CDC function of TIDB, the user outputs the Canal JSON format to Kafka for consumption, and then Flink writes the data to a database similar to TIDB or other sink.
- Number of warehouse acceleration layer/ODS layer
Another common solution is the accelerated or ODS layer of a data warehouse.
The most common usage is that the data warehouse will separate the acceleration layer. With the TIDB, the two parts can be combined. The user’s data can be written to the TIDB in various ways. TIDB provides real-time data analysis services directly to the outside world, which is also one of the most popular architectures.
The application case
Next, I’ll share some examples of real-life companies.
Zhongtong Express Logistics
First of all, we are all familiar with ZTO Express, which should be one of the largest express companies in the world. In recent years, they have experimented with using TIDB for package tracking management. In the early days, they used Tispark for calculations, then assembled the data into a wide table and wrote it to the TIDB, followed by some aggregation. Recently, they have been testing a 5.0 MPP architecture to see if TIDB 5.0 can help more.
-
Zhongtong express
- The largest express delivery company in the world.
-
Logistics full link life cycle management
- The same set of TIDB platform service package tracking management and real-time report.
- The peak value of QPS is over 120,000.
- Real-time statistical analysis.
- Connecting to offline platforms via TISPARK.
The structure of ZTO Express is as above. First of all, parcel tracking is an online business. It is written into TIDB through Spark Streaming training method, and real-time analysis is carried out at the same time. Then the archived data of TIDB will be sent to ZTO’s big data platform for calculation, and finally the calculation results of big data platform will be written back to TIDB. In this structure, the TIDB is the integration layer of the whole real-time computation.
The little red book
Little red book is a content to do vertical e-commerce related platform, the number of users and traffic are also very large.
The early architecture of Xiaohong-shu is the scheme of using MySQL database and tables for businesses. Business data is written to offline products through ETL, and after T+1 calculation, it is written back to another MySQL database and table cluster to provide external data services. At the same time, the offline warehouse will also be used to do risk control related business.
The pain point of the above architecture is T+1, which is very difficult for business and operations. After trying TIDB, the architecture was upgraded.
At present, the online business layer still uses library and table division, but the business data will be directly written to TIDB through some simple ways. At the same time, TIDB will feed back the data to the offline layer, and then write it back to TIDB after the offline data processing.
The above structure directly uses TIDB for data analysis or risk control services, and the overall architecture has changed from T+1 to T+0. According to the feedback from Xiaohengshu engineers, the use of TIDB saves a lot of MySQL database and table operation and maintenance energy, which is also one of the advantages of TIDB.
Wisdom buds
Smart Sprout is a provider of SaaS services, providing big data intelligence services to more than 10,000 technology companies, universities, scientific research and financial institutions in more than 50 countries.
-
Wisdom buds
- The rapidly developing technology innovation SaaS service provider provides big data intelligence services for more than 10,000 technology companies, universities, scientific research and financial institutions in more than 50 countries around the world.
-
Number of real-time warehouse
- Deploy to the AWS cloud environment.
- Modeling the warehouse using AWS Kinesis/AWS EMR Flink.
All of Wisdom Sprout’s operations are deployed on AWS. In the early stage, Wisdom Shoot carried out data analysis through Redshift of AWS, but the speed of Redshift itself was not particularly ideal. Therefore, in order to obtain better real-time performance, Wisdom Shoot began to try to build real-time data storehouse by using TIDB. In terms of data warehouse architecture, it is very similar to other companies. Flink is also used for real-time data processing, and then all kinds of data are written into TIDB, and finally directly presented to the data application.
The above cases are very typical scenarios of using TIDB to do real-time data analysis. There are also businesses that are relatively inclined to HTAP, such as the architecture of Little Red Book, whose online business data will be directly written to TIDB, which can make full use of the OLTP capability of TIDB.
After looking at so many cases, we can also imagine the future of TIDB HTAP.
The future of TIDB HTAP
First, and most importantly, after 5.0, TIDB can be used to do complex calculations, and we can provide more real-time scenarios for verification.
What does SQL MPP mean?
With SQL and MPP, we have faster computing speeds and can handle more complex computing tasks, coupled with strong real-time data and strong consistency assurance. And with that, what can we do?
Live scene
First, live scenes. When a big anchor is on the air, users will flood in directly. At this time, the information of users will be inserted into the accessed fact table, and the anchor’s broadcast room will also update its dimension table. This architecture, in the traditional way, would probably use Flink for data processing, but the problem is that the operations would be very concurrent and would need to be completed in a short time. Therefore, if Flink is to process, it needs to maintain some complex Watermark, etc., and after pre-processing, there may be some delays.
If TIDB is directly used to carry these loads, when the data is written in, it can be analyzed immediately, and analysis reports can be generated, which can be fed back to the platform or anchor in time for timely adjustment of business. Of course, the application of live broadcast scene is still a hypothesis at present, and we are looking forward to the landing of TIDB in live broadcast scene.
Real-time risk control scenarios
Another scenario, take real-time risk control as an example. Some online platforms often generate transactions and money transfers, but fraud is also often reported in the news. In fact, financial or other trading platforms generally have risk control services to detect and avoid the occurrence of similar events.
One of the possible problems with previous risk controls is that the process of committing a fraud is so rapid that the risk control rules are not triggered but the fraud process is over. It not only causes economic losses to users, but also affects the efficiency of police handling cases.
If TIDB is applied to risk control business, it can be directly analyzed and trigger risk control strategy at the moment when illegal transactions occur. The overall link latency will be greatly reduced, which will also help relevant departments solve cases more quickly.
Other more TIDB HTAP application scenarios are welcome to help us imagine the future of TIDB together.