Guide language | TBase is tencent TEG data platform team in research and development on the basis of open source PostgreSQL HTAP enterprise distributed database system, can be in the same database cluster at the same time providing customers with strong consistent high concurrent distributed online transaction capacity and high performance data online analysis ability. This article is the sharing and arrangement of Tencent TBase expert engineer Wu Xin in cloud + community salon online, and will bring you Tencent TBase’s exploration and practice in the distributed HTAP field.

Click on the video for a full replay

First, TBase distributed database introduction

1. Development history of TBase

Tencent Cloud has been trying its internal business since 2009, and is experienced in the self-research process in the field of enterprise distributed database. At that time, it was mainly to meet some small requirements, such as the introduction of PostgreSQL as a supplement to TDW, to compensate for the low performance of TDW small data analysis, and the small processing demand.

However, as the business gradually grew, it was necessary to have a more efficient online transaction processing capacity, so the database had to be expanded, so we continued to invest in the development process of the database.

In 2014, the first version of TBase was used in Tencent’s big data platform. In 2015, TBase wechat Payment merchant cluster was launched, supporting more than 600 million transactions every day; In 2018, the V2 version made significant enhancements to transactions, query optimization, and enterprise functionality, and slowly began to be available to some external customers. In 2019, TBase won the bid for the core business of PICC Group, assisted their leading core system in the domestic insurance industry, and provided stable service for a long time.

Considering the continuous development of TBase’s overall capabilities, we hope to contribute TBase capabilities to the open source community, so as to support more database localization projects. So in November 2019, we made the database open source, hoping to help upgrade the digital industry.

2. Introduction to the PostgreSQL database

TBase is a distributed database based on PostgreSQL. In addition to complete relational database capabilities, TBase also has many enterprise-level capabilities. At the same time enhance the distributed transaction ability, and better support online analysis business, to provide a one-stop solution.

In data security, we have unique features, including the separation of powers of security system, data desensitization, encryption capabilities. Secondly, in the flexible configuration of multi-data, multi-place and multi-center, it also provides a relatively perfect ability to ensure some highly available scenarios in the online transactions of the financial industry. Lay a solid foundation for localization of core businesses such as finance and insurance.

PostgreSQL is an open source RDBMS. The open source protocol is based on the BSB style, so the source code can be modified and commercialized.

PostgreSQL is an open source project led by Turing prize winner MichaelStonebraker. As you can see from the chart above, PostgreSQL has been iterating for a long time. It has now been released in version 12 and has been iterating continuously at a relatively active level.

3. Development trend of PostgreSQL

PostgreSQL has received a lot of attention in the last decade, first and foremost because its kernel features, including its active community, have improved over the past few years. The graph above from DB-Engines shows PostgreSQL’s progress over the past year, despite the fact that there have been some downturns and modest growth.

The yellow curve below is PostgreSQL, and you can see that PostgreSQL is doing well.

The current open source version of TBase is based on PostgreSQL10, and we continue to match more features of PostgreSQL, as well as feedback to the open source community. We hope to have a good integration and interaction with the PostgreSQL ecosystem as a whole.

Second, open source TBase positioning and overall architecture

1. Positioning of open source TBase

Databases are classified into OLAP, OLTP, and HTAP based on service scenarios.

OLAP services are characterized by a large amount of data, usually 10PB+, and are sensitive to storage costs. Its concurrency is not too high compared to OLTP, but it provides good support for complex queries.

OLTP data volume is relatively small, many small and medium-sized systems will not reach the tB-level data volume, but the transaction requirements and query requests will be relatively high, throughput of millions of TPS or more. In addition, OLTP has high requirements on Dr Capability.

Many domestic domestic databases will start from the OLAP field, but it is relatively difficult to start from the OLTP perspective. At present, this field is still monopolized by IBM or Oracle. We hope to realize the localization in this field as soon as possible. TBase has been working in the insurance industry for a long time and has strong OLTP core business capability.

The other one is HTAP. In most previous business deployments, TP and AP would be separated, and there would be ETL or streaming technology in the middle to interact between the two systems. Ideally, both types of business can be supported in a single system.

Of course this is going to be complicated. First of all, we can see that their business characteristics are quite different, and the optimization direction of the kernel field is completely different, or the technology is quite different.

TBase launched HTAP based on specific requirements. In fact, TBase is more inclined to TP, while taking into account better AP processing capacity, so as to achieve better compatibility in a system. But for more extreme performance, HTAP needs to be isolated to provide a complete service capability to the user.

The TBase perspective is also derived from requirements. Tencent cloud is the earliest trading system, slowly supplemented the AP analysis ability.

First of all, the transaction data may be larger than 1T, the analysis capacity is larger than 5T, the concurrency capacity is required to reach more than 2000, and the transaction peak value per second may reach 10 million. In cases where extended capabilities are required, there is a need to minimize the impact on existing transactional capabilities, analytical capabilities, or data redistribution. At the same time, a complete distributed consistent database is achieved at the transaction level.

At the same time, TBase has also enhanced many enterprise-level capabilities, such as security guarantee capability of separation of three powers, data governance capability, separation of hot and cold data and data of large and small merchants, etc.

We’ve covered the evolution of TBase, and along the way we hope to contribute to the open source community.

In fact, it is still difficult to replace the core business in the domestic environment, and it is more from the analysis system. In recent years, there has been a system to cut into the core transaction ability. TBase also hopes to give back to the community through open source, to ensure that people can fill some gaps through TBase HTAP ability, expand the development of ecology.

After open source, we have also received more attention and use, including the Gaia Mission of The European Space Agency, which is using our system to conduct data analysis of the galaxy star system. We hope that more students or friends can join in the development process of TBase. We also hope that through this introduction, Convenient for everyone to better access to the TBase open source community interaction.

2. TBase overall architecture

A cluster consists of the GTM, Coordinator, and Datanode. Among them, GTM is mainly responsible for the management and control of global transactions and is the cornerstone of providing distributed consistency protocol. Coordinators are access points for user services. Coordinators analyze and deliver user requests. Specific computing and data storage is stored in Datanodes.

Iii. Introduction of HTAP capabilities

Just now we are talking about HTAP, let’s talk about OLTP first, TBase in this part of the ability is more prominent.

If users need to have requirements on transaction or concurrent transaction volume, they need a better distributed transaction system. Specific requirements include high performance and low cost, where TBase has great advantages over traditional IBM or more expensive foreign all-in-one machines.

Another requirement is scalability, which extends transaction processing capabilities almost linearly in the case of node scaling. So how do you get there?

PostgreSQL maintains a list of currently active transactions. It has a structure called Proc Array, which records active transactions for each user session that has a new transaction request. When determining the visibility of a tuple, a Snapshot in the active transaction list will be compared with the XID information recorded in the tuple header for MVCC access control.

An easy way to scale to a distributed situation is to have a central node. Under the previous architecture, there would have been a centralized list of active items on GTM to uniformly assign a Snapshot to each access request.

However, at this time, there is a relatively big problem, the central node will become the bottleneck, and GTM will also have some problems, such as the size of the snapshot is too large or the network occupancy is high.

If GTM is a centralized node, there is actually a single point of bottleneck. Each request has to ensure the correctness of its access to snapshot, which requires the active transaction list to be locked, which will cause a lot of lock conflicts under high concurrency. How did we solve this problem?

This is actually a common problem in the industry. Now we see solutions for the Internet industry, derived from the direction of Google Spanner.

Google Spanner is a globally distributed database that provides consistent database service capabilities across continents. Its concurrency control technology features: one is that KV stores multi-version concurrency control based on global time; the other is that it provides a TrueTime API and makes a set of submission protocol based on real time by using high cost GPS and globally consistent service timestamp mechanism. Because of its overall global distribution, the average error will be about 6 milliseconds, and the overall transaction delay is relatively high.

CockRoachDB is one of a number of systems that borrow from Google Spanner for their transaction model.

In addition, Percolator is also a more efficient database provided by Google for search engines. It uses KV storage and MVCC based on global logical timestamp for concurrency control. Its timestamp is provided by a special timestamp service. In the first phase of distributed transaction, the modified record needs to be locked, and the lock is at the end of the commit phase. The transaction commit time complexity is O(N), where N is the number of records, resulting in an impact on the performance of the commit. Of course, such a design is also related to system requirements.

Let’s take a look at TBase’s capabilities for distributed transactions, where we have also made significant improvements on the previous basis.

Firstly, we optimized the GTM cluster from the original global XID to GlobalTimeStamp (GTS), which is monotonically increasing. Based on GTS, we designed a new MVCC visibility judgment protocol, including vacuum and other mechanisms. Such a design can reduce the pressure by delegating the commit protocol from the single point bottleneck of GTM to each node, while achieving high availability of GTM nodes through time-stamp log replication.

Under this protocol, GTM only needs to allocate global GTS, so that the CDR point pressure can be solved obviously. According to our calculations, tengxu TS85 server can process approximately 12 million TPS per second, basically meeting all distributed pressure and user scenarios.

As we mentioned earlier, Percolator implementations require locking tuples and modifying records, which results in poor performance. In fact, we have optimized the commit protocol by delaying the writing of the GTS of the Tuple Header. When a transaction commits, it does not need to modify the GTS information for each Tuple, but stores the GTS information in the corresponding GTS Store File. As a safeguard for transaction recovery.

When users scan data for the first time, they will fetch the state from the GTS Store File and write it into the Tuple Header. The subsequent scanning does not need to traverse the state File, so as to achieve accelerated access and accelerated transaction processing. As a whole, this allows for a more efficient design of the database at the transaction level.

There are three classifications of centralized data distribution.

The first case is to copy the table. Each storage node in a replicated table has a complete copy of data, which is suitable for small tables with few changes and can speed up associated query.

The second is the HASH distribution, which is the classic one. In simple terms, data is hash based on distributed columns to disperse data among storage nodes. If the hash key is improperly selected, data skew may occur.

The third is based on the RANGE distribution. The RANGE distribution will break data into small fragments according to the segments. Compared with hash, the distribution is less strict and supports the elasticity of upper-layer nodes. But it doesn’t perform particularly well compared to hash when it comes to computation.

Overall, TBase opts for replicated tables and enhanced hash distribution.

Here’s how to look at distributed queries, PushQuery, and PullData.

At the beginning, some early systems may choose a faster implementation, such as storage is divided into multiple DN, and then pull data to CN for calculation.

In this case, both advantages and disadvantages are obvious. The advantage is that CN is more efficient and faster, while the disadvantage is that CN is a bottleneck and has great pressure on the network. So we prefer to do what we do on the right in the figure above, which is to put some data and calculations on the DN node.

In the most basic case, you want all the calculations to be done on the DN. In the process of redistribution, DN needs to have the ability to interact with DN, which has been greatly enhanced after TBase V2. At present, TBase can distribute computation to DN nodes as much as possible.

The figure above shows the difference between SQL Shipping and PlanShipping.

There are actually two situations when working with a Query or a query plan. One is that I directly send SQL through analysis to DN for execution, CN is only responsible for the collection of results. This optimization is better because there is no need to establish a distributed consistent commit protocol across multiple nodes, and it is more efficient in computing resources. Some of our optimizations in the OLTP space take this approach.

Another scenario is the more formal distribution of plans in the OLAP domain. An overall plan is made for query on CN, and the plan is decomposed into different calculation fragments according to the situation of redistribution and distributed to DN for processing

As mentioned earlier, if OLTP can be pushed to a single DN to do, the effect will be better. Let’s take a simple example.

The distribution column of the two tables is F1, and the data column is F2. If query can be written as the association of distribution keys, and hash distribution is adopted, query can be pushed to different DN. Because the data between different DN are constrained by distribution components, there is no need to do cross calculation or data redistribution.

The second type is an equivalent link with a distributed key and a specific fixed value of a distributed key.

In this case, CN can determine which DN to push to by the value of F1.

There are also more complex queries, such as cases where there are subqueries, that are analyzed in a similar way.

Subqueries may have a complicated situation. If the same single-node distribution can be determined in the multi-layer subquery, query can also be delivered to the DN. In this case, OLTP performance is affected and the clustering capability is optimized.

For more complex queries, tuning the optimization configuration may be required.

There are two main methods: rule optimization (RBO) and cost optimization (CBO). RBO mainly uses rules to determine whether the query plan is consistent or not for optimization. This is a relatively early implementation method. Due to the relatively small amount of calculation, it is relatively efficient for some scenarios, but its obvious disadvantage is that it is not elastic enough, and it cannot be used in complex scenarios.

In fact, more of these databases use the CBO approach. Simply put, CBO will dynamically plan all paths and select the one with the least cost as the execution plan. The advantage of this method is that it has good applicability, can be suitable for the optimization of complex scenes, and the performance is stable. The disadvantage is that the implementation is complex and requires certain preconditions, including statistical information, cost calculation model construction, etc.

But this is not absolute, neither of the two who can “win” the other side of the argument, more is the need for a combination of the two. TBase is primarily optimized for CBO. For example, in cases where small tables are calculated, Replication can be done without redistribution.

About the situation of some adjustment of distributed Chinese distribution, we still make a simple analogy. Two tables, TBL_A and TBL_B.

If F1 is a distribution column, the value of the distribution class will become push down. In this case, calculation can be performed directly on DN.

In the middle case, TBL_A is a distributed bond, and TBL_B is a non-distributed bond. In this case, some cost estimation is involved if TBL_B is small enough that TBL_B must be redistributed, that is, replicated. If TBL_B is large, TBL_B may be addressed to each other.

As we mentioned just now, TBase also has a relatively strong ability in OLAP. The optimization idea of this part mainly relies on the parallelism of computation. The full parallel capability of computing is mainly reflected in several aspects.

The first is node-level parallelism. Because we are a distributed database, we can have multiple nodes or processes to perform calculations. The other layer is process-level parallelism. Currently, TBase is not changed to the thread model, so parallelism is mainly reflected in the process-level model, based on PostgreSQL process parallelism capability has been enhanced. There is also a layer of parallelism of instruction sets, which is optimized and continuously enhanced later.

So how does Postgres implement its query plan, or its ability to process parallel? At the very beginning, we followed PG10, which was not very strong in parallelism and only provided basic framework and partial operator optimization, which was a point of TBase optimization at that time.

In the distributed case, many machines can parallelize, but in the distributed case, parallelization is not possible, so we want to make some enhancements to these capabilities to ensure a larger range of parallelism.

Parallel computing is actually a bottom-up deduction method. For example, if a Node at the bottom can be parallel, then if the recursion reaches a certain level and cannot be parallel, you can add a Gather Node to all the places below that can be parallel, collect the results from multiple processes, and continue to plan upwards. That’s one of the things we want to strengthen.

Let’s look at some specific optimizations.

The HashJoin of early PG was parallel in outer plan, but not in inner hash table building.

In short, a HashJoin can be broken down into several steps. The first step is to build the hash table. The second step is to get some of the outer plan data, calculate the hash value, and match it. Here we parallelize the inner Hash table construction process to ensure that the left and right subtrees of the Hashjoin can be parallelized, and continue to push the upper node to parallelization.

The other case is AGG (Aggregation).

In many cases, it is a two-stage Agg. Some partial AGGs need to be done on THE DN, and then the upper layer plans sharding to make the final Agg. In fact, in this case, when redistribute is encountered in the middle, data integration needs to be performed at DN before final Agg.

In the case of multiple levels of subqueries, each level of computation results in less overall parallel computation. Therefore, we also attempted redistribution to the corresponding upper-layer DN nodes in the case of Partial Agg in a hash distribution for parallelization.

There are also some optimizations for data transfer.

We mentioned that the Redistributio node can enhance the parallelism capability, and the data receiving and sending need to be optimized. The early version is single processing mode, the network delay will be relatively high, the performance will not be good, so we made some optimization in this part, so as to achieve better parallel execution ability.

Iv. Introduction of open source TBase enterprise capability

In fact, TBase has made some enterprise capability enhancements, and will continue to make open source contributions and optimizations.

At present, the open source Tbase enterprise has been able to achieve multi-site multi-center, or multi-live capacity construction, including security, management, audit capabilities, Tbase has relatively high requirements on security, and will continue to contribute to the future.

Moreover, in terms of horizontal expansion capability, TBase can expand capacity in the case of relatively small user perception. Capacity expansion is a common pain point in the context of large data volumes, and we will continue to improve our capabilities in this area.

In addition, TBase also has its own analysis table and the separation of hot and cold data, which has good effects, reducing the cost of users and improving the flexibility of data distribution.

TBase released v2.1.0 on July 13 this year, and we continue to build on our open source capabilities, including the continuous enhancement of live capacity, maintenance, performance and security upgrades, including issues discovered by some enterprise customers will continue to contribute. It also includes statistical information enhancement and optimization of small table redistribution. We also hope that you can continue to pay attention to TBase and have more discussions and discussions with us.

Five, the Q&A

Q: What are the requirements for TBase layout?

**A: ** You can access the open source version of TBase, which has specific usage methods and source-based construction, as well as the construction process, there are relatively clear documents. If you want to try, you can use a normal X86 server or native Linux server can do some simple build, enterprise projects we can also try to keep in touch.

Q: Why did you choose PostgreSQL?

A: Actually, you have to choose between MySQL and PostgreSQL. I’m going to focus on the reason why we chose PostgreSQL.

One is that the PostgreSQL protocol is friendlier, with more flexibility in the protocol, allowing people to make changes to the code and publish it in its entirety.

In addition, its kernel implementation is also more rigorous, has its own unique, continuous enhancement, including its iteration speed is relatively fast. We’ve been following up on PostgreSQL features in the early days, and as PostgreSQL has grown, we’ve done a quick iteration of the kernel. At the same time, we did some further tuning with the knowledge of the kernel.

Q: Is storage clustering for DN nodes Raft based? Multiple leaders or single leaders?

**A: ** At present, our DN nodes do not use Raft protocol, but do master/slave replication. I know there are a lot of new businesses out there that are based on Raft commit protocols or replicated protocols for consistency and high availability. However, the fact that Raft has multiple copies does have some performance impact on the commit protocol, the overall process will have longer latency than the traditional one, which is equivalent to the CAP principle, C has been improved and A has some impact.

However, we prefer OLTP system, so the requirements of transaction and delay response are relatively high, so we made such a choice.

Q: Can you elaborate on the implementation process of distributed transactions? How to guarantee distributed transactions between multiple machines, two-phase commit?

**A: ** Now we are basically folLW two-phase commit. One is the two-phase commit control process, control protocol, and the other is the transaction isolation protocol. I’ve focused on MVCC, which is basically a two-phase commit enhancement of the commit protocol.

Due to the use of GTM, it is not quite the same as the traditional single-machine mode, so there are some unified coordination, which was also highlighted just now. Such an advantage is to reduce the pressure on GTM. In addition, there will be partial blocking in the prepare stage, but the impact is very small after optimization, but it can greatly reduce the pressure on GTM.

Q: How does the underlying storage meet both row and column requirements? Or do you store them consecutively as tiles?

**A: ** The underlying storage of our open source version is mainly row storage, which will be continuously enhanced in column storage and HTAP to further improve the capability of HTAP. Iteration of the open source version will be considered after gradual stabilization.

Q: What is the minimum server setup required?

**A: ** Actually, single point setup is also possible. One DN, one CN, one GTM also work. In fact, the best cloth into two DN, you can experience more distributed construction. In fact, we have more than thousands of nodes in the cluster of enterprise services, including solving the single point pressure of GTM, which has improved the overall scalability of the cluster. Therefore, we can have a try from two nodes to multiple nodes.

Q: Is batch or pipeline adopted for GTM timing? What about the read consistency from the slave libraries now supported by Tbase?

**A: ** Yes. We have also made more optimization for GTM timing. Simply speaking, we can do some parallel monotonous TIMING of GTS. According to the current scale or our estimation of customer scenarios, the timing capacity of x86 servers can reach about 12 million QPS. In the case of enhancing the server, the overall ability is relatively strong, basically there will be no bottleneck in this part.

Q: What security mechanism does Tbase have?

A: * * * * from the business point of view, we speak to the security isolation, and has A strong line level safety rules, column level access control and data encryption, desensitization, these can tend to database to the application of an enterprise, now A lot of the ability of enterprise services also in TBase, behind we can according to the situation further iterations.