With the upcoming TiDB 5.0 release, improving the cross-center deployment capability of TiDB clusters is an important focus for us. Among them, the new distributed local transaction capability and its corresponding timing service transformation are the basic and important part. This article will start from the existing TiDB timing service, step by step to explain the transformation of the new distributed timing service and the performance of local transactions, and finally will share an application scenario and hands-on steps for interested users to try this new feature.

Single point TSO service

As a NewSQL distributed database, TiDB naturally exudes the qualities of distributed systems from the inside out — highly available and almost unlimited horizontal scaling. In fact, since the decision was made to go shared-nothing in the future, there’s nothing wrong with having such advantages, but if you take a closer look at what used to be a TiDB cluster, you’ll see that in this ubiquitous distributed system, There is one thing that seems particularly “unique” about its existence – the TSO cluster timing service.

The global Timestamp server is called TSO (Timestamp Oracle), which has Oracle in its name, but I believe it has nothing to do with Oracle database at least, just stands for its word meaning: “divine, unquestionable”. TSO timing service can ensure that the time stamp is allocated in an incremental manner, and the time stamp obtained by any application will not be repeated. It is often used to order events in distributed systems, and the most common and important role is to ensure the monotonic increase of transaction version number and ensure the timing of distributed transactions. TSO has the advantages of simple implementation, strict ordering and good performance. Therefore, many distributed systems adopt TSO as the clock scheme, and TiDB is also one of them. In a TiDB cluster, PD Leader node acts as a global single point of time service to allocate time stamps and provide distributed transactions for the cluster. Data isolation levels, version control, etc. provide solid building blocks, and PD’s TSO service performance is proven and can easily reach millions of QPS. But you can notice that the TSO service is a single point in the cluster. With a large distributed system dependent on a single point, there are naturally some problems.

The first is a single point of failure. The whole system depends on the high availability of TSO, which has been solved to some extent by the high availability of PD itself. Meanwhile, the persistence storage with strong consistency of ETCD also ensures the non-rollback of TSO after service switchover. However, as the election of PD leader is based on Raft construction of ETCD itself, once network partition or other reasons trigger re-election, unavailability problems are still inevitable in the election process.

Then there is the issue of delay. To obtain the timestamp, the PD client communicates with the PD leader. As shown in the figure below, if the TiDB instance and PD Leader are not in the same room, the latency will be high, and even if globalization is deployed across the data center, the latency will be very large at the cross-center level, which will have a significant impact on the cluster performance.

These two problems, especially the latter, greatly restrict the deployment of TiDB in a cross-data center scenario, and the single-point TSO timing service is a key factor. Imagine if every transaction has to tolerate the delay caused by the cross-center request timestamp. In addition, TiDB uses Percolator as the transaction model. In the write scenario, the next transaction needs at least two times of timestamp acquisition of Prewrite and Commit, and high intensity OLTP load is encountered. The PD leader in another data center is obviously a bottleneck for the entire cluster due to network latency. In fact, TiDB has made many optimizations for write scenarios in version 5.0. The above mentioned two TSO acquisition is required for each transaction write. One of the network overhead can be saved by enabling the 1PC phase and asynchronous Commit Async Commit features, thus improving performance. But in fact, these two features do not eliminate the network delay caused by cross-center in essence. How to optimize from the root, or even completely eliminate the impact of this part of delay?

Cross-center deployment issues

In the scenario of cross-data center deployment, the key factor affecting TiDB performance is the high network latency between different data centers. Specifically, when combined with the operation mechanism of TiDB, it includes the following aspects:

  • The LEADER of TiDB and Region are not in the same data center

  • The leader and follower of a Region are in different data centers

  • TiDB and PD Leader are not in the same data center

For these reasons, users often set scheduling policies to centralize service requests, Region Leader, and PD Leader in the same data center. The disadvantages of this approach are that only one DATA center provides external services and the utilization of hardware resources is low, which is obviously a non-optimal solution of “compromise and harmony”.

In order to provide services for multiple data centers, some users may choose to disassemble the database into multiple lines of business, each of which deploys a set of clusters across data centers, with different lines of business providing services in different data centers. There are two main disadvantages of this approach. One is that the operation and maintenance of multiple clusters will be more troublesome, and the other is that transactions across multiple clusters are not supported.

Based on this, we expect to have a set of solutions that can realize multiple lines of business operation within a cluster, different lines of business in different data centers to provide services, when the transaction involves the local data only, don’t have to bear the cost across the delay of the data center, and when the transaction involves multiple data centers, speaking, reading and writing, and can not destroy the transactional consistency. That said, TiDB’s transaction consistency model needed to be revamped in order to achieve the important historical mission of balancing performance while enabling the same cluster to be deployed across data centers.

Distributed consistency

As a highly consistent HTAP database, TiDB provides database services with strong consistency (linear consistency) and high availability based on Raft consensus algorithm. At the same time, it achieves Snapshot Isolation level in Isolation level (in order to keep consistent with MySQL, Also called “repeatable”). Although Serializability is the most ideal and strict concurrency control scheme, the performance brought by its implementation is not efficient, TiDB naturally and many old database systems (PostgreSQL, To address the Concurrency problem, Oracle and Microsoft SQL Server have opted for the Multi-version Concurrency Control (MVCC). The globally unique, strictly incrementing timestamp provided by the TSO timing service is used in THE MVCC to represent version tokens. In a TiDB cluster, a TSO timestamp is an INT64 integer consisting of two parts: Physical time and Logical event Logical Time.

  • Physical time is the current Unix system timestamp (in milliseconds)

  • Logical time is a counter in the range [0, 1 << 18]. In this way, the physical time granularity can be further refined to a maximum of 262144 Tsos per millisecond

As a single point, TSO can easily ensure the full ordering relationship of the timestamp, with physical time following natural time and finer granularity guaranteed by logical time. Additional mechanisms are also introduced to ensure that high availability and timestamps do not fall back due to system time distortions, as described in our previous blog on TSO design.

A single point is good, but the high network latency between TSO services and other nodes in the cross-center deployment scenario described above prevents us from enjoying a good cross-data center deployment. In practical scenarios, user services usually have the feature of DATA center association, that is, except for a small number of transactions that will carry out data read and write across data centers, the data read and write of transactions from a data center usually only take place in the current data center. Therefore, for these transactions, As TSO timing service, it only needs to ensure the linear consistency in the current data center, as shown in the figure below.

As a result, latency between data centers is no longer part of the burden of most things, and even in the case of global transactions, we can have a mechanism to synchronize TSO across all data centers to ensure that global and local linear consistency is not broken. In a word, the former single-point TSO service should be distributed, while taking into account the consistency guarantee within a certain range and globally. As a result, Local TSO, a new feature of PD in TiDB 5.0, is naturally introduced. Specific to the user level, that is, TiDB’s new function Local Transaction Local transactions.

As shown in the figure above, TiDB checks the Location properties of data during transaction execution through TiKV’s Location Label and Placement Policy functions to ensure that local transactions are only allowed to read and write data from the local data center. However, global transactions have no such limitation and can arbitrarily read and write all data in the center. Local transactions are timed by the local TSO and live in the same data center as the TiDB instance performing the local transaction without cross-center latency. Global transactions are served by the global TSO and need to be computed synchronously.

Distributed TSO

As mentioned above, PD leader has provided TSO service as a single point for a long time. However, since there is the concept of Local TSO, each DC has to have a Local TSO distributor, so we naturally decouple TSO service from PD leader. Operates as a separate service in a cross-central PD cluster, and this role is called TSO Allocator. From the above scenario, we can abstract the following two TSOS.

  • The Local TSO can only serve Local transactions that access data in a single DC and ensure linear growth in a single DC. Local TSO Allocator is used to allocate data.

  • Global TSO, which can serve Global transactions that access any data center and guarantee linear growth across the entire cluster, is allocated by Global TSO Allocator.

Linear consistency

The size of a single TSO is in full order, that is, any two TSOS can be determined to compare the size, while the size of any two Local TSO is no longer in full order. The reason is also very simple. These two TSO may come from different data centers respectively. Different system time and logical time between nodes will lead to different progress of Local TSO, so they cannot be compared. The original purpose of local transactions is to limit access to data in a unified data center, so we can use the previous TSO functional design, The only difference is that the Local TSO service needs to be provided for the corresponding Local TSO Allocator elected in the PD node of each DATA center (it can be understood that each data center provides a single TSO Service) and limit the scope of its services (only transactions initiated by the center are allowed to request the center’s Local TSO service), as explained later in the availability section.

As for Global TSO, its corresponding service is Global transaction, which has no data access constraint, which means holding a Global TSO, we need to establish a full-order comparison relationship with all Global TSO and Local TSO previously allocated. That is, for the Global TSO timestamp t held by any Global transaction, we have:

  • In a natural time sense, t1 < t is required for the timestamp T1 held by any transaction that occurred before T

  • In a natural time sense, t < t2 is required for the timestamp T2 held by any transaction that occurs after T

Only when the above constraints are met, can the Global TSO be safe and ensure linear consistency in the Global scope under the premise of multiple Local Tsos. To achieve this, Global TSO is made into an algorithm calculation product, which does not have the logic of storage and advancement as the previous single-point TSO. The specific algorithm process is as follows:

  1. Global TSO Allocator Communicates with all Local TSO Allocator in the current cluster to collect the maximum Local TSO Allocator that has been allocated.

  2. Global TSO Allocator The Global TSO Allocator sends the collected maximum Local TSO Allocator to all Local TSO Allocators as MaxTSO for synchronization.

  3. After receiving MaxTSO, the Local TSO Allocator checks whether the current Local TSO progress is greater than its own progress. If so, the current Local TSO in the memory is directly updated to MaxTSO, and success is returned. If no, the update is not performed and success is returned.

  4. Global TSO Allocator collects all successful replies and returns MaxTSO as the Global TSO result.

The Global TSO Allocator satisfies the constraint that a Global TSO is the largest in the cluster at this moment and no TSO smaller than it will appear later through the two-stage synchronization process of collection and write. For compatibility and design intuition, unlike Local TSO Allocator, Global TSO Allocator needs to be independently elected. The PD leader still provides services externally as the Global TSO Allocator.

In a word, for the transformation of Local TSO, PD stripped the original design of “TSO allocation by PD leader single point”, and transformed the election of TSO Allocator and Local TSO allocation into distributed existence with data center as the unit. Global TSO is taken as the algorithm result generated after synchronization with all Local TSO Allocator. That is, the PD Leader node and all Local TSO Allocator timing service nodes reach a consensus as the result of Global TSO allocation.

Global uniqueness

In the whole operation process, TiDB instance will use StartTS acquired at the beginning of transaction as transaction ID to store relevant state information. In some Corner cases, for example, local transaction and global transaction are executed simultaneously. Even if the local transactions of two different data centers are executed on the same TiDB at the same time, it is possible to duplicate TSO due to their independent TSO schedules, which not only leads to undefined behavior in TiDB due to the same transaction ID, but also has the potential risk of breaking consistency.

Our solution is also “rough”, TSO logic time is 18 bits, suppose we have a 3-center cluster, then we only need to take the low 2 bits out of this logical bit, The Local TSO and Global TSO from the two centers can be uniquely distinguished by suffix marking to ensure the Global uniqueness of each TSO.

To be sure, this isn’t a perfect solution, given the thousands of data centers that TiDB’s father, Google Spanner, has deployed on a global scale. TSO may run out of logical bits too fast in high-intensity requests and accelerate the advancement of physical time, resulting in deviation from the real time. But for now, it’s still a good “technical debt” solution for the complexity and necessity of the implementation.

availability

As a service in distributed cluster, it is natural to ensure availability. Local TSO Allocator is elected, and the specific mechanism is realized by the transaction function of ETCD. All PD candidates write special key values through one ETCD transaction. The successful writer can be elected as the Local TSO Allocator in the data center, which is similar to the election of PD leader. Although by default all PD can run for Local TSO Allocator in a data center, we also need to ensure this election priority: In most cases, the Local TSO Allocator of a DC can be selected only from PD nodes in the DC. In extreme cases (for example, PD nodes in the entire DC are down), PD nodes in other DCS can serve the DC.

As mentioned above, Global TSO Allocator is still provided by PD Leader, which is guaranteed along with the high availability of PD Leader.

Performance trade-offs

After completing the function implementation, we combined Chaos Mesh to test the performance of ordinary transactions (old single-point TSO), Local TSO and Global TSO in K8s environment. The Chaos Mesh was used to inject 50ms round-trip delay between different nodes to simulate cross-center deployment. The results were also very promising. Compared to the old single-point TSO, which was affected by cross-center delay, Local transactions using Local TSO improved performance exponentially. QPS in Sysbench olTP_WRITe_only tests showed a maximum of 25 to 30 times difference. The figure below shows the TPS performance of oltp_write_only at 50ms cross-center delay

As can be seen from the mechanism of Global TSO algorithm itself, to ensure the linear growth of Global scope, it is necessary to communicate with Local TSO Allocator nodes in all centers at least twice, which can be said to eat the cross-center delay to the extreme and avoid large performance loss. In the test results, the QPS performance of global transactions decreased by about 50% compared to the old single-point TSO service. However, the natural nature of local access of data center data also balances this performance shortcoming. In real application scenarios, people accept high latency of global transactions. If the business is dominated by local transactions, global transactions are rare or latency requirements for global transactions are not strict, this performance trade-off can be quite worthwhile. However, we are further optimizing the acquisition latency of the Global TSO, using methods such as estimating a MaxTSO to reduce communication between PD’s within a cluster to minimize the delay cost and make it as close to the delay performance of the old single-point TSO timing service as possible.

Application scenarios

When can you use Local Transaction as a user? In summary, Local Transaction applies to the following requirements and business characteristics:

  • You want to deploy a single cluster across data centers with high latency between data centers

  • Data access is datacenter specific, that is, most bureaux are limited to read and write within the data center, and a few transactions have read and write requirements across the data center.

  • Insensitive to transaction latency for reads and writes across data centers and able to accept high latency.

Based on the above points, compared with the PD leader without Local Transaction that provides single-point TSO service, if there are three data centers, namely DC-1, DC-2 and DC-3, when PD leader is elected in DC-1, Therefore, when THE TiDB of DC-2 and DC-3 read and write the data in the Local data center, the delay cost of DC-1 obtaining THE TSO timestamp can be avoided by enabling Local Transaction. Each DC has one TSO Allocator serving local transactions and synchronizes between DCS only when cross-data center transactions are required. In this way, most local transactions can be kept to low latency, and the high latency costs only need to be borne by a few global transactions.

The Local Transaction feature will be available as an experimental feature in version 5.0, and if you want to experience TiDB’s new cross-data center deployment capabilities, you will need to follow these steps.

Instance Topology Configuration

First, we need to partition the TiDB/PD/TiKV in the cluster at the data center level. Suppose we now have three data centers named DC-1, DC-2 and DC-3. The expected plan is 1 TiDB, 3 PD and 3 TiKV per DC. We need to divide the Labels of these three components into regions and specify the specific data center topology where different instances are located. Take DC-1 as an example:

For TiKV, topological label needs to be configured first. Currently, we require zone label to be used as the configuration description at the DC level. For detailed steps, please refer to the document for copy scheduling by topological Label. After labeling the three TiKV sets of DC-1 with zone= DC-1 and setting the corresponding location-labels on PD, the configuration of TiKV is completed.

For PD, we only need to set the same zone= “DC-1” on the labels configuration of these three PD sets and turn on the enable-local-tSO switch.

For TiDB, just like for PD, set zone= “DC-1” in the Labels configuration of TiDB. TiDB with no zone tag will default to global transactions in the new Session. On TiDB instances with zone tags, TiDB will default to local transactions. You can also manually change the value of Session Variable @@txn_scope to manually control the “local” and “global” modes.

Data location configuration

After partitioning the instance, we also need to position the data so that TiDB can correctly check the data constraints and get the Local TSO from the correct PD to ensure the consistency of Local transactions within the center. This can be configured through Placement Policy. Currently, the Placement Policy only supports Partition level control, so different partitions need to be divided for different data centers during table construction. For example, suppose we have table T1 as follows:

CREATE TABLE t1 (c int)
PARTITION BY RANGE (c) (
    PARTITION p1 VALUES LESS THAN (100),
    PARTITION p2 VALUES LESS THAN (200),
    PARTITION p3 VALUES LESS THAN MAXVALUE
);
Copy the code

Partition P1 data corresponds to the data in DC-1, we can use the following SQL constraint partition p1 table:

ALTER TABLE t1 ALTER PARTITION p1 ADD PLACEMENT POLICY CONSTRAINTS='["+zone=dc-1"]' ROLE=leader REPLICAS=1;
Copy the code

After this setting, the TiDB in DC-1 can only access data in p1 partition through local transactions. Access to data in any partition without zone constraint or non-DC-1 partition will be denied.

After the topological division of component instances and data is completed, TiDB instances in the corresponding data center will perform local transactions by default, but also automatically identify DDL operations and convert them into global transactions to ensure the consistency of Scheme information and other metadata in the whole cluster.

Afterword.

At present, the new TSO service still has a lot of imperfections. To address this biggest pain point in the cross-center deployment scenario, we have made many trade-offs in the technical solution and made many improvements to the existing results. In addition to these written changes, we have a number of questions to consider: how can we optimize the Global TSO generation process to further reduce time consumption? How can network partitioning be handled to further provide availability of TSO services? How to make the configuration division between local and global transactions more convenient and elegant? These are some of the ongoing work we should be doing and are doing, and you can look forward to TiDB getting better every day. This is a preliminary look at the distributed timing service and local transaction features in TiDB 5.0’s cross-center deployment capability.