Author: Wei Wan

TiDB is a distributed HTAP database. It currently has two storage nodes, namely TiKV and TiFlash. TiKV uses line storage, which is more suitable for TP-type services. TiFlash uses column storage and is good at AP type business. TiFlash synchronizes data from TiKV node in real time via RAFT protocol with millisecond latency and excellent data analysis performance. It supports real-time synchronization of TiKV data updates, as well as online DDL. TiFlash is integrated into the Raft system of TiDB as Raft Learner, and the two nodes are integrated into a database cluster. The upper layer queries through TiDB nodes uniformly, making TiDB a real HTAP database.

In order to support real-time updates on column storage, we developed a new column storage engine, Delta Tree, for TiFlash. It can support high TPS writes while still maintaining good read performance. This article explains the design idea of Delta Tree in detail. Welcome to discuss.

The overall architecture

The architecture of Delta Tree is based on the design ideas of B+ Tree and LSM Tree. On the whole, the Delta Tree divides the table data into range partitions based on primary keys. The segmented data blocks are called segments. Then, a hierarchical structure like LSM Tree is used inside the Segment. Partitioning is designed to reduce the amount of data per partition and reduce complexity, which is a huge advantage we’ll see later.

Segment

The granularity of segments is usually around 1.5 million rows, far larger than the size of Leaf nodes in traditional B+ trees. The number of segments on a machine is usually less than 100,000, so we can keep the meta information of segments completely in memory, which simplifies the complexity of engineering implementation. Like B+ Tree leaf nodes, segments support Split and Merge. In the initial state, a table has only one Segment whose range is [-∞, +∞].

2 Levels LSM Tree

Within the Segment, we organize the data in a hierarchical manner similar to an LSM Tree. Since the amount of Segment data is much smaller than other LSM Tree implementations, the Delta Tree only needs two fixed layers, namely the Delta Layer and Stable Layer, which correspond to L0 and L1 of the LSM Tree respectively. We know that for LSM Tree, the smaller the number of layers, the smaller the write magnification. By default, Delta Tree’s theoretical write magnification (not counting compression) is about 19x. Because column storage continuously stores the same type of data, it is naturally more friendly to compression algorithms, and the actual write magnification of the Delta Tree engine is less than 5x common in production environments.

The overall architecture of Delta Tree is described above. Let’s discuss the internal design details of the Segment in detail.

Pack

The data management unit within a Segment is a Pack, usually containing 8K rows or more. The schema of a relational database consists of multiple column definitions, including column name, column ID, column Type, and default value. Because we support DDL operations, such as adding columns, deleting columns, changing data types, etc., different Pack Schemas may be different. The data in Pack is also composed of column data, each of which is essentially a one-dimensional array. Pack contains the Version and DEL_mark columns in addition to the Primary key column Primary Keys(PK) and schema contained columns. Version is the commit timestamp of the transaction through which MVCC is implemented. Del_mark is a Boolean type indicating whether the row has been deleted.

The function of breaking Segment data into packs is that we can use Pack as IO unit and index filter unit. In data analysis scenarios, data is obtained from the storage engine in Scan mode. In order to improve Scan performance, we usually have large I/O blocks, so we can read one or more Pack data at a time. In addition, traditional row-level precise indexes are usually not useful in analysis scenarios, but we can still implement some simple rough indexes, such as min-max indexes, for which the filtering unit is also Pack.

If you are familiar with the architecture of TiDB, you will know that TiKV data is scheduled by Region, which is a virtual data block segmented by range. The data in the Delta Tree Pack is sorted in ascending order by the combined field (PK, version), in the same order as the data in TiKV. In this way, TiFlash can seamlessly access TiDB clusters and reuse the original Region scheduling mechanism.

Delta Layer

The Delta Layer corresponds to L0 of the LSM Tree. It can be considered an incremental update to the Segment, so it is named Delta. Similar to the MemTable of an LSM Tree, the latest data is first written to a data structure called the Delta Cache, and when full is flushed to the Delta Layer on disk. When the Delta Layer is full, the Sable Layer is merged with the Stable Layer (this action is called Delta Merge) to get the new Sable Layer.

Stable Layer

A Stable Layer is equivalent to L1 of an LSM Tree and is where most of the Segment data is stored. It is stored in an unmodifiable file called a DTFile. A Segment has only one DTFile. Stable Layer also consists of packs, and the data is sorted in ascending order by (PK, Version) combined fields. The difference is that the data in Stable Layer is globally ordered, whereas the Delta Layer only ensures that the data is ordered within the Pack. The reason is very simple, because the data of Delta Layer is written from Delta Cache, there will be overlap between each Pack; Stable Layer data is sorted by Delta Merge action to achieve global order.

When the total number of segments exceeds the upper limit of the configured capacity, the center point of the Segment range is used as the split point to split two segments. If two adjacent segments are both small, they may be merged into one Segment.

storage

The Delta Layer and Stable Layer of the Delta Tree engine are stored differently on disk. The former uses PageStorage (PS) for storage, while the latter uses DTFile for storage.

PageStorage

PS is similar to object storage in that it stores and manages blocks of data called Pages, which are bytes. It is designed to store Pack data for the Delta Layer of the Delta Tree and metadata for the Segment. PS supports Page Get, Insert, Delete, and Update operations and WriteBatch operations to implement atomic write. Page is stored in a PageFile, and a PageFile can store multiple pages. We maintain a Page metadata PageMap in memory, through which we can locate the specific PageFile and file offset stored in Page, so as to support random reading of Page.

PS may have one or more Writable PageFile. When Writable PageFile is full, it will become Readonly PageFile. All updates to the Page are written directly to the end of the PageFile, such as the Page Update operation, by writing a new Page in Writable PageFile, and then in the metadata table in memory, the PageId points to the new Page. As time goes by, PageFile will appear more invalid data sections, so we design a GC thread, in the background to merge the low utilization of PageFile, so as to improve the reading efficiency and reclaim disk space.

The Pack in the Delta Layer is serialized as a Page and stored in Photoshop. Business queries typically involve only part of a table’s columns, so PS also supports reading only part of the data in the Page.

DTFile

DTFile takes the form of a folder on the file system, and inside the folder is the standard column storage format, that is, each column is stored in a file. DTFile is read-only and its read mode is sequential, which is suitable for storing Stable Layer data. Dtfiles are generated in three cases: Delta Merge, Segment Split, and Segment Merge.

  • Stable Layer is suitable for DTFile storage, while Delta Layer is more suitable for PageStorage storage. Mainly considering their differences in the following aspects:

  • DTFile is used to store a single file per column, so if you need to read and write consecutive packs, you can do column-level IO merging, which is a good fit for the read and build mode of Stable Layer.

  • After the DTFile is written, it is a read-only file. If it is used to store Delta Layer data, it can only use the one-to-one correspondence between Pack and DTFile, which will cause a large number of small files and affect system performance. PageStorage combines multiple pages and stores them without the problem of small files.

  • PageStorage can read pages randomly, which corresponds to the read mode of Delta Layer.

  • PageStorage reduces the number of I/OS generated during the write process. When writing a DTFile, you need to open as many files as there are columns, each of which generates at least one IO. Since Stable Layers are regenerated infrequently and multiple packs can be written to do column-level IO merges, each IO size is appropriate and DTFile storage is appropriate. For the Delta Layer, only one Pack is written at a time, and if DTFile is stored, the IO unit can be only one column of the Pack; If we use PageStorage storage, we can serialize all the column data of Pack into Page and write to disk at once, thus reducing I/O times.

Write optimization

TiFlash is mainly for real-time OLAP scenarios, and is designed to support both high-frequency write and batch write. The high-frequency writing is to synchronize the update operation of TiKV in OLTP scenario in real time. In a batch data import scenario or after TiFlash column synchronization is enabled, a large amount of data needs to be written in a short period of time. Therefore, the Delta Tree also needs to support batch write. We did different treatments for the two scenes to achieve the best results.

Delta Cache

To alleviate IOPS pressure caused by frequent write operations, a memory cache is designed in the Delta Layer of the Delta Tree, called Delta Cache. Updates are written to the Delta Cache before being flushed to disk. Batch writes do not write to the Delta Cache. This type of data is written directly to disk.

You may have noticed that the design of the Delta Cache runs the risk of losing data. Therefore, a storage engine needs to write to WAL before writing data, and then recover from the previous Flush point after a restart. But since Delta Tree is a storage engine designed for TiDB, it takes full advantage of TiDB’s unique feature, the RAFT protocol. In the RAFT protocol, any update is written to the RAFT Log and then applied to the state machine (database storage engine) after the majority of replicas are confirmed. So we implemented WAL directly with raft Log, updating the Raft Log Applied Index after flush.

Continuous write capability

Similar to LSM Tree, Delta Tree uses background threads to merge Delta Layer into Stable Layer species during data writing. The purpose is to keep the data volume of Delta Layer within a certain proportion to maintain good read performance. Delta Merge is the main factor for write amplification of Delta Tree. Write Amplification of Delta Tree ~= (Segment average size/Delta limit). Therefore, the higher Delta Merge frequency, the greater write magnification, the smaller Delta Layer ratio, and the better read performance.

The design concept of TiFlash is to maintain the ability to write as much as possible, and to find a balance between write performance and read performance. For example, if the write speed is too fast and the service write volume is very high, if the background Delta Merge cannot keep up, write stall may occur, that is, block the write of the foreground. In this case, the Delta Tree dynamically limits the write speed and reduces the frequency of Delta Merge, reducing write magnification and rebalancing the write and background Delta Merge. This sacrifices some of the read and write performance, but alleviates the problem of no write at all in extreme cases, making the business experience better.

Read optimization

Instead of making a columnar LSM Tree implementation directly, Delta Tree messed with the above design, in large part for read acceleration. The Segment design of Delta Tree can reduce read magnification, and the two-layer structure within the Segment can also facilitate read acceleration.

Reading data from the storage engine corresponds to the Scan operator in the execution plan. Its main task is to return data sorted by (PK, version) based on a range (such as [x, y)). There are three time-consuming parts of the process, and the design of the Delta Tree is optimized for all of them.

A. Time spent reading I/OS and decompressing data.

B. The consumption of the multi-way merging algorithm itself for multiple sorted data streams, such as the common minimum heap algorithm.

C. Copy the data of multiple data streams to the output stream after merging.

Reduced read amplification

We know that TiDB cluster data is managed by Region. We schedule copies of regions to distribute reasonably among storage nodes of the cluster (i.e., TiKV and TiFlash) to achieve load balancing and high data availability. A Region is only a logical block, and regions on a node are not contiguous. Therefore, when a Scan involves a large number of regions, read magnification is inevitable. Obviously, the more files a Region’s data is distributed across, the larger its read size will be. That is, the more layers, the larger the read magnification. The Delta Tree uses Segment partitioning to reduce the amount of data in the area and the number of layers, thus reducing read magnification and optimizing the time of part A.

Read Index Delta Index

Since multiple merges are time consuming, can we avoid having to redo each read? The answer is yes. In fact, some in-memory databases already practice similar ideas. The specific idea is that after the first Scan is completed, we try to save the information generated by the multi-way merging algorithm, so that the next Scan can be reused, and only the incremental part needs to be processed. This information, which can be reused, is called the Delta Index and is implemented by a B+ Tree. The design of Delta Index went through several design iterations and referenced many existing database solutions.

  1. We start by placing Stable Layer and Delta Layer side by side, so that each row of data has a Tuple Id that can be indexed. Then, a similar idea of secondary index is used to construct a B+ Tree in memory. Entries of leaf nodes of B+ Tree are also sorted in ascending order (PK, version). The data content of the Entry is (PK, version), IS_INSERT, Tuple Id). Is_insert indicates whether it is an Insert or a Delete.

  2. In this case, each line in the Segment needs to have a corresponding Entry, and the memory pressure is very high, so it needs to be compressed. It is observed that (PK, version) is not necessary because we have the Tuple Id, which can be obtained from the raw data and does not need to be placed in the Delta Index. This is critical for large PK scenarios.

  3. Tuple ids have a lot of sequential incrementing ids, because most of the data in the Segment is Stable Layer. Therefore, for N consecutive Stable entries, use one (Tuple Id, N).

  4. Looking further, the Tuple ids of Stable Layer data must be sorted incrementally because they are globally sorted in Stable Layer. So we can just record the Delta Layer data and additionally record how many Stable tuples are inserted between two Delta entries. Therefore, the Tuple ids of Stable Layer and Delta Layer need to be distinguished and called Stable Id and Delta Id respectively.

The final Entry record is in the format (IS_INSERT, Delta Id, (Stable Id, N)), and the amount of Entry data is equal to the amount of data in the Delta Layer. Generally, the data volume of Delta Layer accounts for less than 3% of the total data volume, and an Entry only needs 16 bytes. If 10 billion data is stored in a single node, the memory cost is about 4.5 GB, which is acceptable.

With the Delta Index record, Scan can easily merge the Delta Layer and Stable Layer together to output an ordered Stream. There is a slight problem that new data may be written into the Delta Layer between scans, so we still need to process the Delta data and update it to the Delta Index. Since the last result can be reused the next time, the cost is amortized, that is, we optimize the time of part B.

For part C, the main optimization of Delta Tree is that it can batch copy data of continuous Stable Layer to save CPU cycle. LSM Tree multi – way merging algorithm of course can also batch copy, but it is more difficult to implement. In addition, the data volume ratio between the lower layer and upper layer of an LSM Tree is 10 times, while that of a Delta Tree is usually more than 37 times. Therefore, the batch copy effect of a Delta Tree is better than that of an LSM Tree.

Delta Tree vs LSM Tree

TiFlash tried to implement it based on LSM Tree architecture at the beginning, but later found that its read performance was not able to meet the requirements, and there were other problems. That’s why we started the Delta Tree project. The following is the comparison of Scan time between Delta Tree and LSM trees-based storage engine under different Tuple numbers and Transactions per second (TPS) updates.

The following is a comparison of the elapsed time of different SQL using the OnTime data set. You can see that Delta Tree has a greater advantage in most cases, not only due to better Scan performance, but also because it supports rough indexes (such as min-max) that avoid reading irrelevant data.

Here is TiFlash using the Delta Tree engine, compared to other databases.

How to handle transactions

TiDB clusters with TiFlash copies turned on also support transactions and are guaranteed the same isolation level as TiKV, which is rare in an AP database and can be detailed in the TiDB documentation. Some of you might be curious about how TiFlash did it. Here is A brief introduction to some of the most important points. If you are interested, you can check out our paper “TiDB: A Raft- Based HTAP Database” which has just been published in VLDB.

TiFlash currently only synchronizes TiKV changes in real time and does not provide data during transactions, so it does not need to provide point-lookup capabilities, which circumvents the weakness of column storage.

As with TiKV, each piece of data in the Delta Tree has a version number field, version, which is commit_TS for the transaction. Therefore, during the query process, data snapshot matching version <= query_ts can be filtered according to query_ts to achieve MVCC.

The distributed transaction model used by TiDB, called Percolator, requires locks during transactions, which are generally persistent. TiKV can certainly persist these locks to its own store, RocksDB, but TiFlash’s column store Delta Tree does not efficiently support high-TPS, low-latency reads in K-V mode. TiFlash’s current solution is also very simple, since these locks are not easy to persist, it is better to put memory. We will only write the changes that have been committed in the transaction to the Delta Tree store, and the uncommitted transactions to memory. Note that there is no concern about the loss of lock data and uncommitted data. There are two mechanisms to ensure this:

  1. As mentioned earlier, all data is landed in the Raft log beforehand. Parts of memory that have not been persisted can be recovered from raft Log after reboot.

  2. The state in memory, including locks and uncommitted data, is periodically saved to disk with a copy and then pushed forward raft log Applied Index. Restore the copy from disk to memory after restart.

conclusion

There has always been a huge technology gap between online business systems and analytics systems, as data updates do not flow smoothly in real time. The core reason is that the storage of analysis databases is usually very poor at updating, and imagine the pain of synchronizing MySQL binlog to Hive in real time. The Delta Tree column storage engine solves this problem perfectly, allowing the column storage best suited for analysis scenarios to be updated in real time.

By introducing TiFlash, TiDB integrates row storage and column storage in a database system, making real-time business data analysis very simple. You can even use both row and column data in one SQL and keep the data strictly consistent. In TiDB, you only need one SQL: ALTER table my_table set Tiflash Replica 1; .