HTAP (Hybrid Transactional/Analytical Processing) is a technical term that has attracted increasing attention in recent years. It describes a database that can meet both Transactional and Analytical tasks. TiDB 4.0 is a special design and architecture enhancement for HTAP. This time, I will bring you a paper interpretation on the topic of VLDB 2020 HTAP. What is more special is that this paper is written by PingCAP, about TiDB HTAP architecture. So this interpretation is written from the perspective of a team of authors. The original text is here, welcome correction.

To the point

This paper introduces the architecture and design of TiDB as a whole. Students who are interested in TiDB are recommended to read it completely, which will be of great help to understand the architecture. But since the focus is on HTAP, it seems to me that these three points are important:

  1. Real-time updated column storage
  2. Multi-raft replication system
  3. Intelligently select row/column stores based on business SQL

I’ll focus on these three parts later.

Say first store

TP and AP have traditionally relied on different storage formats: row storage corresponds to OLTP and column storage corresponds to OLAP. However, the advantages and disadvantages of the two are not obvious In Memory, so Hasso Plattner, author of SAP Hana, proposed to use in-memory + column storage technology to process OLTP and OLAP simultaneously. Then in 2014, Gartner proposed the concept of HTAP, which is also focused on memory computing. The key message here is that column storage is not suitable for TP-class scenarios. This is probably common sense for many people, but perhaps not everyone has thought about why column storage is not suitable for TP.

Fast data access depends on Locality, which simply means that you want to place data to be read and written as much as possible together based on your access mode. Data that is not together requires additional Seek and is less Cache efficient. Row storage and column storage, removing encoding and compression factors, essentially provide different Locality of data for different access modes. Row memory keeps the data in the same row together so that you can access a whole row at once and get good speed. A column store puts data from the same column together so that reads of only one column at a time are accelerated; Columns, on the other hand, are traditionally slow to update, partly because Naive methods of breaking a row into multiple columns to write to the desired location can lead to catastrophic write speeds. These effects are evident on disk, but are attenuated in memory, so when we think of HTAP over the years, the first thing that comes to mind is an in-memory database.

Although memory prices are falling, they are still expensive. While analysts tout the architectural simplicity that HTAP brings as a way to reduce overall costs, the reality is that in-memory databases are still only being used in a few specific areas: architects still need to convince their bosses that the benefits of HTAP are really worth using in-memory databases if not for the ultra-low latency scenarios that are undeniably obvious. As a result, HTAP can be used in very limited areas.

So, we’re still designing for disk, not memory.

It’s not that no one has tried to mix columns and columns before. This column blending can be a compromise format such as PAX, or it can be a clever algorithm combining the two forms in the same storage engine. However, the Locality problem mentioned above cannot be bypassed. It is difficult to approach the optimal solution of both sides at the same time even if the performance is squeezed through super engineering capabilities, not to mention that it will be several times more complicated than considering a single scenario technically.

TiDB didn’t want to give up either TP or AP, so although it knew Spanner was using PAX for HTAP, it didn’t jump in. Maybe there’s a better way?

TiDB as a whole has always believed in modularity as a solution to engineering problems, including the layering and module cutting of TiDB and TiKV. HTAP is no exception. After various Prototype experiments, including but not limited to using CDC schemes like Binlog to synchronize TP updates to the easily structured AP side, we ended up stripping/fusing row and column storage via Raft. Rather than tightly coupling the two formats in the same engine. This allows us to think of the two scenarios in isolation without having to change the existing engine too much, which greatly reduces the molding and stabilization cycle of the product. On the other hand, modularity makes it easier to leverage the power of other open source products (ClickHouse) because complex details don’t have to be sealed in the same box.

Other designs on the market use tighter coupling, such as MemSQL nodes running BOTH TP and AP services, Spanner choosing PAX for different read modes, and even traditional databases mostly add support for different data organizations in the same engine. Such an architecture would introduce overly complex designs and would not necessarily benefit either TP or AP.

With the loosely-coupled design chosen, we only had to focus on one problem to solve storage: how to design a column storage system that could be updated in real time based on primary keys. In fact, any column can be updated, but this update is usually done by covering a large chunk of data as a whole, which is why most traditional OLAP databases can only support batch data updates. If real-time primary key updates are not a concern, then the storage can be completely free of data de-duplication and sorting: the storage is organized by primary key order not only for fast read positioning, but also for write update acceleration. If a piece of data needs to be updated, the engine at least needs to have both new and old versions of the same piece of data quickly erased in some way, whether it is erased on read or overwritten directly. Traditional analytical databases or Hadoop columns do not have the ability to update in real time, so they do not have to pay the cost of reading or writing, which is one of the reasons they can support very fast batch loading and reading. But such a design would not satisfy our scenario. To achieve HTAP’s goals, TiDB’s column storage engine must be able to support real-time updates at a rate no lower than row storage.

In fact, we’re certainly not the first product in the industry to try to implement a storage update. A common practice in the industry for column updates, regardless of their variation, is called Delta Main. Since it is inefficient to do column store updates, why not use write optimization to store change data and then gradually merge the updated portion into the read-optimized primary column store? As long as we maintain sufficient merge frequency, a large proportion of the entire data will exist in read-optimized column storage to maintain performance. This is a design that has been thought of almost since the beginning of column storage: you can think of the c-Store, the ancestor of column storage, as a Delta Main design of sorts, which uses a row storage engine as a write area and constantly merges write area data into column storage.Our updatable memory engine, DeltaTree, was designed along very similar lines. On the macro level, DeltaTree sorts and splits data according to primary key order. Similar to TiDB’s Region concept, each data range forms a fragment independently, and the fragment will be split whenever the physical size of the fragment exceeds a threshold. Microscopically, each fragment is divided into two parts, Delta and Stable Space, as shown above. The Delta parts are mainly optimized for writes, which are small blocks of data arranged in batch order. Writing in write order instead of primary key order makes writing much faster, because data writes only need to be appended continuously. When enough Delta data is accumulated, the engine merges them into a Stable, which has a Parquet design of row-groups and columns cut, sorted and compressed. The Stable section is definitely optimized for reading, and if you only consider stables, it will be fast. In practice, however, Delta data that is not yet merged into a Stable may need to overwrite older data in the Stable at read time, so reading is an online merge process. To speed up this merge, the engine adds a secondary B+Tree index in memory to the Delta part, so that while the Delta is not physically ordered (keeping it physically ordered greatly reduces write performance), it remains logically ordered without the cost of sorting before merge. At the same time, due to the macroscopic division of data interval, it is not necessary to rewrite all data for merging, which reduces the pressure of merging.

Back to the LSM column storage scheme mentioned earlier. In fact, you can think of LSM and you can think of it almost as Delta Main. When data is written to MemTable, it is also written in a write-optimized appending form. Could LSM also be a design that supports column updates? We’ve tried it, it’s not impossible, but it’s not as good as DeltaTree: For range reading, the LSM needs to perform very heavy merging, because any new data at the upper level may overwrite the old data at the lower level, and there is an intersection between layers, so the LSM at the N level may need to perform n-merging to get a piece of data. We implemented a ClickHouse MergeTree based LSM column engine that was nearly twice as slow as the new DeltaTree.So far, we have solved the updatable column memory problem.

Besides, copy

Since the loosely-coupled storage engine has been chosen, the row and column stores are not in the same module, so the problem must be how to replicate the data. For traditional master-slave replication systems, we tend to use a High Level of replication such as MySQL Binlog. In fact, this replica system was also used in our first prototype iteration. Binlog-based replication is a good way to encapsulate unnecessary details, as long as the storage engine TiFlash can play back logs properly, without worrying about details such as transaction implementation. This quickly led to the first version of TiFlash, which connected row and column storage via binlog, but required fault tolerance, load balancing, and so on. To make matters worse, TiDB is a distributed, multi-master system. Each TiDB server generates a binlog, and in order to keep the data consistent and not overwrite old and new, the binlog actually needs to go through a layer of aggregation and sorting, which almost reduces the distributed dimension reduction to a single point of throughput, and the sorting pipeline greatly increases the delay of data arrival. Therefore, the prototype TiFlash did not provide mixed row and column queries: you could only query row and column stores separately, because the data could not be guaranteed to be consistent, and mixing the two in a query would create endless unknowable data errors.

So we replicated from a lower level of logs, and yes, we chose to dock in the Raft layer. The benefits of docking from a lower level are obvious, the Raft Log retains all the details needed for data replication, and we were able to design TiFlash as a specific TiKV node that directly benefits the multi-raft system: Data can be transparently migrated and expanded through PD, and fault tolerance itself is completely unnecessary. It is all done by Raft system. When a copy is lost, the storage layer will automatically initiate recovery, and the complex consistency guarantee of replication itself is also unnecessary. Everything from being faced with perfecting your clickhouse-based replica system to sitting back and enjoying the benefits. Of course, there is also a cost in the actual engineering implementation, the cost of switching from a high-level quasi-SQL level Binlog to a completely low-level Raft Log is also quite huge. We needed to implement all the complex operations needed for multi-raft systems on ClickHouse, such as splitting and merging regions, as well as migration and read tolerance.

The new design is key to the entire HTAP architecture, giving TiFlash the ability to seamlessly access the entire storage tier. The same replication system, the same scheduling system, the same transaction model, the same consistency guarantee. Its replication design is fully distributed, load balanced and automatically fault-tolerant. Instead of using master-slave replication or double-writing to the machine column and column, the AP and TP parts can operate independently and expand freely: if you need more AP power, add TiFlash nodes. If you need to increase TP power, please add TiKV node. Non-interference, in terms of Workload or computing resource expansion.

At the same time, this replication is automatically load balanced and directly linked point-to-point. Each Region Leader copy communicates independently with the stored copy without intermediate storage media. If the Region copy is too large to be split, the stored copy will also be split. When replicas migrate due to hot spot fragmentation, the replication pipes between them migrate as well. This is already implemented for TiDB’s Multi-raft system.One of the biggest benefits of Raft systems is the coexistence of consistency and asynchronous replication.

Traditionally, synchronous replication must be used if replication is required to keep copies consistent. In this way, either the high pressure of the storage nodes or the increased network latency will have a huge impact on TP services: in order to maintain data consistency, the row storage transaction must wait for the column to complete the write before returning, otherwise the failure during the period will lead to data loss and inconsistency. In addition, any additional storage nodes will increase the probability of experiencing network latency. Although many HTAP products do not take into account the interaction between AP and TP, we still expect the delicate TP to receive a greater degree of protection.

This is precisely what Raft solves. TiFlash accesses Raft systems via Learner roles, which allows columns to join the cluster without voting and only write asynchronously, meaning that it does not interfere with normal TP business because of its stability. When a transaction is written to TP side, TiKV does not need to wait for data synchronization of TiFlash and can return to the client to complete the transaction only after the normal fault-tolerant replication of the row storage copy is completed. So you might ask, is there any inconsistency in the data, is there some latency between the rows and the columns? Yes and no. Physically, yes, there is no way that a system can replicate asynchronously and still physically keep copies consistent. But we don’t really need to guarantee that the data is physically consistent from moment to moment, we just need to provide a consistent logical read. This is one of the core features of Raft itself, although multiple copies are not always consistent all the time, as long as they are read with the latest consistent data. When the actual read occurs, the column copy sends a collation request to the row Leader. The request itself is simple: please tell me what the latest log number was at the moment you received the request. And TiFlash will wait for the data replication progress to catch up with the colonel. That’s all. This allows TiFlash to ensure that the data is fresh enough to include the information written in the last instant. Yes, the latest data written from TiKV is guaranteed to be read from TiFlash, which forms the read water line. With timestamps and MVCC, TiFlash’s asynchronous synchronization can also provide the same strong consistency guarantee as TiKV. This makes TiFlash columns behave less like a heterogeneous replication system and more like a special type of column index, and allows us to freely mix two different engines in the same query without worrying about subtle, hard-to-trace errors caused by inconsistencies.

Smart choice

Smart choice comes last, because it’s the last thing we did. TiDB’s smart selection of row and column storage is to automatically select row or column storage through cost optimization. This part is also easy to say, just as we use statistics to select indexes, we can also use the cost formula to estimate the cost of using columns. By combining the cost of each access path, we can know which way to read data, and column storage is only one of them, and there is no particularity.Technically, this isn’t much new. However, by automatic selection, TiDB’s HTAP system expands from TP + reports to HTAP hybrid services. Some fuzzy business systems, through TiFlash support, become simple architecture. For example, in the logistics system, users want to be able to retrieve individual order numbers and delivery details on the same query platform, and also want to be able to collect statistics on the receiving and receiving situation of different types of goods in a certain period. There is no obstacle to detailed query for TiDB, but there is still a big gap between the performance of multidimensional analysis under big data sets and that of real analytical products when there is no column storage in the past. With TiFlash, the blurring boundaries between AP and TP will soon become rounded and complete. On the other hand, TiSpark reads in the original plan are not that much because TiDB is closer to business and DBA than big data.

The last

This article does not completely cover the content of our paper. The missing part is the non-HTAP part of the TiDB design. Interested students can click on the original text here. In addition, we also welcome you to use our products, your use and valuable advice is the most basic driving force of TiDB development.