Author introduction: Wei Wan, PingCAP database development engineer, the main field is database storage engine development, and system performance optimization.

Why do we need HTAP database?

Before the emergence of the Internet wave, the amount of data of enterprises is generally not large, especially the core business data, which can be stored in a single database. Storage at that time did not require a complex architecture; all Online Transactional Processing (OLTP) and Online Analytical Processing (OLAP) ran on the same database instance. As the business became more complex and the volume of data became larger, DBAs could no longer optimize SQL. One obvious problem is that the TP requests on the standalone database support line are already too heavy to run heavy AP analysis tasks. Run or OOM, or affect the online business, or do the separation of master and slave, branch library and table after difficult to achieve business requirements.

In this context, big data technology represented by Hadoop began to flourish. It built a data analysis platform with many relatively cheap x86 machines, and cracked the computing problems of large data sets with parallel capabilities. So to some extent, big data technology can be regarded as a branch of the development process of traditional relational database technology. Of course, in the process of big data field has also developed its own new scene, the birth of many new technologies, this will not be discussed in depth.

Thus, the architect divides the storage into two modules: online business and data analysis. As shown in the figure below, the data of the business library is extracted by ETL tool and imported into a dedicated analysis platform. The business database focused on providing TP capabilities, and the analytics platform focused on PROVIDING AP capabilities, each in its own right, looks perfect. But this architecture has its own shortcomings.

The first is complexity. ETL process itself is a very complicated process. An example is that ETL can be a business model if it is well done. Because there are two systems, it is inevitable to bring higher learning costs, maintenance costs and integration costs. If you use an analysis platform built by open source big data tools, you will certainly encounter run-ins between various tools, as well as quality problems caused by various good and bad tools.

The second is the real time problem. In general, the closer the data is to real-time, the more valuable it is. Many business scenarios have high requirements for real-time performance, such as the risk control system, which needs to constantly analyze data and respond to dangerous situations as soon as possible. However, ETL is usually a periodic operation, such as conducting data once a day or an hour, so real-time data cannot be guaranteed. Finally, there is the question of consistency. Consistency is a very important concept in database, and database transactions are used to ensure consistency. If data is stored in two different systems, consistency cannot be guaranteed. That is, the query results of the AP system cannot correspond to online services. The interaction between the two systems will be limited, for example, users cannot access data from both systems in a single transaction.

Due to the above limitations of existing data platforms, we believe that developing a Hybrid Transactional/Analytical Processing converged database product (HTAP) can alleviate the anxiety over TP or AP choices. It allows users of databases to satisfy both OLTP and OLAP requirements in a single database without having to worry about overly complex architectures. This is what TiDB was originally designed for.

What is TiFlash?

TiDB is positioned as an HTAP database to solve both TP and AP problems. We know that TiDB can be used as a linearly scalable MySQL and is designed to meet the needs of TP. In 2017, we released TiSpark, which can directly read TiKV data and use Spark’s powerful computing power to strengthen AP terminal. However, as TiKV is a storage layer designed for TP scenarios, its ability to extract and analyze large amounts of data is limited, so we introduce a new TiFlash component for TiDB, whose mission is to further enhance TiDB’s AP capability and make it a true HTAP database.

TiFlash is an AP extension of TiDB. In positioning, it is the storage node corresponding to TiKV and is deployed separately from TiKV. It can either store data or push down a piece of computing logic. Data is synchronized from TiKV via Raft Learner protocol. The biggest difference between TiFlash and TiKV is the original vectorization model and column storage. This is all optimized for AP scenarios. The TiFlash project leveragesClickHouse’s vectorization engine and thus computationally inherits its high performance benefits.

Since TiFlash node and TiKV node are deployed separately, there is no impact on online business even if we run heavy computing tasks.

The upper computing nodes, including TiSpark and TiDB, have access to TiKV and TiFlash. We’ll see how we can take advantage of this architecture to serve both TP and AP scenarios in one system with 1+1>2 effect.

TiFlash technology insider

For a database system, TP and AP are conflicting in system design. In TP scenario, we focus on transaction correctness, and the performance indicators are QPS and delay. It is usually a point-write and point-look scenario. AP is more concerned with throughput, which is the ability to process large amounts of data, and processing costs. For example, in many cases, AP analysis queries need to scan millions of data and join dozens of tables. In this scenario, the system design philosophy is completely different from TP. TP usually uses row storage, such as InnoDB, RocksDB, etc. AP systems typically use column storage. It is difficult to choose between the two requirements in the same system in terms of design. In addition, AP query services are usually resource-consuming, and TP services are easily affected due to poor isolation. Therefore, it is very difficult to make an HTAP system, which tests the engineering design ability of the system.

1. Column storage

Generally speaking, AP systems basically use column storage, and TiFlash is no exception. Column storage naturally supports column filtering and has a high compression rate, which is suitable for Scan scenarios with big data. In addition, column storage is more suitable for vectorization acceleration and more aggregation operators suitable for push-down. Compared with TiKV, TiFlash improves performance by an order of magnitude in Scan scenarios.

Row storage is obviously more suitable for TP scenarios because it is good for point-and-search, reads only a small amount of data, and has smaller I/O counts and granularity. In the vast majority of indexed queries, high QPS and low latency can be achieved.

Since we integrate TiFlash and TiKV inside TiDB, users can choose which storage mode to use flexibly. After the data is written to TiKV, the user can choose whether to synchronize it to TiFlash as needed for AP acceleration. Currently, the synchronization granularity options are tables or libraries.

2. Low-cost data replication

Data replication is always one of the most important problems in distributed systems. As another storage layer of TiDB, TiFlash needs real-time synchronization of TiKV data. The solution we adopted was also natural: since WE use Raft protocol for synchronization within TiKV, we can also use Raft protocol to synchronize data from TiKV to TiFlash. TiFlash will disguise himself as a TiKV node and join Raft Group. The difference is that TiFlash only acts as a Raft Learner, not as a Raft Leader/Follower. The reason is that TiFlash currently does not support direct writes from the SQL side (TiDB/ TiSpark), we will support this feature later.

As we know, Raft Log replication from the Leader to Follower/Learner is usually optimized to be asynchronous replication in order to improve data replication efficiency. As long as the record is copied to the “majority” nodes of the Leader + Follower, It is considered committed. And Learner is excluded from the “majority”, that is to say, the update does not need to wait for Learner’s confirmation. In this way, the disadvantage is that the data on Learner may have a certain delay, and the advantage is that the extra data replication overhead caused by the introduction of TiFlash is greatly reduced. Of course, if the replication delay is too large, it indicates that there is a problem with the network between nodes or the writing performance of the machine. At this time, we will have an alarm for further processing.

3. Strong consistency

So how to ensure read consistency since asynchronous replication? In general, we only go to the Leader node to read the data because the latest data is always available at the Leader node. But TiFlash is only Learner and cannot read data like this. We use Raft Follower/Learner Read mechanism to Read data directly at TiFlash node. The principle is to take advantage of Raft Log offset + global timestamp. Get a Read TS when the request is initiated. For all regions, make sure that the local copy of the Region has been synchronized to a new Raft Log. It is safe to read the Region copy directly. For each unique key, filter out all versions of commit TS <= read ts, where the largest version of commit TS is the one we should read.

How does Learner know that the current copy of the Region is new enough? Before reading data, the Learner sends a request to the Leader with read TS to obtain the offset that ensures that the Region is new enough for Raft Log. The current implementation of TiFlash waits until a timeout occurs until the local Region copy is synchronized sufficiently new. We will add other strategies in the future, such as proactively requesting data synchronization (see Figures 6 and 7).

4. Update support

TiFlash will synchronize all table changes on TiKV. It is a synchronization between two heterogeneous systems, which will encounter some very difficult problems. One of the most representative is how to make TiFlash replicate TiKV updates in real time, and in real time, transactional updates. Columnar storage is generally considered to be relatively difficult to update because columnar storage tends to use block compression, and blocks are larger than row storage, making it easier to increase write magnification. Separate storage is also more likely to cause more small IO. In addition, a large number of Scan operations are required due to AP service characteristics. How to ensure Scan performance while performing high-speed updates is also a big problem.

The current solution for TiFlash is that the storage engine uses the lSM-tree class storage architecture and MVCC to achieve the same SI isolation level as TiDB. Lsm-tree architecture can handle high frequency small IO writes of TP type well. At the same time, there is a certain local order, which is conducive to Scan optimization.

Iv. Imagination space brought by TiFlash

Scalable TiDB at new business latitude. With the introduction of a new TiFlash AP extension, TiDB has true AP capabilities, namely storage and computing optimized for AP. We can dynamically increase or decrease the TP or AP capability of TiDB system by adding or deleting corresponding nodes. Data no longer needs to be manually synchronized between two independent systems, and can ensure real-time, transactional.

AP and TP services are isolated to minimize the impact of AP services on TP services on TiDB. Because TiFlash is a standalone node, usually deployed separately from TiKV, hardware level resource isolation can be achieved. We use tags in the TiDB system to manage different types of storage nodes.

From the point of view of TiDB, TiFlash and TiKV are the same level as storage nodes. The difference lies in the node labels reported to the PD (Coordinator of the TiDB cluster) during startup. TiDB can then use this information to route different types of requests to the appropriate nodes. For example, according to some heuristic algorithms and statistical information, we can learn that a SQL needs to Scan a large number of data and perform aggregation operations, so it is obviously more reasonable for the Scan operator of this SQL to request data from TiFlash node. These heavy IO and calculations do not affect TP services on the TiKV side.

TiFlash brings a new fusion experience. The TiFlash nodes do not simply synchronize data from the TiKV node, they can actually be further coordinated, resulting in 1+1>2 effect. The upper computing layer, TiDB or TiSpark, can read data from both TiFlash and TiKV.

As shown in Figure 10, for example, if we encounter a SQL that needs to join two pieces of data, one of which requires a full table Scan and the other can be indexed, it is obvious that we can take advantage of both TiFlash’s powerful Scan and TiKV’s point-lookup. It is worth mentioning that users typically deploy 3 or 5 replicas in TiKV and may only deploy 1 replica to TiFlash to save costs. So when a TiFlash node fails, we need to synchronize the node from TiKV again.

Our next plan is to make TiFlash nodes into MPP clusters. That is, after TiDB or TiSpark receives SQL, they can choose to push down the calculation completely. MPP is used to further improve the computing efficiency of AP.

The figure above shows the performance data of a certain version of TiFlash. We use TiSpark + TiFlash to compare Spark + Parquet. You can see that TiFlash achieves nearly consistent performance with real-time update support and transaction consistency. TiFlash is still in rapid iteration, and the latest version has been greatly improved compared to here. In addition, we are currently developing a new storage engine specially designed for TiFlash, which will bring at least a 2x performance improvement. You can expect the performance to come out later.

Simplicity is productivity. Due to the limitation of traditional data platform technology, enterprises need to do very heavy construction work. Many technologies need to be integrated to meet business requirements, and complex ETL processes are used to synchronize data between systems, resulting in long data chains that are not always effective. TiDB wants to leave the complexity of the system at the tool level to greatly simplify the user’s application architecture.

TiFlash is currently conducting internal POC with some partners, and a GA version is expected to be released before the end of the year, so stay tuned.

(Friends who are interested in TiFlash are welcome to chat with the author privately and exchange more technical details ~ [email protected])