Author: Ma Xiaoyu
It has been a while since the paper was released. The article mentioned before is also overdue for some time due to various reasons. Sorry for everyone.
As MENTIONED last time, there are several VLDB articles on HTAP (Hybrid Transactional/Analytical Processing) topics. Since Gartner coined the term in 2014, HTAP has become a buzzword over the years due to the increasing demand for real-time analysis of transaction data. In addition to the PingCAP article, Replication at the Speed of Change — a Fast by IBM’s PingCAP, Scalable Replication Solution for Near real-time HTAP Processing, and today’s Lightning paper for Google F1 team. Even other systems from a non-TP perspective have evolved into HTAP’s twilight zone, such as Ali’s Hologres paper and Databricks’ PingCAP’s Delta Lake paper.
Google has been a beacon in many areas because of its advanced data volumes and user scenarios. Although the two recent papers of F1 are not eye-catching, the article of Dog Ji still attracts countless attention. This time, I will read this paper with you and make some comparisons with the previous TiDB papers. In fact, we feel that we have taken into account almost all of the design advantages of F1 Lightning and have much more to offer in terms of HTAP’s architectural design alone.
In the Related Work section, the paper quotes a short HTAP Survey article by PingCAP of IBM Research, and the summary is quite interesting. I recommend you to read it here. When writing the paper, our team also used this article to investigate other people’s achievements.
Existing HTAP systems are divided into two design types:
- Single System for OLTP and OLAP
- Separate OLTP and AP System
Many of the HTAP projects on the market start from scratch with a single integrated system, and because they start from scratch, they can do a more tightly coupled design without too much historical baggage. Further down the line, there are different designs of a single storage engine or a hybrid line-and-row engine, which is obviously better for performance. It is worth mentioning that the paper did not mention a key point, that is, mutual interference. The design of a single system can easily cause AP operations to interfere with TP, which is intolerable for an HTAP system that needs to operate in serious scenarios.
The advantage of a separate system is that TP and AP can be designed separately with less intrusion from each other, but under existing architectures, data is often transferred via offline ETL (see our analysis of storage in this article). The DESIGN of F1 Lightning also falls into the category of separation system, using CDC to carry out real-time data blending instead of offline ETL. Therefore, there is no doubt that the design of F1’s column storage also needs to be designed for column storage changes as we do.
By the way, TiDB and TiFlash were also mentioned in the analysis of related work in the paper, but this part of description is wrong. TiDB and Lightning can provide Strong Snapshot Consistency, and even due to its unique design, It also provides greater consistency and freshness.
Compared with the existing HTAP system, Lightning provides the following advantages (TiDB HTAP also has the same or even better advantages) :
- Having a read-only copy of the column provides better execution efficiency
- Provides simpler configuration and de-weighting compared to ETL processes
- Query HTAP directly without modifying SQL
- Consistency and freshness
- In terms of security, TP and AP are unified
- Separation of design concerns, TP and AP can optimize their own domains without too much involvement
- Extensibility, providing the possibility of docking with different TP databases besides F1 DB and Spanner
System architecture
As the premise of project approval is non-invasive to TP system, the architecture design of F1 Lightning as HTAP is relatively conservative. In the current research field of HTAP, most projects assume the so-called Greenfield scheme (not subject to the constraints of the prior scheme), but F1 Lightning requires no migration of the existing business and no modification of TP system in the design scheme to the maximum extent (organizational structure level, The F1 Lightning team also doesn’t care about the CODE of the TP system), so they come up with an “an official coupled HTAP solution”, which is only as simple as an official snake bag: do HTAP via CDC. So in fact, F1 Lightning is a CDC + changeable column storage scheme.
Lightning is divided into the following modules:
- Data storage: Continually receives CDC change information and stores a read-optimized (columnar) copy of Lightning.
- Change replication: A distributed CDC channel that accepts transaction logs from OLTP systems and distributes them to different storage servers, with data history playback of new access tables as needed.
- Metadata database: Stores storage nodes and change replication status.
- Lightning Master: Full component collaboration and management.
However, there are also two implicit modules that make up the complete HTAP system:
- OLTP data source: Continuously publish CDC information.
- Query engine: uses F1 Query to retrieve data from Lightning in response to queries.
Since most of Google’s existing TP systems use the MVCC model (in fact, MVCC is also a popular choice in distributed TP systems), Lightning also chooses MVCC and the corresponding snapshot isolation for read semantics: MVCC would be the most natural and convenient design if the CDC issued by TP were copied to Lightning with a timestamp. Due to the natural constraints and characteristics of the distributed CDC architecture, Lightning chose to provide Safe Timestamps to maintain consistency. This is a waterline like design, because it is distributed once, so different storage servers will receive data with unequal latency, so consistency cannot be easily guaranteed. So a water line design like this can give snapshot consistency to queries at the expense of data freshness. TiDB HTAP, on the other hand, has chosen a lower-level replication solution that is more elegant by providing freshest and consistent data with Raft’s consistency feature.
storage
Previous articles have analyzed why the special delta-main design is required for column storage to provide real-time changes, and the design of Lightning also reflects relevant ideas. This is an LSM-like architecture where Delta information is stored directly in memory and can be queried once in place. Because replicas, like TiDB, reside in OLTP systems and can be replenished at any time, this data is not recorded separately for WAL. Even if data can be replenished at any time, there will be a huge recovery delay if tons of data need to be played back. It’s no surprise that Checkpoint was introduced here. Since the Delta part is organized in memory by the B+Tree structure, it can be written to disk as a Checkpoint at any time without change. In addition to Checkpoint, deltas are also written to disk after row transpose when they are too large and occupy too much memory. Like TiFlash or Parquet, Lightning’s Delta disk format uses the popular PAX-like format for column storage: Row bundles are formed from a group of rows and then cut by column, with each Row group appended with a sparse B+Tree index for the primary key. This scheme can be used for both point-lookup and range query.
To support consistent reads, Lightning, like TiFlash, must also encounter the de-duplication of MVCC read versions and the need for read-time merging between different stores (disk Delta and memory Delta). Because of the physical multiple versions of the Schema that real-time DDL enables, the column storage process takes one more Schema Coercion consideration: during merging, data will be normalized to the same Schema. Incidentally, TiFlash’s columnar engine also supports different structure storage from tables, as dynamic Schema change support is also considered. Lightning uses an LSM-like architecture, so reading will also trigger a K-way merge, which may cause considerable read performance loss. To minimize the cost of merging reads, Lightning constantly compacts data and processes Delta Compaction, similar to LSM and Delta Main designs. What’s interesting about this section is that all Compaction except for minor merge operations is handed over to a dedicated Compaction Worker resource. For a discussion of the principles of this section, see our previous post.
Since Lightning requires access to different OLTP systems, it has a special two-tier mapping design of Logical Schema and Physical Schema in addition to multi-version Schema support. The Logical Schema of the first layer corresponds to the native Schema of the OLTP system, which contains complex types such as Protobuf and structure, while the physical Schema of the second layer only contains the native types of Lightning itself, such as integer, floating point and string. This allows the system implementation to be substantially decoupled from the OLTP-type system of the data source. The two are concatenated by Logical Mapping, which defines how each data type translates back and forth from the source type to the F1 type system (source to F1 when writing and F1 back to the source type when reading).
Lightning uses range sharding and supports online Repartitioning to achieve load balancing, where data injection and queries are not interrupted. The partition of Lightning will split when the size or write pressure reaches a threshold. This split is metadata-only (Only the scope of the Metadata shard is modified, not the physical cut data), and the resulting shard data is inherited by the Delta of the old shard. When the query occurs, the data Delta is inherited from the original shard, so it needs to be filtered according to the scope of the new shard to read the correct data (TiKV logical Region splitting is also handled similarly). At an old/new port, a new shard does not become active until a synchronous connection is established and data is caught up, and an old shard waits for the query involved to complete before stopping working. In a similar way, Lightning also supports shard merging. The article does not say how shards migrate between nodes beyond splitting and merging. However, since the Delta is stored in the cloud after it falls, it seems that the data migration can be completed by simply reassigning containers to the shard and having the shard play back a short Checkpoint of the new Memory Delta.
In general, the considerations for storing changes are similar, and the design is similar. It is nothing more than a write optimization area to buffer direct writes, while the main data area is a large collated column store. Constantly merge the write area to the read area from the background job to maintain read speed. That’s what we do, or Lightning, or the old Sisto system. Not much can be said about this part. While this Compaction appears to have a size and time dimension to it, the paper doesn’t go into too much detail about how this Compaction strategy works. Write Compaction design requires a combination of write Compaction and read efficiency: Frequent Compaction can trade resources and write Compaction for performance (more aggressive Compaction allows data to remain in a better read state, but consumes more computing resources and drive write Compaction). However, it seems that given the cloud architecture, the ability to write cloud storage and attach additional Compation workers to it, even a more aggressive layered design and Compaction algorithm should suffice.
copy
F1 Lightning relies primarily on replication at the CDC level, which shields the complexity of many front-end TP systems. Componentwise, change replication physically depends on two parts:
- Changepump: Built-in conversion between different source CDCS to formats accepted by Changepump; The Transaction Log is also converted into a Partition oriented change Log. Since Lightning storage is also sharded, the playback of the source system transaction may span multiple shards and servers, and Changepump needs to maintain the transaction’s information to maintain consistency during the playback.
- Change subscriber: indicates the Client of Changepump. Lightning divides a table into shards, and each shard maintains a subscription to Changepump. The subscription transfer maintains a start time stamp (which can point to historical data) against which Changepump plays back changes, thus enabling resumable breakpoints.
Transfers to and from the Primary Key are guaranteed to be orderly, but not across the Primary Key. This should allow data from different primary keys to be distributed across different nodes without ordering through a central single point. As a result, the latest data is often not consistent across primary keys, so the mechanism also uses a safety threshold to ensure that the queryable data is consistent. Changepump periodically generates a Checkpoint Timestamp that is sent to all subscribers: all data older than this Timestamp has been delivered, and the Lightning server itself corrects its own security query threshold based on this information. However, this Checkpoint timestamp is generated ina concurrent environment where it is not possible to correct every change (Coordinate Cost is very high), so the system makes a compromise between freshness and efficiency: the faster and more accurate correction of the water level will result in fresher water level and lower efficiency of processing, and vice versa.
Changepump itself is distributed. The same Changepump log may be processed by different Changepump servers at the source, and the same subscription may connect to different Pumps at the destination. Changepump and Lightning Server do not correspond, because Changepump needs to balance the amount of writing, while Lightning Server needs to balance the amount of storage.
The biggest difference between F1 Lightning and TiDB HTAP lies in the replication design. The replication of TiDB is based on the lower-level Raft Log design, which allows TiDB to provide better consistency and freshness with very little modification of the OLTP part of the design. The F1 design takes a looser coupling and uses CDC for replication. The lower-level replication protocol can keep the details and consistency of replication to a greater extent, while the higher-level protocol needs some blending means to maintain consistency. But the advantage is that OLAP’s storage tier doesn’t have to worry too much about OLTP’s design constraints (for example, TiDB needs to replicate with regions of the same size while F1 Lightning doesn’t). Unfortunately, the Change Replication section of the paper is not described in great detail, and there are actually quite a few details that can be explored on how to maintain consistency and fault tolerance. For example, how do you tell if a server has no transactions occurring, if playback is stuck, or if the CDC is not generated? Is survival determined by heartbeat sending empty packets to advance the timestamp? How do I trace the mapping between the source and target partitions to advance the timestamp? In fact, for a distributed OLTP source library like Spanner, it is not easy to maintain consistency through CDC playback. And the elaboration of these papers are slightly too omitted, some regret.
Fault tolerance
Since cloud storage is used, the cloud storage itself guarantees the high availability of data, so the Lightning Server only needs to guarantee the high availability of services. Therefore, each shard will be assigned multiple Lightning Servers, and each of these servers will drag the same change log from Changepump (memory Cache is done in Changepump to optimize this repeated pull). And maintain a separate Memory Delta. However, only one master shard writes to the cloud storage (including disk-resident delta and Memory Delta checkpoint), and when the master shard writes to the cloud storage, it notifies the other replica servers to update the LSM. Multiple copies of all of these can respond simultaneously to the read service. When a server goes down, the new node tries to retrieve data from other copies of the shard first (the data is in a sorted form, a natural Checkpoint) rather than playing it back directly from Changepump. When rolling upgrades, Lightning is also careful not to restart multiple servers in the same shard at the same time.
Lightning comes with table level Failover. When a table becomes unavailable (blacklisted under pressure, or data is bad, etc.), the system can automatically route the query back to the OLTP system (since the query and data are basically equivalent). However, users can choose whether or not to implement such fault tolerance in case AP query pressure overwhelps TP.
In short, there is not much new in fault tolerance. However, thanks to cloud fault tolerance, the history can be traced back from Pump at any time, and there is no need to guarantee the latest data, which gives F1 Lightning a lot of design space. There is no need for elaborate consistency replication protocol, and there is no need to worry too much about reading inconsistency.
F1 Query integration
The F1 Query paper can be found here and can be considered a Federated Query Engine, similar to Apache Spark and Presto, that spans multiple different storage engines and source databases. F1 Query is used as the Query engine. Because of the nature of federated queries, users can respond to queries regardless of whether they are rows or columns. Similar to TiDB processing, F1 Lightning can be exposed to the optimizer as a special Access Path that uses CBO to decide whether or not to read the column store. Interestingly, however, F1 Query does not differentiate queries at the logical plan stage: it produces a logical plan with the same read rows and columns, and selects the specific Access Path at the physical plan stage. There seems to be room for further optimization, such as choosing different JOIN orders based on different indexes and common considerations for columns. In addition, F1 Lightning can also push down the Subplan, which is similar to TiFlash’s coprocessor function, without the need for Shuffle assisted calculation, such as Filter, Partial Aggregation, and Projection.
This part of the design is similar to TiFlash + TiSpark/TiDB combination. The form of F1 Query alone can be replaced by TiSpark in TiDB HTAP, both of which have the same ability to bridge different stores and push down subplans.
The engineering practice
- Component reuse: many components that can be reused in the system are separated into separate libraries and reused in multiple other products. In addition, components originally designed for Lightning, such as Changepump, are also utilized by products like TableCache.
- Validation: Added a validation mechanism for comparison of row and column systems.
Benefits and Costs
There is no doubt that column use takes up extra space by adding extra copies, but column use is more computationally efficient, and computing resources are more expensive than storage. On analysis, the significant computing resource savings resulting from the additional storage appear to be worthwhile for Read Intensive applications. For mixed Workload queries, such as small distributed queries, or Adhoc queries written by human beings, xor ETL type jobs, Lightning storage copy and computing push-down can save a lot of computing resources and time. This is also true for TiDB HTAP, we have received feedback that TiFlash uses fewer servers.
Contrast TiDB HTAP
The following is to tout their own products 🙂
Overall, TiDB HTAP uses a completely different design but achieves or even exceeds the advantages described in the F1 Lightning paper. As mentioned in the previous article, our original TiFlash design was also based on CDC. But for a variety of reasons, the design “evolved” into what it now looks like based on multi-raft. Both aim to solve the problem of heterogeneous data replication from row to column storage, and both ends may be distributed systems. Heterogeneous replication from single-point OLTP to distributed OLAP, such as SAP Hana, is undoubtedly simpler, because the output logs are generated in a single machine, and order preservation and consistency can be solved in a very simple way. The consistency of logs generated by distributed systems with multiple masters is a bit tricky. Since the source is a multi-point distributed system, transmission can not be transmitted through a single point of reception, which is bound to be insufficient, but it is difficult to ensure global consistency of multi-point reception. Therefore, THE CDC scheme of F1 Lightning designed a complex timestamp water line to provide a consistency query mechanism at the expense of freshness. However, due to my lack of skills, I have not fully understood all the design points behind it. However, it is a relatively complex design to ensure consistency when conducting heterogeneous replication in the way of CDC. In contrast, heterogeneous replication of Raft has very different characteristics. TiFlash’s replication system automatically achieves load balancing through multi-raft + PD scheduling, regardless of log transfer, source write pressure, and storage balancing at xOR target. Also because Raft protocol provides a consistent reading algorithm for Follower/Learner copies, TiFlash can easily provide the latest data reading without maintaining complex water levels by indexing linear logs. The almost identical Sharding (Region) design also allows global consistency between shards to be completely guaranteed by Raft plus MVCC and ensuring that the “freshest” data is read. The freshest and freshest may seem like a small time difference, but the fact that TiKV is almost the same consistency guarantees that TiDB’s row and column stores can be mixed in online business or even in the same query without considering that they may provide different data services, in my opinion, this is the real HTAP. In addition, due to such a design, TiFlash does not need to design a different set of storage architecture, no additional design of load balancing and no additional storage scheduling. It only needs to operate and maintain with the almost consistent behavior of TiKV engine through the existing PD, which greatly reduces the burden of deployment, operation and maintenance. All in all, I think TiDB HTAP design is more elegant and excellent 🙂 of course, you are welcome to make your own comments in the comments section.