PolarDB MySQL is a database system born out of the cloud. In addition to OLTP scenarios on the cloud, many customers have also asked PolarDB for performance in real-time data analysis. PolarDB technology team proposed the in-memory Column Index(IMCI) technology solution, which can achieve hundreds of times of acceleration effect In complex analysis query scenarios. This paper expounds the thinking of technical route behind IMCI and the choice of concrete scheme.
The author | technology public source | ali, north building
preface
Analytical databases are very popular in capital markets and the technology community recently, and innovative products of various startups are springing up like mushrooms. On the one hand, this is because the current stage of enterprises is increasingly based on finding from the data brings the increase in demand growth potential, on the other hand the cloud native technology development evolution and reform of the existing technical system, such as Snowflakes success proves this kind of product, using the cloud native reengineering analytical database system is necessary and there is a big market opportunity.
PolarDB MySQL is a database system born out of the cloud. In addition to OLTP scenarios on the cloud, many customers have also asked PolarDB for performance in real-time data analysis. PolarDB technology team proposed the in-memory Column Index(IMCI) technology solution, which can achieve hundreds of times of acceleration effect In complex analysis query scenarios.
This paper expounds the thinking of technical route behind IMCI and the choice of concrete scheme. PolarDB MySQL column storage analysis function is coming online in Aliyun, please look forward to it.
MySQL Ecological HTAP database solution
MySQL is an open source database mainly designed for OLTP-type scenarios. The development direction of the open source community focuses on strengthening its transaction processing capabilities, such as improving single-core performance/multi-core scalability/enhancing cluster capability to improve availability, etc. In terms of the capabilities needed to deal with complex queries under large amounts of data, such as the optimizer’s ability to deal with sub-queries, HashJoin, SQL parallel execution and other high-performance operators, the community has always placed them on a relatively low priority, so the improvement of MySQL’s data analysis capability is slow.
As MySQL has grown into the most popular open source database system in the world, where users store vast amounts of data and run critical business logic, real-time analysis of this data has become a growing demand. When standalone MySQL failed to meet the requirements, users sought a better solution.
1. Building block scheme of MySQL + dedicated AP database
There are many options for dedicated analytical database products. One option is to use OLTP and OLAP type requirements that are met by two systems respectively, and to synchronize data in real time between the two systems through data synchronization tools, etc. Furthermore, users can even add a layer of proxy that automatically routes TP-type loads to MySQL and analytical loads to OLAP databases, shielding the application layer from the deployment topology of the underlying databases.
This architecture has some flexibility, such as the ability to choose the best solution for both TP and AP databases, and complete isolation of TP/AP loads. But its disadvantages are also obvious. Firstly, it is technically necessary to maintain two sets of database systems with different technology systems. Secondly, due to the difference in processing mechanism of the two systems, it is also very challenging to maintain the real-time consistency of upstream and downstream data. In addition, due to the existence of synchronization delay, downstream AP systems often store outdated data, which can not meet the requirements of real-time analysis.
Divergent Design method based on multiple copies
Many of the new database products emerging with the Internet are compatible with the MySQL protocol, so they become an alternative to MySQL. Most of these distributed database products adopt the distributed Share Nothing scheme, and one of its core characteristics is the use of distributed consistency protocol to ensure the data consistency between multiple copies of a single partition. Because a piece of data is completely independent across multiple copies, storing it in different formats on different copies to service different query loads is an easy solution to implement. A typical example is TiDB, which starts with TiDB4.0 and uses column storage (TiFlash) to respond to AP-type loads on one of the copies in a Raft Group and automatically selects data sources through TiDB’s intelligent routing capabilities. This enables a database system to serve both OLTP and OLAP loads.
This approach has been used in many Research and Industry efforts and is increasingly becoming the de facto standard solution for integrated HTAP for distributed data. However, the application of this solution requires the user to migrate to the corresponding NewSQL database system, which often brings various compatibility issues.
3 Integrated column and column hybrid storage solution
One step further than the multiple-copy Divergent Design approach is the adoption of a mixed-row storage scheme in the same database instance that responds to both TP-type and AP-type loads. This is the traditional commercial database Oracle/SQL Server/DB2, etc.
- Oracle released Database in-memory suite In Oracle 12C In 2013. Its core function is in-memory Column Store, which provides mixed storage of columns and columns/advanced query optimization (materialized expressions, JoinGroup and other technologies to improve OLAP performance.
- In SQL Server 2016 SP1, Microsoft began to provide Column Store Indexs function. Users can flexibly use pure row storage table, pure Column storage table, mixed Column and Column table, Column storage table + row storage index and other modes according to the load characteristics.
- In release 10.5 (Kepler) in 2013, IBM added the DB2 BLU Acceleration component, which dramatically improves the performance of analysis scenarios through column data storage coupled with in-memory computing and DataSkipping technology.
Three leading commercial database vendors have all adopted the technical route of row and column hybrid storage combined with in-memory computing at the same time, which has its underlying technical logic: Column storage provides better I/O efficiency (compression,DataSkipping, column clipping) and CPU efficiency (Cache friendliness), so column storage must be used for maximum analysis performance. However, the index precision problem caused by sparse index in column storage determines that it cannot be used as the storage format for TP scenarios, so the mixed column storage becomes a mandatory solution. However, there is a performance gap between row index and column index when dealing with random update in the mixed column and column storage architecture, and the low read and write latency of DRAM must be used to compensate for the low update efficiency of column storage. Therefore, on the premise of low latency online transaction processing and high performance real-time data analysis, the hybrid storage of row and column combined with memory computing is the only solution.
Compared with the three methods mentioned above, Divergent Design, and integrated row and row mixed storage have become more and more integrated, providing better user experience. But each of its internal nuclear engineering challenges is bigger than the last. The role of basic software is to keep the complexity for themselves and the simplicity for the user, so the integrated approach is in line with the trend of technological development.
PolarDB MySQL AP capability evolution
PolarDB MySQL capability stack is similar to open source MySQL, longer than TP but weaker than AP capability. PolarDB provides a maximum single instance storage capacity of 100TB, and its transaction processing capacity is far superior to user-built MySQL. As a result, PolarDB users tend to store more data on a single instance and run complex aggregated queries on that data. With PolarDB’s write-read architecture, users can add read-only RO nodes to run complex read-only queries, avoiding interference with the TP load from analytical queries.
1 Defects of MySQL architecture in AP scenarios
There are several reasons for the poor performance of MySQL’s implementation architecture when executing complex queries. Compared with dedicated OLAP systems, its performance bottleneck is reflected in several aspects:
- The SQL execution engine of MySQL is implemented based on the Volcano Iterator model. This architecture relies on a large number of deep function nesting and virtual function calls in engineering implementation. When processing massive data, this architecture will affect the pipline efficiency of modern CPU pipeline. CPU Cache efficiency is low. At the same time, the Iterator execution model does not take full advantage of the SIMD instructions provided by modern cpus for execution acceleration.
- The execution engine can only execute serially and cannot take advantage of the parallel communication capabilities of modern multicore cpus. MySQL 8.0 has officially added parallel execution capability for some basic queries like count(*), but there is still a long way to go to build parallel execution capability for complex SQL.
- Most commonly used storage engines of MySQL store data by row. When analyzing massive data by column, reading data by row from disk consumes a large AMOUNT of I/O bandwidth. Second, the row storage format copies a large amount of unnecessary column data when processing a large amount of data, which has an impact on the memory read and write efficiency.
2 PolarDB parallel query breaks the CPU bottleneck
Parallel Query framework (Parallel Query) developed by PolarDB team can automatically start Parallel execution when the amount of Query data reaches a certain threshold. Data is fragmented to different threads in the storage layer, multiple threads perform Parallel computation, and the result pipeline is summarized to the total thread. Finally, the total thread does some simple merging and returns to the user. Improve query efficiency.
The addition of parallel query makes PolarDB break through the limitation of single-core execution performance, using the parallel processing ability of multi-core CPU, the time of some SQL queries on PolarDB decreases exponentially.
3 Why We Need Column-Store
Parallel execution framework breaks through the limitation of CPU scalability and brings significant performance improvement. However, limited by the efficiency of line storage and line actuators, single-core execution performance has a ceiling, and its peak performance still lags behind that of dedicated OLAP systems. To further improve PolarDB MySQL’s analysis performance, we need to introduce column storage:
- In analysis scenarios, it is often necessary to access a large number of records for a column, and the column store is split by column to avoid reading unwanted columns. The compression efficiency of secondary column storage is much higher than that of row storage due to the continuous storage of columns with the same attributes, which can usually reach more than 10 times. Finally, a large range of data filtering can be realized by combining the structure of bulk storage in the column storage with rough index information such as MIN/MAX. All of these actions greatly increase the efficiency of IO. In today’s storage-computing architecture, reducing the amount of data read over the network can bring an immediate improvement in response time for query processing.
- Column storage also improves CPU execution efficiency when processing data. First, the compact arrangement of column storage improves CPU access efficiency and reduces execution pauses caused by L1/L2 Cache misses. Secondly, column storage can be further improved by using SIMD technology, which is the common technical route of modern high-performance analysis execution engines (Oracle/SQL Server/ClickHouse).
PolarDB in-memory Column Index
PolarDB in-memory Column Index brings Column storage and in-memory computing power to PolarDB, allowing users to run TP and AP mixed loads on a set of PolarDB databases. Greatly improved PolarDB’s performance in running complex queries on large amounts of data.
In-memory Column Index uses Column and Column hybrid storage technology and combines PolarDB’s shared-storage-based write to read architecture. It contains the following key technological innovations:
- PolarDB storage engine (InnoDB) added support for Columnar Index (Columnar Index), users can choose to create all or part of a table column Index through DDL, column Index using column compression storage, its storage space consumption is much less than the row storage format. The default column indexes are all resident in memory to maximize analysis performance, but they can also be persisted to shared storage when memory runs out.
- In PolarDB SQL executor layer, we rewrote a set of column-oriented executor engine framework, which makes full use of the advantages of Column storage, such as a Batch of 4096 rows to access the data of storage layer. SIMD instruction is used to improve the throughput of single core CPU processing data, and all key operators support parallel execution. In column storage, the new executor is several orders of magnitude better than MySQL’s original row memory executor.
- An optimizer framework that supports mixed execution of columns and columns. The optimizer framework determines whether to enable column execution based on whether the delivered SQL can perform an override query on a column index and whether the functions and operators on which it depends can be supported by the column executor. The optimizer makes a cost estimate for both the row and column execution plans and selects the execution plan that the cost accounts for.
- A user can use a RO node in a PolarDB cluster as an analysis node and configure the column storage index generation on this RO node. Complex queries run on the column storage index and use all available CPU computing power to achieve maximum execution performance without affecting the available memory and CPU resources for tP-type loads on the cluster.
The combination of several key technologies makes PolarDB a true HTAP database system, whose performance in running complex queries on large amounts of data can be at the same level as the industry’s top commercial database systems such as Oracle/SQL Server.
The technical architecture of in-memory Column Index
1 row mix optimizer
PolarDB has a native row-memory oriented optimizer component that needs to be enhanced after the addition of column storage support in the engine layer. The optimizer needs to be able to determine whether a query should be scheduled for row or column storage execution. We do this through a set of whitelisting mechanisms and an implementation cost calculation framework. The system ensures that the SUPPORTED SQL is accelerated and compatible with the unsupported SQL.
How to achieve 100% MySQL compatibility
We achieve compatibility through a set of whitelisting mechanisms. The whitelisting mechanism is based on several considerations. The first point is that given the limitations of available resources (mainly memory) on the system, column indexes will not be created on all tables. When a query is required to use a column that does not exist in the column store, it cannot be executed on the column store. Second, based on the consideration of performance, we completely rewrote a set of SQL execution engine for column storage, including all the physical execution operators and expression calculation, which covered a range of scenarios that were not supported by MySQL native row storage. If the delivered SQL contains some operator fragments or column types that are not supported by the IMCI execution engine, the system must be able to identify the interception and switch back to line memory execution.
Query plan transformation
The purpose of the Plan transformation is to convert the AST, MySQL’s native Logical execution Plan representation, to the Logical Plan of IMCI. After generating the Logical Plan for IMCI, it goes through an Optimize process to generate the Physical Plan. The Plan transformation method is simple and direct. It only needs to traverse the execution Plan tree and convert the mysql optimized AST into IMCI tree structure with relation operator bit nodes, which is a relatively direct translation process. However, some additional things are done along the way, such as implicit type conversions, to accommodate MySQL’s flexible type system.
Optimizer that takes into account mixed execution of rows and columns
With the existence of row and column storage execution engines, the optimizer has more choices when choosing execution plans. It can compare the Cost of row storage execution plans with the Cost of column storage execution plans and use the execution plan with the lowest Cost. In PolarDB, in addition to the serial execution of the original MySQL, there is also a Paralle Query function based on the row memory which can play the multi-core computing ability. So the actual optimizer will choose between 1) Paralle Query serial, 2) Paralle Query and 3)IMCI. In the current iteration phase, the optimizer follows the following flow:
- Parse SQL and generate LogicalPlan, then call the MySQL native optimizer to perform certain optimization actions, such as join order adjustment, etc. At the same time, the logical execution plan obtained in this phase is forwarded to IMCI’s execution plan compilation module, which tries to generate a column storage execution plan (which may be whitelisted and fallback to row storage).
- The PolarDB Optimizer calculates a row-oriented execution Cost based on the row-store Plan. If the Cost exceeds a certain threshold, the IMCI executor will be pushed down to use IMCI_Plan for execution.
- If IMCI cannot execute the SQL, PolarDB attempts to compile a Parallel Query execution plan and execute it. If the EXECUTION plan of PQ cannot be generated, it indicates that IMCI and PQ cannot support this SQL, and fallback will be executed in line storage.
The above strategy is based on the judgment from execution performance comparison, row memory serial execution < row memory parallel execution < IMCI. From SQL compatibility, IMCI < row memory executes in parallel < row memory executes in serial. However, the actual situation is more complicated. For example, in some cases, parallel Index Join based on row memory ordered Index coverage has a lower Cost than Sort Merge Join based on column memory. Under the current policy, IMCI may be selected for column execution.
Execution engine for columnar storage
IMCI execution engine is a set of column storage oriented optimization, and is completely independent of the existing MySQL row executor. The purpose of rewriting the executor is to eliminate two key bottlenecks of the existing row storage execution engine in the execution of analytical SQL: virtual function access overhead caused by row access and the inability to execute in parallel.
BATCH parallel operators are supported
The IMCI actuator engine uses the classic volcano model, but with the help of column storage and vector execution to improve performance.
In the volcano model, each operation in the relational algebra corresponding to the syntax tree generated by SQL is abstracted into an Operator. The execution engine constructs the entire SQL into an Operator tree. The query tree calls the Next() interface from the top down, and the data is pulled and processed from the bottom up. The advantage of this method is that the calculation model is simple and direct, by abstracting different physical operators into iterators. Each operator only cares about its own internal logic, so that the coupling between operators is reduced, so that it is easier to write a logical execution engine.
- In IMCI’s execution engine, each Operator also uses iterator functions to access data, but the difference is that each call to the iterator returns a batch of data instead of a row, which can be considered a volcano model that supports batch processing.
- Serial execution is limited by single-core computing efficiency, access delay, IO delay and so on. The IMCI executor supports parallel execution on several key physical operators (Scan, Join, Agg, etc.). In addition to physical operators supporting parallelism, the optimizer of IMCI needs to support the generation of parallel execution plans. When determining the access mode of a table, the optimizer will decide whether to enable parallel execution according to the amount of data to be accessed. If parallel execution is enabled, it will determine the parallelism by referring to a series of state data: The information includes available CPU/Memory/IO resources, currently scheduled and queued tasks, statistics, query complexity, and user-configurable parameters. From this data, a recommended DOP value is calculated for the operator, which uses the same DOP internally. Hint is also supported for DOP Settings.
Vectorization execution solves the problem of single-core execution efficiency, while parallel execution breaks through the bottleneck of single-core computation. Combined, IMCI is orders of magnitude faster than traditional MySQL line execution.
SIMD vectorization computes acceleration
Ap-type scenarios, SQL often contains many calculations involving one or more values/operators/functions, which belong to the category of expression calculation. Expression evaluation is a computationally intensive task, so the computational efficiency of expression is a key factor affecting the overall performance.
The traditional MySQL expression calculation system is based on a row by row operation of a unit, commonly known as the iterator model. Because the iterator abstracts the whole table, the whole expression is implemented as a tree structure, and the implementation code is easy to understand, and the whole process is very clear.
But this abstraction also comes with a performance cost, as fetching each row of data causes multiple function calls as the iterator iterates through it, and fetching data row by row is too much I/O and cache-unfriendly. MySQL adopts the tree iterator model, which is limited by the storage engine access method, which makes it difficult to optimize complex logical calculations.
In the column storage format, since the data for each column is stored separately and sequentially, the computation of expressions involving a particular column can be done in batches. For each calculation expression, its input and output are Batch as the unit. In Batch processing mode, the calculation process can be accelerated by USING SIMD instructions. The new expression system has two key optimizations:
- Taking advantage of columnar storage and using batching instead of iterator models, we used SIMD instructions to rewrite the expression kernel implementation of most common data types, such as the basic mathematical operations (+, -, *, /, abs) for all numeric types (int, decimal, double), All have corresponding SIMD instruction implementation. With the AVX512 instruction set, the performance of single-core computing can be improved several times.
- Postgres similar expression implementation method: In SQL compilation and optimization phase, IMCI expression in a tree structure to store (and the performance of the existing line type iterator model method), but before execution after a sequence of the expression tree traversal, converts it to a one-dimensional array to store, in the subsequent calculations only need to traverse the one-dimensional array structure which can complete the operation. Because the recursive process in the tree iterator model is eliminated, the computational efficiency is higher. At the same time, the method provides simple abstraction for the calculation process, and separates the data from the calculation process, which is naturally suitable for parallel computing.
3 Storage engine that supports mixed storage of columns and columns
Transactional applications and analytical applications have completely different requirements on storage engines. The former requires indexes to locate each row accurately and support efficient addition, deletion and modification, while the latter requires efficient batch scan processing. These two scenarios have completely different and sometimes contradictory requirements on storage engines.
Therefore, designing an integrated storage engine that can serve both OLTP and OLAP loads is very challenging. At present, the HTAP storage engine on the market is relatively good only a few large manufacturers with decades of research and development accumulation, such as Oracle (In-memory Column Store) /Sql Server(In Memory Column Index) /DB2 (BLU) and so on. Such as TiDB can only support HTAP requirements by adjusting one replica in a multi-replica cluster to a column store.
The integrated HTAP storage engine generally uses the mixed column and column storage scheme, that is, the engine has both row storage and column storage. The row storage serves TP and column storage serves AP. Compared to deploying an independent SET of OLTP databases and a set of OLAP databases to meet business requirements, a single HTAP engine has the following advantages:
- The real-time consistency of row data and column data can meet many demanding business requirements, and all data writes can be seen in analytical queries.
- With lower cost, users can easily specify which columns and even which ranges of a table are stored in column storage format for analysis. Full data continues to be stored as row memory.
- Convenient management, operation and maintenance, users do not need to pay attention to data synchronization between two systems and data consistency.
PolarDB uses a mixed storage technology similar to Oracle/Sql Server and other commercial databases, which is called in-memory Column Index:
- When creating a table, you can specify a partial table or column storage format, or add column storage attributes to existing tables by using the Alter TABLE statement. Analytical queries automatically use column storage format for query acceleration.
- Column data is stored on disk In compressed format by default, and in-memory Columbia Store Area can be used for cache acceleration and query acceleration. The traditional row format is still stored In BufferPool for OLTP-enabled loads.
- All transaction add, delete and change operations are reflected in the column store in real time to ensure data consistency at the transaction level.
Implementing a mixed column and column storage engine is technically difficult, but adding column storage support to a mature OLTP-optimized storage engine like InnoDB presents a different situation:
- Meeting the requirements of the OLTP business is a top priority, so adding column storage support should not affect TP performance too much. This requires us to maintain the column memory light enough, sacrificing AP performance to preserve TP performance if necessary.
- The design of the column store does not take into account the problems of concurrent transaction modification of data, unique check of data, etc., which are already solved in the row store system and are very difficult for a separate column store engine such as ClickHouse to handle.
- Due to the existence of a proven row storage system, any problems with the row storage system can be switched back to the row storage system to respond to query requests.
The above conditions have both advantages and disadvantages, which also affect the design of PolarDB’s overall mixed storage program.
A column store represented as Index
In the context of MySQL’s plug-in storage engine framework, the simplest solution to add column storage support is to implement a separate storage engine, such as Inforbright and the ColumnStore of MarinaDB. PolarDB adopted a plan to implement column storage as a secondary index of InnoDB based on the following considerations:
- InnoDB natively supports multiple indexes. Insert/Update/Delete operations are applied to the Primary Index and all Secondary indexes in row granularity, and transactions are guaranteed. Implementing the column storage as a secondary index can reuse this transaction framework.
- In terms of data encoding format, the column storage of the secondary index can use exactly the same internal format as other row storage indexes, and can be directly copied in memory, without considering charset and collation information, which is also completely transparent to the upper executor.
- The secondary index operation is very flexible. You can specify the columns in the index when building a table, or add or delete columns in a secondary index through DDL statements. For example, the user can add int/float/Double columns that need to be parsed to the column index, while text/ BLOb fields that usually require only a few bullet points but take up a lot of space can be kept in the row store.
- Crash recovery can reuse InnoDB’s Redo transaction log module seamlessly with existing implementations. At the same time, it is convenient to support the physical replication process of PolarDB and to generate column index on independent RO nodes or Standby nodes to provide analysis services.
- At the same time, the secondary index and the primary table have the same life cycle, convenient management.
As shown in the figure above, all Primary and Seconary indexes are implemented as a B+Tree in PolarDB. A column Index is an Index by definition, but is actually a virtual Index that captures additions, deletions, and changes to columns covered by the Index.
For the above table, the Primary Index contains five columns of data (C1,C2,C3,C4,C5), and the Seconary Index contains two columns of data (C2,C1). In the common secondary Index,C2 and C1 are encoded in a row and stored in B+tree. And the column storage index contains (C2,C3,C4) three column data. In actual physical storage, the three columns are split and stored independently, and each column is converted to the column storage format in the order of writing.
Another advantage of implementing the column storage as a secondary index is that the implementation of the actuator is very simple. In MySQL, the concept of overwriting index already exists, that is, the columns required by a query are stored in a secondary index, so the data in this secondary index can be directly used to meet the query requirements. Using a secondary Index compared to a Primary Index can greatly reduce the amount of data read and improve query performance. When all columns required by a query are covered by column indexes, the query performance can be improved tens or even hundreds of times by the acceleration of column storage.
Column storage data organization
Each ColumnIndex column is stored in an unordered, appending format, which is reclaimed using a combination of tag deletion and a background asynchronous compaction. Its specific implementation has the following key points:
- The records in the column index are organized by RowGroup, and the different columns in each RowGroup are individually packaged into a DataPack.
- Each RowGroup is appending, as are the datapacks belonging to each column. For a column index, only one Active RowGroup is responsible for accepting new writes. When the RowGroup is full, it is frozen. All datapacks contained in the RowGroup are compressed and saved to disk. Meanwhile, statistics of each data block are recorded for filtering.
- Each new row written into the RowGroup is allocated a RowID for location. All columns belonging to the row can be located using this RowID. Meanwhile, the system maintains the mapping index to the RowID to support subsequent deletion and modification operations.
- The update operation is supported in the form of tag deletion, where the original location is calculated from the RowID and the delete flag is set, and then the new data version is written to the ActiveRowGroup.
- When the number of invalid records in a RowGroup exceeds a certain threshold, an asynchronous compaction occurs in the background. This compacts data storage and improves the efficiency of analytic queries.
On the one hand, this data organization method satisfies the requirement of batch scanning and filtering by column for analytical query. On the other hand, tP-type transaction operations have very little impact. Write operations only need to append to memory by column, and delete operations only need to set a delete flag bit. An update operation is a tag deletion followed by an appending write. Column storage can support transaction-level updates with little impact on OLTP performance.
Full and incremental row columns
The row to column operation occurs in two cases. In the first case, DDL statements are used to create column indexes for some columns. In this case, the entire table data needs to be scanned to create column indexes. Another scenario is to row columns in real time for the columns involved during a transaction.
In the case of full table row to column, we use parallel scanning to scan the Primary Key of InnoDB and convert all involved columns into column storage form in turn. This operation is very fast, which is basically limited by the available I/O throughput speed and available CPU resources of the server. This operation is an online-DDL process and does not block online services.
After a column index is established on a table, all update transactions update the row and column data simultaneously to ensure transaction consistency. The following figure illustrates the difference between IMCI functionality off and on. When IMCI is not enabled, the transaction locks all row updates before modifying the data page, and all locked records are checked at once before the transaction commits. After the IMCI function is enabled, the transaction system creates a column store update cache. When all data pages are modified, the column store modification operations involved are recorded. The update cache is applied to the column store system before the transaction commits.
Under this implementation, the column store provides the same transaction isolation level as the row store. For each write operation, the transaction number for modifying the row is recorded for each row in the RowGroup, and for each tag delete operation, the transaction number for that set action is recorded. By writing and deleting transaction numbers, AP-type queries can take a snapshot of global consistency in a very lightweight way.
Column index Rough index
As can be seen from the storage format of the above columns, all datapacks in IMCI adopt the unordered and appending mode, so they cannot accurately filter out data that does not meet the requirements like InnoDB’s ordinary ordered index. In IMCI, we use statistics for data block filtering to reduce the unit price of data access.
- At the end of each Active Datapack write, a pre-calculation is performed to generate the minimum/maximum/sum of values/number of null values/total number of records contained in the Datapack. All of this information is maintained in the DataPacks meta-information area and resident in memory. Because the frozen Datapack will also delete data, the update and maintenance of statistics will be done in the background.
- For query requests, Datapacks are classified into related, unrelated, and possibly related categories based on query conditions to reduce the actual access to data blocks. Some aggregate query operations, such as count/sum, can be simply computed from pre-calculated statistics, and these blocks do not even need to be decompressed.
The rough indexing scheme based on statistics is not very friendly for some queries that need to accurately locate part of the data. But in a hybrid storage engine, where column indexes are only needed to help speed up queries that involve extensive data scanning, using columns has significant advantages. For SQL that accesses only a small amount of data, the optimizer usually calculates based on the cost model that a lower-cost solution based on row memory is available.
TP and AP resource isolation under mixed row and column storage
PolarDB mixed storage supports both AP-type and TP-type queries in one instance. However, many services have high OLTP load, and sudden OLAP load may interfere with the response delay of TP services. Support for load isolation is therefore a must in an HTAP database. With PolarDB’s write read architecture, it is very easy to isolate AP and TP loads. Under PolarDB’s technology architecture, we have the following deployment modes:
- The first option is to enable mixed column and column storage on the RW. This mode of deployment supports lightweight AP queries, which can be used when TP loads are heavy and AP-type requests are small. Or use PolarDB for report query, but the data is from the batch data import scenario.
- Second, the RW supports OLTP-type loads and enables an AP-type RO to enable mixed row and column storage to support queries. In this deployment mode, 100% CPU resources can be isolated and 100% memory on the AP-type RO node can be allocated to column storage and actuators. However, as the same shared storage is used, IO will have some influence on each other. For this problem, we will support to write the column storage data to external storage such as OSS in the future to achieve IO resource isolation and improve THE IO throughput rate on AP RO.
- Third, RW/RO supports OLTP load, and enables mixed row and column storage on separate Standby nodes to support AP query. Since Standby nodes use independent shared storage clusters, this solution can achieve I/O resource isolation on the basis of CPU and memory resource isolation supported by the second solution.
In addition to the above deployment architecture differences can support resource office isolation. In PolarDB, dynamic parallelism adjustment (Auto DOP) is supported for some large queries that need to be executed in parallel. This mechanism will comprehensively consider the current system load and available CPU and memory resources, and limit the resources used by a single query to avoid consuming too many resources and affecting the processing of other requests.
OLAP performance of PolarDB IMCI
In order to verify the effect of IMCI technology, we tested PolarDB MySQL IMCI in tPC-H scenario. In the same scenario, it is compared with the native MySQL rowmemory execution engine and ClickHouse, which has the best performance on a standalone OLAP engine. Test parameters are briefly described as follows:
- Data amount TPC-H 100GB, 22 queries
- CPU Intel(R) Xeon(R) CPU E5-2682 2 socket
- 512GB memory, data is poured into memory after startup.
PolarDB IMCI VS MySQL serial
In the TPC-H scenario, the processing delay of all 22 queries and IMCI is tens to hundreds of times faster than that of native MySQL. Q6 was nearly 400 times more effective. This reflects IMCI’s great advantages.
2 PolarDB IMCI VS ClickHouse
In comparison to ClickHouse, the most popular analytical database in the community, IMCI’s performance in the TPC-H scenario was about the same. Some SQL processing delays have their pros and cons. You can use IMCI as an alternative to ClickHouse, and it also makes data management easier.
FutureWork
IMCI is PolarDB’s first step towards the data analysis market. Its iterative steps will not stop. Next, we will further study and explore in the following directions to bring better user experience to customers:
- At present, the creation and deletion of column storage need to be manually specified by users, which increases the workload of DBA. At present, we are studying the introduction of automatic recommendation technology, which can automatically create column indexes according to the characteristics of SQL requests of users and reduce the maintenance burden.
- For purely analytical scenarios, removing row storage can further reduce storage size, while IMCI actuators support reading and writing OSS object storage to minimize storage costs.
- Mixed column and column execution, where part of the execution plan of an SQl is executed in row stores and part of the execution in column stores. For maximum execution acceleration.
The original link
This article is the original content of Aliyun and shall not be reproduced without permission.