Abstract: In addition to the superior performance of ClickHouse, GaussDB(for MySQL) HTAP read-only analytics is superior to MaterilizeMySQL in terms of performance and stability, providing faster and more accurate data analysis.
1, the introduction
HTAP (Hybrid Transactional/Analytical Processing) is a term you hear a lot these days. It supports both on-line Transactional Processing (OLTP) and On-Line Analytical Processing (OLAP). Amazingly, ClickHouse, one of the most popular big data analysis systems in recent years, can be mounted as a slave library for MySQL through the MaterializeMySQL engine, serving as a “coprocessor” for MySQL to provide efficient data analysis capabilities for OLAP scenarios. This provides a new way to solve the problem of data sharing between heterogeneous databases. We can leverage ClickHouse’s analytics capabilities with tP-like engines such as MySQL to provide HTAP capabilities. However, ClickHouse still faces some challenges in practical application scenarios, so GaussDB(for MySQL) HTAP read-only analysis comes into being. In addition to the extreme performance of ClickHouse, The HTAP read-only analysis of GaussDB(for MySQL) is superior to the MaterilizeMySQL engine in terms of performance and stability, providing faster and more accurate data analysis.
2, the background,
With the advent of the era of big data, the data volume is increasing rapidly and the user structure is becoming more and more diversified. When these users process data, they find that they need to Extract, Transform and Load data just to create a visual report, and the whole cycle may take days or even weeks. In fact, the advantage of ETL mode is that it can process multi-source data by combining data lakes and so on, process massive data at a low cost and has a relatively perfect ecology. Of course, its disadvantages are also obvious. Traditional data warehouses and data lakes cannot support a large number of real-time concurrent updates, and the timeliness of data analysis is low. In addition, THE ABILITY of ETL mode to cope with changes is relatively weak, such as changes in upstream data sources (such as changes in table structure, etc.), the processing process of the entire data chain needs to be modified accordingly, which increases the difficulty of data maintenance.
How do you pursue real-time analytics? The answer is HTAP. HTAP can support a large number of concurrent updates and the data synchronization delay is usually in the second or millisecond level, effectively avoiding the tedious steps of data extraction, conversion and loading in traditional solutions, greatly improving the timeliness of data processing.
3. Ultimate performance -ClickHouse
ClickHouse
ClickHouse is Yandex’s open source distributed column database for OLAP, with real-time queries, a complete DBMS, efficient data compression, support for batch updates, and high availability. In addition, ClickHouse has very good SQL support and many other features right out of the box. ClickHouse is far ahead of its rivals in the official benchmark comparison.
Row Store & Column Store
MySQL storage uses Row Store. Data in a table is continuously stored in storage media as logical storage units based on Row. This storage mode is suitable for random add, delete, change and search operations, and is friendly to query by line. However, if only a few attributes of a Row are selected as the target of the query, the Row storage method also has to traverse all rows and filter out the target attributes. When the table is very wide (the table has many attributes), the query efficiency is usually low. Although indexes and other optimization schemes can improve efficiency in OLTP application scenarios, they are still inadequate in OLAP scenarios facing massive data background.
ClickHouse uses a Column Store, where data in a table is stored consecutively in a storage medium as a logical storage unit by Column. This storage mode is suitable for concurrent Data processing with Single Instruction Multiple Data (SIMD), which makes up for the defects of RowStore storage mode, especially in large and wide tables (with many attributes), the query efficiency is significantly improved. In addition, the column storage mode has the same adjacent data type, so it is naturally suitable for data compression, thus achieving the ultimate data compression ratio.
​Performance
The following table is the performance test data officially released by Yandex, with a data set of 100 million. The three data from top to bottom are shown as follows: With Cold Cache, Second Round, and Third Round query response times, ClickHouse is far ahead of all the other database engines, with performance more than 600 times that of MySQL.
Note: The following experimental data are single node: 2 * Intel (R) Xeon (R) CPU E5-2650 V2 @ 2.60GHz; 128 GiB RAM; mdRAID-5 on 8 6TB SATA HDD; ext4.
4. GaussDB(forMySQL) HTAP read-only analysis on giant shoulders
Despite ClickHouse’s extreme performance, there are still some challenges in practical production. For example, there are challenges such as unsupported data types and full replication performance issues. In addition, there are some performance issues related to the design of the engine itself, such as the query performance problems caused by FINAL deduplication, which brings some bad experience to the user.
Full parallel replication
The MaterializeMySQL engine subscribes to MySQL data by consuming binlogs. The data synchronization process is divided into three steps, the first is to verify whether the source MySQL parameters meet the specification, and then the full and incremental replication phases. ClickHouse data synchronization’s full replication process is single-threaded, with high replication latency for large volumes of data. GaussDB(forMySQL) HTAP read-only analysis parallelizes full replication and improves replication performance by 8-10 times on average, which is meaningful in actual production practices.
MVCC & Snapshot
The MaterializeMySQL engine adds two hidden fields by default during DDL transformation: _sign (-1 delete,1 insert/update) and _version (data version). DDL for the same table in MySQL and ClickHouse
Create Table: CREATE TABLE `runoob_tbl` ( `runoob_id` int unsigned NOT NULL AUTO_INCREMENT, PRIMARY KEY (`runoob_id`) ) ENGINE=InnoDB AUTO_INCREMENT=3 DEFAULT CHARSET=utf8 --------------------------------------------------------------- ATTACH TABLE _ UUID '14dbff59-930e-4aa8-9f20-ccfddaf78077' ( `runoob_id` UInt32, `_sign` Int8 MATERIALIZED 1, /// _SIGN field '_VERSION' UInt64 MATERIALIZED 1 /// _VERSION field)Copy the code
The MaterializeMySQL engine does not currently provide a transactional consistency view of MySQL data, rows are synchronized to ClickHouse as bulk inserts, and the underlying engine uses ReplacingMergeTree. If the data has been modified, you need to specify FINAL (similar to GROUP BY) deduplication when retrieving the latest data, and use filters to hide deleted rows. However, when the data is large, the performance of FINAL operations tends to be less than ideal.
To detect transactions, GaussDB(forMySQL) HTAP read-only analysis implements transaction consistency and provides four isolation levels. You can select different isolation levels based on application scenarios. In addition, GaussDB(for MySQL) HTAP read-only analysis also provides a snapshot function to optimize query performance problems caused by FINAL.
Read_uncommitted: MVCC support is not provided and dirty reads may be introduced
Read_committed: Provides MVCC support (including SubQuery) to read the latest committed data
Query_snapshot: avoids the Merge cost in SELECT + FINAL queries and directly queries snapshots
Query_raw: Returns all data (including different deleted and updated versions) without any optimization
FINAL performance optimization
As mentioned above, the underlying layer of MaterializeMySQL uses ReplacingMergeTree. The engine performs Merge operation in the background according to certain rules, and the user must perform FINAL operation to obtain the latest data. In addition to the MVCC + Snapshot mechanism to ensure query performance, GaussDB(for MySQL) HTAP read-only analysis optimizes the FINAL operations of the ReplacingMergeTree engine in terms of indexes and filtering policies. You can provide good query performance without relying on MVCC + Snapshot.
5. GaussDB(for MySQL) HTAP read-only analysis compatibility and stability
Type support enhancement
There is a mapping between the basic data types of MySQL and ClickHouse (see table below), and it is worth noting that ClickHouse converts unsupported MySQL data types to String storage. ClickHouse types not supported by MySQL are also converted to MYSQL_TYPE_STRING. As you can see from the table below, ClickHouse still does not support a number of data types that might be available in real-world applications. Therefore, GaussDB(for MySQL) HTAP read-only analysis ADAPTS to common data types, such as BIT, TIME, and YEAR, to meet some users’ requirements.
Unique Key synchronization support
Currently, the MaterializeMySQL engine only supports synchronization of tables with Primary keys. In the actual production process, some tables may not have Primary keys but contain Unique keys. Therefore, it is necessary to support data synchronization of such tables. GaussDB(forMySQL) HTAP read-only analysis exclusively processes forms that contain only Unique keys (NOT NULL) and partitions them using Unique keys.
Elegant copy-interrupt reconnection
In practice, full data replication is usually large and takes a long time. In this case, ClickHouse aborts the synchronization of the current library and returns an error. In order to improve the stability of data synchronization, GaussDB(for MySQL) HTAP read-only analysis designed a reconnection for the MaterializeMySQL engine, which cleaned up the site when an interruption occurred and reconnected at a certain interval. Incremental replication is based on a BinLog Event. Data that has been incrementally synchronized does not need to be reconnected. After the connection is reestablished, the system finds the latest synchronization point based on the global GTID and starts the synchronization.
Improved exception handling mechanisms
GaussDB(for MySQL) HTAP read-only analysis not only introduces new features such as MVCC +Snapshot and parallel replication, but also incorporates a more complete exception handling mechanism into the kernel. Take full parallel replication as an example. GaussDB(forMySQL) HTAP read-only analysis maintains independent exception handling information and stacks for all parallel threads. With the new exception handling mechanism, GaussDB(forMySQL) HTAP read-only analysis is more stable and helps users find the root cause of potential problems.
6. Customize GaussDB(for MySQL) HTAP read-only analysis
Show Slave Status Supported
GaussDB(for MySQL) HTAP read-only analysis provides users with the SHOW SLAVE STATUS command similar to that between the master and SLAVE MySQL servers. By using this command, you can intuitively obtain the database STATUS synchronized from the MaterializeMySQL engine. The status information not only reflects whether the synchronization thread is abnormal, but also covers the current replication BinLog site, GITD and Second Behind Master and other valuable information, which provides great convenience for user operation and maintenance.
The ALTER Database support
Alter Database provides the following operations for MaterializeMySQL engine users:
ALTER DATABASE db MODIFY SETTING ... ALTER DATABASE DB ADD TABLE OVERRIDE TBL... ALTER DATABASE db MODIFY TABLE Override TBL... ALTER DATABASE DB DROP TABLE Override ALTER DATABASE DB DROP TABLECopy the code
Table definition overrides Override
To provide personalized database construction synchronization, GaussDB(forMySQL) HTAP read-only analysis adds Over Write functionality to the MaterializeMySQL engine. Users can overwrite the columns of a specified table and add new columns. The following is an example of adding an index and overwriting the PARTITION BY or SAMPLE BY field:
CREATE DATABASE test ENGINE=MaterializeMySQL('host:port', 'db', 'user', 'pw') TABLE OVERRIDE table1 (_uint8 MATERIALIZED column) PARTITION BY (UInt8) // Overwrite the partition fieldCopy the code
Adapter MySQL Partition
Data partitioning is one of the most important ways to improve database performance. ClickHouse’s partitioning strategy prioritizes dates, and otherwise hashes and partitions fields with smaller type lengths. As you can see, ClickHouse’s partition policy is different from MySQL’s. In order to support MySQL’s partition policy as much as possible, GaussDB(for MySQL) HTAP read-only analysis currently supports Range partitions. The default ClickHouse partitioning policy is used.
Black/white list filtering
The data synchronization established by the MaterializeMySQL engine is library level, which means that by default, all tables in the library will be replicated. In some practical application scenarios, it is not necessary to replicate all tables. In other words, some tables are NOT suitable for replication (for example, there is no Primary Key or Unique Key that is NOT NULL). GaussDB(for MySQL) HTAP read-only analysis does NOT want to fail to replicate the entire database because some tables cannot be replicated, but can selectively replicate the database. GaussDB(for MySQL) HTAP read-only analysis to solve this problem, the design of black/white list filtering, allowing users to customize the table to be replicated, is very meaningful in production applications, usage reference is as follows:
CREATE DATABASE test ENGINE = MaterializeMySQL('host:port', 'db', 'user', 'pw') SETTINGS black_list='T1,T2Copy the code
7. Scenario Example
The advantages of GaussDB(forMySQL) HTAP read-only analysis have been analyzed. What solution does GaussDB(forMySQL) HTAP read-only analysis provide to solve data problems?
In the preceding figure, MySQL + GaussDB(forMySQL) HTAP read-only analysis is used as an example. Users can enjoy the complete transaction guarantee of MySQL and the extreme analysis performance of GaussDB(forMySQL) HTAP read-only analysis. Users obtain data from different channels and load it to the MySQL engine. GaussDB(for MySQL) HTAP Read-only analysis synchronizes user data in real time and provides efficient data analysis capabilities.
High efficiency
Unlike the traditional ETL (T + 1) solution, GaussDB(for MySQL) HTAP read-only analysis works with MySQL’s HTAP solution to provide second-level data synchronization.
Data compression
GaussDB(for MySQL) HTAP read-only analysis uses Column Store as the underlying storage. This storage format is naturally suitable for data compression. Therefore, GaussDB(for MySQL) HTAP read-only analysis has an extreme data compression ratio. Under the same conditions, users can save a lot of storage costs.
Historical backup
GaussDB(for MySQL) HTAP read-only analysis has lower storage cost than backing up data in MySQL, and is more suitable for backing up historical data in some scenarios.
Storage layer
To further reduce storage costs, GaussDB(forMySQL) HTAP read-only analysis provides ESSD + EVS + OBS layered storage to store hot data on different storage media and cold data on different storage media to further reduce storage costs.
8, summary
HTAP is not a very new concept, but with the current data services becoming more and more fuzzy (AP services tP-oriented, TP services AP), the concept has come back to people’s attention. The constant iteration and upgrading of user demand for data processing and consumption also creates more opportunities for HTAP’s growth. GaussDB(for MySQL) HTAP Read-only Analysis Stands on the shoulder of ClickHouse’s extreme performance and makes a series of optimizations for actual production problems, resulting in a faster and better use experience. It is believed that the competition of GaussDB(for MySQL) HTAP read-only analysis will become increasingly fierce in the future, which is both a challenge and an opportunity for GaussDB(for MySQL) HTAP read-only analysis. GaussDB(for MySQL) HTAP read-only analysis will continue to provide users with efficient solutions for massive data and help enterprises in digital transformation. More products, please stamp www.huaweicloud.com/product/gau…
Note: This feature is currently in the invitation-testing stage. If you want to experience it, please leave a message in the comments section!
Click to follow, the first time to learn about Huawei cloud fresh technology ~