Author: TCeason Database development engineer

MySQL[1] has been the most popular OLTP (Online Transaction processing) database in the world since 2000, and ClickHouse[2] has been the most popular OLAP (Online Analytical processing) database in recent years. So is there a spark between the two?

This article will take you through the process of breaking down heterogeneous database barriers and synchronizing MySQL data to ClickHouse.

background

1. The development of MySQL replication

Figure 1-1 details the evolution of MySQL replication.

In 2001, MySQL 3.23 version already support isomorphic database asynchronous replication; Because of asynchronous replication, it cannot be used in large quantities in actual production.

MySQL 5.7.2 in 2013 supports enhanced semi-synchronous replication, making it an enterprise-wide data synchronization solution.

MySQL 5.7.17 supported MGR in 2016 and continues to mature into a financial level available data synchronization solution.

For homogeneous MySQL data synchronization, the next thing to do is to continuously optimize the experience, improve synchronization timeliness, and solve various problems under network exceptions.

Based on this, vendors are starting to make their own high availability synchronization components. For example, Xenon, developed and open source by QingCloud database R&D team, has truly strong consistency and high availability.

MySQL + Xenon

Xenon in Figure 1-2 is a highly available component implemented by a Raft-like algorithm to manage MySQL elections and probes and to correct data accuracy. For MySQL data synchronization, semi-sync Replication or MGR is still used to achieve strong data consistency, non-centralized automatic primary selection and second-level switchover, as well as cloud-based cross-region disaster recovery capability.

ClickHouse synchronizes MySQL data

To speed up OLAP queries, QingCloud MySQL Plus[3] (MySQL + Xenon) borroaks ClickHouse to synchronize MySQL data.

1. ClickHouse Overview

ClickHouse is a column database management system (DBMS) for online analysis (OLAP). ClickHouse was originally developed in 2008 for YandexMetrica, the world’s second largest Web analytics platform. It has been used as a core component of the system for many years and was announced as open source in 2016.

From the latest DB-Engines, we can see that the ranking curve is going up all the time, and it is a clear trend that major factories have deployed a lot in important businesses. So it seems safe to assume that ClickHouse’s popularity isn’t just a temporary phenomenon, it’s here to stay. Moreover, ClickHouse’s flexible external table engine makes it easy to synchronize data with MySQL, as we’ll see next.

2. MySQL Table Engine

Features of the MySQL Table Engine.

  • Mapping to MySQL table
  • Fetch table struct from MySQL
  • Fetch data from MySQL when executing query

ClickHouse originally supported table-level synchronization of MySQL data, using the external MySQL Table Engine to map to MySQL tables. Obtain the corresponding table structure from the information_SCHEMA table and convert it to the data structure supported by ClickHouse. At this time, the table structure is successfully established on the ClickHouse side. But at this point, you’re not really synchronizing the data. Only when a request is made to the table in ClickHouse will the data from the MySQL table to be synchronized be proactively pulled.

The MySQL Table Engine is pretty rudimentary to use, but it makes sense. This is the first time that ClickHouse and MySQL have been connected to the data channel. However, the disadvantages are particularly obvious:

I. It is only a mapping of MySQL table relationships.

Ii. Transferring MySQL data to ClickHouse during query may cause unknown network pressure and read pressure on MySQL, which may affect the normal use of MySQL in production.

The QingCloud ClickHouse team implemented the MySQL Database Engine based on the shortcoming of the MySQL Table Engine, which can only map MySQL Table relationships.

3. MySQL Database Engine

Features of the MySQL Database Engine.

  • Mapping to MySQL Database
  • Fetch table list from MySQL
  • Fetch table struct from MySQL
  • Fetch data from MySQL when executing query

MySQL Database Engine is a database-level mapping that pulls the structure of all MySQL tables contained in the Database to be synchronized from information_Schema, which solves the problem of creating multiple tables. However, it still has the same drawbacks as MySQL Table Engine: transferring MySQL data to ClickHouse during query may cause unknown network pressure and read pressure on MySQL, which may affect normal use of MySQL in production.

4, borrow third-party software synchronization

In addition to the MySQL Table Engine and MySQL Database Engine mentioned above, third-party software such as Canal or Kafka can be used to synchronize data. Then write a program to control writing to ClickHouse. This has the great advantage that the synchronous process can be controlled autonomously. But it also poses an additional problem:

I. The complexity of data synchronization is increased.

Ii. Third-party software is added, which increases the difficulty of operation and maintenance exponentially.

With this in mind, can ClickHouse actively synchronize and subscribe to MySQL data?

Materialize MySQL

In order to solve the remaining problems of MySQL Database Engine and support ClickHouse to proactively synchronize and subscribe to MySQL data, QingCloud ClickHouse team developed the MaterializeMySQL[4] Engine.

1. Brief the MaterializeMySQL

The MaterializeMySQL engine is a library engine developed by the QingCloud ClickHouse team and is currently incorporated as an experimental feature in the ClickHouse 20.8 release. It is a mapping of MySQL library level relationships. Subscribe to MySQL data by consuming binlog storage to MergeTree.

A simple example of creating DATABASE SQL:

CREATE DATABASE test ENGINE = MaterializeMySQL(
  '172.17.0.3:3306'.'demo'.'root'.'123'
)

# 172.17. 03.:3306 -MySQL address and port # demo-MySQL library name # root-MySQL sync account #123 -Password of the MySQL synchronization accountCopy the code

2. Design idea of MaterializeMySQL

  • Check MySQL Vars
  • Select history data
  • Consume new data

The design of MaterializeMySQL is as follows:

  1. First check whether the source MySQL parameters conform to the specification;
  2. Then the data is divided into historical data and incremental data according to GTID.
  3. Synchronize historical data to GTID points.
  4. Continue to consume incremental data.

3. Function flow of the MaterializeMySQL

As shown in Figure 3-1, the main process of the MaterializeMySQL function is as follows:

CheckMySQLVars -> prepareSynchronized -> Synchronized

(1) CheckMySQLVars

Checking parameters is a simple matter of querying whether the parameters are as expected.

SHOW VARIABLES WHERE (Variable_name = 'log_bin'
AND upper(Value) = 'ON')
OR (Variable_name = 'binlog_format'
AND upper(Value) = 'ROW')
OR (Variable_name = 'binlog_row_image'
AND upper(Value) = 'FULL')
OR (Variable_name = 'default_authentication_plugin'
AND upper(Value) = 'MYSQL_NATIVE_PASSWORD')
OR (Variable_name = 'log_bin_use_v1_row_events'
AND upper(Value) = 'OFF');
Copy the code

(2) prepareSynchronized

This step to achieve the historical data pull.

  • To initialize the GTID information;
  • The tables under the ClickHouse MaterializeMySQL engine library are cleaned up to ensure idempotentiality at each resynchronization;
  • Pull back the historical data and rewrite the MySQL table structure in ClickHouse.
  • Establish Binlog transfer channel with MySQL.
std::optional<MaterializeMetadata> MaterializeMySQLSyncThread::prepareSynchronized() { connection = pool.get(); MaterializeMetadata metadata( connection, DatabaseCatalog::instance().getDatabase(database_name)->getMetadataPath() + "/.metadata", mysql_database_name, opened_transaction); if (! metadata.need_dumping_tables.empty()) { Position position; position.update(metadata.binlog_position, metadata.binlog_file, metadata.executed_gtid_set); metadata.transaction(position, [&]() { cleanOutdatedTables(database_name, global_context); dumpDataForTables(connection, metadata, query_prefix, database_name, mysql_database_name, global_context, [this] { return isCancelled(); }); }); } connection->query("COMMIT").execute(); }Copy the code

In MySQL, the demo library has a table t with a primary key ID and a common column COL_1.

CREATE TABLE demo.t (
  id int(11) NOT NULL,
  col_1 varchar(20) DEFAULT NULL.PRIMARY KEY (id)
) ENGINE = InnoDB;
Copy the code

In ClickHouse, id is still the primary key column, but there are more hidden columns _sign and _version.

I. _sign: the value can only be 1 or -1. Where, 1 indicates that this row of data exists, and -1 indicates that this row of data is deleted.

Ii._version: Only the values with the highest version are read. Rows with the same primary key are continuously merged in the background and the rows with the highest version are retained.

CREATE TABLE test.t
(
    `id` Int32,
    `col_1` Nullable(String),
    `_sign` Int8 MATERIALIZED 1,
    `_version` UInt64 MATERIALIZED 1
)
ENGINE = ReplacingMergeTree(_version)
PARTITION BY intDiv(id, 4294967)
ORDER BY tuple(id)
Copy the code

(3) the Synchronized

In prepareSynchronized, we get the historical data and the historical data loci information, and we get the Binlog transfer channel with MySQL. The next step is to synchronize the incremental data from that locus. The readOneBinlogEvent function reads each binlog and uses onEvent to convert it to ClickHouse statement format. Finally, call flushBuffersData to flushBuffersData for data security.

client.connect(); client.startBinlogDumpGTID(randomNumber(), mysql_database_name, metadata.executed_gtid_set, metadata.binlog_checksum); Buffers buffers(database_name); while (! isCancelled()) { BinlogEventPtr binlog_event = client.readOneBinlogEvent(std::max(UInt64(1), max_flush_time - watch.elapsedMilliseconds())); if (binlog_event) onEvent(buffers, binlog_event, *metadata); if (! buffers.data.empty()) flushBuffersData(buffers, *metadata); }Copy the code

HTAP application scenarios

When we get through ClickHouse and MySQL replication channels, and ClickHouse’s analytics capabilities are amazing, can we implement HTAP using MySQL + ClickHouse?

In the architecture shown in Figure 4-1, the high availability component Xenon is still used to manage MySQL replication, while Xenon adds monitoring of ClickHouse and synchronizes MySQL data through MaterializeMySQL.

In the previous architecture diagram, MySQL read-only instances were used for business analysis, user profiling, and other analysis businesses. ClickHouse can now be added directly to the MySQL replication as an analysis instance, instead of a partial read only instance for analysis calculations. ClickHouse itself supports a large number of functions to support analysis capabilities as well as support for standard SQL, making it a great experience for users.

ClickHouse currently supports synchronizing MySQL 5.7 and 8.0 data, but not MySQL 5.6 data. However, as an experimental feature, the timeline of MaterializeMySQL is equivalent to MySQL, which supports replication only in 2001. Everyone is welcome to contribute and maintain MaterializeMySQL.

[1]. MySQL : www.mysql.com/

[2]. ClickHouse : clickhouse.tech/docs/en/

[3]. MySQL Plus

[4]. MaterializeMySQL: clickhouse. Tech/docs/en/eng…

About RadonDB

RadonDB Open Source Community is a cloud-oriented, container database open source community, providing database technology enthusiasts with technology sharing platform around the mainstream open source database (MySQL, PostgreSQL, Redis, MongoDB, ClickHouse, etc.). And provide enterprise RadonDB open source products and services.

Currently the RadonDB open source database family has been Everbright bank, bank of Shanghai pudong development bank in silicon valley, hami, taikang life insurance, taiping insurance, axa, sunshine, life, anji logistics, one hundred AnChang logistics, blue moon, TianCai ShangLong, Luo Kejia China, zhe hui run sports science and technology, wuxi, Beijing telecom, jiangsu traffic holding airlines, sichuan airlines, kunming, the thousands of companies such as biological control is used and the community of users.

RadonDB can be delivered based on cloud platform and Kubernetes container platform. It not only provides database product solutions covering multiple scenarios, but also provides professional cluster management and automatic operation and maintenance capabilities. The main functions and features include: High availability (HA) primary/secondary switchover, strong data consistency, read/write separation, one-click installation and deployment, multi-dimensional indicator monitoring and alarm, flexible capacity expansion and reduction, horizontal free expansion, automatic backup and recovery, same-city multi-active, and remote Dr. RadonDB only requires enterprises and community users to focus on business logic development, and does not need to pay attention to complex issues such as cluster selection, management, operation and maintenance, so as to help enterprises and community users greatly improve the efficiency of business development and value innovation!

GitHub:

github.com/radondb

This article describes how the MaterializeMySQL engine implements MySQL data synchronization to ClickHouse.