033- Most comprehensive - 5 ways to migrate mysql to ClickHouse

This is article 33 of the Stick to technical Writing Project (including translation). Set a small goal of 999, at least 2 articles per week.

Data migration requires the import of ClickHouse from mysql. The solution is summarized as follows, including three methods supported by ClickHouse itself and two third-party tools.

create table engin mysql

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
    INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
    INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2
) ENGINE = MySQL('host:port'.'database'.'table'.'user'.'password'[, replace_query, 'on_duplicate_clause']);
Copy the code

The official document: clickhouse. Another dual/docs/en/ope…

Note that the actual data is stored in the remote mysql database and can be understood as a facade. This can be verified by adding or deleting data from mysql.

insert into select from

- build table first
CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2],
    ...
) ENGINE = engine
-- Import data
INSERT INTO [db.]table [(c1, c2, c3)] selectColumn or *from mysql('host:port'.'db'.'table_name'.'user'.'password')
Copy the code

Select toDate(xx) from mysql(“host:port”,”db”,”table_name”,”user_name”,”password”)

create table as select from

CREATE TABLE [IF NOT EXISTS] [db.]table_name
ENGINE =Log
AS 
SELECT *
FROM mysql('host:port'.'db'.'article_clientuser_sum'.'user'.'password')
Copy the code

Online article: jackpgao. Making. IO / 2018/02/04 /…

Custom columns are not supported, the blogger in Resources failed the ENGIN=MergeTree test. Create table and insert into SELECT

Altinity/clickhouse-mysql-data-reader

Altinity is an open source Python tool used to migrate data from mysql to ClickHouse (with support for binlog incremental updates and full imports), but the official readme is out of sync with the code, according to quick Start.

# # to create tablesClickhouse-mysql \ --src-host=127.0.0.1 \ --src-user=reader \ --src-password=Qwerty1# \
    --table-templates-with-create-database \
    --src-table=airline.ontime > create_clickhouse_table_template.sql
## Modify the script
vim create_clickhouse_table_template.sql

# import build table
clickhouse-client -mn < create_clickhouse_table_template.sql

## Data importClickhouse-mysql \ --src-host=127.0.0.1 \ --src-user=reader \ --src-password=Qwerty1# \--table-migrate \ --dst-host=127.0.0.1 \ --dst-table=logunified \ -- csvPoolCopy the code

Official documentation: github.com/Altinity/cl…

Note that all three import clickHouse from mysql, which can be quite stressful for mysql if the data is large. There are two offline methods (streamsets support both live and offline) CSV

## ignore table construction
clickhouse-client \
  -h host \
  --query="INSERT INTO [db].table FORMAT CSV" < test.csv
Copy the code

But if the source data is of poor quality, there are often problems, such as containing special characters (delimiters, escapes), or newlines. The pit is very sad.

Custom delimiters,--format_csv_delimiter="|"
Skip without stopping when an error occurs,--input_format_allow_errors_num=10Up to 10 lines of error are allowed,- input_format_allow_errors_ratio = 0.1Allow 10% error
CSV skips the null valueCode: 27. DB::Exception: Cannot parse input: expected , before: xxxx: (at row 69) ERROR: garbage after Nullable(Date): "8002 < LINE FEED > 0205" sed ' :a; s/,,/,\\N,/g; ta' |clickhouse-client -h host --query "INSERT INTO [db].table FORMAT CSV" 将 .replace,\N,

python clean_csv.py --src=src.csv --dest=dest.csv --chunksize=50000 --cols --encoding=utf-8 --delimiter=,

Clean_csv.py refer to my other 032-CSV file fault tolerance

streamsets

Streamsets support full import from mysql or read CSV, as well as subscribe to binlog incremental inserts. See 025- Big data ETL tool for streamsets installation and subscribe to mysql Binlog This article will only show you the full import of ClickHouse from mysql. This article assumes that you have set up the Streamsets service

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.anjia</groupId>
  <artifactId>demo</artifactId>
  <packaging>jar</packaging>
  <version>1.0 the SNAPSHOT</version>
  <name>demo</name>
  <url>http://maven.apache.org</url>
  <dependencies>
    <dependency>
        <groupId>ru.yandex.clickhouse</groupId>
        <artifactId>clickhouse-jdbc</artifactId>
        <version>0.1.54</version>
    </dependency>
    <dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>5.1.47</version>
  </dependency>
  </dependencies>
</project>
Copy the code

If maven is installed locally, run the following command: MVN dependency:copy-dependencies -DoutputDirectory= lib-dincludescope =compile All required JARS will be downloaded and copied to the lib directory