Clickhouse synchronizes data with Kafka

Author: LemonNan

The original address: mp.weixin.qq.com/s/SUUHF9R_F…

Note: the author and original address should be indicated

introduce

Clickhouse itself is an analytical database that provides many synchronization solutions with other components. This article will use Kafka as a data source to show how to synchronize Kafka data to Clickhouse.

The flow chart

Without further ado, let’s start with a data synchronization flow chart

Build table

Before data synchronization, we need to create the corresponding ClickHouse table. According to the above flowchart, we need to create three tables:

1. Data sheets

Table 2. Kafka engine

3. Materialized view

The data table

CREATE DATABASE IF NOT EXISTS data_sync; CREATE TABLE IF NOT EXISTS data_sync.test (name String DEFAULT 'lemonNan' COMMENT 'name ', Age int DEFAULT 18 COMMENT 'age ', gongzhonghao String DEFAULT 'lemonCode' COMMENT' gongzhonghao ', my_time DateTime64(3, 'UTC') COMMENT 'time ') ENGINE = ReplacingMergeTree() PARTITION BY toYYYYMM(my_time) ORDER BY my_timeCopy the code

Engine table

# create kafka engine table, address: 172.16.16.4, topic: lemonCode CREATE TABLE IF NOT EXISTS data_sync.test_queue( name String, age int, gongzhonghao String, My_time DateTime64(3, 'UTC') ENGINE = Kafka SETTINGS Kafka_broker_list = '172.16.16.4:9092', kafka_topic_list = 'lemonCode', kafka_group_name = 'lemonNan', kafka_format = 'JSONEachRow', kafka_row_delimiter = '\n', kafka_schema = '', kafka_num_consumers = 1Copy the code

Materialized views

CREATE MATERIALIZED VIEW IF NOT EXISTS test_mv TO test AS SELECT name, age, gongzhonghao, my_time FROM test_queue;Copy the code

The data simulation

Here’s how to start the simulation flowchart. You can skip the installation step if you have Kafka installed.

Install the kafka

Kafka is installed on a single machine for demonstration purposes

Wurstmeister /zookeeper docker run -d --name zookeeper -p 2181 The IP address of KAFKA_ADVERTISED_LISTENERS is machine IP docker run -d --name kafka -P 9092:9092 -e KAFKA_BROKER_ID= 0-e KAFKA_ZOOKEEPER_CONNECT = zookeeper: 2181 - link zookeeper - e KAFKA_ADVERTISED_LISTENERS = PLAINTEXT: / / 172.16.16.4:9092 - e KAFKA_LISTENERS = PLAINTEXT: / / 0.0.0.0-9092 - t wurstmeister/kafkaCopy the code

Use the kafka command to send data

# Start the producer, Kafka-console-producer. sh --bootstrap-server 172.16.16.4:9092 -- Topic lemonCode # send the following message {" name ":" lemonNan ", "age" : 20, "gongzhonghao" : "lemonCode", "my_time" : "the 2022-03-06 18:00:00. 001"} {" name ":" lemonNan ", "age" : 20, "gongzhonghao" : "lemonCode", "my_time" : "the 2022-03-06 18:00:00. 001"} {" name ":" lemonNan ", "age" : 20, "gongzhonghao" : "lemonCode", "my_time" : "the 2022-03-06 18:00:00. 002"} {"name":"lemonNan","age":20,"gongzhonghao":"lemonCode","my_time":"2022-03-06 23; 59:59. 002 "}Copy the code

View the Clickhouse data sheet

select * from test;
Copy the code

At this point, the data has been synchronized from Kafka to Clickhouse, which is, well, convenient.

About Data Copies

The table engine used here is ReplicateMergeTree, and one of the reasons for using ReplicateMergeTree is to make multiple copies of the data and reduce the risk of data loss. With ReplicateMergeTree, Data is automatically synchronized to other nodes in the same shard.

In practice, there is another way to synchronize data by using different Kafka consumer groups for data consumption.

See the following figure for details:

Copy Scheme 1

The synchronization mechanism of ReplicateMergeTree synchronizes data to other nodes in the same fragment, occupying resources of consuming nodes during synchronization.

Duplicate Scheme 2

Messages are broadcast to multiple Clickhouse nodes through Kafka’s own consumption mechanism, and data synchronization takes no additional Clickhouse resources.

Areas of attention

What you might need to know about the setup process

This article appears172.16.16.4Is the internal IP address of the machine
Generally, engine lists end in queue, materialized views end in MV, which is more recognizable

conclusion

This article describes how to sync data from Kafka to Clickhouse and multiple replicas. Clickhouse also provides many other integration solutions, including Hive, MongoDB, S3, SQLite, Kafka, and more. See the links below.

The integration of table engine: clickhouse.com/docs/zh/eng…

The last

Please scan the qr code below or search LemonCode to exchange and learn together!