An overview of the

Clickhouse serves as a database to manage the structured storage of data, optimize queries, and so on.

For the knowledge of the engine, we need to understand:

  • Creation mode and usage mode
  • Data storage structure
  • Principles of Data Query

ClickHouse has a very large table engine architecture, consisting of six main categories: The MergeTree family, external storage, memory, files, interfaces, and others. As MergeTree includes complete features such as ALTER related operations, primary key index, data partition, data copy and data sampling, it is the core content of Clickhouse. This paper mainly summarizes the knowledge points of MergeTree engine and its derived engine.

MergeTree is a MergeTree that can be used to perform query, index, partition, and other operations.

Create & Use

When MergeTree writes a batch of data, the data is always written to disk in the form of data fragments, and the data fragments cannot be modified. Data fragments belonging to the same partition are periodically combined into a new fragment. This feature of the data fragment back and forth is where the name of the merge tree comes from.

Construct sentence:

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name (
    name1 [type] [DEFAULT|MATERIALIZED|ALIAS expr],
    name2 [type] [DEFAULT|MATERIALIZED|ALIAS expr], omitted...) ENGINE= MergeTree()
[PARTITION BY expr]
[ORDER BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[SETTINGS name=value, omit...Copy the code
  • Partition by: indicates the Partition key

  • ORDER BY: ORDER BY

  • Note: The PRIMARY KEY must be a subset of the sort KEY

  • SAMPLE BY: SAMPLE expression

The sampling expression needs to be used in conjunction with the SAMPLE subquery, which is useful for selecting sampled data.

File structure

MergeTree is implemented by a specially designed file storage method. The core knowledge mainly includes:

  • partition
  • The index
  • Data compression block
    • Compression algorithm
    • File structure
  • Data tag

Partitions are used to divide folders, and each partition is stored separately. Each partition has an independent index file, and real data is stored in columns and compressed in blocks. Data mark record The matching relationship between index records and file blocks.

With this combination, Clickhouse is able to remove as much invalid data as possible and read only useful data blocks each time a query is requested, which is the key to Clickhouse’s reading efficiency.

The directory structure

Clickhouse’s original data structure: ${clickhouse_data}/db.name/table.name/{partition path}/{all kinds of data}.

partition

Partitioned directory structure

Data is organized in partitioned directories, and each partition directory contains files such as indexes and data blocks. The complete storage structure is as follows:

The complete physical structure of a data table is divided into three levels: the data table directory (library/table), the partition directory, and the specific data files under each partition. Here’s what they do.

  • Idx, [Column].mrk, [Column].bin, and other data files are stored in a partition directory. Data on the same partition is merged into the same partition directory.

  • Checkwar.txt: the checksums file, stored in binary format. It saves the size of the remaining types of files (primary.idx, count.txt, etc.) and the hash value of size, used to quickly verify the integrity and correctness of the file.

  • TXT: column information file, stored in plain text format. Used to hold column field information under this data partition.

  • TXT: count file, stored in plain text format. This parameter is used to record the total number of rows in the current data partition directory.

  • Primary. idx: Primary index file, stored in binary format. It is used to store sparse indexes. With the help of sparse indexes, data files outside the range of primary key conditions can be excluded during data query.

  • [Column].bin: data file, stored in compressed format, default LZ4 compressed format, each Column has its own. Bin data file, named after the Column field name;

  • [Column]. MRK: Column Column data tag file, stored in binary format. The offset information of the data in the. Bin file is saved in the tag file. The tag file is aligned with the sparse index and corresponding to the.bin file one by one. Therefore, MergeTree establishes the mapping relationship between the primary.idx sparse index and.bin data file through the tag file.

  • Partition. dat and minmax[Column].idx: If a partitioning key is used, such as Partition BY EventTime, additional partition.dat and minmax index files are generated and stored in binary format. Partition. Dat is used to save the value of the partition expression generated under the current partition. The minmax index is used to record the minimum and maximum value of the original data corresponding to the partition field under the current partition.

  • Skp_idx_ [Column].idx and SKp_idx_ [Column].mrk: If a secondary index is declared in a table statement, additional secondary index and tag files are generated, which also use binary storage. Secondary indexes are also called hop indexes in ClickHouse.

** Note: ** If no PARTITION key is used, that is, no PARTITION expression is declared with PARTITION BY, then the PARTITION ID is named all BY default, and all data is written to the all PARTITION.

There is still a partition concept, but the default is one partition.

Partition directory name: PartitionID_MinBlockNum_MaxBlockNum_Level

PartitionID: indicates the PartitionID

MinBlockNum and MaxBlockNum: As the name suggests, the minimum and maximum data block numbers.

The count n accumulates globally within a single MergeTree table, starting at 1 and increasing by 1 each time a new partition is created. For a new partition directory, MinBlockNum has the same value as MaxBlockNum, which is equal to n.

Level: Current merge Level, the number of times partitions have been merged, that is, “version number”.

For example: 201905 _1_2_0

The process of merging partitions

With each batch of data write (an INSERT statement), MergeTree generates a batch of new partition directories. Even if the data written in different batches belong to the same partition, different partition directories are generated.

At some point later (10 to 15 minutes after the write, the optimize query can also be manually executed), ClickHouse performs a background task to merge multiple directories belonging to the same partition into a new directory. Existing old partition directories are not deleted immediately, but at some later point (8 minutes by default) through background tasks. However, the old partitioned directories are no longer active (active=0), so they are automatically filtered out during data queries.

Manually trigger merge operations:

OPTIMIZE TABLE ***** FINAL;
Copy the code

Merged partition directory:

PartitionID_ Minimum block number before merging _ Maximum block number before merging _(original maximum Level+1)

For example, 201905_1_2_1 and 201905_5_5_0 are merged to 201905_1_5_2

The partition name changes are shown in the following figure:

The index

The primary index

Primary key deweighting principle:

The index primary key must be a subset of the Order By sorting field.

In order by Key, only the latest Key is kept for the same Key. By default, the PRIMARYKEY is the same as the sort key.

Granularity is set to an index_granularity interval (rows 8192 by default) for a data table whose PRIMARY KEY is defined by the MergeTree PRIMARY KEY. After the PRIMARY KEY is defined, MergeTree will generate a level-1 index for the table based on the index_granularity interval. Index data is sorted by PRIMARY KEY.

Sort first, then divide according to the specified line, the index file is stored separately.

Secondary indexes

For reference to the primary index, the key values matched by multiple index fields are stored as one index.

The MarkRange corresponds to the index number and uses the start and end attributes to indicate its range. By the value of the index number corresponding to start and end, you can get the corresponding numeric range. The range of values represents the range of data contained in this MarkRange.

Compressed data block

Core concepts:

  • Compression algorithm
  • File structure
  1. Data is compressed. Currently, LZ4, ZSTD, Multiple and Delta algorithms are supported. By default, LZ4 algorithm is used.
  2. The data is sorted in advance according to the ORDER BY declaration;
  3. Data is organized as compressed data blocks and written to.bin files.
The structure of a file block

A compressed data block consists of header information and compressed data. The header is represented by a set of 9-bit bytes consisting of 1 UInt8 (1 byte) integers and 2 UInt32 (4 bytes) integers representing the type of compression algorithm used, the size of the compressed data, and the size of the data before compression, respectively.

For example: 0 x821200065536

0x82 is the encoding of the compression algorithm: LZ4

12000: indicates the size of compressed bytes

65536: Size of bytes before compression

Judgment basis of Block merger
  • Each time, 8192 rows, size<64KB, wait for the next batch of data accumulation (still 8192 rows);
  • 64KB<=size<=1MB, directly generates a compression block;
  • If the size is larger than 1MB, the next compressed data block is generated, and the remaining data is executed according to the preceding rules.

Data tag

Data mark (.mrk) files are used to match index files and data block files.

A row of token data is represented by a tuple containing the offset information for two integer values. They represent the starting offset of the compressed data block in the corresponding. Bin compressed file within this data interval. And the start offset of the uncompressed data after the data compression block is decompressed.

Implementation process

After checking the data storage methods, this section combs how to quickly locate the data location and read the data in the process of writing and querying MergeTree data to complete the calculation process.

Writing process

  1. Generate a partitioned directory. With each batch of data written, a new partitioned directory is generated. At a later point in time, directories belonging to the same partition are merged according to the rule.
  2. Granularity: primary. Idx level-1 index is generated based on index_granularity (if level-2 index is declared, a level-2 index file will be created).
  3. Generate. MRK data tags for each column field;
  4. Generates a. Bin compressed data file (binary) for each column field.

Indexes and tag intervals are aligned, while tags and compressed blocks generate many-to-one, one-to-one, and one-to-many relationships depending on the size of the interval data. The corresponding relationship is based on the following:

  • Many-to-one :size < 64KB
  • One-to-one: 64 KB < = size < = 1 MB
  • One-to-many: size>1MB

The query process

The nature of a data query can be seen as a process of narrowing the data range. In the best case, MergeTree can first minimize the range of data scans with partitioned indexes, primary indexes, and secondary indexes in turn. Then, with the help of data tags, minimize the range of data that needs to be decompressed and computed.

MergeTree derived engine

In addition to the base table engine MergeTree, the whole MergeTree family includes the derived engine of MergeTree.

Single copy derived engine

Commonly used table engine and ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, CollapsingMergeTree and VersionedCollapsingMergeTree, etc. Each MergeTree variant inherits the capabilities of the base MergeTree, but adds unique features. The sub-engine contains MergeTree’s compressed data blocks, column storage, data markup, and other features.

All of their special logic is activated in the process of triggering the merge.

Such as:

  • ReplacingMergeTree: When merging, only one of the same key values is retained.

  • SummingMergeTree: the same Key, automatically do Sum operation;

The derived inheritance relationship is as follows:

Multiple copies derived engine

ReplicatedMergeTree family: If you prefix the Replicated table engine in the merged tree family, you get a set of table engines that support data replicas, such as ReplicatedMergeTree. The main function is to realize the automatic data copy attribute of the table, and the concept of distributed table, data copy. And the combination relationship of various derived engines of MergeTree is shown in the figure below:

Other engines

Other types of engines, including Kafka, HDFS, Mysql, etc., are not covered in this discussion.

Note: This blog is a compilation of the MergeTree engine chapter notes of Clickhouse principle analysis and application Practice.