The first four chapters of this series have taken a detailed look at clickhouse’s storage engine design ideas. This chapter gives you a complete overview of the Storage model and how it works.

File organization

In most DBMSS, a database is essentially a file directory consisting of various subdirectories and files, and ClickHouse is certainly no exception. The default clickhouse data directory is in /var/lib/clickhouse/data. All databases create a subfolder in this directory. The following diagram shows how ClickHouse organizes data files.

Clickhouse directory structure

Each database creates a subdirectory in the Data directory of Clickhouse, which carries default and System databases by default. Default is the default database, and System is the database that stores clickHouse server information, such as connection counts, resource usage, and so on.

The figure above shows the file form of the imported database after ClickHouse has gone through the official introductory tutorial. As you can see, in the tutorial folder, two subdirectories are created, each of which is a data table. Once you enter the hits_v1 directory, you see two subdirectories and a text file.

Partition directory

The meanings of subdirectories and files in a partition directory are as follows:

Directory name type instructions
201403 _1_10_2 directory Partition directory one or more, due to partition +LSM generated
detached directory Table partition location after DETACH statement
format_version.txt Text file Plain text, record storage format

Partition directory composition

The partition directory is composed of partition ID_ Minimum data block ID_ Maximum data block ID_. In this case, the partition ID is 201403, the minimum block number is 1, the maximum database number is 10, and the level is 2. The number of a data block increases from 1. The maximum and minimum number of a newly created database are the same. When a database is merged, the new database will be changed to the combined data block number. And each merge increases the level by 1. For more details on the merger, see LSM, the sequel to this series.

The partition ID is specified by the user when creating the table. It allows the user to create multiple partition keys, each of which is connected with a ‘-‘. In this example, only one partitioning key is used, the time field, which is partitioned by month. The benefits of partitioning are increased concurrency and partial query speed.

The data directory

After entering the partitioned directory, you can see the structure of the data directory where the data is actually stored.

columns.txt

This file is a text file that stores table structure information and can be opened with text editing.

count.txt

This file is also a text file that stores the number of lines under the partition. It can be opened as a text file. When the user executes select Count () from XXX, it essentially returns the contents of the file directly, without traversing the data. So Clickhouse’s count() is very fast.

Also, compare the implementation of MySQL and PostgreSQL. In both relational databases, the common storage engine does not use ClickHouse. Can readers answer the question of whether to use traversal instead of simple schemes for MySQL or PG?

MySQL > select count(*); MySQL > select count(*); MySQL > select count(*); That is, the count performed in this transaction takes into account the effects of insert and DELETE. However, for transaction B, operations on data in transaction A cannot be seen until transaction A commits. Therefore, the result after the count executed in the two transactions AB may be different. With clickHouse’s solution, these requirements cannot be met. Clickhouse, on the other hand, doesn’t need to support transactions, so it uses a relatively simple solution.

primary.idx

Primary key index. For details, see Chapter 4 of this series.

checksums.txt

Binary, checksum. Used to quickly verify whether data is tampered with. ,

default_compression_codec.txt

A file added in the new version that was not present in the old version. The file is a text file that stores the compression encoder used in the data file. Clickhouse offers a variety of compression algorithms to choose from, using LZ4 by default.

[column].mrk3

For details, see Chapter 4 of this series.

[column].bin

Data files that actually store data. The next section takes a closer look at this. Each column generates a separate bin file.

skp_idx_[column].idx

Hop index, which is generated when a secondary index is used. Otherwise it doesn’t generate

skp_idx_[column].mrk

Hop index tag file, generated when secondary indexes are used. Otherwise it doesn’t generate

This section details the file organization of the storage engine in ClickHouse. By disassembling a table into specific directories and files, readers should get a sense of clickHouse’s storage system. The next section shows you how to read data files.

Data organization

This section shows you how to read the bin file. Because the bin file is a binary file, you need to use tools to read the file. Therefore, you cannot use a text file to read the file. In Windows, winHEX is recommended. In MAC, HEX friend is recommended.

Data file structure

Data file structure

The figure above shows the structure of a bin file. Bin files are stored in small endian order. Bin files arrange data in blocks. Each block file consists of a 16-byte checksum, a 1-byte compression mode, a 4-byte size after compression, and a 4-byte size before compression. The starting address of each block is determined by the following formula:

  1. Offset (n)=offset(n-1)+25+ compressed size (n >=2)
  2. offset(1)=0

The checksum

The first 16 are validation and areas for quick validation of data integrity.

Compression way

The default is 0x82. Clickhouse supports four compression modes: LZ4(0x82), ZSTD(0x90), Multiple(0x91), and Delta(0x92).

Compressed size

The size of the data stored in the data section. The offset of the next BLOCK needs to be calculated based on this size.

Size before compression

Data Indicates the size of data stored in the data area before compression. The compression ratio can be calculated from this.

The data area

Data stores data in bytes 18 to 21 of the header. After getting the data in the data area, because it is compressed, it cannot be directly identified. It needs to be decompressed in accordance with the compression method to identify.

example

Below is a screenshot of the userAgent.bin file of the hits_v1 table opened using the binary reader, with the header information color-marked, after clickhouse followed the tutorial instructions provided by the official.

By reading the header information, we know that the database is compressed using LZ4 algorithm, and the compressed size is 0x000049AC, that is, 18860 bytes. The size before compression is 0x00010000, that is, 65536 bytes, and the compression ratio is about 1:3.4≈29%. The figure shows AC490000, which is 0x000049AC because clickHouse uses small endian order.

Then, we copied the compressed data of 18860 bytes from the start address of data area and decompressed it through LZ4 algorithm to obtain the original data, as shown in the figure below. The source code of the decompression program has been uploaded to Github, and readers can visit the repository to download the experiment themselves. Cfcz48 /lz4compress (github.com)

Decompressed raw data

Next, we query soxes.txt to see that the UserAgent column is the data type of the UInt8, that is, each data is an unsigned 8-bit integer. So every 8 bits is a row of data combined with the raw data obtained in the previous step. For example, 0x04 represents 4 in decimal notation. The raw data in the figure above is 4. The image below shows the result after decoding the first 20 bits of raw data:

Result after decoding the first 20 bits

We are using SQL to query the first 20 bits of data to see if they are consistent:

conclusion

This chapter explains the file organization of ClickHouse’s storage engine in detail and provides an example to help you understand it. The next chapter describes the process of data insertion in detail.

\