In section 3, you were shown how ClickHouse compresses data in blocks and then writes it to disk data files. This reduces the data size and disk I/O time. However, the absence of indexes means that all data needs to be read each time a query is made, which still costs a lot of disk IO, even though compression has reduced the amount of data by a factor of 6.2. This is where the index appears, again helping to reduce the amount of data that needs to be read when querying.

Before introducing ClickHouse’s index, let’s review the B+ tree, an index technique commonly used in relational database MySQL. The B+ tree algorithm is beyond the content of this article, and will not be discussed in depth here. We mainly analyze the purpose of MySQL to use B+ tree and the nature of B+ tree. In fact, B+ tree is an N-fork tree in nature, and its leaf nodes are ordered index values. Therefore, data can be quickly located according to this tree during query, and it can adapt to range search due to its order. The figure below shows a B+ tree.

B+ tree diagram

Now that you understand the nature of B+ trees, you can try to answer the question: Is clickHouse worth indexing with B+ trees? Why is that?

If your answer is no, then you have a good understanding of both ClickHouse and the MySQL storage engine. If your answer is yes or not, don’t worry. Here’s why.

The answer to this question is no, due to the design differences between ClickHouse’s storage engine and MySQL’s. MySQL uses MVCC’s transaction control mechanism to support transactions, so there is a problem: the insertion order of data is different from the order of indexes. For example, if I index the age column, my insert order would be 44,21,20,33,19. This set of data needs to be rearranged in the B+ tree as 19,20,21,33,44. This is where the insertion order and index sort of data differ. The order in which data is inserted in a relational database is the order in which the storage engine writes to the data file.

In The Clickhouse storage engine, which is described in detail in Chapter 3 and Additional articles, the LSM mechanism ensures that data is written to the storage engine in primary key order, even if the insertion order is 44, 21, 20, 33, and 19. This set of data is written to the data file in the order 19,20,21,33,44 through clickhouse’s LSM mechanism. This means that ClickHouse is already storing the data in order, so there is no need to use the B+ tree to index it again.

The primary index

To sum up, clickHouse’s primary indexing is very simple, requiring only the first value of each block to be recorded. For example, a set of 100 million rows with primary keys ranging from 1 to 100,000,000. Once stored in clickhouse, one block is executed as 8192, making a total of 12,208 blocks. Index 1,8193,16635…… The query simply determines which blocks to read from the values. For example, if I want to query data with ID >500 and ID <12258, I just need to read block 0 and block 1.

In the ClickHouse datastore file, the primary index exists in primary.idx. The essence of a level 1 index is to store the minimum value of the data in each block, so as to determine the block in which the data to be queried is located. It maps data to blocks. Simply put, given a piece of data, the block in which the data resides can be queried quickly through a level 1 index. This avoids the need to traverse the entire data set to query one data.

tag

The previous section has introduced the reader to what a primary index is. But a level 1 index alone cannot achieve the goal of fast lookup. In other words, a level 1 index only maps data to blocks. But there is a problem. Even though I already know that my data is stored in the first block, how do I locate that block? The answer to this question is to tag files. In other words, the tag file stores a mapping of blocks to file offsets.

We know that a block consists of 8192 rows of data. If each block is 64M in size. It is easy to find the address of the NTH block by N*64m, but the size of the key block is not guaranteed after compression, which means that each block is different in size. The starting position of each block must be stored for easy lookup.

In actual clickhouse processing, each line tag has three fields that make up blockid, the offset in the data file, and the offset in the compressed block. Each field is 8 bytes, so each tag totals 24 bytes. With Blockid you can quickly locate N*24B.

If you have blockid and the offset in the data file, you can quickly read the location. What about the offset in the third compressed block in the ClickHouse tag file? The answer to this question has to do with clickHouse’s handling of compressed blocks. As mentioned in Chapter 3, ClickHouse generates a block for every 8192 rows. But what if 8192 rows are still small? For example, the UInt8 type, where a line of data is only 8 bits and 1 byte. Even 8192 rows of data are only 8192 bytes =8KB. This amount of data is still too small for compression. Therefore, ClickHouse processes each compressed block with a minimum of 64KB and a maximum of 1M. That is, if the total data of a block is less than 64K, the next block is removed. So multiple blocks appear in a compressed block. The third parameter in the tag file is used to handle this situation.

So suppose a tag file has the following contents:

0 0 0
1 0 8192
2 0 16384
1 12016 0

Mark file contents

When reading the second block, locate the row (2, 0, 16384) based on 2 x 24Byte=48, read the data block with offset =0 from the bin file based on 0, decompress the data from 16384, and then read the data from 16384 to find the original data.

conclusion

Clickhouse’s indexing can be made very simple due to the design of its storage engine. It mainly consists of primary indexes and tags. The first level index implements data to block mapping, and the tag implements block to file offsets. In addition, because the primary index is very small, 100 million data only need more than 10,000 rows of index, so the primary index can be resident in memory, speed up the search.

Clickhouse also provides secondary indexes, but they are simple, unnecessary, and have little impact on overall performance, so I’ll cover them in the introduction.

Now that the design of clickhouse’s storage engine has been updated, in the next installment I’ll give you a complete example of how clickhouse’s storage engine works. I will also optimize what I have written, especially by adding lots of images to make the concept clearer to the reader.

This section gives you an idea of clickHouse’s design philosophy. Here’s a new question for the reader to ponder: as far as this article goes, clickhouse is designed to speed things up, so why is clickhouse written in C++? Can you achieve the same performance using JAVA? Why is that? This question leads to the core of the next section: ClickHouse’s computing engine. Let’s meet again in the next section for a more challenging journey.