ClickHouse is an OLAP-enabled column database management system released by Yandex in Russia. What he is most praised for is how fast his queries are. Table 1 shows how many times ClickHouse’s queries improve compared to various other types of data.
clickhouse | Hive | MySQL | greenplum | |
---|---|---|---|---|
100min. | 1 | 294.48 | 844.58 | 24.53 |
1bn. | 1 | / | / | 18.68 |
Table 1 Performance comparison of ClickHouse with other databases (data from ClickHouse official website)
As you can see, at 100 million pieces of data, ClickHouse is on average 294 times faster than Hive, 844 times faster than MySQL, and 24 times faster than Greenplum. This amazing performance boost is amazing, and there must be some good architectural design behind this amazing performance.
ClickHouse is an excellent and classic engineering project, which is the crystallization of human computer engineering. The reason why we define it as engineering is that most of the optimization of query speed by the ClickHouse kernel does not create new theories, but applies previous research successfully to this field. For example, data compression is easy to use with the LZ77 algorithm. LevelDB’s LSM algorithm is used for reference in the storage engine…
In order to uncover the secrets behind ClickHouse, this series will analyze ClickHouse in three parts, and finally bring the performance secrets of ClickHouse to the reader.
- Part I: Principle.
- Part 2: Fundamentals: The architecture design of ClickHouse’s storage engine and computing engine.
- Part 3: ClickHouse source code for the above design implementation.
The first part of the principle mainly answers a question: what are the factors that affect the query speed in online analysis of big data? This question is the origin of everything. What improvements have been made to solve this problem?
The second part of Fundamentals mainly reveals how ClickHouse has been further optimized on the basis of the first part. And it introduces the principle of the storage engine of ClickHouse in detail, so that readers can experience the ultimate optimization of ClickHouse in storage. Of course, while a structure has its advantages, it also inevitably has disadvantages. So this section will also want the reader to analyze what ClickHouse is not good at doing. Having covered the storage engine, ClickHouse also has a great engineering innovation in the computing engine, the vectorization computing engine, which is also analyzed in this section. Together, these two engine architecture designs form the foundation of ClickHouse.
The third part shows the reader how ClickHouse implements the Part 2 architecture from the source side. This section also explores some of the relationships between architecture and source code writing. Experienced programmers can feel more or less that how the code is written affects the outcome of the architecture. I like to think of architecture as the fundamentals of a system. The fundamentals set the upper limit of the system, but the quality of the code can pull the upper limit very low. The best architecture, if the code is broken, can’t fully exploit the power of the architecture. For example, if too many locks are used, the system will not be able to reach the upper limit set by the architecture by Amdal’s law. At the same time, coding quality affects reusability and ease of use, and there are some limitations between the three. This section won’t cover much of this, but it will look at the ClickHouse source code and show you how ClickHouse handles this. This is what I like to call the microarchitecture, and I’ll do a separate column on this series when I get the chance. This series will use some of those inferences directly. Of course, readers can also have different objections, after all, it is only inference, you are welcome to discuss with me.
Finally, this series will briefly introduce the distributed architecture of ClickHouse. Because ClickHouse is designed to emphasize the processing power of a single machine, it does not pay much attention to the distributed architecture and does not use too many distributed techniques. This may be because the open source version does not open source this part of the architecture. However, according to the existing source code, ClickHouse uses MPP architecture, which is a relatively simple design in distributed architecture. More distributed content will be covered in detail in other series that will exit later. ClickHouse will also be compared.
Next, we begin our tour of Clickhouse.