The article reprinted in: https://mp.weixin.qq.com/s?__…
Apache Kylin and ClickHouse are both popular OLAP engines for big data in the market; Kylin was originally developed by the R&D center of eBay in China. It was opened source and contributed to Apache Software Foundation in 2014. With its sub-second query ability and super concurrent query ability, Kylin has been adopted by many major companies, including Meituan, Didi, Ctrip, Kezhaofang, Tencent, 58.com, etc.
ClickHouse, which has been hot in OLAP for two years, was developed by Russian search giant Yandex and opened source in 2016. Its typical users include well-known companies such as Bytedance, Sina and Tencent.
What are the differences between the two OLAP engines, what are the advantages of each, and how to choose? This article will try to compare the two OLAP engines from the aspects of technical principles, storage structure, optimization methods and advantageous scenarios, so as to provide some references for you to choose the technology.
01
The technology principle
In terms of technical principles, we mainly compare architecture and ecology.
1.1 Technical architecture
Kylin is Molap (Multi-Dimensional OLAP) technology based on Hadoop, and the core technology is OLAP Cube. Unlike traditional MOLAP technology, Kylin runs on Hadoop, a powerful and extensible platform that supports massive (terabyte to petabyte) amounts of data. It imports a well-calculated (executed by MapReduce or Spark) Cube into a low-latency distributed database called HBase, which enables subsecond query responses. More recently, Kylin 4 began replacing HBase with Spark + Parquet to further simplify the architecture. Since a large amount of aggregate computation has been completed in the offline task (Cube build), when executing SQL queries, it does not need to access the original data, but directly uses the index combined with the aggregate results to calculate the second time. The performance is hundreds or even thousands of times better than accessing the original data. Due to low CPU usage, it can support a high amount of concurrency, especially suitable for self-help analysis, fixed reports and other multi-user, interactive analysis scenarios.
ClickHouse is a Relational OLAP analysis engine based on MPP architecture. Each node has the same responsibility and each node is responsible for the processing of a shared nothing. ClickHouse developed a vectorized execution engine. Using log merge tree, sparse index and CPU’s SIMD (Single Instruction Multiple Data, Single Instruction Multiple Data) features, the hardware advantage is fully utilized to achieve the purpose of efficient computing. As a result, ClickHouse often reaches the limits of CPU performance when faced with large data volume scenarios.
1.2 Technology ecology
Kylin is written in Java and fully integrated into the Hadoop ecosystem. HDFS is used for distributed storage. MapReduce, Spark and Flink are optional computing engines. Storage engines HBase, Parquet (combined with Spark) are optional. The source data access supports Hive, Kafka, RDBMS, etc., and multi-node coordination relies on ZooKeeper. Compliant with Hive metadata, Kylin only supports SELECT query, schema changes and so on need to be completed in Hive, and then synchronized to Kylin; Modeling and other operations are completed through Web UI, and task scheduling is carried out through REST API. Task progress can be viewed on Web UI.
ClickHouse is written in C++ with its own system and less reliance on third-party tools. Support more complete DDL and DML, most operations can be done through the command line combined with SQL; Distributed clusters rely on Zookeper management, single nodes do not rely on Zookeper, most of the configuration needs to be done by modifying the configuration file.
02
storage
Kylin uses HBase or Parquet of Hadoop ecology as the storage structure and relies on the RowKey index of HBase or the Row Group sparse index of Parquet to speed up the query. Use HBase Region Server or Spark Executor for distributed parallel computing. ClickHouse manages its own data storage. Its storage features include: Mergetree as the primary storage structure, data compression and partitioning, sparse indexing, etc. A detailed comparison of the two engines is shown below.
2.1 Storage structure of Kylin
Kylin calculates the multidimensional Cube data through pre-aggregation, and dynamically selects the optimal Cuboid (similar to materialized view) according to the query conditions during the query, which will greatly reduce the CPU calculation and IO reading.
In the Cube construction process, Kylin encodes the dimension values to a certain extent, such as dictionary coding, in an effort to minimize data storage. Since Kylin’s storage engine and build engine are both pluggable, the storage structure varies for different storage engines.
HBase storage
In the case of using HBase as the storage engine, each dimension will be encoded in the predicted calculation to ensure that the dimension value has a fixed length. In addition, the dimensions in the calculated results will be stitched into RowKey and the aggregate value will be taken as value when generating HFile. The order of dimensions determines the design of the RowKey and directly affects the efficiency of the query.
Parquet storage engine
When Parquet is used as the storage format, dimension values and aggregate values are stored directly without the need for encoding and rowkey concatenation. Before saving as Parquet, the computing engine will sort the results according to the dimension, the more advanced the dimension field, the more efficient the filtering on it. The number of shards in the same partition and the number of row groups in the Parquet file also affect the efficiency of the query.
2.2 ClickHouse storage structure
ClickHouse generally requires the user to specify the partition column when creating the table structure. Using data compression and pure column storage technology, using Mergetree to store each column separately and compress into blocks,
Data is always written to disk as fragments, and ClickHouse periodically merges these fragments through background threads when certain conditions are met.
As the volume of data continues to grow, ClickHouse will merge the data from the partition directory to improve the efficiency of data scanning.
At the same time, ClickHouse provides a sparse index for each data block. In the processing of query requests, you can use sparse indexes to reduce data scanning to accelerate the effect.
03
An optimization method
Both Kylin and ClickHouse are big data processing systems. When the data magnitude continues to increase, appropriate optimization methods can often achieve twice the result with half the effort, greatly reducing the query response time, reducing the storage space and improving the query performance. Due to their different computing systems and storage systems, the optimization methods adopted are also different. The next section will focus on analyzing the optimization methods of Kylin and ClickHouse.
3.1 Optimization method of Kylin
The core principle of Kylin is predictive computation, as described in the first section of technical principles: Kylin’s computing engine uses Apache Spark and MapReduce; HBase for storage, Parquet; The SQL is parsed and post-computed using Apache Calcite. Kylin’s core technology is to develop a series of optimization methods to help solve the problem of dimension explosion and excessive scanning data. These methods include: setting aggregation group, setting union dimension, setting derived dimension, setting dimension table snapshot, setting Rowkey order, setting shard by column, etc.
- Set aggregation group: pruning through aggregation group to reduce unnecessary combination of predicted calculation;
- Set joint dimensions: Combine the dimensions that often occur in pairs together to reduce unnecessary calculation;
- Set derived dimensions: the dimensions that can be calculated from other dimensions (such as year, month, and day can be calculated from date) can be set as derived dimensions to reduce unnecessary calculation;
- Set the dimension table snapshot: put it into memory to count, reduce the storage space occupied;
- Dictionary coding: reduces storage footprint;
- ROWKEY encoding, setting SHARD BY columns: Accelerates query efficiency by reducing the number of rows scanned for data
3.2 ClickHouse optimization method
The most common way to optimize an MPP architecture system is to separate libraries and tables. Similarly, the most common ways to optimize ClickHouse include partitioning and sharding, and ClickHouse also includes a number of specific engines. To sum up, these optimization methods include:
- Use flat table structure instead of multi-table JOIN to avoid expensive JOIN operation and data shuffling
- Set reasonable partitioning key, sort key, secondary index, reduce data scanning
- Build ClickHouse Distributed Cluster to add sharding and replicas, and add computing resources
- Combined with materialized view, appropriate use of Summingmergetree, AggregateMergetree and other predict-oriented engines
With the higher and higher requirements of performance and concurrency, the resource consumption of the machine is also increasing. In the official website document of ClickHouse, it is recommended that the number of concurrency of ClickHouse should not exceed 100. When the concurrency requirement is high, in order to reduce the resource consumption of ClickHouse, some special engines of ClickHouse can be combined for optimization.
The most commonly used data structures for special engines are SummingMergetree and AggregateMergetree. These two data structures are derived from Mergetree and are essentially calculated in advance by predicating the data that needs to be queriedand stored in ClickHouse. This further reduces resource consumption when querying.
From the principle of use, Summingmergetree and AggregateMergetree are similar to Kylin’s Cube. However, when there are too many dimensions, it is unrealistic to manage many materialized views, and the problem of high management cost exists. Unlike ClickHouse, Kylin provides a series of simple and straightforward optimization methods to avoid the problem of exploding dimensions.
As you can see, both ClickHouse and Kylin provide ways to reduce the storage footprint and reduce the number of rows of data scanned when querying. It is generally accepted that both ClickHouse and Kylin can be properly optimized to meet business requirements in high-data scenarios. ClickHouse uses MPP for current calculation, while Kylin uses predicted calculation. Due to the different technical routes adopted by the two, the corresponding advantage scenarios are also different.
04
Advantage of scene
Kylin is suitable for aggregate queries with fixed patterns because it adopts predictive computing technology. For example, the conditional patterns of JOIN, GROUP BY and WHERE in SQL are relatively fixed. The larger the data volume is, the more obvious the advantages of using Kylin are. In particular, Kylin has great advantages in such scenes as COUNT distinct, TOP N, Percentile, etc., which are widely used in Dashboard, various reports, large screen display, traffic statistics, user behavior analysis and other scenes. Meituan, Aurora, ShellFinder and others use Kylin to build their data service platform, providing up to millions to tens of millions of queries every day, and most queries can be completed within 2-3 seconds. There are few better alternatives to such high concurrency scenarios.
ClickHouse uses MPP architecture because of its strong field computing capacity, which is suitable for flexible query requests, or for detailed query requirements with small concurrency. Scenarios include user tag filtering with very many columns and arbitrary combinations of WHERE conditions, complex AD hoc queries with small concurrent volumes, and so on. If the volume of data and traffic is large and you need to deploy a distributed ClickHouse cluster, the operational challenges can be high.
If some queries are very flexible, but not frequently checked, the use of on-the-spot calculation is more resource saving, because the number of queries is small, even if each query consumes a large amount of computing resources can be cost-effective overall. If some queries have fixed modes, a large number of queries is more suitable for Kylin. Because of the large number of queries, large computational resources are used to save the computational results, and the early computational cost can be spread out in each query, so it is the most economical.
05
conclusion
In this paper, Kylin and ClickHouse are compared in terms of technical principles, storage structure, optimization methods and advantageous scenarios.
Technical principle: ClickHouse adopts MPP + Shared Nothing architecture, so it is flexible to query, easy to install, deploy and operate. Since the data is stored locally, it is relatively troublesome to expand capacity and maintain operation. Kylin adopts Molap prediction, architecture based on Hadoop, separation of computing and storage (especially after using Parquet storage), and Shared storage, which is more suitable for scenes with relatively fixed scenes but large data volume. Based on Hadoop, it is easy to integrate with existing big data platform. Horizontal scaling is also easy (especially after upgrading from HBase to Spark + Parquet).
Storage structure: ClickHouse stores detailed data. Features include Mergetree storage structure and sparse indexes. On top of the detailed data, you can further create aggregate tables to accelerate performance. Kylin uses pre-aggregation and HBase or Parquet for storage. The materialized view is transparent to the query. Aggregate query is very efficient but does not support detailed query.
In terms of optimization methods: ClickHouse includes optimization methods such as partition sharding and secondary index, while Kylin adopts optimization methods such as aggregation group, union dimension, derivative dimension, hierarchy dimension and Rowkey sorting
Advantage scenario: ClickHouse is usually suitable for flexible queries on the order of hundreds of millions to billions (more orders of magnitude are also supported, but cluster operation and maintenance becomes more difficult). Kylin is more suitable for the relatively fixed query scenarios of billions to billions.
Here is a summary of many aspects:
Taken together, Kylin and ClickHouse are used in a variety of areas and scenarios. The modern field of data analytics does not have a single analytics engine that fits all scenarios. Enterprises need to choose the right tools to solve specific problems according to their business scenarios. I hope this paper can help enterprises to make a suitable technology selection.