ClickHouse is an OLAP database for big data scenarios. Compared to traditional OLAP big data analysis systems based on the Hadoop ecosystem, ClickHouse offers extreme query performance, lightweight architecture, and easy maintenance. At present, the community is highly active and the application practice in the industry is increasingly extensive.
Business introduction
Jingdong Energy Management platform is an energy consumption analysis product launched by THE IoT Product Department of Jingdong Technology for government and enterprise customers, which uses the Internet of Things, big data and AI technology to collect, monitor, analyze and alarm energy big data for enterprises and public institutions. It aims to help customers achieve energy conservation and emission reduction and reduce energy consumption per unit product.
Energy indicators include electricity consumption, water consumption and natural gas consumption. The dimensions include time dimension (year, month, week, day and hour), manufacturer, workshop, type of production line, production line and equipment. According to these indicators and dimensions, real-time data multidimensional analysis and diagnosis services are provided.
Technology selection
The concept of BI (business intelligence) was put forward in the last century for multi-dimensional analysis of data indicators. Compared to OLTP (online transaction) systems, the industry refers to such BI-oriented systems collectively as OLAP(Online analysis) systems. With the development of computer software technology, from single machine tools of a small amount of data analysis, such as Excel, to medium-sized data type by analyzing the relationship between database building OLAP (SSAS) such as Microsoft, to today’s era of big data, huge amounts of data real-time OLAP analysis engine, technical innovation, flowers on tool system schools of thought contend, Each has its advantages, but they can be broadly divided into two broad categories in terms of architectural patterns:
-
MPP architecture. The MPP architecture features that the service sends the received query request to each compute node. After the compute node completes the calculation, the final result is summarized by one node to obtain the final result. Typical implementations such as Presto, Impala, SparkSQL, Drill, etc. The MPP architecture is characterized by flexible data model and high memory overhead to achieve high performance.
-
Predictive computing system. The core idea of predictive calculation is to use space for time, through in-depth business understanding, preprocess the combination of data indicators and dimensions that need to be queried, store the calculated results in the database and establish corresponding indexes, and realize query acceleration. Typical implementations are Kylin and Druid. Predictive computing system is characterized by high performance, but poor flexibility. Generally, data model adjustment involves rerun of historical data, which makes maintenance difficult.
As can be seen from the above table, there is no OLAP engine in the industry that can meet the requirements of performance and flexibility at the same time. When selecting the technology of JINGdong Energy Management platform, factors such as flexibility of the model, ease of deployment, development cost, maintainability and suitability for cloud deployment are taken into consideration. The final decision was to use ClickHouse based on the MPP architecture as our OLAP engine.
The application of ClickHouse
1. System architecture
Jingdong energy Management platform is mainly used for multi-dimensional analysis and statistics, AI diagnosis and energy consumption statement issuance of counts reported by various meters (water meters, electricity meters, natural gas meters, etc.). The raw data of a meter is usually a cumulative value. For example, the kilowatt-hour is a total of all the electricity consumed since the meter was installed. Therefore, we will introduce a differentiator to preprocess the data before data access, so that the index data entering ClickHouse can be directly added to the index, and it is convenient to use SQL to connect ClickHouse to realize multidimensional query service. The architecture diagram is as follows:
Description:
-
Property management platform: manage the equipment, manage the property model and equipment status, collect equipment data.
-
Message bus: Kafka message queue, using JSON format data to achieve data interaction between object management platform and energy platform.
-
Differentiator: Calculates the difference between the cumulative value reported each time and the cumulative value reported last time to get the cumulative index.
-
Exception rule chain: Provides an exception rule set, which is used by the differentiator to determine whether the reported data is abnormal. If abnormal data is recorded, the data is not processed.
-
OLAP engine: An OLAP engine based on the ClickHouse implementation.
-
Multidimensional analysis service: provides universal multidimensional data analysis query service, which can realize the combination query of various dimensions and indicators through a unified API.
-
Government and Enterprise interfaces: WEB interfaces for government and enterprise customers.
2. ClickHouse application
As you can see from the architecture diagram above, the energy platform uses ClickHouse as an OLAP engine to provide multidimensional query services. The following focuses on data access, storage and universal interface design
ClickHouse applications:
- Data access
A typical ETL process for ClickHouse is to Extract data from a Kafka table and generate consumer subscriptions to Kafka topics. Data Transform is realized through materialized view. Data is eventually loaded into the MergeTree table to realize the actual data storage.
Example of creating Kafka:
1 CREATE TABLE statistics_kafka ON CLUSTER '{cluster}' (
2 timestamp UInt64,
3 level String,
4 message String
5 ) ENGINE = Kafka SETTINGS kafka_broker_list = 'kafka.jd.com:9092',
6 kafka_topic_list = 'statistics',
7 afka_group_name = 'gpst',
8 kafka_format = 'JSONEachRow',
9 kafka_skip_broken_messages = 1,
10 kafka_num_consumers = 3;
Copy the code
-
Kafka_broker_list: Kafka broker address.
-
Kafka_topic_list: topics consumed.
-
Kafka_group_name: consumption groupId.
-
Kafka_format: Data format JSONEachRow Indicates that the message body is in JSON format.
-
Kafka_skip_broken_messages: number of Kafka exception messages that are ignored. The default value is 0.
-
Kafka_num_consumers: indicates the number of consumers. The default value is 1.
Example of creating materialized views:
1 CREATE MATERIALIZED VIEW statistics_view ON CLUSTER '{cluster}' TO statistics_replica AS
2 SELECT timestamp,
3 level,
4 message5FROM statistics_kafka;
Copy the code
Example Of creating a MergeTree engine:
1 CREATE TABLE statistics_replica ON CLUSTER '{cluster}'{
2 timestamp UInt64,
3 dt String,
4 deviceId String,
5 level String,
6 message String
7 } ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/statistics_replica','{replica}')
8 PARTITION BY dt
9 ORDER BY (dt,deviceId,level);
Copy the code
- storage
I.C lickHouse table type
Local table: Actual data storage table, such as table STATISticS_Replica.
Distributed table: A logical table that can be thought of as a view in a database. The distributed table engine will route our query requests to the local table, then summarize them and return them to the user. Create a distributed representation:
1 CREATE TABLE statistics ON CLUSTER '{cluster}' AS statistics_replica
2 ENGINE = Distributed(ck_cluster_1,test,events_local,rand());
Copy the code
Ii. The Replication and Sharding
Replication is a Replication mechanism provided by ClickHouse. For Replicated MergeTree series Replicated tables, you can set up multiple copies of exactly the same data per table on different compute nodes, and each copy is complete and called a copy.
Shard: Divides the data in a table into multiple parts according to certain rules. The data in each part is stored on different compute nodes. The data on each compute node is called a fragment.
ClickHouse implements a Replicated table mechanism based on the Replicated MergeTree engine with Zookeeper, allowing you to determine whether the table is highly available when you create it. In the statistics_replica list in the previous section, / Clickhouse /tables/{shard}/ STATIStics_replica indicates the nodes in Zookeeper that correspond to the replica. When data is written to a ReplicatedMergeTree table, the process is as follows:
-
A ClickHouse node received a data write request.
-
Synchronize to other instances through the InterServer HTTP port.
-
Updated node information in the Zookeeper cluster.
3.OLAP universal interface design
ClickHouse provides a standard SQL query engine that enables the basic operation of multiple ClickHouses through JDBC references. OLAP regular operations such as roll up, drill down and slice will involve a variety of dimensions free combination, a variety of indicators of cross analysis process, if the server use Mybatis or conventional ORM operation such as JPA, engineers can easily according to the requirements of the different query scenario design a corresponding interface, according to a large number of branch operation or design complex judgmental interface, In view of this, the author designs a set of universal multidimensional service query interface optimized for OLAP.
First, a typical analysis SQL statement looks like this:
1 SELECT day_str,
2 factory_name,
3 workshop_name,
4 prodline_name,
5 device_id,
6 SUM(w_total) AS total
7 FROM statistics
8 WHERE day_str BETWEEN '2020-10-01' AND '2020-12-31'
9 GROUP BY day_str,factory_name,workshop_name,prodline_name,device_id
10 ORDER BY day_str ASC;
Copy the code
We translated the above statement into the business language as “Query the power consumption of all equipment in the plant in the fourth quarter of 2020”. From this, we can clearly know that the dimension here refers to “equipment name” and the indicator is “power consumption”. Based on this, it can be further classified. Dimensions usually appear after SQL statements SELECT, WHERE, GROUP BY, and ORDER BY, and metrics usually appear after SELECT, which can be summarized as follows:
1 SELECT {dimension},{dimension} 2 FROM table_name 3 WHERE {dimension}=' XXX '4 GROUP BY {dimension} 5 ORDER BY {dimension};Copy the code
Therefore, we can design the following generic interface methods:
List<Map<String,Object>> queryStatisticsResult(Query Query); Public class Query {private static final Long serialVersionUID = 4904019884726531900L; /** * dimensions */ private List<String> dimensions; /** * private List<Measure> measures; /** * private List<Filter> where; } // Public class implements Serializable private static final Long serialVersionUID = -8556179136317748835L; /** * @nonNULL private String name; /** * private String field; /** * Aggregation type */ @nonNULL private AggregationEnum} // Aggregation enumeration public enum AggregationEnum { SUM,AVG,COUNT,MIN,MAX,COUNT_DISTINCT,PERCENTILE; }Copy the code
conclusion
This paper focuses on the architecture and design of the multidimensional data analysis engine of JINGdong integrated energy management platform, and illustrates a typical application scenario of ClickHouse from the perspectives of data access, storage and multidimensional analysis service design. It is hoped that this paper will provide readers with an idea and method to deal with real-time OLAP field of big data. Of course, due to the limited space and my level, I did not further elaborate more possible solutions. As we deepen our business and upgrade the system iteratively, better solutions suitable for the future will be introduced step by step. Please look forward to it.
Recommended reading
-
How to use ClickHouse for temporal data management and mining?
-
Support 271.5 billion yuan of massive orders revealed jingdong promote the database cornerstone
-
ClickHouse best Practices distribution table writing process analysis
Welcome to [JINGdong Technology] to learn about the developer community
More wonderful technical practice and exclusive dry goods analysis
Welcome to “JINGdong Technology Developer” public account