Apache Kylin starter series directory

  • Introduction to Apache Kylin 1 – Basic Concepts
  • Getting Started with Apache Kylin 2 – Principles and Architecture
  • Apache Kylin Getting started 3 – Details of installation and configuration parameters
  • Apache Kylin Starter 4 – Building the Model
  • Apache Kylin Starter 5 – Build Cube
  • Apache Kylin Starter 6 – Optimizing Cube
  • Construct Kylin query time monitoring page based on ELKB

This article summarizes some best practices related to Kylin practices from various perspectives, which may not be accurate, and welcome criticism.

Dimension tables and fact tables

1. Dimension tables

  1. For data consistency, the primary key value must be unique (otherwise the Kylin build process will report an error);
  2. The smaller the dimension table is, the better, because Kylin loads the dimension table into memory for query. Too large tables are not suitable for dimension tables. The default threshold is 300MB.
  3. Change frequency is low. Kylin tries to reuse the Snapshot of the dimension table in each build. If the dimension table changes frequently, reuse will fail, resulting in frequent snapshots of the dimension table.
  4. Dimension tables should not be Hive views, since views need to be materialized each time, resulting in additional time overhead.

2. Fact sheets

  1. Removing fields that do not participate in Cube construction can speed up Cube construction and reduce the size of Cube construction results.
  2. As far as possible, the fact table is divided into dimensions to extract common dimensions.
  3. Ensure the mapping between dimension tables and fact tables and filter records that cannot be mapped.

If dimensions cannot be mapped to the fact table, certain fields (data types in the Number series) will experience build failures (numberFormatException(‘\N’), \N is the actual stored content of NULL in Hive); In addition, a large number of NULL values are generated during a Left Join, and these NULL values have no effect on the actual query.

3, partition table

Hive tables support multiple partitions. Simply put, a Partition is a file directory that stores specific data files. When new data is generated, the data can be loaded into the specified partition, and the partition can be specified when the data is read. For SQL queries, if attributes of partition columns are specified, Hive intelligentially selects a specific partition (directory) to avoid full data scanning and reduce read and write operations on the cluster.

Kylin supports incremental Cube builds, typically extracting data increments from Hive tables by time attributes. If the Hive table is partitioned according to this time attribute, then you can take advantage of Hive partition, each Hive build can directly skip irrelevant date data, saving Cube build time. Such a Column is also known as a Partition Time Column in Kylin and is usually a Partition Column of a Hive table.

2. Cube Construction optimization

1. Dimension optimization

  1. Analyze the query conditions. Do not select the criteria that do not participate in Group By or Where filtering.
  2. In general, the fields in the table on the right side of a Left Join can be used as dimensions.
  3. There are four dimensional optimization schemes: necessary dimension, hierarchy dimension, joint dimension and derivative dimension.
  4. Aggregation groups further optimize the combination of dimensions.

Essential dimensions, hierarchical dimensions, federated dimensions, derived dimensions, and aggregate groups are detailed in Apache Kylin Introduction 5.

2. Other optimizations

  1. Order of RowKeys: Mandatory dimension, where filter condition occurs frequently, high cardinality dimension, low cardinality dimension;
  2. ShardBy setting: It is recommended to select columns with large cardinality as ShardBy columns, so that data can be evenly distributed;
  3. Data compression: Kylin uses special compression algorithms for dimension dictionaries and dimension table snapshots. For HBase aggregated computing data, Kylin uses LZO or Snappy compression algorithms of Hadoop to keep data stored in HBase and memory as small as possible.
  4. For large fact tables you can partition incrementally and then set up periodic Merge operations.
  5. The date data type in the fact table is set to date if possible;
  6. The ShardBy field helps distribute all data in each Region, preventing data skews.
  7. If the ShardBy field is True, data with the same value will be stored together to facilitate data retrieval in batches.

Three, notes

1. Which dimensions are involved in Cuboid construction?

Many people mistakenly think that only the Dimensions selected in the aggregate group include are involved in the calculation of Cube construction. In fact, the Dimensions selected as Default in the Dimensions setting interface are involved in the calculation. And the Left Join field (if its Derived field is checked).

Simply put, Dimensions seen in the RowKeys sorting screen will participate in the Cuboid construction. If Dimensions seen in RowKeys that do not want to participate in the calculation can be modified in the Dimensions setting screen.

2. Set the hierarchy dimension according to the business

When setting the hierarchy dimension, you must have a deep understanding of the business. For example, does the enterprise belong to the industry type or the region type? From the perspective of hierarchical structure, it is ok. After in-depth mining of business needs, we will find that all queries of enterprises are accompanied by regional filtering conditions. Therefore, it is more effective to assign enterprises to the level of the region rather than to the industry type.

3. Be good at experimenting

In practice, there may be various uncertain ideas. The best way is to practice, build more models and cubes, optimize from different levels, try, and constantly summarize experience in practice.


Any Code, Code Any!

Scan code to pay attention to “AnyCode”, programming road, together forward.