Apache Kylin starter series directory

  • Introduction to Apache Kylin 1 – Basic Concepts
  • Getting Started with Apache Kylin 2 – Principles and Architecture
  • Apache Kylin Getting started 3 – Details of installation and configuration parameters
  • Apache Kylin Starter 4 – Building the Model
  • Apache Kylin Starter 5 – Build Cube
  • Apache Kylin Starter 6 – Optimizing Cube
  • Construct Kylin query time monitoring page based on ELKB

Load tables from Hive

To import table definitions from Hive, perform the following steps:

  1. Login system:http://ip:7070/kylin;
  2. Click the plus sign “+” under the website Logo at the top left of the main interface to create a new project;
  3. In the pop-up window, enter the project name (mandatory) and project description, and click “OK” button to complete the project creation.
  4. Select the created project from the drop-down box on the right side of the website Logo, and click the “Model” menu;
  5. Click the “Data Source” TAB. There are three buttons behind Tables. The first dark blue button imports Tables from Hive by table name. Click the second light blue button to visually select tables to import from Hive.
  6. Click the light blue button and select the Hive table to import. After selecting the table, click Sync in the lower right corner to import the Hive table.

Select Calculate column cardinality in the Hive table. The cardinality is the number of different values in the index data set. For example, country is a dimension. If there are 200 different values, So the cardinality of this dimension is 200.

Create a Model

Click on the ‘Models’ TAB to see the Models and Cubes already created for the project. Click on the’ + New ‘button and select’ New Model ‘to open the create Model window. Due to data Model differences, this article does not cover specific cases, but instead focuses on the various concepts encountered during the creation of the Model.

1, the Model Info

Model Info is used to fill in basic information about the Model. “Model Name” is mandatory. Note the following two points about the Model Name:

  1. The model name is globally unique, which means that even if you create a new project, your model name cannot be repeated.
  2. Once a model is created, the model name cannot be changed.

2, the Data Model

Data Model is mainly to build the overall Data Model, whether your Data is a star Model or a snowflake Model, need to establish the relationship between Data tables in this place.

2.1. Select the fact table

The first step to build the data model is to select the fact Table. After selecting the fact Table, click “Add Lookup Table” button to set the relationship between the fact Table and the dimension Table.

2.2. Establish data relationships

The following describes the Add Lookup Table page:

  1. Data relationships are not only between fact tables and dimension tables (star models), but also between dimension tables and dimension tables (snowflake models).
  2. There are three types of Join added between tables: Left Join, Inner Join, and Right Join.
  3. Skip snapshot for this lookup tableThis option refers to whether to skip the creation of snapshotTable. Since some Lookup tables are very large (greater than 300M), if the cardinality of a dimension is large, memory may appear OOM. So creating a snapshotTable limits the size of the original table to a configured upper limit (kylin.snapshot.max-mb, the default value is 300.
  4. Lookup tables that skip build snapshot will not be searchable and will not support setting to Derived dimensions.
  5. In most cases, “Left Join” is used. The other two Join methods are not very common.

2.3. Complete the construction of table relationships

By doing this, you can connect the fact table and dimension table to form a data model.

3, Dimensions

Select the Dimensions that may be involved in calculation on the Dimensions page. The selected Dimensions are only those that have the qualification to be selected when Cube is constructed, not the Dimensions that will be involved in Cube construction at last. It is recommended to select all the fields in the dimension table.

In general, date, type of product, region, and so on are used as dimensions.

4, Measures

On the Measures page, select the Measures that you might use to calculate.

In general, sales, flow, temperature and humidity, etc.

5, Settings

In the Settings page, you can set partitions and filter criteria. Partitions are designed for incremental build. Currently, Kylin supports date-based partitions. Then select the date format; After filtering conditions are set, Kylin will select the data that meets the filtering conditions for construction.

A few points to note:

  1. The time partition column can support date or more fine-grained time partitioning;
  2. The data types supported by the time partition column aretime/date/datetime/integerAnd so on;
  3. Filter conditions do not need to be writtenWHERE;
  4. Filter criteria cannot contain date dimensions.

6, the Save

To complete the Model creation, you can open the Visualization TAB in the Model and query the table joins for the Model.

Third, the Snapshot of the Table

Each Snapshot corresponds to a Hive dimension table and is generated as follows:

  1. Read the values of each row and column sequentially from the original Hive dimension table;
  2. All of these values are encoded in TrieDictionary mode (one value corresponds to one Id);
  3. Read the values of each row in the original table again, replace the values of each column with the encoded Id, and get a new table with only Id.
  4. Saving the new table together with the Dictionary object (a mapping between Id and value) will save the entire dimension table;
  5. Kylin stores this data in a metadata database.

Any Code, Code Any!

Scan code to pay attention to “AnyCode”, programming road, together forward.