Apache Kylin starter series directory

  • Introduction to Apache Kylin 1 – Basic Concepts
  • Getting Started with Apache Kylin 2 – Principles and Architecture
  • Apache Kylin Getting started 3 – Details of installation and configuration parameters
  • Apache Kylin Starter 4 – Building the Model
  • Apache Kylin Starter 5 – Build Cube
  • Apache Kylin Starter 6 – Optimizing Cube
  • Construct Kylin query time monitoring page based on ELKB

Install Kylin

The first two articles introduced you to the basic concepts and how Apache Kylin works. The next part of the article will start with installation, deployment and configuration.

Big Data Environment Requirements (V2.5.1)

  • Hadoop: 2.7 + 3.1 +
  • The Hive: 0.13 1.2.1 +
  • HBase: 1.1+, 2.0
  • Spark (Optional) 2.1.1+
  • Kafka (optional) 0.10.0+
  • The JDK: 1.8 +
  • OS: Linux only, CentOS 6.5+ or Ubuntu 16.0.4+
  • HDP (officially tested) 2.2-2.6 and 3.0
  • CDH (officially tested) 5.7-5.11 and 6.0

Big Data Environment Requirements (V2.4.x)

  • Hadoop: 2.7 +
  • The Hive: 0.13 1.2.1 +
  • HBase: 1.1 +
  • Spark (Optional) 2.1.1+
  • Kafka (optional) 0.10.0+
  • The JDK: 1.7 +
  • OS: Linux only, CentOS 6.5+ or Ubuntu 16.0.4+
  • HDP (officially tested) 2.2-2.6
  • CDH (officially tested) 5.7-5.11

As you can see from the above configuration, the latest version (V2.5.1) has many changes. It supports Hadoop 3.1 and HBase 2.0. The JDK requirement is JDK8; CDH users should note that V2.5 already supports CDH 6.0.

The hardware configuration

Minimum Configuration (official website)
  • 4 core CPU
  • 16 GB memory
  • 100 GB disk
Recommended Configuration (OFFICIAL KAP Document)
  • Two Intel Xeon processors, 6-core (or 8-core) CPU, 2.3ghz or above
  • 64GB memory
  • At least one 1TB SAS hard disk (3.5-inch), 7200RPM, RAID1

Installation package directory description

  • Bin: directory where Kylin scripts are stored, including start-stop management, metadata management, environment check, and sample creation scripts.
  • Conf: directory for storing Kylin configuration files, including Hive, Job, Kylin running parameters, and Kylin Config.
  • Lib: Kylin JDBC driver, HBase Coprocessor JAR directory.
  • Meta_backups: Kylin metadata backup directory.
  • Sample_cube: the script and data that the official sample depends on;
  • Sys_cube: the script on which the system Cube builds;
  • Spark: The default is spark. In the figure, a soft connection points to spark whose address is independently deployed.

Install and deploy Kylin

For details, please refer to the official website. Here is a brief introduction to Kylin installation steps:

  1. Download the appropriate version from the official website;
  2. Unpack the installation package and configure the environment variable KYLIN_HOME to point to the Kylin folder.
  3. Check the Kylin operating environment:$KYLIN_HOME/bin/check-env.sh;
  4. Start the Kylin:$KYLIN_HOME/bin/kylin.sh start;
  5. Through the browserhttp://hostname:7070/kylinThe initial user name and password areADMIN/KYLIN;
  6. run$KYLIN_HOME/bin/kylin.sh stopYou can stop Kylin.

Parameter configuration

Configuration File Overview

Component name The file name describe
Kylin kylin.properties The global configuration file used by Kylin
Kylin kylin_hive_conf.xml Hive task configuration items, which will adjust Hive configuration parameters based on this file when generating intermediate tables through Hive in the first step of Cube construction
Kylin kylin_job_conf_inmem.xml Contains configuration items for MR tasks when Cube build algorithms areFast CubingThe MR parameters in the build task are adjusted according to the Settings in this file
Kylin kylin_job_conf.xml Configuration item of the MR task, whenkylin_job_conf_inmem.xmlDoes not exist, or the Cube build algorithm isLayer CubingThe MR parameters in the build task are adjusted according to the Settings in this file
Hadoop core-site.xml Hadoop global configuration file used to define system-level parameters, such as HDFS URL and Hadoop temporary directory
Hadoop hdfs-site.xml This parameter is used to configure HDFS parameters, such as the storage location of NameNode and DataNode, number of file copies, and file read permission
Hadoop yarn-site.xml This parameter is used to configure Hadoop cluster resource management system parameters, such as the communication port between ResourceManader and NodeManager and the Web monitoring port
Hadoop mapred-site.xml This parameter is used to configure MR parameters, such as the default number of Reduce jobs and the default upper and lower limits of memory that can be used by jobs
Hbase hbase-site.xml This parameter is used to configure Hbase running parameters, such as the master machine name and port number, and the location where root data is stored
Hive hive-site.xml This parameter is used to configure Hive operating parameters, such as Hive data storage directory and database address

Hadoop Parameter Configuration

  • yarn.nodemanager.resource.memory-mbThe value of the configuration item must be at least 8192MB
  • yarn.scheduler.maximum-allocation-mbThe value of the configuration item must be at least 4096MB
  • mapreduce.reduce.memory.mbThe value of the configuration item must be at least 700MB
  • mapreduce.reduce.java.optsThe value of the configuration item must be at least 512MB
  • yarn.nodemanager.resource.cpu-vcoresThe value of the configuration item is not less than 8

Kellin.properties core parameter

The configuration of The default value instructions
kylin.metadata.url kylin_metadata@hbase Kylin metadata path
kylin.env.hdfs-working-dir /kylin HDFS path used by the Kylin service
kylin.server.mode all Run mode, which can be all, Job, or Query
kylin.source.hive.database-for-flat-table default The Hive intermediate table is stored in the Hive database
kylin.storage.hbase.compression-codec none Compression algorithm used by HTable
kylin.storage.hbase.table-name-prefix kylin_ Prefix of the HTable table name
kylin.storage.hbase.namespace default HTable Default tablespace
kylin.storage.hbase.region-cut-gb 5 Region Indicates the partition size
kylin.storage.hbase.hfile-size-gb 2 Hfile size
kylin.storage.hbase.min-region-count 1 Minimum number of regions
kylin.storage.hbase.max-region-count 500 Maximum number of regions
kylin.query.force-limit - 1 forselect *Statement enforces a LIMIT clause
kylin.query.pushdown.update-enabled false Whether to enable query down pressure
kylin.query.pushdown.cache-enabled false Enable querying whether cache is enabled
kylin.cube.is-automerge-enabled true Automatic merging of segments
kylin.metadata.hbase-client-scanner-timeout-period 10000 Timeout period of HBase data scanning
kylin.metadata.hbase-rpc-timeout 5000 Timeout period for an RPC operation
kylin.metadata.hbase-client-retries-number 1 HBase Retry Times

Some notes on the above parameters:

  • kylin.query.force-limitThe default value is no limit. The recommended value is 1000.
  • kylin.storage.hbase.hfile-size-gbCan be set to 1 to help speed up MR.
  • kylin.storage.hbase.min-region-countYou can set this parameter to the number of HBase nodes to force data to be distributed on N nodes.
  • kylin.storage.hbase.compression-codecBy default, no compression is performed. You are advised to configure the compression algorithm when the environment is running.

Spark Configuration

All use of kylin. Engine. The spark – the conf. As a prefix spark configuration properties can be in the $KYLIN_HOME/conf/kylin. The properties of the management, of course, these parameters to cover support in advanced configuration of Cube. The following are recommended Spark dynamic resource allocation configurations:

// Run in yarn-cluster mode. Of course, you can configure an independent Spark cluster: spark://ip:7077 kylin.engine.spark-conf.spark.master=yarn kylin.engine.spark-conf.spark.submit.deployMode=cluster / / start dynamic resource allocation kylin. Engine. The spark - the conf. Spark. DynamicAllocation. Enabled =true
kylin.engine.spark-conf.spark.dynamicAllocation.minExecutors=2
kylin.engine.spark-conf.spark.dynamicAllocation.maxExecutors=1000
kylin.engine.spark-conf.spark.dynamicAllocation.executorIdleTimeout=300
kylin.engine.spark-conf.spark.shuffle.service.enabled=trueKylin. Engine. The spark - the conf. Spark. Shuffle. Service. The port = 7337 / / memory Settings kylin. Engine, spark - the conf. Spark. Driver. = 2 g memory / / data larger or greater dictionary can be big executor memory kylin. The engine, spark - the conf. Spark. Executor. = 4 g memory Kylin. Engine. The spark - the conf. Spark. Executor. Cores = 2 / / kylin.engine.spark-conf.spark.net work overtime heartbeat. Timeout = 600 / / partition size kylin.engine.spark.rdd-partition-cut-mb=100Copy the code

Cube Planner configuration

Cube Planner is a new function added after V2.3. With this function, you can see the number and combination of all cuboids after Cube is created successfully. In addition, after the configuration is successful, we can see the matching of Query and Cuboid on the line, which enables us to see the popular, unpopular or even unused CuBoids. With these, we can guide the secondary optimization of Cube construction. About the Cube Planner use, you can refer to the official document: kylin.apache.org/cn/docs/tut… .

Refer to the article

  • Kylin official documentation
  • Kyligence_Enterprise_3_1-zh.pdf
  • Kylin 2.0 Spark Cubing optimization improvements

Any Code, Code Any!

Scan code to pay attention to “AnyCode”, programming road, together forward.