Learn more about Java basics


V3.0 Big data E-commerce Warehouse Project Script: E-commerce warehouse of big data Project (Script)

Query Presto AD hoc

Presto concept

Presto architecture

Presto pros and cons

Blog.csdn.net/u012551524/…

Impala is slightly ahead of Presto in performance, but Presto is rich in data source support, including Hive, graph databases, traditional relational databases, Redis, and more.

Presto optimized data storage

Partitioning properly

Similar to Hive, Presto reads partition data based on metadata information. Proper partitioning reduces Presto data reads and improves query performance.

Use column storage

Presto is optimized for reading ORC files. Therefore, you are advised to use ORC format to store Presto tables in Hive. Presto supports ORC better than Parquet.

Use compression

Data compression can reduce the I/O bandwidth pressure caused by data transmission between nodes. For impromptu query, Snappy compression is recommended.

Presto optimized query SQL

Select only the fields used

Because of column storage, selecting the required field can speed up the field reading and reduce the amount of data. Avoid * for reading all fields.

[GOOD]: SELECT time.user, host FROM tbl

[BAD]:  SELECT * FROM tbl
Copy the code

A partition field must be added to the filter criteria

For partitioned tables, partitioning fields are filtered in preference in the WHERE statement. Acct_day is the partition field, and visit_time is the specific access time.

[GOOD]: SELECT time.user, host FROM tbl where acct_day=20171101

[BAD]:  SELECT * FROM tbl where visit_time=20171101
Copy the code

Group By statement optimization

Properly arranging the field order in a Group by statement can improve performance. Sort the fields in a Group By statement in descending order By how much distinct data each field has. Those with more DISTINCT data are ranked first

[GOOD]: SELECT GROUP BY uid, gender

[BAD]:  SELECT GROUP BY gender, uid
Copy the code

Use Limit when ordering by

Order by requires scanning data to a single worker node for sorting, resulting in a large amount of memory for a single worker. If you are querying Top N or Bottom N, use limit to reduce sort computation and memory stress.

[GOOD]: SELECT * FROM tbl ORDER BY time LIMIT 100

[BAD]:  SELECT * FROM tbl ORDER BY time
Copy the code

Place the large table to the left when using the Join statement

The default algorithm of join in Presto is broadcast Join, that is, the table on the left of join is divided into multiple workers, and then the data on the table on the right of join is copied in whole and sent to each worker for calculation. If the table on the right has too much data, an out-of-memory error may be reported.

[GOOD] SELECT ... FROM large_table l join small_table s on l.id = s.id
[BAD] SELECT ... FROM small_table s join large_table l on l.id = s.id
Copy the code

Matters needing attention

Field name reference

Avoid keyword conflicts: MySQL uses double quotation marks to separate fields, Presto uses double quotation marks to separate fields, Presto uses double quotation marks to separate fields.

Time function

In the case of Timestamp, you need to add the Timestamp keyword when you need to compare Timestamp. In MySQL, you can directly compare Timestamp.

/*MySQL */
SELECT t FROM a WHERE t > '2017-01-01 00:00:00'; 

/*Presto */
SELECT t FROM a WHERE t > timestamp '2017-01-01 00:00:00';
Copy the code

INSERT OVERWRITE syntax is not supported

Insert overwrite is not supported in Presto. You can only delete and then insert into.

PARQUET format

Presto currently supports the Parquet format and supports queries, but not INSERTS.

Impromptu query Kylin

Kylin definition

Apache Kylin is an open source distributed analysis engine that provides SQL query interfaces on Hadoop/Spark and multidimensional analysis (OLAP) capabilities to support very large scale data. It was originally developed by eBay Inc and contributed to the open source community. It can query huge Hive tables in subseconds.

Kylin architecture

  • REST Server

REST Server is a set of entry points for application development, designed to enable application development for the Kylin platform. Such applications can provide queries, get results, trigger cube build tasks, get metadata, get user permissions, and so on. In addition, SQL queries can be implemented through Restful interfaces.

  • Query Engine

When the Cube is ready, the query engine is able to retrieve and parse the user query. It then interacts with other components in the system to return corresponding results to the user.

  • Routing

Considered at the time of the initial design will Kylin cannot perform a query guide to continue in the Hive, but after the practice found that the speed of the Hive and Kylin difference is too big, lead to users cannot expect of the speed of query is consistent, is likely to be most queries for a few seconds will return a result, some queries have to wait a few minutes to a few minutes, So the experience was terrible. Finally this routing feature is turned off by default in the release.

  • Metadata Management Tools

Kylin is a metadata-driven app. The metadata management tool is a key component for managing all metadata stored in Kylin, including the most important cube metadata. All other components are based on the metadata management tool. Kylin’s metadata is stored in hbase.

  • Task Build Engine

The engine is designed to handle all offline tasks, including shell scripts, Java apis, And Map Reduce tasks. The task engine manages and coordinates all tasks in Kylin to ensure that each task is executed and glitches are resolved.

Kylin characteristics

Key features of Kylin include SUPPORT for SQL interfaces, support for very large data sets, sub-second response, scalability, high throughput, BI tool integration, and more.

  1. Standard SQL interface: Kylin uses standard SQL as the interface for external services.
  2. Support for large data sets: Kylin’s ability to support big data is probably the most advanced of all technologies currently available. As early as 2015, eBay’s production environment was able to support tens of billions of records of second-level queries, and then there were hundreds of billions of records of second-level queries in mobile application scenarios.
  3. Sub-second response: Kylin has excellent query response speed, thanks to predictive calculation. Many complex calculations, such as join and aggregation, have been completed in the offline predictive calculation process, which greatly reduces the amount of computation required at query time and improves the response speed.
  4. Scalability and high throughput: a single Kylin node can achieve 70 queries per second and can also build Kylin clusters.
  5. BI tool integration Kylin can be integrated with existing BI tools, including the following. ODBC: Integrates with Tableau, Excel, PowerBI and other tools JDBC: Integrates with Saiku, BIRT and other Java tools RestAPI: Integration with JavaScript and Web Pages The Kylin development team has also contributed plug-ins for Zepplin, which can also be used to access Kylin services.

The use of advanced

Duplicate Key problems in full dimension table and zipper dimension table

The reason for the error is that dwD_DIM_user_info_HIS dimension table in model is a zipper table and DWD_DIM_SKu_info daily full table. Therefore, using the whole table as a dimension table will inevitably lead to the problem that the same user_id or sku_id corresponds to multiple data in the order table. For the above problem, There are two solutions.

  • Solution 1: Create a temporary dimension table in Hive. The temporary table stores only the latest complete data of the dimension table. Select the temporary table as the dimension table when creating models in Kylin.
  • Scheme 2: The idea is the same as scheme 1, but do not use the physical temporary table, and choose the view to achieve the same function.

Plan 2 is adopted here:

Zip dimension table view
create view dwd_dim_user_info_his_view as select * from dwd_dim_user_info_his where end_date='9999-99-99';

Full dimension table view
create view dwd_dim_sku_info_view as select * from dwd_dim_sku_info where dt=date_add(current_date.- 1);

Let's create a 2020-03-10 view first
create view dwd_dim_sku_info_view as select * from dwd_dim_sku_info where dt='2020-03-10';
Copy the code

How do I build cubes automatically every day

Kylin provides Restful apis, so you can write cube commands into scripts, which can be delivered to scheduling tools such as Azkaban or Oozie to implement scheduled scheduling. The script is as follows:

#! /bin/bash cube_name=order_cube do_date= 'date -d '-1 day' +%F' # Stop_date =$(($start_date+86400000)) curl -x PUT -h "Authorization: Basic QURNSU46S1lMSU4=" -H 'Content-Type: application/json' -d '{"startTime":'$start_date', "endTime":'$stop_date', "buildType":"BUILD"}' http://hadoop102:7070/kylin/api/cubes/$cube_name/buildCopy the code

Kylin Cube construction principle

Each Angle is a Cuboid, and all angles together are called cubes

Degrees and measure

Dimension: The Angle from which data is viewed. For example, employee data can be analyzed from the perspective of gender, or it can be more detailed and observed from the dimension of entry time or region. A dimension is a discrete set of values, such as male and female in gender, or each individual date in a time dimension. Therefore, in statistics, records with the same dimension value can be aggregated together, and then aggregation function can be used to perform aggregation calculation such as accumulation, average, maximum and minimum.

Measurement: the aggregated (observed) statistics, that is, the result of the aggregation operation. For example, the number of employees of different genders in the employee data, or how many employees were recruited in the same year.

Cube and Cuboid

With dimensions and measures, all fields on a data table or data model can be categorized as either dimensions or measures (which can be aggregated). Hence the Cube theory of making predictions based on dimensions and measures. Given a data model, we can aggregate all the dimensions on it, and for N dimensions, there are 2^ N possible combinations. For each combination of dimensions, the measures are aggregated, and the results are saved into a materialized view called Cuboid. A Cuboid of all dimensions combined as a whole is called a Cube.

As a simple example, assume that there is a sales data set of an e-commerce. The dimensions include time, item, location, and supplier, which are measured as sales volume.

  • The combination of dimension (1D) includes [time], [item], [location], and [supplier].
  • The combination of two dimensions (2D) includes [time, item], [time, location], [time, supplier], [item, location], [location, supplier].
  • There are also four combinations of three dimensions (3D);
  • Finally, there are 16 kinds of zero dimension (0D) and four dimension (4D). Sometimes the zero dimension doesn’t count, so that’s 15.

Note: Each combination of dimensions is a Cuboid, and all 16 CuBoids are a Cube.

Principle of Cube storage

Cuboid Id indicates whether the preceding dimension exists or not. The value is 1 if the dimension exists or 0 if the dimension does not exist. The dimension value represents the corresponding value of the dimension that exists in Cuboid. For example, the third line 101+10 indicates the existence of customer_address and order_data dimensions, where the value of customer_address is 1, which corresponds to Shanghai, and the value of order_data is 0, which corresponds to 2019-01-09.

Cube construction algorithm

Layer by Layer Building algorithm

As we know, an N-dimensional Cube is composed of 1 n-dimensional subcube, N (n-1) -dimensional subcube, N*(n-1)/2 (n-2) -dimensional subcube,…… , N 1-dimensional subcubes and 1 0-dimensional subcubes, a total of 2^N subcubes. In the layer-by-layer algorithm, the number of dimensions is reduced layer by layer, and the calculation of each level (except for the first level, which is aggregated from the original data) is based on the results of the level above it. For example, the result of [Group by A, B] can be aggregated based on the result of [Group by A, B, C] by removing C. This reduces double counting; When the 0 dimension Cuboid is calculated, the calculation of the entire Cube is completed. Each round of calculation is a MapReduce job and is executed sequentially. An N-dimensional Cube requires at least N MapReduce jobs.

Advantages of the algorithm:

  • This algorithm makes full use of the advantages of MapReduce and handles the complicated sorting and shuffle tasks in the middle, so the algorithm code is clear, simple and easy to maintain.
  • Thanks to the growing maturity of Hadoop, this algorithm is very stable and can be guaranteed to complete even when cluster resources are tight.

Disadvantages of the algorithm:

  • When Cube has more dimensions, the number of MapReduce tasks required increases accordingly. Because the task scheduling of Hadoop requires extra resources, especially when the cluster is large, the extra cost caused by repeatedly submitting tasks will be considerable.
  • Because the Mapper logic does not perform aggregation operations, each round of MR shuffle has a large workload, resulting in low efficiency.
  • There are many read and write operations on the HDFS. Because the output of each layer’s calculation is used as the input of the next layer’s calculation, these key-values must be written to the HDFS. When all the calculations are complete, Kylin takes an extra round of work to convert the files to HBase HFile format for import into HBase.

In general, the efficiency of this algorithm is low, especially when the number of Cube dimensions is large.

Fast Build Algorithm (INMEM)

Also known as “By Segment” or “By Split” algorithm, this algorithm was introduced from 1.5.x. The main idea of this algorithm is that each Mapper calculates the data blocks allocated to it into a complete small Cube Segment (including all cuboids). Each Mapper output the calculated Cube segments to Reducer for merging, generating large Cube, which is the final result. This process is illustrated in the figure.

Compared with the old algorithm, the fast algorithm has two main differences:

  • Mapper uses memory to do pre-aggregation, to figure out all the combinations; Each Key output by Mapper is different. This reduces the amount of data output to Hadoop MapReduce and Combiner is no longer needed.
  • A round of MapReduce completes all levels of computation, reducing the allocation of Hadoop tasks.

Kylin Cube build optimization

Use Aggregation Groups

The Aggregation Group is a powerful pruning tool. You can define dimensions in any of the following ways. The relationship between dimensions is described as follows.

Mandatory dimensions

If a dimension is defined as a mandatory dimension, then each of the cuBoids generated by this grouping will contain that dimension. Each group can have zero, one, or more mandatory dimensions. Based on the business logic of this grouping, the relevant queries must be in the filtering or grouping conditions, so you can set the dimension to a mandatory dimension in that grouping.

Hierarchy

Each level contains two or more dimensions. Suppose a hierarchy contains D1, D2… Dn n dimensions, then in any Cuboid generated by this grouping, the n dimensions will only be represented by (), (D1), (D1, D2)… (D1, D2… Dn) occurs in one of these n+1 forms. There can be zero, one, or more levels in each grouping, and there should be no shared dimensions between different levels. According to the business logic of this grouping, multiple dimensions are directly hierarchical, so they can be set as hierarchical dimensions in this grouping.

Joint Dimensions

There are two or more dimensions in each union, and if some columns form a union, these union dimensions will either appear together or not at all in any CuBOIDS generated by that group. There can be zero or more unions in each group, but there should be no shared dimensions between different unions (otherwise they can be combined into a single union). If, according to the business logic of this grouping, multiple dimensions always appear together in the query, you can set these dimensions as federated dimensions in that grouping.

The Row Key optimization

Kylin arranges all the dimensions into a complete Rowkey, and sorts all the rows in the Cuboid in ascending order by the Rowkey. A well-designed Rowkey can filter and locate data query more effectively, reduce I/O times, improve query speed, and have a significant impact on query performance. The Row key design principles are as follows:

The dimensions that are used as filters come first

Dimensions with large cardinals precede those with small cardinals.

The cardinality here refers to the size of the distinct number. For example, in the figure below, dimension C has 6 categories and dimension D has 3 categories, so C comes before D

Concurrency granularity optimization

When the size of a Cuboid in the Segment exceeds a certain threshold, the system will fragment the data of the Cuboid into multiple partitions to realize the parallelization of Cuboid data reading and optimize Cube query speed. The specific implementation methods are as follows: The build engine determines how many partitions are required to store the Segment based on the estimated size of the Segment and the parameter Kylin.hbase.region. cut. If the storage engine is hbase, the number of partitions corresponds to the number of regions in hbase.

The default value of kellin.hbase.region. cut is 5.0 in GB, which means that the build engine will allocate 10 partitions to a Segment estimated to be 50GB. Users can also by setting the kylin. Hbase. Region. Count. Min (defaults to 1) and kylin. Hbase. Region. Count. Max (default is 500) two configuration to determine each Segment, or at least most is divided into how many partitions.