Hive is a common data warehouse component in the field of big data. Hive must be efficient during design and development. Hive efficiency is affected not only by the large amount of data. Factors such as data skew, data redundancy, too many Jobs or I/ OS, and inappropriate MapReduce allocation affect Hive efficiency. Hive tuning includes optimization of HiveQL statements, Hive configuration items and MR tuning.

This chapter describes the following aspects: Architecture optimization Parameter optimization SQL optimization

1. Architecture

In terms of execution engines, we can choose more suitable and faster engines, such as MR, TEZ, Spark, etc., according to the resources of the platform in the company.

If TEZ engine is selected, vectorization optimizer can be enabled during optimizer, and cost optimizer CBO can be selected as follows:

set hive.vectorized.execution.enabled = true; -
-The defaultfalse
set hive.vectorized.execution.reduce.enabled = true; -
-The defaultfalse
SET hive.cbo.enable=true; -- default from v0.14.0
true
SET hive.compute.query.using.stats=true; The default false -
SET hive.stats.fetch.column.stats=true; The default false -
SET hive.stats.fetch.partition.stats=true; The default true -
Copy the code

Optimize the table design, such as selecting partition table, bucket table, and table storage format. In order to reduce data transmission, compression can be used. Here are several parameters (more parameters can be viewed on the official website).

-- Intermediate result compression
SET
hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec ;
Output compression
SET hive.exec.compress.output=true;
SET mapreduce.output.fileoutputformat.compress.codec =org.apache.hadoop.io.compress.SnappyCodc
Copy the code

2. Parameter optimization

The second part is parameter optimization. In fact, some parts of the above architecture are also controlled by parameters. The parameter control of this part mainly includes the following aspects

Local mode, strict mode, JVM reuse, parallel execution, speculative execution, merge small files, Fetch mode

2.1 Local Mode

If the amount of data is small, it takes a long time to start distributed data processing, which is not as fast as the local mode

SET hive.exec.mode.local.auto=true; The default false -SET hive.exec.mode.local.auto.inputbytes.max=50000000; - the size of the input file is less than the hive. The exec. Mode. Local. Auto. Inputbytes. Max configuration
SET hive.exec.mode.local.auto.input.files.max=5; - the default number of map tasks than hive. 4. The exec mode. Local. Auto.. Input files. The Max configurationThe size of theCopy the code

2.2 Strict Mode

This is a switch that will fail if it satisfies the following three statements. If it is not turned on, it will be executed normally, and when turned on, the statements will automatically fail

hive.mapred.mode=nostrict
 -- a statement that does not qualify partitioned columns when querying a partitioned table;
 The two table joins produce the statement cartesian product;
 Order by, but no limit statement
Copy the code

2.3 the Jvm reuse

In Mr, the unit is process, a process is a Jvm, in fact, like short jobs, these processes can be reused quickly, but its disadvantage is that the task slot will wait for the completion of the task, which is more obvious in the case of data skew. Turn this on using the following parameters

SET mapreduce.job.jvm.numtasks=5;
Copy the code

2.4 Parallel Execution

Hive queries are converted to stages, which are not interdependent and can be executed in parallel using the following parameters

SET hive.exec.parallel=true; The default false -
SET hive.exec.parallel.thread.number=16; The default 8 -
Copy the code

2.5 Speculative Execution

This parameter is used to obtain the final result by using space resources. For example, if some tasks run very slowly due to network and resource inequality, a backup process will be started to process the same data and the first successful calculation result will be selected as the final result.

set mapreduce.map.speculative=true
set mapreduce.reduce.speculative=true
set hive.mapred.reduce.tasks.speculative.execution=true
Copy the code

2.6 Merging Small Files

Before map execution, merge small files to reduce the number of maps

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
Copy the code

At the end of the task, merge the small files

# in the map-onlyMerge small files at end of task, defaulttrue
SET hive.merge.mapfiles = true; # in the map-Merge small files at the end of the reduce job. Defaultfalse
SET hive.merge.mapredfiles = true; # Size of merged files, default256M
SET hive.merge.size.per.task = 268435456; When the average output file size is smaller than this value, a separate map is started-Reduce task execution filesmerge
SET hive.merge.smallfiles.avgsize = 16777216;
Copy the code

2.7 the Fetch mode

The last fetch mode is to try not to run Mr In some cases, such as query several fields, global search, field search, limit search, etc

hive.fetch.task.conversion=more
Copy the code

3. The SQL optimization

This part is complicated and may involve data skew problem. As for data skew problem, it has always been an inevitable problem in big data processing and there are many ways to deal with it

3.1 SQL optimization

SQL optimization is one of the easiest areas for developers to control, and it is often a matter of experience

Columns, partition disassembly, sort by instead of Order BY, group by instead of count(distinct), pre-aggregation of Group BY (controlled by parameters), skew configuration items, Map Join, filter null values separately, adjust map and reduce number appropriately, You’ll encounter almost all of these at work, and it’s your job to optimize them as much as possible

3.2 Incline Balancing Configuration items

Skewjoin this configuration is similar to group by’s skew balancing configuration item, which is configured using hive.optimise.skewjoin (false by default). If this function is enabled, during the join process, Hive will temporarily write the lines corresponding to slanted keys that exceed the threshold hive.skewjoin.key (100000 by default) to a file, and then start another job to perform map join and generate results. Through the hive. Skewjoin. Mapjoin. Map. The tasks parameter can also control the second job number mapper, 1000 by default

3.3 Handling tilted Keys Separately

If the slanted keys have practical significance, as there are few slanted keys in general, you can extract them separately, store the corresponding rows in the temporary table separately, and then add a small random number prefix (for example, 0~9), and then aggregate them. Do not write too many joins in a Select statement. Make sure you know the business, you know the data. (A0-a9) is divided into multiple statements and executed step by step; (A0-A4; A5 – A9); First, perform the association between large tables and small tables.

4. Two SQL

4.1 Find all the teams that have won 3 consecutive games

Team, Year Piston,1990 Bull,1991 Bull,1992


--
 - 1 ranking
select team, year.row_number(a)over (partition by team order by year) as rank
  from t1;

-- 2 Obtain the group ID
select team, year.row_number(a)over (partition by team order by year) as rank,
(year -row_number(a)over (partition by team order by year)) as groupid
  from t1;

-- 3 Group solution
select team, count(1) years
  from (select team, 
        (year -row_number(a)over (partition by team order by year)) as groupid
          from t1
       ) tmp
group by team, groupid
having count(1) > = 3;
Copy the code

4.2 Find out all the peaks and troughs of each ID in a day

Wave:

The value at this moment > the value at the previous moment > the value at the next moment Id time price Value of the previous time (LAG) value of the next time (lead) SH66688, 9:35, 29.48 NULL 28.72 SH66688, 9:40 28.72 29.48 27.74SH66688, 9:45, 27.74SH66688, 9:50, 26.75SH66688, 9:55, 27.13SH66688, 10:00, 26.30SH66688, 10:05, 27.09 SH66688, 10:10, 26.46 SH66688, 10:15, 26.11 SH66688, 10:20, 26.88 SH66688, 10:25, 27.49 SH66688, 10:30 26.70SH66688, 10:35, 27.57SH66688, 10:40, 28.26SH66688, 10:45, 28.03

The key is to find the characteristics of the peaks and troughs
-- Characteristics of the wave peak: greater than the value of the previous time period, the value of the next time period
-- The value of a trough is smaller than the value of the previous period or the value of the later period
Find this feature and write the SQL

select id, time, price,
       case when price > beforeprice and price > afterprice then"Peak"when price < beforeprice and price < afterprice then"Trough"end as feature
  from (select id, time, price,
               lag(price) over (partition by id order by time) beforeprice,
               lead(price) over (partition by id order by time) afterprice
          from t2
        )tmp
 where (price > beforeprice and price > afterprice) or
       (price < beforeprice and price < afterprice);
Copy the code

Check your profile for more.