As an important framework for big data platform, Hive has become one of the most widely used frameworks in the construction of enterprise-level data warehouses due to its stability and simplicity of use.
However, it is difficult to build a perfect database if we are limited to using Hive without considering performance issues, so Hive performance tuning is a skill that we big data practitioners must master. This article explains some of the techniques and techniques for Hive performance tuning.
This article is published on the public number: Learn Big Data in Five Minutes
How to troubleshoot Hive performance problems
When we find that a SQL statement execution time is too long or unreasonable, we need to consider the SQL optimization, optimization of the first to troubleshoot the problem, then we can through which way to troubleshoot?
Those of you who are regular users of relational databases may know the trick to optimizing relational databases – look at the execution plan. For example, Oracle database has various types of execution plans. Through the cooperation of various execution plans, you can see the execution plans deduced from statistical information, that is, the execution plans inferred by Oracle that are not really run. You can also see the execution plan for the actual task executed; To be able to observe the main process from the data read to the final presentation and the quantitative data in the middle. It can be said that in Oracle development, SQL tuning is not a difficult task if you master the right steps and choose different execution plans.
There are execution plans in Hive, but Hive’s execution plans are predictive. Unlike Oracle and SQL Server, which have real plans, you can see quantified data such as processing data for each stage, resources consumed, and processing time. The execution plan provided by Hive does not have this data, which means that while Hive users are aware of the overall SQL execution logic, it is not clear how many resources are being consumed by each stage and where the overall SQL execution bottleneck is.
To see how all the HiveSQL phases are running, check the logs provided by YARN. A link to view the log can be found in the message printed on the console after each job has executed. As shown in the figure below:
The following information is currently available for viewing the execution plan provided by Hive:
- View the basic information of the execution plan, namely EXPLAIN;
- View extended information for the execution plan, namely Explain Extended;
- View SQL data entry dependencies, i.e. Explain Dependency;
- View information about permissions for SQL operations, that is, EXPLAIN Authorization;
- Look at the vectorization description information of the SQL, which is the Explain Vectorization.
Prefacing the SQL of the query statement with the keyword EXPLAIN is the basic way to view the execution plan. The execution plan opened with EXPLAIN contains the following two parts:
- The job’s dependency diagram, i.e. Stage Dependencies;
- Details of each job, i.e. Stage Plans.
Explaining the Explain Execution Plan in Hive can be found in my previous post:
The underlying principles of Hive: Explain the execution plan
Note: Using Explain to view the execution plan is a very important way of Hive performance tuning, so be sure to master it!
In summary, Hive can troubleshoot SQL statement performance problems:
- Use Explain to view the execution plan;
- Check out the logs provided by YARN.
How to tune Hive performance
Why is it difficult to optimize the performance of this work, because the optimization of a technology, is necessarily a comprehensive work, it is a combination of multiple technologies. If we are only limited to a technology, then certainly do not optimize.
Let’s take a look at the diversity of Hive optimization from several completely different perspectives.
1. SQL statement optimization
SQL statement optimization involves too much content, due to the space is limited, can not be introduced one by one, so take a few typical examples, let you learn this idea, later encounter similar tuning problems can be more about these aspects.
1. union all
insert into table stu partition(tp) select s_age,max(s_birth) stat,'max' tp from stu_ori group by s_age union all insert into table stu partition(tp) select s_age,min(s_birth) stat,'min' tp from stu_ori group by s_age;
We simply analysis the above SQl statement that each age the maximum and minimum birthday to get out into the same table, the union all before and after the two statements are on the same table according to the s_age grouping, and then takes the maximum and the minimum separately. Groups the same fields in the same table twice, which is a huge waste. Can we change this? Of course we can. insert into … This syntax prefixes from, so that you can use a single table for multiple inserts:
- open dynamic partitioning the set hive. The exec. Dynamic. The partition = true; set hive.exec.dynamic.partition.mode=nonstrict; from stu_ori insert into table stu partition(tp) select s_age,max(s_birth) stat,'max' tp group by s_age insert into table stu partition(tp) select s_age,min(s_birth) stat,'min' tp group by s_age;
The above SQL can group the S_AGE field of the STU_ORI table once and perform two separate inserts.
This example shows us that we need to know more about SQL statements. If we didn’t know the syntax, we wouldn’t think about it this way.
2. distinct
SELECT * FROM ‘recount’ WHERE id = ‘recount’;
select count(1)
from(
select s_age
from stu
group by s_age
) b;
This is a simple enumeration of age values, why not use DISTINCT?
select count(distinct s_age)
from stu;
Some people say that using the first method can effectively avoid data skew in the Reduce side when the amount of data is particularly large, but is this the case?
Setting aside the problem of very large data volumes, using DISTINCT for the current business and environment will definitely be more efficient than the above seeded query method. There are several reasons for this:
- We should know that the enumeration value of age is very limited. Even if we calculate the age between 1 and 100 years old, the maximum enumeration value of s_age is 100. If it is interpreted by MapReduce, s_age will be duplicated for each Map in the Map phase. Since the s_age enumeration value is limited, the s_age obtained by each Map is also limited, and the amount of data finally obtained is the number of Map *s_age enumeration values.
- DISTINCT’s command will build a hashtable in memory and find the de-duplication time is O(1); Group by varies greatly from version to version. Some versions will be de-duplicating by building a hashtable, while some versions will be sorting, and the optimal sorting time complexity cannot reach O(1). In addition, the first approach (Group By) to de-duplication translates into two tasks that consume more disk network I/O resources.
- In the latest Hive 3.0, count(distinct) optimization was added, via configuration
hive.optimize.countdistinct
, even if the actual data skew can be automatically optimized, automatically change the logic of SQL execution. - The second method (distinct) is more concise than the first method (group by), and the meaning of the expression is simple and clear, if there are no special problems, concise code is superior!
This example tells us that sometimes we should not over-optimize. Tuning pays attention to timely tuning. Tuning too early may result in useless work or even negative effects, and the work cost invested in tuning is not proportional to the return. Tuning requires following certain principles.
2. Data format optimization
Hive provides a variety of data storage organization formats, and different formats can have a significant impact on the efficiency of the program.
Hive provides formats such as Text, Sequencefile, RcFile, ORC, and Parquet.
Sequencefile is a flat file with binary key/value pair structure that was widely used in the MapReduce output/output format on the early Hadoop platform, as well as as a data storage format.
Parquet is a column data storage format that is compatible with multiple computing engines, such as MapredCue and Spark, and provides good performance support for multi-layer nested data structures. It is one of the mainstream data storage options in the current Hive production environment.
ORC optimization is an optimization of RcFile that provides an efficient way to store Hive data while also improving the performance of Hive’s reading, writing, and processing of data. It is compatible with a variety of computing engines. In fact, in a real production environment, ORC has become one of the mainstream data storage options for Hive.
We use the same data and SQL statement, but the format of the data storage is different, and we get the following execution time:
The data format | The CPU time | User waiting time |
---|---|---|
TextFile | 33 points | 171 seconds |
SequenceFile | 38 points | 162 seconds |
Parquet | 2 minutes and 22 seconds | 50 seconds |
ORC | 1 minute 52 seconds | 56 seconds |
Note:
The CPU time: Represents the time that the server CPU resources are consumed by running the program.
* User wait time * : records all the time the user waits between the time the user submits the job and the time the user returns the result.
It takes 33 minutes to query a TextFile table and 1 minute and 52 seconds to query an ORC table, which is greatly shortened. It shows that different data storage formats can also have a great impact on the performance of HiveSQL.
3. Too much optimization for small files
If there are too many small files, for Hive, each small file will be treated as a block during the query and a Map task will be started to complete. However, a Map task will take much more time to start and initialize than the logical processing time, which will be a great waste of resources. Also, there is a limit to the number of maps that can be executed at the same time.
Therefore, it is necessary for us to optimize the overabundance of small files. As for the solution to the overabundance of small files, I have written an article to explain it. For details, please see:
Solved the Hive issue of excessive small files
4. Perform optimization in parallel
Hive transforms a query into one or more phases. Such phases can be the MapReduce phase, the Sampling phase, the Merge phase, and the Limit phase. Or any other phases that might be required during the Hive execution. By default, Hive executes one stage at a time. However, a particular Job may have many phases, which may not be completely interdependent, meaning that some phases can be executed in parallel, which may shorten the overall execution time of the Job. If there are more phases that can be executed in parallel, the faster the Job is likely to complete.
Concurrent execution can be enabled by setting the parameter HIV.EXEC.PARALLEL to true. In a shared cluster, it is important to note that cluster utilization increases as the number of parallel phases in the Job increases.
set hive.exec.parallel=true; / / open the task parallel execution set hive. The exec. Parallel. Thread. Number = 16; // The maximum parallelism allowed for the same SQL is 8 by default.
Of course, the advantage must be when the system resources are relatively idle. Otherwise, without resources, parallelism can not start.
5. The JVM optimization
JVM reuse is the content of Hadoop tuning parameters that can have a significant impact on Hive performance, especially for scenarios where it is difficult to avoid small files or where there are too many tasks, most of which have short execution times.
The default configuration for Hadoop is usually to use derived JVMs to perform map and Reduce tasks. The JVM startup process can be quite expensive at this point, especially if the job being executed contains hundreds or thousands of Task tasks. JVM reuse enables JVM instances to be reused N times in the same Job. The value of N can be configured in Hadoop’s mapred-site.xml file. It is usually between 10 and 20, depending on the specific business scenario.
<property>
<name>mapreduce.job.jvm.numtasks</name>
<value>10</value>
<description>How many tasks to run per jvm. If set to -1, there is
no limit.
</description>
</property>
We can also set this in Hive
set mapred.job.reuse.jvm.num.tasks=10; // This setting is used to set our JVM reuse
The disadvantage of this feature is that enabling JVM reuse will keep the used Task slot available for reuse until the task is completed. If one of the “unbalanced” jobs spends more time executing a reduce task than the other reduce tasks, the reserved slots will remain unusable until all the tasks have finished.
6. Presumably perform optimizations
In the distributed cluster environment, because the program Bug Bug (including Hadoop itself), unequal load imbalance or resource distribution, can cause the same homework between multiple tasks running speed, the running speed of some tasks may be significantly slower than other tasks (such as a job of a task schedule is only 50%, When all other tasks have already run), these tasks can slow down the overall execution of the job. To prevent this from happening, Hadoop uses Speculative Execution mechanisms, which speculates on “lagging” tasks based on certain rules and initiates a backup task for such tasks that processes the same piece of data at the same time as the original task. Finally, the first successful operation of the task is selected as the final result.
Set the parameters to enable the speculated execution: configure in the mapred-site. XML file of Hadoop:
<property>
<name>mapreduce.map.speculative</name>
<value>true</value>
<description>If true, then multiple instances of some map tasks
may be executed in parallel.</description>
</property>
<property>
<name>mapreduce.reduce.speculative</name>
<value>true</value>
<description>If true, then multiple instances of some reduce tasks
may be executed in parallel.</description>
</property>
Hive itself also provides configuration items to control the speculative execution of reduce-side:
set hive.mapred.reduce.tasks.speculative.execution=true
It is difficult to make a specific recommendation about tuning these presumed execution variables. These features can be turned off if the user is sensitive to runtime deviations. If the user needs to execute a map or Reduce task for a long time due to the large amount of input data, the cost of starting the speculated execution can be significant.
Public number [five minutes to learn big data], the original technology number in the field of big data
The last
Code optimization principles:
- Reason through the principle of demand, which is the basis of optimization;
- To grasp the principle of data full link is the context of optimization.
- Keep your code simple, which makes optimization easier.
- Talking about optimization when there is no bottleneck is asking for trouble.