Hive is an important framework for the big data platform. It is one of the most commonly used frameworks for building enterprise-level data warehouses due to its stability and ease of use.
However, if we are limited to using Hive without considering performance, it is difficult to build a perfect warehouse, so Hive performance tuning is a skill that we big data practitioners must master. This article will show you some methods and techniques for tuning Hive performance.
This article was published on the official account: Big Data in five Minutes
This section describes how to rectify Hive performance problems
When we find that a SQL statement execution time is too long or unreasonable, we need to consider the OPTIMIZATION of SQL, optimization first to troubleshoot the problem, so we can through what ways to troubleshoot it.
Those of you who have used relational databases a lot may know the trick to optimizing relational databases – see the execution plan. For example, Oracle database has multiple types of execution plans. Through the combined use of multiple execution plans, you can see the execution plan deduced based on statistical information, that is, the execution plan that is not really run inferred by Oracle. You can also see the execution plan for the actual execution of the task; The ability to observe the main process from data reading to final presentation and the quantification of data in the middle. It can be said that in Oracle development, with the right links and different execution plans, SQL tuning is not a difficult task.
Hive also has execution plans, but Hive execution plans are predicated. Unlike Oracle and SQL Server, Hive execution plans are predicated. You can see quantitative data such as processing data, resources consumed, and processing time at each stage. The execution plan provided by Hive does not contain this data, which means that although Hive users are aware of the entire SQL execution logic, the resource consumption of each phase and the bottleneck of the entire SQL execution are not clear.
To view the running information about all phases of HiveSQL, view the logs provided by YARN. The link to view the log can be found in the information printed on the console after each job is executed. As shown in the figure below:
You can view the following information about Hive execution plans:
- View the basic information about the execution plan, that is, explain;
- View the extended information for the execution plan, explain Extended;
- View information on SQL data entry dependencies, explain dependency;
- View information about permissions related to SQL operations, known as Explain Authorization;
- View the SQL vectorization description, explain Vectorization.
Prefixing the keyword explain to the SQL of the query statement is the basic way to view the execution plan. The execution plan opened with Explain consists of the following two parts:
- A diagram of job DEPENDENCIES, known as STAGE DEPENDENCIES;
- Details of each job, i.e. STAGE PLANS.
The explain execution plan in Hive can be explained in my previous post:
Underlying Hive principles: Explain Details about the execution plan
Note: Using Explain to view execution plans is an important way to tune Hive performance.
Summary: How to troubleshoot SQL statement performance problems
- Use Explain to view the execution plan;
- View logs provided by YARN.
Indicates the method for tuning Hive performance
Why all say performance optimization this work is more difficult, because the optimization of a technology, is necessarily a comprehensive work, it is the combination of many technologies. If we are limited to only one technology, then certainly can’t do optimization.
Let’s take a look at the variety of Hive optimizations from a number of different perspectives.
1. Optimize SQL statements
SQL statement optimization involves too much content, because space is limited, can not be introduced to one by one, so take a few typical examples, let you learn this idea, later encounter similar tuning problems can go to these aspects to think more.
1. union all
insert into table stu partition(tp)
select s_age,max(s_birth) stat,'max' tp
from stu_ori
group by s_age
union all
insert into table stu partition(tp)
select s_age,min(s_birth) stat,'min' tp
from stu_ori
group by s_age;
Copy the code
Select * from union all where s_age = age; select * from union all where s_age = age; It is a waste to group the same fields in the same table twice. insert into … This syntax prefixes from and allows multiple inserts to be performed using a single table:
-- Enable dynamic partitioning
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
from stu_ori
insert into table stu partition(tp)
select s_age,max(s_birth) stat,'max' tp
group by s_age
insert into table stu partition(tp)
select s_age,min(s_birth) stat,'min' tp
group by s_age;
Copy the code
Select * from stu_ORI table where s_age = 1 and s_age = 1;
This example tells us to learn more about SQL statements. If we didn’t know the syntax, we wouldn’t have thought of it this way.
2. distinct
Select * from SQL where SQL = ‘count’;
select count(1)
from(
select s_age
from stu
group by s_age
) b;
Copy the code
This is a simple enumeration of age values, why not use DISTINCT?
select count(distinct s_age)
from stu;
Copy the code
Some people say that using the first method can effectively avoid data skewing on the Reduce side when the amount of data is very large, but is this true?
Despite the sheer volume of data, using DISTINCT is definitely more efficient than using subquery in the current business and environment. The reasons are as follows:
-
The enumeration value of s_age is limited. Even if the age between 1 and 100 is calculated, the maximum enumeration value of s_age is 100. If translated into MapReduce, each Map will deiterate s_age during the Map phase. Since the s_AGE enumeration value is limited, the s_AGE obtained by each Map is also limited. The final amount of reduce data is the number of maps * the number of s_AGE enumeration values.
-
The distinct command builds a hashtable in memory with a time complexity of O(1) to find deduplications; Group by varies widely from version to version. Some versions use a hashtable to redo, and some versions use a sorting method where the optimal sorting time cannot reach O(1). In addition, the first method (group by) deduplicates into two tasks, which consumes more disk network I/O resources.
-
The newest Hive in 3.0 adds the count (distinct) optimization, through configuration Hive. Optimize. Countdistinct, even really appear data skew can automatically optimize, automatically change the SQL execution logic.
-
The second method (distinct) is simpler than the first method (Group by). The meaning of the expression is simple and clear. If there is no special problem, the concise code is superior!
This example tells us that sometimes we should not over-optimize, and tuning is about timely tuning. Tuning too early may lead to useless or even negative effects, and the cost of tuning is not proportional to the return. Tuning needs to follow certain principles.
2. Data format optimization
Hive provides a variety of data storage organization formats, which greatly affect the program running efficiency.
Hive supports TEXT, SequenceFile, RCFile, ORC, and Parquet formats.
SequenceFile is a binary key/value pair structured flat file that was widely used on early Hadoop platforms in the MapReduce output/output format, as well as as a data storage format.
Parquet is a column data storage format that is compatible with a variety of computing engines, such as MapRedcue and Spark, and provides good performance support for multi-layer nested data structures. It is one of the mainstream data storage options in the Hive production environment.
ORC optimization is an optimization over RCFile. It provides an efficient way to store Hive data and improves the read, write, and processing performance of Hive data. It is compatible with multiple computing engines. In fact, ORC has become one of the mainstream options for Hive data storage in production environments.
We use the same data and SQL statement, but the data is stored in a different format, and we get the following execution time:
The data format | The CPU time | User wait time |
---|---|---|
TextFile | 33 points | 171 seconds |
SequenceFile | 38 points | 162 seconds |
Parquet | 2 minutes and 22 seconds | 50 seconds |
ORC | 1 minute 52 seconds | 56 seconds |
Note: CPU time: indicates the time that the running program occupies the server CPU resources.
* User wait time * : This is how long a user waits between the time the job is submitted and the time the result is returned.Copy the code
It takes 33 minutes to query a TextFile data table, and 1 minute and 52 seconds to query an ORC data table. Therefore, different data storage formats can greatly affect HiveSQL performance.
3. Optimize for too many small files
If there are too many small files, each small file is treated as a block and a Map task is started to complete the query. The startup and initialization time of a Map task is much longer than the logical processing time, which wastes a lot of resources. Moreover, the number of maps that can be executed at the same time is limited.
Therefore, it is necessary for us to optimize too many small files. As for the solution to too many small files, I have written a special article to explain the solution. The details can be viewed:
The problem of too many small hive files is solved
4. Perform optimization in parallel
Hive converts a query into one or more phases. Such phases can be the MapReduce phase, the sampling phase, the merge phase, and the LIMIT phase. Or other phases that may be required during Hive execution. By default, Hive executes only one phase at a time. However, a particular job may contain many phases, and these phases may not be completely interdependent, that is, some phases may be executed in parallel, which may shorten the execution time of the overall job. The more stages that can be executed in parallel, the faster the job is likely to complete.
Parallel is enabled by setting the parameter Hive.exec. parallel to true. In a shared cluster, it is important to note that if the number of concurrent phases in the job increases, the cluster utilization will increase.
set hive.exec.parallel=true; //The parallel task execution function is enabledset hive.exec.parallel.thread.number=16; //The samesqlMaximum parallelism allowed. Default is8.Copy the code
Of course, in the system resources are relatively idle when the advantage, otherwise no resources, parallel also can not come.
5. The JVM optimization
JVM reuse is one of the Hadoop tuning parameters that can have a very significant impact on Hive performance, especially in scenarios where small files are difficult to avoid or where there are a lot of tasks, most of which have very short execution times.
The default configuration for Hadoop is typically to use a derived JVM to execute Map and Reduce jobs. The JVM startup process can be quite expensive, especially if the job is executed with hundreds or thousands of tasks. JVM reuse enables the JVM instance to be reused N times in the same job. The value of N can be configured in the mapred-site. XML file of Hadoop. Typically between 10 and 20, depending on the business scenario testing.
<property>
<name>mapreduce.job.jvm.numtasks</name>
<value>10</value>
<description>How many tasks to run per jvm. If set to -1, there is
no limit.
</description>
</property>
Copy the code
We can also set this in Hive
set mapred.job.reuse.jvm.num.tasks=10; //This setting is used to set our JVM reuseCopy the code
The downside of this feature is that turning on JVM reuse will keep the used task slot occupied for reuse until the task is complete. If some reduce tasks in an “unbalanced” job take more time to execute than other Reduce tasks, the reserved slots will remain idle until all the tasks are finished.
6. Presumably perform optimizations
In the distributed cluster environment, because the program Bug Bug (including Hadoop itself), unequal load imbalance or resource distribution, can cause the same homework between multiple tasks running speed, the running speed of some tasks may be significantly slower than other tasks (such as a job of a task schedule is only 50%, When all other tasks have already run), these tasks slow down the overall execution of the job. To avoid this, Hadoop supports Speculative Execution, which speculates on tasks that are “lagging behind” and starts a backup job for that task that processes the same data as the original job. Finally, the calculation result of the first successful operation of the task is selected as the final result.
Set the parameters for enabling predictive execution: set them in the mapred-site. XML file of Hadoop:
<property>
<name>mapreduce.map.speculative</name>
<value>true</value>
<description>If true, then multiple instances of some map tasks
may be executed in parallel.</description>
</property>
<property>
<name>mapreduce.reduce.speculative</name>
<value>true</value>
<description>If true, then multiple instances of some reduce tasks
may be executed in parallel.</description>
</property>
Copy the code
Hive itself also provides configuration items to control the speculative execution of reduce-side:
set hive.mapred.reduce.tasks.speculative.execution=true
Copy the code
It is difficult to give specific advice on tuning these speculative execution variables. These features can be turned off if the user is sensitive to runtime deviations. If the user needs to run a map or Reduce task for a long time because of the large amount of input data, the waste of starting a speculative execution is enormous.
The public number [five minutes to learn big data], the original technology number in the field of big data
The last
Code optimization principles:
- Rationalizing the requirement principle, which is the basis of optimization;
- Grasp the data link principle, which is the context of optimization;
- Adhere to the principle of simplicity of the code, which makes optimization easier;
- Talking about optimization when there are no bottlenecks is asking for trouble.