Hive SQL applies to most scenarios of offline data processing in the big data field. Hive SQL optimization is also a skill we must master, and the interview will be asked. So, I expect the interviewee to be able to answer 80% of the optimisation points in order to pass this question.


  • Hive Optimization Goals
    • With limited resources, execution is more efficient
  • Q&A
    • Data skew
    • Number of map Settings
    • The reduce number of Settings
    • other


  • Hive to perform
    • HQL –> Job –> Map/Reduce
    • The execution plan
      • explain [extended] hql
      • The sample
      • select col,count(1) from test2 group by col;
      • explain select col,count(1) from test2 group by col;


  • Hive table optimization
    • partition
      • set hive.exec.dynamic.partition=true;
      • set hive.exec.dynamic.partition.mode=nonstrict;
      • Static partitioning
      • Dynamic partitioning
    • Points barrels
      • set hive.enforce.bucketing=true;
      • set hive.enforce.sorting=true;
    • data
      • Try to cluster the same data together




  • Hive Job optimization
    • Parallel execution
      • Each query is converted into multiple stages by Hive. If some stages are not associated with each other, they can be executed in parallel, shortening the execution time
      • set hive.exec.parallel= true;
      • set hive.exec.parallel.thread.numbe=8;
    • Localized execution
      • Job input data must be smaller than the size parameters: hive. The exec. Mode. Local. Auto. Inputbytes. Max (default 128 MB)
      • Job on the number of the map must be smaller than the parameters: the hive. The exec. Mode. Local. Auto. The tasks. Max (4) by default
      • The number of job reduces must be 0 or 1
      • set hive.exec.mode.local.auto=true;
      • The local mode can be used only when a job meets the following conditions:
    • Job merges input small files
      • set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
      • The number of merged files is determined by the size of the mapred.max.split.size limit
    • Job Merges output small files
      • set hive.merge.smallfiles.avgsize=256000000; If the average value of output files is smaller than this value, new job merge files are started
      • set hive.merge.size.per.task=64000000; The size of the merged file
    • The JVM reuse
      • set mapred.job.reuse.jvm.num.tasks=20;
      • JVM reuse allows jobs to remain in slots for a long time until the JOB ends, which makes sense for jobs with many tasks and many small files, reducing execution time. The value cannot be too large because some jobs have Reduce jobs. If the Reduce job is not completed, the slot occupied by the Map job cannot be released, and other jobs may need to wait.
    • Compressed data
      • set hive.exec.compress.output=true;
      • set mapred.output.compreession.codec=org.apache.hadoop.io.compress.GzipCodec;
      • set mapred.output.compression.type=BLOCK;
      • set hive.exec.compress.intermediate=true;
      • set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
      • set hive.intermediate.compression.type=BLOCK;
      • Intermediate compression is used to process data between multiple Jobs queried by Hive. You are advised to select an efficient compression method that saves CPU time
      • The output of Hive queries can also be compressed
  • Hive Map optimization

    • set mapred.map.tasks =10; invalid
    • (1) Number of default maps
      • default_num=total_size/block_size;
    • (2) Expected size
      • goal_num=mapred.map.tasks;
    • (3) Set the file size for processing
      • split_size=max(mapred.min.split.size,block_size);
      • split_num=total_size/split_size;
    • (4) Number of calculated maps
      • compute_map_num=min(split_num,max(default_num,goal_num))
    • After the above analysis, when setting the number of maps, the following points can be summarized briefly:
      • Increase the value of mapred.min.split.size
      • If you want to increase the number of maps, set mapred.map.tasks to a large value
      • If you want to reduce the number of maps, set mapred.min.split.size to a large value
      • Case 1: The input file size is large, but not small
      • Case 2: The number of input files is large and all of them are small. Therefore, the size of a single file is smaller than blockSize. In this case, increasing mapred.min.split.size is not feasible. Instead, combineFileInputFormat should be used to combine multiple input paths into an InputSplit and send it to Mapper for processing, thus reducing the number of Mapper.
    • Map the aggregation
      • set hive.map.aggr=true;
    • Speculated that perform
      • mapred.map.tasks.apeculative.execution
  • Hive Shuffle optimization

    • The Map side
      • io.sort.mb
      • io.sort.spill.percent
      • min.num.spill.for.combine
      • io.sort.factor
      • io.sort.record.percent
    • Reduce the
      • mapred.reduce.parallel.copies
      • mapred.reduce.copy.backoff
      • io.sort.factor
      • mapred.job.shuffle.input.buffer.percent
      • mapred.job.shuffle.input.buffer.percent
      • mapred.job.shuffle.input.buffer.percent
  • Hive Reduce optimization

    • Queries that require reduce operations
      • group by,join,distribute by,cluster by…
      • Order by is special, and only one Reduce is required
      • sum,count,distinct…
      • Aggregation function
      • Advanced query
    • Speculated that perform
      • mapred.reduce.tasks.speculative.execution
      • hive.mapred.reduce.tasks.speculative.execution
    • Reduce optimization
      • numRTasks = min[maxReducers,input.size/perReducer]
      • maxReducers=hive.exec.reducers.max
      • perReducer = hive.exec.reducers.bytes.per.reducer
      • Hive.exec.reducers. Max Default value: 999
      • Hive. The exec. Reducers) bytes) per. The reducer for the default: 1 g
      • set mapred.reduce.tasks=10; Set up directly
      • A formula to calculate


  • Optimized Hive query operations

  • The join optimization
    • One of the tables in the association operation is very small
    • Unequal linking operations
    • set hive.auto.current.join=true;
    • Hive. Mapjoin. Smalltable. Filesize default value is 25 MB
    • select /*+mapjoin(A)*/ f.a,f.b from A t join B f on (f.a=t.a)
    • hive.optimize.skewjoin=true; Set to true if the Join process is skewed
    • set hive.skewjoin.key=100000; This is the join key corresponding to the number of records above this value will be optimized
    • mapjoin
    • Briefly summarize the usage scenarios of MapJoin:
  • Bucket join
    • Both tables divide buckets in the same way
    • The number of buckets in the two tables is a multiple relationship
    • crete table order(cid int,price float) clustered by(cid) into 32 buckets;
    • crete table customer(id int,first string) clustered by(id) into 32 buckets;
    • select price from order t join customer s on t.cid=s.id

  • Before join optimization
    • select m.cid,u.id from order m join customer u on m.cid=u.id where m.dt=’2013-12-12′;

  • Join the optimized

    • select m.cid,u.id from (select cid from order where dt=’2013-12-12′)m join customer u on m.cid=u.id;
  • Group by optimization
    • hive.groupby.skewindata=true; Set to true if the group by process is skewed
    • set hive.groupby.mapaggr.checkinterval=100000; — This is the group key corresponding to the number of records exceeding this value will be optimized


  • The count of distinct optimization

    • Before optimization
      • select count(distinct id) from tablename
    • The optimized
      • select count(1) from (select distinct id from tablename) tmp;
      • select count(1) from (select id from tablename group by id) tmp;
    • Before optimization
      • select a,sum(b),count(distinct c),count(distinct d) from test group by a
    • The optimized
      • select a,sum(b) as b,count(c) as c,count(d) as d from(select a,0 as b,c,null as d from test group by a,c union all select a,0 as b,null as c,d from test group by a,d union all select a,b,null as c,null as d from test)tmp1 group by a;