1. The union and union all

  • The former can be de-weighted
select sex,address from test where dt='20210218' union all select sex,address from test where dt='20210218';
+------+----------+--+
| sex  | address  |
+------+----------+--+
| m    | A        |
| m    | A        |
| m    | B        |
| m    | B        |
| m    | B        |
| m    | B        |
+------+----------+--+ 
Copy the code
  • The latter will not be weighed
select sex,address from test where dt='20210218' union select sex,address from test where dt='20210218';
+------+----------+--+
| sex  | address  |
+------+----------+--+
| m    | A        |
| m    | B        |
+------+----------+--+ 
Copy the code

2. Distribute by and sort by after SQL

3. The barrels of table

clustered by (sno) sorted by (age desc) into 4 buckets

Incoming data can only be used with insert into /overwrite

In 2.1.1, mandatory bucket splitting is enabled. Therefore, manually changing the number of Reduce files does not affect the number of final files (the number of files is determined by the number of buckets) – 1. In version 2.1.1, the underlying implementation of mandatory bucket sorting policy – that is, the formal type should be distribute by(bucket field)[sort by sort field], if not, bucket sorting and sorting are also implemented. – 2. Insert into can be used without the keyword table. Insert overwrite must include the keyword table. Because the bucket reduction policy is enforced at the bottom, changing the number of Mapreduce.job. reduces does not affect bucket file data. However, it will affect the number of reduceTask when it is actually implemented. The number of true reduceTask is the factor closest to the number of mapReduce.job. reduces. If it’s prime, use itself

4. Dynamically partition small files and OOM optimization

INSERT OVERWRITE TABLE ris_relation_result_prod partition(rel_id)
SELECT get_json_object(relation, '$.relationHashcode') AS relation_hashcode,
get_json_object(relation, '$.targetVariableValue') AS target_variable_value,
get_json_object(relation, '$.relId') AS rel_id
FROM ris_relation_old_prod733 where get_json_object(relation, '$.relId') in (**********)
Copy the code

set hive.optimize.sort.dynamic.partition=true;

Blog.csdn.net/lzw2016/art…

5. Hive needs to solve many memory problems

Blog.csdn.net/qq_26442553…

Fault 1: The memory of the Hive/MR task report overflows

Running beyond physical memory limits. Current Usage: 2.0 GB of 2 GB physical memory used; 3.9 GB of 4.2 GB Virtual Memory Used. Killing Container.

Memory tuning parameters: blog.csdn.net/sn_zzy/arti…

6. SQL optimization of Hive

Blog.csdn.net/kwu_ganymed…

The map join optimization

7. Small files inserted into Hive are killed

When Hive inserts data into dynamic partitions, many small files are created, which are killed, as shown in Figure 1 below, and execded in GC overhead limit

8. The Hive is in the Block state and times out

mapreduce.task.timeout

If a task has no entry for a certain period of time, that is, no new data is read or output, the task is considered to be in the block state and may be temporarily or permanently stuck. To prevent the user program from ever blocking out, a timeout (in milliseconds) is enforced. The default is 600000, and a value of 0 disables timeout

9.Hive window functions must be case-sensitive

Max () over (partition by prcid order by b.occurtime desc

10. Hive client logs

hive –verbos=true hive –hiveconf hive.root.logger=DEBUG,console

11.~/. Beeline /history && ~/. Hivehistory in version 2.1, will be oom

Delete or move the backup file to view personal data. You can pay more attention to it.