Reasons for data skew:

Uneven distribution of keys, characteristics of service data, data skew of SQL statements, and non-standard table construction

Such as:

1. Join of large and small tables. The data distributed to one or several Reduce tables is much higher than the average

2. Large tables join large tables with special values such as 0 and too many null values. These null values have a reduce process that is very slow

3. If the group by dimension is too small and a certain value is too large, it takes time to process reduce of a certain value

4. Count (DISTINCT) Too many special values. It takes time to process the special values

Performance:

The amount of data in a Reduce job is 99% or 100% for a long time, and the number of jobs is different from that of other Reduce jobs

Solutions:

1. The SQL statement performs column clipping to reduce the amount of join data in the two tables

The group by dimension is too small

hive.map.aggr=true; Hive. Groupby. Skewindata =true; - In the case of data skew, load balancing is performed to generate two MR JobsCopy the code

3. Join large and small tables, map Join Map Join is automatically enabled for small table data

set hive.auto.convert.join=true; set hive.mapjoin.smalltable.filesize=25000000; -- 25 mCopy the code

The big table joins the null value into a string and adds a random number to distribute the data to different Reduces

Such as:

NVL (fields, concat (' hive, rand ()))Copy the code

5. Count (distinct xx) a large number of special values can be replaced by sum() group by

6. Data skew associated with different types, such as int and string convert data types to character types case

7. Merge small files

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; -- Merge small files before executing MapCopy the code

Output end: Merges output end small files to reduce the number of generated small files. – mapd

set hive.merge.mapfiles=true;
Copy the code

Setting: enable small file merger-map-only job, default true; – reduce side

set hive.merge.mapredfiles=true;
Copy the code

Setting: Enable small file merger-map-reduce jobs. The default value is false.

Hive. Merge. The size, per task = 268534456;Copy the code

8. Reduce the number of jobs and set a proper number of Map and Reduce tasks

Data skew is to set a proper number of Map and Reduce tasks to evenly distribute a large amount of data to Reduce tasks.