Reasons for data skew:
Uneven distribution of keys, characteristics of service data, data skew of SQL statements, and non-standard table construction
Such as:
1. Join of large and small tables. The data distributed to one or several Reduce tables is much higher than the average
2. Large tables join large tables with special values such as 0 and too many null values. These null values have a reduce process that is very slow
3. If the group by dimension is too small and a certain value is too large, it takes time to process reduce of a certain value
4. Count (DISTINCT) Too many special values. It takes time to process the special values
Performance:
The amount of data in a Reduce job is 99% or 100% for a long time, and the number of jobs is different from that of other Reduce jobs
Solutions:
1. The SQL statement performs column clipping to reduce the amount of join data in the two tables
The group by dimension is too small
hive.map.aggr=true; Hive. Groupby. Skewindata =true; - In the case of data skew, load balancing is performed to generate two MR JobsCopy the code
3. Join large and small tables, map Join Map Join is automatically enabled for small table data
set hive.auto.convert.join=true; set hive.mapjoin.smalltable.filesize=25000000; -- 25 mCopy the code
The big table joins the null value into a string and adds a random number to distribute the data to different Reduces
Such as:
NVL (fields, concat (' hive, rand ()))Copy the code
5. Count (distinct xx) a large number of special values can be replaced by sum() group by
6. Data skew associated with different types, such as int and string convert data types to character types case
7. Merge small files
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; -- Merge small files before executing MapCopy the code
Output end: Merges output end small files to reduce the number of generated small files. – mapd
set hive.merge.mapfiles=true;
Copy the code
Setting: enable small file merger-map-only job, default true; – reduce side
set hive.merge.mapredfiles=true;
Copy the code
Setting: Enable small file merger-map-reduce jobs. The default value is false.
Hive. Merge. The size, per task = 268534456;Copy the code
8. Reduce the number of jobs and set a proper number of Map and Reduce tasks
Data skew is to set a proper number of Map and Reduce tasks to evenly distribute a large amount of data to Reduce tasks.