Join

Join principles: 1) == Join small tables with large tables ==. The tables with relatively scattered keys and small amount of data are placed on the left of the Join, which can effectively reduce the probability of memory overflow errors. Further, you can use groups to advance small dimension tables (under 1000 entries) into memory. Complete reduce on the Map side. select count(distinct s_id) from score; select count(s_id) from score group by s_id; Aggregation on the Map side is more efficient

2) When multiple tables are associated, it is best to split them into small segments to avoid large SQL (intermediate jobs cannot be controlled) == 3) Join large table Join large table (1) Empty KEY filtering Sometimes Join timeout is because there is too much data corresponding to some keys, but data corresponding to the same keys are sent to the same Reducer, resulting in insufficient memory. In this case, we should carefully analyze the abnormal keys. In most cases, the data corresponding to these keys is abnormal data, which needs to be filtered in SQL statements. For example, if the key field is empty, perform the following operations:

create table ori(id bigint.time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

create table nullidtable(id bigint.time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

create table jointable(id bigint.time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

load data local inpath '/export/servers/hivedatas/hive_big_table/*' into table ori; 
load data local inpath '/export/servers/hivedatas/hive_have_null_id/*' into tablenullidtable; Don't filter:INSERT OVERWRITE TABLE jointable
SELECT a.* FROM nullidtable a JOIN ori b ONa.id = b.id; No rows affected (152.135 seconds)-----------------------------------------------Filtering:INSERT OVERWRITE TABLE jointable
SELECT a.* FROM (SELECT * FROM nullidtable WHERE id IS NOT NULL ) a JOIN ori b ONa.id = b.id; No rows affected (141.585 seconds)Copy the code

(2) Empty key transformation Sometimes there are a lot of data corresponding to an empty key, but the corresponding data is not abnormal data and must be included in the result of join. In this case, we can assign a random value to the field with empty key in Table A, so that the data can be randomly and evenly divided into different reducer. For example: Non-random distribution:

set hive.exec.reducers.bytes.per.reducer=32123456;
set mapreduce.job.reduces=7;
INSERT OVERWRITE TABLE jointable
SELECT a.*
FROM nullidtable a
LEFT JOIN ori b ON 
CASE WHEN 
a.id IS NULL 
THEN 'hive' 
ELSE a.id 
END 
= b.id;
No rows affected (41.668 seconds)   52.477
Copy the code

Results: The result is that all null ids become the same string “hive”, which is prone to data skew (all keys are the same, data with the same key will be sent to the same Reduce). To solve this problem, we can use the RAND function of Hive. Randomly assign a random value to each empty ID, so as not to skew the data randomly:

set hive.exec.reducers.bytes.per.reducer=32123456;
set mapreduce.job.reduces=7;
INSERT OVERWRITE TABLE jointable
SELECT a.*
FROM nullidtable a
LEFT JOIN ori b ON 
CASE WHEN 
id IS NULL 
THEN concat('hive'.rand()) 
ELSE a.id 
END 
= b.id;
No rows affected (42.594 seconds)
Copy the code

A case in field

SQL > alter table JOIN table (s); SQL > alter table JOIN table (s)

create table bigtable(id bigint.time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

create table smalltable(id bigint.time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

create table jointable2(id bigint.time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
Copy the code

(2) Import data into large table and small table respectively

hive (default)> load data local inpath '/export/servers/hivedatas/big_data' into table bigtable;
hive (default)>load data local inpath '/export/servers/hivedatas/small_data' into table smalltable;
Copy the code

(3) Disable mapJoin (enabled by default)

set hive.auto.convert.join = false;
Copy the code

(4) small table the JOIN predicate sentence

INSERT OVERWRITE TABLE jointable2
SELECT b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
FROM smalltable s
left JOIN bigtable  b
ONb.id = s.id; / / Time seems: 67.411 secondsCopy the code

JOIN (1) JOIN (2) JOIN (3

INSERT OVERWRITE TABLE jointable2
SELECT b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
FROM bigtable  b
left JOIN smalltable  s
ONs.id = b.id; / / Time seems: 69.376 secondsCopy the code

If a large table joins a small table or a small table joins a large table, there is almost no difference between the new version and the new version even if join is disabled on the map side.

MapJoin

If MapJoin is not specified or the conditions for MapJoin are not met, the Hive parser converts the Join operation to a Common Join (complete the Join in the Reduce phase). Prone to data skew. MapJoin can be used to load all small tables into the map end for join, avoiding reducer processing. 1) Enable MapJoin parameter setting: (1) Set MapJoin automatically

set hive.auto.convert.join = true; The default is trueCopy the code

(2) Threshold setting for large tables and small tables (the default value is 25M or less) :

set hive.mapjoin.smalltable.filesize=25123456;
Copy the code

2) MapJoin working mechanism

set hive.auto.convert.join = true; The default is trueCopy the code

(2) small table the JOIN predicate sentence

INSERT OVERWRITE TABLE jointable2
SELECT b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
FROM smalltable s
JOIN bigtable  b
ONs.id = b.id; / / Time seems: 31.814 secondsCopy the code

JOIN (3) JOIN (4) JOIN (4

INSERT OVERWRITE TABLE jointable2
SELECT b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
FROM bigtable  b
JOIN smalltable  s
ONs.id = b.id; / / Time seems: 28.46 secondsCopy the code