IN is suitable for the situation where the outside is large and the inside is small; EXISTS applies to a situation where the outer surface is small and the inner surface is large.

In /exists/left semi join does not produce cartesian product! Inner join may produce cartesian product!

1. in

select * from A
where A.id in (select B.id from B)
Copy the code

It looks up all id fields in table B and caches them. Then, check whether the IDS in table A are equal to those in table B. If so, add the records in table A to the result set until all the records in table A are traversed.

As you can see, it is not appropriate to use in() when table B is large, because it traverses all table B data once.

For example, if table A has 10,000 records and table B has 1,000,000 records, it is possible to traverse 10,000 x 1,000,000 records at most, which is inefficient.

Another example: Table A has 10000 records and Table B has 100 records, so it is possible to traverse 10000*100 times at most, which greatly reduces the number of traverse times and greatly improves the efficiency.

Conclusion: In () is suitable for the case where the data of table B is smaller than that of table A.

2. exists

select A.* from A
where exists (select B.id from B where A.id = B.id)
Copy the code

Use exists() when table B is larger than table A because it does not iterate and only needs to perform one more query.

For example, if there are 10,000 records in table A and 1,000,000 records in table B, exists() performs 10,000 executions to determine whether the ids in table A are equal to those in table B.

If exists(A) has 10000 entries and exists(B) has 100 entries, it is better to use in() to do 10000*100 entries, because in() does the comparison in memory, and exists() needs to query the database, which is known to consume higher performance. Memory is fast.

Conclusion: Exists () applies to the case where table B has larger data than table A

3. Usage scenarios of Left semi Join [It is recommended to use left semi Join instead of in/exist]

The above example should be converted to:

select A.id from A left semi join B on A.id = B.id;
Copy the code

Hive supports more subqueries after version 0.13, such as in and not in subqueries. If hive does not support subqueries such as in, Exists, and not in, it is probably older than version 0.13.

end