Yesterday afternoon when I was looking up information, I accidentally clicked on the official website of MySQL. A new version of MySQL has been found.

Mysql is a database that anyone is not familiar with? Don’t have to? I don’t think so.

In late 2019, the 8.0.18GA release of MySQL brought a number of new features and enhancements. Most notably, multi-table Join queries support Hash Joins.

Is the same, suggest English good students directly see here: https://dev.mysql.com/doc/refman/8.0/en/hash-joins.html

MySQL Hash Join features

  • 1. For table association with large amount of data, HJ(Hash Join) is significantly faster than NL(Nested Loop)
  • 2. Process in memory
  • 3. Use disk space if necessary
  • 4, for internal connection, can be extended to external connection, semi-connection and anti-connection
  • 5. Replace Block Nested Loop in query plan
  • SQL can be forced to HJ or NL by HINT

Some of you might be confused. What is a Hash Join? What is NL? HINT is a HINT.

The first part is to do a simple popular science

First, if we look at the execution plan of a joint query with multiple tables, we will see how the tables are joined. There are three ways to Join multiple tables: Nested Loops, Hash Join, and Sort Merge Join.

There must be some people who say that alibaba specifications have said that concurrent cases can not be queried with multiple tables. How concurrent are you? The background of any system will use multi-table joint query.

Hash Join is used in both Spark and Flink SQL joins.

[Three implementation methods of Spark SQL Join].

Hash Join is a common method used by CBO to Join large data sets, and is generally suitable for joining large and small tables. In general, small tables are used to create hash tables in memory using JOIN keys, store column data in hash tables, and then scan larger tables. The hash table is also hashed on JOIN keys, and the hash table is detected to find rows that match the hash table.

Some students are confused again. What is the CBO? CBO is a form of SQL optimization that evaluates the execution plan based on real data and selects the least expensive execution plan.

What is an execution plan? Baidu go ahead… [Black question mark face]

What is Nested Loops? To put it simply, it is a two-layer Loop. The first table is used as Outter Loop and the second table as Inner Loop. Each record of Outter Loop is compared with that of Inner Loop to find the data that meets the conditions. Of course, Nested Loops have many different scenarios. For the simplest example, the pseudocode looks like this:

for (r in R) { for (s in S) { if (r satisfy condition s) { output <r, s>; }}}Copy the code

What is Hint? “Hint” means Hint. In a nutshell, hints are very much like comments in code that indicate to the developer or others what the code means. So the Hint has a special role in SQL, as a Hint to the database that I want it to do what I want it to do. I’m not going to give you an example.

How to use Hash Join in MySQL?

Let’s go straight to the example on the website.

Suppose we have three tables as follows:

CREATE TABLE t1 (c1 INT, c2 INT);
CREATE TABLE t2 (c1 INT, c2 INT);
CREATE TABLE t3 (c1 INT, c2 INT);
Copy the code

There is a simple table associated query:

SELECT * 
    FROM t1 
    JOIN t2 
        ON t1.c1=t2.c1;
Copy the code

We can use EXPLAIN FORMAT=TREE command to see the above SQL execution plan:

mysql> EXPLAIN FORMAT=TREE -> SELECT * -> FROM t1 -> JOIN t2 -> ON t1.c1=t2.c1\G *************************** 1. row *************************** EXPLAIN: -> Inner Hash JOIN (T2.c1 = T1.c1) (cost=0.70 rows=1) -> Table scan on T2 (cost=0.35 rows=1) -> Hash -> Table scan on T1 (cost = 0.35 rows = 1)Copy the code

We see that the Inner Hash JOIN keyword means that this SQL is using a Hash join.

In addition, queries that use equivalent joins between multiple tables are optimized in this way. For example:


SELECT * 
    FROM t1
    JOIN t2 
        ON (t1.c1 = t2.c1 AND t1.c2 < t2.c2)
    JOIN t3 
        ON (t2.c1 = t3.c1);
Copy the code

Looking at the output of the EXPLAIN FORMAT=TREE command, we can also see that the condition of the unequal join ends up as a filter.

mysql> EXPLAIN FORMAT=TREE -> SELECT * -> FROM t1 -> JOIN t2 -> ON (t1.c1 = t2.c1 AND t1.c2 < t2.c2) -> JOIN t3 -> ON (t2.c1 = t3.c1)\G *************************** 1. row *************************** EXPLAIN: -> Inner Hash JOIN (t3.c1 = T1.c1) (cost=1.05 rows=1) -> Table scan on T3 (cost=0.35 rows=1) -> hash -> Filter: C2 < T2.c2) (cost=0.70 rows=1) -> Inner Hash join (T2.c1 = T1.c1) (cost=0.70 rows=1) -> Table scan on T2 (cost=0.35) Rows =1) -> Hash -> Table scan on T1 (cost=0.35 rows=1)Copy the code

As you can see from the above log, MySQL will use multiple Hash joins if your SQL contains multiple equivalent joins.

But, look out! Hash Join is not used if the on condition in your SQL is not equivalent Join.

Such as:


mysql> EXPLAIN FORMAT=TREE
    ->     SELECT * 
    ->         FROM t1
    ->         JOIN t2 
    ->             ON (t1.c1 = t2.c1)
    ->         JOIN t3 
    ->             ON (t2.c1 < t3.c1)\G
*************************** 1. row ***************************
EXPLAIN: <not executable by iterator executor>

Copy the code

Let’s EXPLAIN:

mysql> EXPLAIN -> SELECT * -> FROM t1 -> JOIN t2 -> ON (t1.c1 = t2.c1) -> JOIN t3 -> ON (t2.c1 < t3.c1)\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t1 partitions: NULL type: ALL possible_keys: NULL Key: NULL KEY_len: NULL ref: NULL rows: 1 filtered: 100.00 Extra: NULL *************************** 2. row *************************** id: 1 select_type: SIMPLE table: t2 partitions: NULL type: ALL possible_keys: NULL key: NULL KEY_len: NULL ref: NULL rows: 1 filtered: 100.00 Extra: Using WHERE; Using join buffer (Block Nested Loop) *************************** 3. row *************************** id: 1 select_type: SIMPLE table: t3 partitions: NULL type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 1 filtered: 100.00 Extra: Using the where; Using join buffer (Block Nested Loop)Copy the code

See, MySQL now selects Nested Loop.

Cartesian product queries can also use HJ:


mysql> EXPLAIN FORMAT=TREE
    -> SELECT *
    ->     FROM t1
    ->     JOIN t2
    ->     WHERE t1.c2 > 50\G
*************************** 1. row ***************************
EXPLAIN: -> Inner hash join  (cost=0.70 rows=1)
    -> Table scan on t2  (cost=0.35 rows=1)
    -> Hash
        -> Filter: (t1.c2 > 50)  (cost=0.35 rows=1)
            -> Table scan on t1  (cost=0.35 rows=1)
Copy the code

Keep your eyes on the ball!

By default, MySQL uses Hash Joins whenever possible. It also provides two methods to control whether a Hash Join is used. For example, I just don’t like HJ, I like turtle speed Join NL, and then the project manager asked to open HJ query when optimization, wouldn’t it be delightful?

Either way:

  • Set the server system variable hash_join=on globally
  • Use Hint to specify a HASH in SQLJOIN 或者 NOHASH_JOIN

Even better, HJ itself is index-free, meaning that even without index optimization, HJ is still fast.

Here’s a test I took with other people online to show how powerful HJ can be.

Firstly, 1,000,000 records were generated for T1, T2 and T3 respectively:


set join_buffer_size=2097152000;

SET @@cte_max_recursion_depth = 99999999;

INSERT INTO t1
-- INSERT INTO t2
-- INSERT INTO t3
WITH RECURSIVE t AS (
  SELECT 1 AS c1, 1 AS c2
  UNION ALL
  SELECT t.c1 + 1, t.c1 * 2
    FROM t
   WHERE t.c1 < 1000000
)
SELECT *
  FROM t;

Copy the code

Hash join without index:

mysql> EXPLAIN ANALYZE -> SELECT COUNT(*) -> FROM t1 -> JOIN t2 -> ON (t1.c1 = t2.c1) -> JOIN t3 -> ON (t2.c1 = t3.c1)\G * * * * * * * * * * * * * * * * * * * * * * * * * * * 1 row * * * * * * * * * * * * * * * * * * * * * * * * * * * the EXPLAIN: - > Aggregate: count (0) (actual time = 22993.098. Rows =1 loops=1) -> Inner Hash JOIN (t3.c1 = t1.c1) (cost=9952535443663536.00 rows=9952435908880402) (actual) Time = 14489.176.. Rows =0 loops=1) -> Table scan on T3 (cost=0.00 rows= 0) (actual time=0.103.. Rows =1000000 loops=1) -> Hash -> Inner Hash JOIN (t2.c1 = t1.c1) (cost=99682753413.67 rows=99682653660) (actual) Time = 5663.592.. Rows =1000000 loops=1) -> Table scan on T2 (cost=0.01 rows= 99912) (actual time=0.067.. Rows =100 loops=1) -> Hash -> Table scan on T1 (cost=100 rows=100) (actual time=0.. Mysql > SELECT COUNT(*) -> FROM t1 -> JOIN T2 -> ON (t1.c1 =) t2.c1) -> JOIN t3 -> ON (t2.c1 = t3.c1); + -- -- -- -- -- -- -- -- -- -- + | COUNT (*) | + -- -- -- -- -- -- -- -- -- -- + | 1000000 | + -- -- -- -- -- -- -- -- -- -- + 1 row in the set (12.98 SEC)Copy the code

The actual run took 12.98 seconds. If block nested loop is used:

mysql> EXPLAIN FORMAT=TREE -> SELECT /*+ NO_HASH_JOIN(t1, t2, t3) */ COUNT(*) -> FROM t1 -> JOIN t2 -> ON (t1.c1 = t2.c1) -> JOIN t3 -> ON (t2.c1 = t3.c1)\G *************************** 1. row *************************** EXPLAIN: < executable by iterator executor> 1 row in set (0.00 SEC) SELECT /*+ NO_HASH_JOIN(t1, t2, t3) */ COUNT(*) FROM t1 JOIN t2 ON (t1.c1 = t2.c1) JOIN t3 ON (t2.c1 = t3.c1);Copy the code

EXPLAIN that hash join cannot be used. The query ran for dozens of minutes with no results, and one of the CPU utilization reached 100%; Because you’re doing nested loops all the time (1,000,000 to the power of 3).

If the nested block loop method has an index, add an index:

mysql> CREATE index idx1 ON t1(c1); Query OK, 0 rows affected (7.39 SEC) Records: 0 Duplicates: 0 Warnings: 0 mysql> CREATE index IDx2 ON T2 (c1); Query OK, 0 rows affected (6.77 SEC) Records: 0 Duplicates: 0 Warnings: 0 mysql> CREATE index idx3 ON t3(c1); Query OK, 0 rows affected (6.77 SEC) Records: 0 Duplicates: 0 Warnings: 0 mysql> CREATE index idx3 ON t3(c1); Query OK, 0 rows affected (7.23 SEC) Records: 0 Duplicates: 0 Warnings: 0Copy the code

View the execution plan and run the same query:

mysql> EXPLAIN ANALYZE -> SELECT COUNT(*) -> FROM t1 -> JOIN t2 -> ON (t1.c1 = t2.c1) -> JOIN t3 -> ON (t2.c1 = t3.c1)\G * * * * * * * * * * * * * * * * * * * * * * * * * * * 1 row * * * * * * * * * * * * * * * * * * * * * * * * * * * the EXPLAIN: - > Aggregate: count (0) (actual time = 47684.034. 47684.035 rows=1 loops=1) -> Nested loop inner join (cost=2295573.22 rows=998412) (actual time=0.116.. 46363.599 rows=1000000 loops=1) -> Nested loop inner join (cost=1198056.31 rows=998412) (actual time=0.087.. Rows =100 loops=1) -> Filter: (t1.c1 is not null) (cost=100 loops=1) (actual time= 100.. Rows =100 loops=1) -> Index scan on T1 using IDx1 (cost=100 loops=1) (actual time=0.043.. 3253.769 rows=1000000 loops=1) -> Index lookup on T2 using IDx2 (c1=t1.c1) (cost=1.00 rows=1) (actual time=0.012.. 0.015 rows=1 loops=1000000) -> Index lookup on T3 using idx3 (c1=t1.c1) (cost=1.00 rows=1) (actual time=0.012.. Mysql > SELECT COUNT(*) -> FROM t1 -> JOIN T2 -> ON (t1.c1 = t2.c1) -> JOIN t3 -> ON (t2.c1 = t3.c1); + -- -- -- -- -- -- -- -- -- -- + | COUNT (*) | + -- -- -- -- -- -- -- -- -- -- + | 1000000 | + -- -- -- -- -- -- -- -- -- -- + 1 row in the set (19.56 SEC)Copy the code

The actual run took 19.56 seconds. So the test results in our scenario are as follows:

Add Hash Join result 1.282s in Oracle 12C with no index. Add a Hash Join result for PostgreSQL 11.5 with no index: 6.234s. Add a Hash Join result of 5.207s without index in SQL Server 2017.

See how powerful Hash Join is? Did you learn?

Pay attention to my public number, background reply [JAVAPDF] get 200 pages of questions! 50000 people pay attention to the big data into the way of god, don’t you want to know? Fifty thousand people pay attention to the big data into the road of god, really not to understand it? Fifty thousand people pay attention to the big data into the way of god, sure really not to understand it?

Welcome your attentionBig Data as the Road to God