When executing a Join statement, it is often said that a small table should drive a large table to query, while a small table can drive a large table to query faster.
Why Mysql recommends that small tables drive large tables to perform queries, which can effectively improve query efficiency
Simple Nested-Loop Join
Simple join query, Mysql does not use this query method, because this query method will produce Cartesian product, resulting in a large number of rows scanned in the query process of table association, resulting in very slow query efficiency.
-
SLJ execution process
select * from t1 straight_jon t2 on t1.a = t2.a; Copy the code
No index is created for column A of t1 and T2
Execution process:
- From the table
t1
To read a row of data, and get the a field - Select a from t1, select a from T2, and select a from T1
- If a match is found, the matched data is added to the result set
- Repeat steps 1-3 until a full scan is performed for table T1
- From the table
-
Time complexity
During the execution, a full scan for table T1 and a full scan for table T2 is required. Therefore, the query efficiency is slow and the time complexity is N x M
When the data volume of the table is large, the number of rows scanned is very large. For example, t1 and T2 are both 10W rows, so the total number of rows scanned is 10 billion
Index Nested-Loop Join
In this query process, it is similar to nested inserts, and the index of the driven table can be used in the query process, so it can be shortened to NLJ
-
NLJ execution process
select * from t1 straight_jon t2 on t1.a = t2.a; Copy the code
Where the a field of table T2 creates the index
Straight_join was used to prevent the mysql optimizer from optimizing the way tables were driven
In this SQL statement, T1 is the driver table and T2 is the driven table. The execution flow of this statement is as follows:
- Start by reading a row from table T1
- Retrieves a field from a row read from T1
a
And geta
Select * from t2 - First use the value of a field in T1 to search the index tree of A field in T2, and then get the primary key ID of the corresponding T2 table
- Use primary key ID to query the primary key index tree of t2 table and obtain the whole row data
- Add a row result from T1 and a row result from t2 to the result set
- Repeat Steps 1-5 until data scanning for table T1 is complete
-
Time complexity
During the execution, the time complexity of NLJ is mainly composed of two table scans:
- Full scan of N rows of t1 table
- Query a row of the driven table. The time complexity is 2logM. When N rows are queried, the total time complexity is N*2*logM
The total time complexity is N + N*2*logM
-
Small tables drive large tables
Obviously, N has a greater impact on the overall time complexity, so it is necessary to keep the value of N as small as possible. Then when two tables join, it is more appropriate to use the small table as the driving table, and the time complexity is lower, and the query efficiency is higher
Of course, the premise of the small table driving the large table is “can use the index of the driven table”, when the index of the driven table cannot be used, the time complexity is different
Block Nested-Loop Join
To solve the problem of too many rows scanned by SLJ algorithm, Mysql uses BLJ algorithm to solve the associated query operation between two tables when there is no index on the driven table
-
BLJ execution process
select * from t1 straight_jon t2 on t1.a = t2.a; Copy the code
No index is created for column A of t1 and T2
- Read data from table T1 into thread memoryjoin_bufferMiddle, because of this
select *
, so put all fields of t1 table into memory - Scan table T2, extract each row from table T2, and then match the data in table T1 in Join_buffer. If join conditions are met, add this row of data to the result set
- Repeat step 2 until the T2 table scan is complete
- Read data from table T1 into thread memoryjoin_bufferMiddle, because of this
-
Time complexity
During BLJ execution, a full table scan is performed on both T1 and T2 tables, so the time complexity is N+M, and each row of T2 table is judged, so The Times of judgment is N*M. Although BLJ and SLJ are the same in terms of time complexity, the time complexity of full table scan of BLJ algorithm is almost constant, which can be ignored. The only time-consuming thing is the number of judgment times. However, these judgment operations are performed in memory, which is much faster and has good performance.
Whether t1 or T2 is used as the driver table, the time complexity is almost the same. The only difference is that join_buffer loads different data
-
Segmented query
The time complexity of a large table is the same as that of a small table. However, the size of a large table is different in join_buffer. Mysql’s policy is to perform segmented queries if it can’t fit
The size of join_buffer is set using the join_buffer_size parameter. The default size is 256K
-
Segmented query procedure
If the table size is larger than JOIN_buffer, a segmented query is performed
- Table T1 is scanned, and rows are read sequentially into join_buffer. When join_buffer is full, it stops and performs step 2
- Scan table T2, extract each row from T2, and match the data in join_buffer. The data that meets join conditions will be added to the result set
- Empty join_buffer
- Continue to scan table T1, read the remaining rows sequentially, add them to JOIN_buffer, and continue with Step 2
As you can see, join_buffer is reused to match the result set
-
Time complexity of segmented query
In the process of segmented query, although the memory query efficiency of Join_buffer is taken advantage of, table T2 is scanned twice
Assume that the number of rows in table T1 is N, which needs to be divided into K segments to complete scanning, and the number of rows in the driven table is M
Then the time complexity is:
- The time complexity of scanning lines is N+K x M
- The number of times for determining the memory is N x M
Obviously, the number of rows scanned depends on the size of K, and K represents the number of segments. In the case of join_buffer, the larger the amount of data in the table is, the more segments and rows scanned will be
-
-
Small tables drive large tables
Therefore, according to the above time complexity, when the number of rows of data driving the table is smaller, the total number of scanned rows will be less, and the total number of judgments will remain unchanged, so the efficiency of the query will be higher.
Of course, the premise is that the amount of table data is too large to load all tables into join_buffer at one time. If join_buffer is large enough to load all tables into memory at one time, the selection of the driver table does not affect the query efficiency
That’s why it’s recommended to make join_buffer_size bigger
Definition of small tables
-
What is a small watch
-
The amount of data filtered in the condition is small
select * from t1 stright_join t2 on t1.a = t2.a where t2.id < 50; Copy the code
The data volume of t1 table is 1W, while the data volume of T2 table is 10W
This SQL query only queries the first 50 tables of table T2, so table T2 is a “small table” relative to table T1 in this SQL query.
-
Query fields are small
select t1.b, t2.* from t1 stright_join t2 on t1.a = t2.a Copy the code
Assume that only 100 rows of data are joined in T1 and T2
In the SQL statement above, T1 should technically be “small table”, because T1 only needs to put field B into join_buffer, while T2 needs to put all fields into Join_buffer
Therefore, in the process of defining a small table, you not only need to look at the size of the data volume of the table, but also need to combine the SQL conditions for filtering. After filtering, you can judge the table according to the query fields. The table with a small data volume is a “small table”.
-
Therefore, in the process of Mysql query, small tables should always be selected to drive large tables to query, reduce the time complexity of query, improve query efficiency, especially when the amount of data in the table is too large, it should be used in this way.