“This is the 15th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

preface

I remember once going to an interview, the interviewer asked a question about the database, the general content of the question is as follows:

Interviewer: Do you usually write SQL?

Me: Sure, this is not an essential skill.

Interviewer: So you know there is a correlation between multiple tables. How do you check the data?

Me: JOIN.

Interviewer: Does it work?

I: Thief easy to use, I especially LEFT JOIN, I use thief proficiently, write every day.

Interviewer: Emm…… Go home and wait for the announcement.

I don’t know if you have encountered this problem, or how you usually deal with the data associated with multiple tables, whether you also use JOIN statement to deal with it.

Instead of defining what’s wrong with the JOIN statement, let’s look at a simple example.

Preparing Test data

Here I use mysql database, because the test requires us to create two tables T/T2, the statement is relatively simple

CREATE TABLE `t` (
  `id` int(11) NOT NULL,
  `a` int(11) NOT NULL,
  `b` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `a` (`a`),
  KEY `b` (`b`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
​
CREATE TABLE `t2` (
  `id` bigint(19) NOT NULL,
  `a` int(11) DEFAULT NULL,
  `c` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Copy the code

I’m going to use the function to create some test data, so I’m going to create 10 million in table T and 10 thousand in table T2.

CREATE DEFINER=`root`@`localhost` PROCEDURE `testdata`()
begin
 declare item int;
 set item=1;
 while(item<=10000000)do
 insert into t values(item, item, item);
 set item=item+1;
 end while;
end
Copy the code

JOIN the test

Write a sentence that elementary school students can test:

select t.*,t2.* from t left join t2 on t.a = t2.a 
Copy the code

emm…… I don’t know how long I have to wait, but still no end, so I added a limit condition, only 10 data query

Select t.*,t2.* from t left join t2 on t.a = t2.a limit 1000010,10Copy the code

This time the wait was much shorter, consuming 19.99 seconds in total

At this speed, I feel like I can carry my baggage home and farm…

Problem analysis

To understand why it takes so long, you need to understand how the query process is implemented before you can conclude whether a JOIN can be used at all.

First, take a look at the execution plan

Three important parameters are as follows:

  • Type: determines whether a full table scan or an index scan is performedALL, that is, full table scan;
  • Rows: Minimize the number of rows that need to be scanned to find the final result;
  • Extra: Additional information that is not suitable for other columns, which is the focus of this exampleBlock Nested Loop.

As we can see from the above execution plan, in order to get to the final 10 pieces of data, we need to scan nearly 10 million rows of data, which is very time-consuming and wasteful.

The whole process can be understood as the following steps:

  1. Read a piece of data R from table T;
  2. Compare field A in R with table T2;
  3. Load all the data of table T into the memory (join_buffer), take out the data of table T2 row by row, compare it in the memory and find that the conditions meett.a = t2.aTo form the result set;
  4. Repeat the above steps until all the data in table T has been traversed.

Time complexity:

  • The time complexity of a single query is M+N, where M drives table data and N is driven table data.
  • Total time complexity: M*N.

A preliminary solution

Alter table T2 alter table T2 alter table T2 alter table T2 alter table T2 alter table T2 alter table T2

Ok, let’s try adding an index and see how long it takes to query the same SQL.

Oh my God, this is not one point two points fast, now it only takes 0.04 seconds to query a time, almost fly

Let’s look at the execution plan:

Mysql > alter TABLE T2; alter table T2; alter table T2;

  1. Read a piece of data R from table T;
  2. Compare field A in R with table T2;
  3. According to index A, the match satisfies the conditiont.a = t2.aTo form the result set;
  4. Repeat the above steps until all the data in table T has been traversed.

As you can see, the only difference in step 3 is the search for the driven table T2. For each step 3, you only need to perform two searches, i.e. search index A and primary key index, and then find the data in the desired driven table.

Time complexity:

  • The time complexity of a single query is 1 + 2 * log2N, where N is the number of rows in the driven table T2.
  • Total time complexity: M * (1 + 2 * log2N)*, M drive table data, N driven table data.

Comparative analysis of JOIN algorithms

Above, two JOIN processes are analyzed, namely JOIN process with index and JOIN process without index:

  • Simple Nested-Loop Join

The algorithm is nested loop, read in turn drive the data in the table, for each of the data, and by comparing the driver table again, such as in this case we driver table data (10000000) * driven table data (10000) = 10 quintillion time complexity, Of course, no database will use this algorithm.

  • Block Nested-Loop Join

The algorithm is that when we have no new index, MySQL’s InnoDB engine will use this algorithm. The specific process is described above, and the time complexity is the same as the above algorithm, which is M * N.

  • Index Nested-Loop Join

This algorithm is used by MySQL after indexes are added. The detailed process has been described above.

JOIN can be used

If we use the Index nested-loop Join algorithm, it does not have much impact and can be used in most scenarios.

If Block nested-loop Join algorithm is used, a large number of rows will be scanned, especially tens of millions or even hundreds of millions of data in this example. As a result, the driven table will be scanned for many times, occupying a large number of system resources and cache resources. Therefore, it is not recommended to use this algorithm.

As for the question mentioned in the beginning of the interviewer, in fact, it really happened, the interviewer does not expect the answer is to process the table associated data in memory, in the SQL execution only query single table, in fact, this time complexity does not reduce, but will increase the NUMBER of SQL execution, I do not know how to use it?

In the next chapter we will look at how to optimize a JOIN when it is used. And what is the join_buffer in this example, and how to optimize it? And other effective means to improve query efficiency.