Today we are going to talk about the multi-table joint query in wechat. Should small table drive big table or big table drive small table?

1. in VS exists

Before we begin our analysis, let’s take a look at the keywords in and exists.

Suppose I now have two tables: employee table and department table, each employee has a department, the employee table holds the ID of the department, and this field is the index; The department table contains the id and name of the department. Id is the primary key and name is the unique index.

Here I will directly use the VHR table to do the experiment, not to give you a separate database script, partners can check the VHR project (github.com/lenve/vhr) get…

Suppose I now want to query all employees in the technical department. I have the following two query methods:

The first way to query is to use the in keyword:

select * from employee e where e.departmentId in(select d.id from department d where d.name='Technology') limit 10;
Copy the code

This SQL is easy to understand, I believe you can understand. SQL > select * from department; select * from department; select * from department;

As can be seen, first query the department table, there is an index to use the index, there is no index to scan the whole table, and then query the employee table, also use the index to query, overall efficiency is relatively high.

The second is to use the exists keyword:

select * from employee e where exists(select 1 from department d where d.id=e.departmentId and d.name='Technology') limit 10;
Copy the code

SQL > select * from ‘in’ where ‘in’ = ‘in’;

It can be seen that the employee table is scanned in the whole table first, and then departmentId in the employee table is taken for data comparison. If the subquery returns a value, it is true; if it does not return a value, it is false; if it is true, the employee record is retained; if it is false, the employee record is discarded. So you can use SELECT * instead of SELECT *, you can change it to SELECT 1 or whatever, MySQL official says that it will ignore the SELECT list in actual execution, so it doesn’t make much difference what you write.

Comparing the number of rows scanned in the two query plans, we can roughly see the difference, which is slightly more efficient using IN.

If the in keyword is used to query, the department table and then the employee table, generally speaking, the data in the department table is smaller than the data in the employee table, so this is the small table to drive the large table, relatively high efficiency.

If the exists keyword is used, the employee table is queried first and then the department table. Generally, the data in the department table is smaller than that in the employee table. Therefore, a large table drives a small table, which is inefficient.

In short, is to small table drive large table efficiency is high, large table drive small table efficiency will be lower. Therefore, if the amount of data in the department table is greater than that in the employee table, the efficiency of using the exists keyword in the above two TYPES of SQL is relatively high.

2. Why do small tables drive large tables

In MySQL, the principle of this multi-table joint query is: based on the data of the driven table, through a nested loop similar to our Java code to match the records of the driven table.

Take the table in section 1 as an example. Assume that our employee table E is a large table with 10000 records. Department table D is a small table with 100 records.

Suppose D drives E, the execution flow would look something like this:

For 100 departments {match 10000 employees (B+ tree search)}Copy the code

So the total number of lookups is 100 plus log base 10,000.

Assuming E drives D, the execution flow would look something like this:

For 10000 employees {match 100 departments (B+ tree search)}Copy the code

So the total number of lookups is 10,000 plus log base 100.

From the comparison of the two data we can see that the efficiency of the small table to drive the large table is high. The core reason is that when searching a driven table, it is usually indexed, and the index search is much faster and requires fewer searches.

3. What if there is no index?

The conclusion drawn in the previous section 2 is based on the premise that the fields associated with the driven table and the driven table are indexed. For example, table E stores the departmentId field, which corresponds to the ID field in table D, and the ID field is the primary key index in table D. If id is not a primary key index, it is a normal field. It doesn’t really matter if E is driving D or D is driving E.

In this case, MySQL uses an algorithm called Block nested-loop Join (BNL for short), where no indexes are available on the driven table:

  1. Read table E into join_buffer.
  2. Scan table D and compare each row in table D with the data in JOIN_buffer. Those that meet join conditions are returned as part of the result set.

Friends to see, if I put in the table E indexes on the departmentId field delete, put D id field in the table on the primary key index is deleted, we take a look at the following SQL execution plan:

As can be seen, both table E and table D are full table scans. In addition, it should be noted that these comparison operations are in memory, so the execution efficiency is OK.

But now that I’ve read all the data into memory, can I put it down? What if I can’t fit it in memory? Table Extra contains a Using join buffer (Block Nested Loop), resulting in a Nested Block. So it’s pretty clear what this means, if you can’t fit it in memory at a time, you read it in chunks, you read one part of it into memory, and then you read the other part into memory.

To check the size of join_buffer, run the following command:

262144/1024=256KB

The default size is 256 KB.

I will now increase this value and look at the new execution plan as follows:

Using join buffer (Block Nested Loop) is no longer displayed.

To sum up:

  • If join_buffer is large enough to read all the data into memory at once, it doesn’t matter whether the large table drives the small table or the small table drives the large table.
  • If join_buffer is limited in size, it is recommended that the small table drive the large table, so that even if the blocks are read, the number of reads is less.

But to be honest, this kind of multi-table joint query without index is inefficient and should be avoided as much as possible.

To sum up, it is recommended that small tables drive large tables when multiple tables are combined.