background

Because of the way developers use it, all kinds of strange SQL can be written. If you write a query the way SQL is written, it can be a waste of performance. Therefore, MySQL uses rewriting rules to optimize SQL for efficient execution, a process called query rewriting.

Rewrite rules

1. Conditional simplification

1.1 Remove unnecessary symbols

select * from (t1, (t2, t3)) where t1.a = t2.a and t2.b = t3.b
Copy the code

The optimizer removes unnecessary parentheses as follows:

select * from t1, t2, t3 where t1.a = t2.a and t2.b = t3.b
Copy the code

1.2 Constant Transfer

For example, if a = 5 and b > a, a has been fixed to be equal to 5, and b > a can be directly changed to b > 5.

1.3 Remove useless conditions

(a < 1 and b = b) or (a = 6 or 5 ! = 5)

B = b must always be true, 5! = 5 must always be false, so the expression is simplified as follows:

(a < 1 and TRUE) or (a = 6 or FALSE)

Further simplification:

a < 1 or a = 6

1.4 Expression Calculation

A is equal to 5 plus 1, 5 plus 1 is a constant, so it’s going to be reduced to a is equal to 6.

But the optimizer does not reduce a column if it does not exist in a separate form as an operand of the expression, such as ABS(a) > 5 or -a < -8.

The having clause is merged with the WHERE clause

If the aggregate functions sum, Max, group by, and so on are not present in the query, the optimizer merges the WHERE and HAVING clauses.

1.6 Constant scale test

MySQL considers the following two types of queries to be very fast:

  1. There is no or only one record in the query table

    How do you know how much data is in the table before you query it? Innodb statistics are not accurate, so this is only suitable for MyISAM or Memory engine tables.

  2. Use primary key equivalence matching or unique secondary index equivalence matching as query conditions

MySQL refers to these two types of tables as constant tables. When the optimizer analyzes a query statement, it first performs a constant table query, then changes the conditions involved in the table to constants, and finally analyzes the query cost.

For example, select * from table1 inner join table2 on table1.column1 = table2.column2 where table1.primary_key=1

Because table1 is queried with the primary key, so table1 can be used as a constant table, in the analysis of the query cost of Table2, the first query for Table1, and then the statement designed to replace all the query conditions for Table1, as follows:

Select table2.* from table1 inner join table2 on table1.column1 constant value = table2.column2

2. The external connection is removed

The position of the driver table and the driven table of the inner join can be converted, while the outer join is fixed, so the inner join can reduce the query cost by optimizing the order of the join table, while the outer join cannot optimize the order of the join table.

The essential difference between an outer join and an inner join is that if the outer join driven table does not find the corresponding record, the driven table record is still placed in the result set, and the driven table value is filled with NULL.

In a reject-NULL case, the outer join is eliminated, the optimizer can turn the outer join into an inner join, and then find the least expensive execution plan based on the different join order. The null-rejected statements are as follows:

  1. select * from t1 left join t2 on t1.x1 = t2.y2 where t2.y2 is not null

  2. select * from t1 left join t2 on t1.x1 = t2.y2 where t2.y2 = 2

Since the above statement does not allow null values in the driven table T2 in the WHERE condition, the same effect can be achieved using an inner join. In this case, the optimizer will do the inner join conversion for us to optimize the execution cost.

3. Sub-query optimization

An SQL statement can contain another SQL statement. The contained statement is called a subquery. The outer statement is called an outer query.

3.1 Location of subquery

1. The select clause

Select (select * from T1).

2. The from clause

Select * from (select * from T1) as t, from (select * from t1) as t;

3. Where or on clause

Select * from t1 where xx in (select xx from t2); select * from t1 where xx in (select xx from T2); select * from T1 where xx in (select xx from T2);

3.2 Subquery classification

Query by result set area molecule

Subqueries can be categorized based on the result set returned by them.

1. Scalar quantum query

A subquery that returns only a single value (scalar) is called a scalar quantum query, for example:

select (select m1 from t1 limit 1)

select * from t1 where m1 = (select min(m2) from t2)

A scalar query can appear as part of a single value or expression anywhere in a query statement.

2. Sub-query

Similar to a scalar quantum query, except that a scalar quantum query returns only one column, and a row subquery returns only one row with multiple columns, for example:

select * from t1 where (m1, n1) = (select m2 ,n2 from t2 limit 1)

Columns M1 and N1 in the T1 table need to be equal to m2 and N2 in the subquery, respectively.

3. Display subqueries

The column subquery queries only one column, but has multiple rows.

select * from t1 where m1 in (select m2 where t2)

4. Table subquery

Table subqueries return multiple columns, multiple rows.

select * from t1 where (m1, n1) in (select m2 ,n2 from t2 limit 1)

2. Perform the molecular query with the outer query relationship zone

1. Irrelevant sub-query

A subquery that can run independently without relying on the value of the outer query is called an irrelevant subquery.

2. Related sub-query

Subqueries that depend on the value of the outer query are called correlation subqueries.

select * from t1 where m1 in (select m2 from t2 where t2.xx = t1.xxx)

The query criteria for the T2 table depend on the value of T1

3. Use of subqueries in Boolean expressions

The most common scenario for using subqueries is to place them in a WHERE, on clause as a Boolean expression.

1.comparison_operator

Comparison_operator is =, >, <, >=, <=, <>,! =, <=> as operators of expressions, the syntax is as follows:

Operand comparison_operator (subquery)

The operand can be a column name, constant, expression, subquery, but must be a row subquery or scalar query, as follows:

select * from t1 where m1 < (select min(m2) from t2)

select * from t1 where (m1, n1) = (select m2, n2 from t2 limit 1)

2. in/not in/any/some/all

For column and table subqueries, because there are many records in the result set, a Boolean expression cannot be formed using the comparison_operator. Therefore, in/not in/any/some/all operations are required.

  1. In/not in, syntax: operand [not] in subquery

    Seelct * from t1 where (m1,n1) in (select m2,n2 from t2)

    Select * from t1 where columns M1 and N1 exist in the subquery result set.

  2. Any /some(the same meaning), syntax: operand comparison_operator any/some(subquery)

    When a comparison_operator is true, the entire expression is true. Otherwise, the expression is false.

    select * from t1 where t1.xx > any(select m2 from t2)

    If t1.xx is greater than any of the entries in the subquery, it will qualify.

    select * from t1 where t1.xx > (select min(m2) from t2)

    =any has the same meaning as in.

  3. All, syntax: operand comparison_operator all(subquery)

    If the comparison_operator is true, the entire expression is true. If the comparison_operator is true, the expression is false.

    select * from t1 where t1.m1 > all(select m2 from t2)

    This means that the m1 of T1 must be greater than all the results in the subquery for the expression to be valid, equivalent to:

    select * from t1 where t1.m1 > select max(m2) from t2

3. [not]exists Indicates a sub-query

If you only need to determine whether a subquery has a record, and do not care what the record is, you can put [not]exists before a subquery:

select * from t1 where exists (select 1 from t2)

The expression holds as long as the subquery returns a record, regardless of what the record is.

4. Precautions for subquery syntax

  1. Subqueries must be enclosed in parentheses

  2. The subquery in the SELECT clause must be a scalar quantum query

  3. To use a scalar query/row subquery without ensuring that only one record is returned, apply the limit 1 limit

  4. Limit statements are not allowed in/not in/any/some/all subqueries

    Since these subqueries do not allow limit statements, it does not make sense to use order BY, distinct, or group by clauses without aggregate functions and having clauses in subqueries, such as:

    Select * from t1 where m1 in (select m2 from T2 order by m2) select * from t1 where m1 in (select m2 from T2 order by m2

    Select * from T1 where m1 in (select distinct m2 from T2) select * from T1 where m1 in (select distinct m2 from T2

    Select * from t1 where m1 in (select m2 from T2 group by m2)

    For those clauses that do not affect the result, the optimizer simply removes them.

  5. It is not allowed to delete records of a table in a statement while performing subqueries on the table

    Delete from t1 where m1 < (select Max (m1) from t1)

4. How is the subquery executed

1. Execution mode of irrelevant subquery

select * from s1 where key1 in (select common_field from s2)

  1. Execute subqueries separately
  2. Take the result of the subquery as the outer query parameter and execute the outer query

2. Related sub-query execution modes

select * from s1 where key1 in (select common_field from s2 where s1.key2 = s2.key2)

  1. Start by fetching a record from the external query
  2. Retrieves the subquery value from the external query and executes the subquery
  3. If the sub-query results match where, the records are added to the result set; otherwise, the records are discarded
  4. Repeat Step 1

3. In sub-query optimization

1. Dichotomy optimization

If for key1 in (2,35,1,3,5,……) In this statement, several parameters in the IN will be sorted first, if the query cannot use the index to form a number of scan intervals, then the sorted parameters will be binary search, accelerate the efficiency of the IN expression.

2. In subquery optimization

Materialized table

Select * from t1 where key1 in (select m1 from s2 where key2 = ‘a’)

If the subquery returns too many results, it will be very inefficient, and even memory may not be able to hold so many results. In this case,

MySQL proposes the concept of materialized table, that is, the results of sub-query into a temporary table (also known as materialized table), the characteristics of temporary table are as follows:

  1. The columns in the temporary table are the columns in the subquery
  2. Records in temporary tables will be de-duplicated

Deduplication of temporary tables by creating unique indexes, all columns are treated as joint unique indexes. Deduplication can save space for temporary tables without affecting the results.

In general, if the temporary table is not too large, it will be stored in Memory, that is, Memory table, and set up a hash index for it. Through the hash index, the corresponding results in the temporary table can be searched faster.

If the temporary table is too large (larger than tMP_TABLE_size or max_HEAP_TABLE_size), the index structure is also converted to a B+ tree, through which records can also be found quickly.

Materialized table to connect

SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL

select * from t1 where key1 in (select m1 from s2 where key2 = 'a')

If key1 is present in t1 and key1 is present in m1, add result set to t1.

At this point, you can transform the sub-query into an inner join query, and after turning into a join query, you can calculate the cost of the connection and choose the optimal execution plan.

The cost calculation with T1 as the driver table is as follows:

  1. The cost of subquerying the materialized table
  2. Cost of scanning table S1
  3. Table S1 records * the cost of single-table access to materialized tables through KEY1

The cost calculation of using the physical and chemical table as the driving table is as follows:

  1. The cost of subquerying the materialized table
  2. Cost of scanning physical tables
  3. Materialized tables record * the cost of single-table access to t1 tables through M1

MySQL will choose the lower-cost option to execute.

Half a connection (semi – join)

Materialized tables can convert subqueries into join queries. MySQL has made further optimization on the basis of materialized tables, that is, it does not create temporary tables, and directly converts subqueries into join queries.

select * from t1 where key1 in (select m1 from s2 where key2 = 'a')

Select * from t1 where key1 exists in column M1 and select * from s2 where key1 exists;

select t1.* from t1 innert join s2 on t1.key1 = s2.m1 where s2.key2 = '1'

This seems to satisfy the convergent trend, but three scenarios need to be considered:

  1. For T1 table, if no one in s2 result set meets the on condition, the result set will not be added
  2. For T1 table, if there is only one match in s2 result set, add it to the result set
  3. For T1 table, if there are multiple qualified entries in s2 result set, the record will be added to the result set for multiple times

For the inner join of cases 1 and 2, both meet the requirements of the face-saving query, but for result 3, only one record will appear in the sub-query, while multiple records will appear in the join query.

Therefore, the two are not completely equivalent, but the effect of join query is very good, so MySQL introduced the concept of semi-join.

MySQL > select * from t1; select * from t1; select * from t1; select * from t1;

Select t1.* from t1 semi join s2 on t1.key1 = s2.m1 where s2.key2 = ‘1’

Semi-connection can be implemented in the following ways:

  1. Table pullout in subquery

When a subquery has only a primary key or a unique secondary index in the column, the subquery is directly promoted to the outer layer to do join queries.

Select * from t1 where key1 in (select m1 from s2 where key2 = ‘a’)

If M1 is the primary key or unique index of S2, then the statement is optimized directly to:

select t1.* from t1 inner join s2 on t1.key1 = s2.m1 where s2.key2 = '1'

Since the primary key or unique index in the subquery itself is not duplicated, the join can be promoted directly without duplicate values.

  1. Duplicate Weedout

Select * from t1 where key1 in (select m1 from s2 where key2 = ‘a’)

If m1 is not the primary key or unique index of S2, then the optimized statement could still be:

select t1.* from t1 inner join s2 on t1.key1 = s2.m1 where s2.key2 = '1'

But we need a temporary table to help:

create table tmp (
    id int primary key
);
Copy the code

In this way, whenever a record of S1 is added to the result set, the id of the record is put into the temporary table. If the add fails, it means that the duplicate data has been added before, and it is discarded directly. In this way, the data will not be duplicated.

  1. Loose scanning (LooseScan)

select * from s1 where key3 in (select key1 from s2 where key1 > 'a' and key1 < 'b')

For the above query, assuming key1 is the index of S2, the optimization to half-join can be, scan key1 index, as shown below:

We get the value matching the search conditions in key1 index, and only take the first matching value and S1 to match, and then add it to the final result set. Although this is a scan index, only take the first matching record, which is called loose scan.

  1. Semi-join Materialization

Semi-join materialization is the way materialized table joins mentioned earlier.

  1. FirstMatch

The first match is the related subquery execution mentioned above.

For related subqueries:

select * from t1 where key1 in (select m1 from s2 where t1.key3 = s2.key3)

Can be easily converted to a semi-connection:

select * from t1 semi join s2 on t1.key1 = s2.m1 and t1.key3 = s2.key3

Semi-join can be used to eliminate duplicate values, loose scan, first match these several ways to query, but it needs to be noted that the related subquery cannot use materialized table, because the condition of the materialized table is that the subquery does not depend on the value of the outer query, can directly directly separate subquery into materialized table.

3. Applicable conditions of semi-connection

Not all in subqueries can be converted to half-join, only the following types can:

select ... from outer_tables where expr in (select ... from inner_tables ...) and ...

select ... from outer_tables where (o1,o2,...) in (select o1,o2,... from inner_tables ...) and ...

Text summary:

  1. The subquery must be a Boolean expression combined with the IN operator and appear in the WHERE or on clause of the outer query
  2. Outer queries can have other query conditions, but must be concatenated with the IN operator using the AND operator
  3. A subquery must be a single query, not several subqueries of a union join
  4. Subqueries cannot contain group by, HAVING statements, or aggregate functions

4. Semi-join inapplicable conditions (SQL that cannot be converted to semi JOIN)

  1. In the WHERE condition of the outer query, there are other search conditions that use Boolean expressions consisting of the OR operator and the IN subquery

select * from s1 where key1 in (select m1 from s2 where key3 = 'a') or key2 > 100

  1. Use not in, not in

select * from s1 where key1 not in (select m1 from s2 where key3 = 'a')

  1. In the SELECT clause

select key1 in (select m1 from s2 where key3 = 'a') from s1

  1. Subqueries contain group by, HAVING, or aggregate functions

select * from s1 where key1 not in (select count(*) from s2 group by key1)

  1. Subqueries include union

select * from s1 where key1 in (select m1 from s2 where key3 = 'a' union select m1 from s2 where key3 = 'b')

Optimization for subqueries that cannot be converted to semi-join

1. The physical and chemical

select * from s1 where key1 not in (select m1 from s2 where key3 = 'a')

For not in, you can materialize the subquery directly, and then determine whether the result is in the materialized table.

2. To the exists

An IN subquery in a WHERE or on clause, whether related or unrelated, can be converted to an EXISTS subquery.

outer_expr in (select inner_expr from ... where subquery_where)

Can be converted to

exists (select inner_expr from ... where subquery_where and outer_expr = inner_expr)

The advantage of converting to EXISTS is that indexes can be used effectively, for example:

select * from s1 where key1 in (select key3 from s2 where s1.common_field = s2.common_field) or key2 > 1000

Select * from table_name where key1, key3, and common_field are not indexes; select * from table_name where key1 and key3 are indexes;

select * from s1 where exists (select 1 from s2 where s1.common_field = s2.common_field and s1.key1 = s2.key3) or key2 > 1000

If in does not qualify as a semi-join, the lower-cost option between the subquery materialization and the exists transformation will be executed.

3. Any /all subquery optimization

For any/all subqueries that are not related, we can generally convert them to the familiar approach:

Original expression convert
< any(select inner_expr …) < (select max(inner_expr …) )
> any(select inner_expr …) > (select min(inner_expr …) )
< all(select inner_expr …) < (select min(inner_expr …) )
> all(select inner_expr …) > (select max(inner_expr …) )

4. [not]/exists Subquery optimization

For non-related subqueries in [not]/exists, execute the subquery first and then rewrite the original query statement as follows:

select * from s1 where exists (select 1 from s2 where key1 = 'a') or key2 > 100

If the result of the subquery is true, the original statement can be rewritten as follows:

select * from s1 where exists true or key2 > 100

Further simplification:

select * from s1 where exists true

The related subquery of [NOT]/ EXISTS is executed in the same way as the previous related subquery. If the condition in the subquery is an index, the query speed is accelerated.

5. Optimization of derived tables

Derived tables are typically written to an internal temporary table, which is then queried as a normal table.

1. Delayed materialization

The following SQL:

select * from (select * from s1 where key1 = 'a') as derived_s1 innert join s2 on derived_s1.key1 = s2.key1 where s2.key2 = 1

In the SQL WHERE condition, it is determined whether s2’s key2 is equal to 1. In order to reduce the generation of temporary tables, it will first look up whether S2 has key2=1. If not, it will not generate derived tables.

2. Merge derived table with outer table

The following SQL:

select * from (select * from s1 where key1 = 'a') as derived_s1

The above SQL can be directly optimized as:

select * from s1 where key1 = 'a'

SQL > alter table SQL > alter table SQL

select * from (select * from s1 where key1 = 'a') as derived_s1 innert join s2 on derived_s1.key1 = s2.key1 where s2.key2 = 1

If there is data in S2, we can merge the derived table with the outer query and put the derived table’s search criteria into the outer query’s search criteria:

seelct * from s1 innert join s2 on s1.key1 = s2.key1 where s1.key1 = 'a' and s2.key2 = 1

The derived table was successfully eliminated here.

A derived table cannot be merged with an outer query if it contains the following functions or statements:

  1. Aggregate functions: Max, min, sum, etc
  2. distinct
  3. group by
  4. having
  5. limit
  6. The union and union all
  7. The subquery SELECT clause corresponding to the derived table contains another subquery