Rule-based optimization

  • Mysql > select * from ‘Mysql’

1. Conditional simplification

Remove unnecessary parentheses

Sometimes expressions have a lot of useless parentheses, such as this:

((a = 5 AND b = c) OR ((a > c) AND (c < 5)))
Copy the code

It’s annoying to watch, because the optimizer will kill the parentheses that are not needed, and that’s it:

(a = 5 and b = c) OR (a > c AND c < 5)
Copy the code

Constant Propagation

Sometimes an expression is an equivalent match between a column and a constant, for example:

a = 5
Copy the code

When this expression is concatenated with other expressions involving column A using AND, the value of a in the other expressions can be replaced with 5, as in:

a = 5 AND b > a
Copy the code

Can be converted to:

a = 5 AND b > 5
Copy the code

Tip: Why can’t an OR concatenated expression be passed a constant? Think for yourself

Equality_Propagation

Sometimes there is an equivalence match between multiple columns, for example:

a = b and b = c and c = 5
Copy the code

This expression can be simplified as:

a = 5 and b = 5 and c = 5
Copy the code

Remove useless condition (trivial_condition_removal)

The optimizer removes expressions that are obviously always TRUE or FALSE, such as this one:

(a < 1 and b = b) OR (a = 6 OR 5 ! = 5)
Copy the code

Obviously, the expression b = b is always TRUE, 5! The = 5 expression is always FALSE, so the simplified expression looks like this:

(a < 1 and TRUE) OR (a = 6 OR FALSE)
Copy the code

Can continue to be reduced to

a < 1 OR a = 6
Copy the code

Expression calculation

If the expression contains only constants, its value will be computed before the query is executed, as in this example:

a = 5 + 1
Copy the code

Since the expression 5 + 1 contains only constants, it can be reduced to:

a = 6
Copy the code

But it’s important to note that if a column is not the operand of an expression in a separate form, such as in a function or in a more complex expression, it looks like this:

ABS(a) > 5
Copy the code

Or:

-a < - 8 -
Copy the code

The optimizer does not attempt to simplify these expressions. We said earlier that indexes can only be used if the index column and the constant in the search criteria are concatenated using some operator, so if possible, it is best to have the index column in a separate form in the expression.

The combination of the HAVING and WHERE clauses

If the query does not have aggregate functions such as SUM, MAX, and so on and a GROUP BY clause, the optimizer combines the HAVING and WHERE clauses.

Constant scale test

MySQL designers find the following two types of queries particularly fast:

  • The queried table has no or only one record.

    Tip: Do you think there’s something wrong with this one? I haven’t even looked it up yet. Haha, this actually depends on statistics. However, we mentioned that InnoDB’s statistics are inaccurate, so this cannot be used for tables using InnoDB as storage engine, only for tables using Memory or MyISAM storage engine.

  • Query a table using primary key equivalence matching or unique secondary index column equivalence matching as search criteria.

The uncle who designed MySQL considers these two queries to take so little time that they are negligible, so they are called constant tables. When the optimizer analyzes a query, it first performs a constant table query, then replaces all conditions related to the table in the query with constants, and finally analyzes the query cost of the remaining tables, such as this query:

SELECT * FROM table1 INNER JOIN table2
    ON table1.column1 = table2.column2 
    WHERE table1.primary_key = 1;
Copy the code

It is obvious that this query can use the primary key and constant value equivalent matching to query table1 table, that is, in this query Table1 table is equivalent to constant table, before the analysis of the query cost of Table2 table, will execute the query to Table1 table, and replace the conditions involved in the query table1 table, The above statement is converted to something like this:

SELECTTable1 The constant values of each field in the table record, table2.* FROM table1 INNER JOIN table2 
    ONTable1 Column1 constant value of the table= table2.column2;
Copy the code

2, external connection elimination

The positions of the inwardly connected driver and driven tables are interchangeable, while the left (outer) and right (outer) connected driver and driven tables are fixed. As a result, the inner join may reduce the overall query cost by optimizing the join order of the table, while the outer join cannot. For the sake of the story, let’s take a look at the structure of the t1 and T2 tables that we used in the introduction of the connection principle. In case you forget, let’s take a look at the structure of the two tables:

CREATE TABLE t1 (
    m1 int, 
    n1 char(1)
) Engine=InnoDB, CHARSET=utf8;

CREATE TABLE t2 (
    m2 int, 
    n2 char(1)
) Engine=InnoDB, CHARSET=utf8;
Copy the code

To refresh your memory, let’s show the data from these two tables again:

mysql> SELECT * FROM t1;
+------+------+
| m1   | n1   |
+------+------+
|    1 | a    |
|    2 | b    |
|    3 | c    |
+------+------+
3 rows in set (0.00 sec)

mysql> SELECT * FROM t2;
+------+------+
| m2   | n2   |
+------+------+
|    2 | b    |
|    3 | c    |
|    4 | d    |
+------+------+
3 rows in set (0.00 sec)
Copy the code

We said before, is that the essential difference between external and internal connections for the records of the driver table outside connection, if you can’t be found in the driver table records matching the filter conditions ON clause, then the record will be added to the result set, the corresponding driven table record each field to use NULL values fill; If no record matching the filter condition in the ON clause can be found in the driven table, the record will be discarded. The query looks like this:

mysql> SELECT * FROM t1 INNER JOIN t2 ON t1.m1 = t2.m2;
+------+------+------+------+
| m1   | n1   | m2   | n2   |
+------+------+------+------+
|    2 | b    |    2 | b    |
|    3 | c    |    3 | c    |
+------+------+------+------+
2 rows in set (0.00 sec)

mysql> SELECT * FROM t1 LEFT JOIN t2 ON t1.m1 = t2.m2;
+------+------+------+------+
| m1   | n1   | m2   | n2   |
+------+------+------+------+
|    2 | b    |    2 | b    |
|    3 | c    |    3 | c    |
|    1 | a    | NULL | NULL |
+------+------+------+------+
3 rows in set (0.00 sec)
Copy the code

For the (left) outer join in the above example, since the record m1=1 and n1=’a’ in the driven table T1 cannot be found in the ON clause T1.m1 = t2.m2 in the driven table T2, this record is directly added to the result set, and the values of m2 and n2 in the corresponding TABLE T2 are set to NULL.

Tip: Right (outer) join and left (outer) join are the same only in the way they are selected to drive the table, so the optimizer first converts right (outer) join queries to left (outer) join queries. We will not talk about the right (outer) join later.

We know that the WHERE clause is lethal, and records that do not meet the criteria in the WHERE clause will not join. As long as we specify that the value of the driven table column in the search condition is not NULL, then the driven table record in the outer join that does not meet the ON clause condition is excluded from the final result set. In this case, there is no difference between the outer join and the inner join. For example, this query:

mysql> SELECT * FROM t1 LEFT JOIN t2 ON t1.m1 = t2.m2 WHERE t2.n2 IS NOT NULL;
+------+------+------+------+
| m1   | n1   | m2   | n2   |
+------+------+------+------+
|    2 | b    |    2 | b    |
|    3 | c    |    3 | c    |
+------+------+------+------+
2 rows in set (0.01 sec)
Copy the code

The left (outer) join query on t1 and T2 is the same as the inner join query, because the n2 column of t2 is not allowed to be NULL. We do NOT explicitly specify that a column in the driven table IS NOT NULL, as long as it implicitly does, for example:

mysql> SELECT * FROM t1 LEFT JOIN t2 ON t1.m1 = t2.m2 WHERE t2.m2 = 2;
+------+------+------+------+
| m1   | n1   | m2   | n2   |
+------+------+------+------+
|    2 | b    |    2 | b    |
+------+------+------+------+
1 row in set (0.00 sec)
Copy the code

In this case, we specify that column m2 of t2 is equal to 2 in the WHERE clause, which indirectly specifies that column m2 is not NULL, so the left (outer) join above is equivalent to the inner join below:

mysql> SELECT * FROM t1 INNER JOIN t2 ON t1.m1 = t2.m2 WHERE t2.m2 = 2;
+------+------+------+------+
| m1   | n1   | m2   | n2   |
+------+------+------+------+
|    2 | b    |    2 | b    |
+------+------+------+------+
1 row in set (0.00 sec)
Copy the code

We call this null-null condition (reject-NULL) when the specified WHERE clause contains a non-null column in the driven table. An outer join and an inner join can be converted to each other after the WHERE clause of the driven table meets the null-value rejection condition. The benefit of this transformation is that the query optimizer can evaluate the cost of different join orders for the table and select the least expensive join order to execute the query.

3. Sub-query optimization

(1) Sub-query syntax

  • A query can also have another query at a location within a query. The query that occurs at a location within a query is called a subquery.

  • In the SELECT clause

    In the query list, for example:

    mysql> SELECT (SELECT m1 FROM t1 LIMIT 1);
    +-----------------------------+
    | (SELECT m1 FROM t1 LIMIT 1) |
    +-----------------------------+
    |                           1 |
    +-----------------------------+
    1 row in set (0.00 sec)
    Copy the code

    SELECT m1 FROM T1 LIMIT 1 (SELECT m1 FROM T1 LIMIT 1)

  • In the FROM clause

    Such as:

    SELECT m, n FROM (SELECT m2 + 1 AS m, n2 AS n FROM t2 WHERE m2 > 2) AS t;
    +------+------+
    | m    | n    |
    +------+------+
    |    4 | c    |
    |    5 | d    |
    +------+------+
    2 rows in set (0.00 sec)
    Copy the code

    SELECT * FROM t2 WHERE m2 + 1 AS m, n2 AS n FROM t2 WHERE m2 > 2. The FROM clause contains the name of the table we want to query. AS t indicates that the result of the subquery is equivalent to a table named T. The column of the table named T is the column of the subquery. For example, table T has two columns: m column and N column. The subquery in the FROM clause is essentially a table, but a little different FROM the tables we use. The uncle who designed MySQL calls this subquery result set a derived table.

  • WHERE or ON clause

    Placing subqueries in the WHERE or ON clause of the outer query is probably the most common way to use subqueries, such as this:

    mysql> SELECT * FROM t1 WHERE m1 IN (SELECT m2 FROM t2);
    +------+------+
    | m1   | n1   |
    +------+------+
    |    2 | b    |
    |    3 | c    |
    +------+------+
    2 rows in set (0.00 sec)
    Copy the code

    This query indicates that we want the result of the subquery (SELECT m2 FROM T2) as an IN parameter of the outer query. The whole query means that we want to find some records IN table T1 whose values IN column M1 match those IN column m2 of table T2.

  • ORDER BY clause

    Although syntactically supported, it doesn’t make any sense to talk about this situation.

  • GROUP BY clause

(2) Query according to the returned result set area molecule

Because subqueries are queries themselves, they can be categorized according to the type of result set they return:

  • Scalar quantum query

    Subqueries that return only a single value are called scalar quanta queries, for example:

    SELECT (SELECT m1 FROM t1 LIMIT 1);
    Copy the code

    Or this:

    SELECT * FROM t1 WHERE m1 = (SELECT MIN(m2) FROM t2);
    Copy the code

    The subqueries in both queries return a single value, that is, a scalar. These scalar queries can appear as part of a single value or expression anywhere in the query statement.

  • Muck query

    As the name implies, this is a subquery that returns a single record, but the record must contain multiple columns. Like this:

    SELECT * FROM t1 WHERE (m1, n1) = (SELECT m2, n2 FROM t2 LIMIT 1);
    Copy the code

    SELECT * FROM t1 where columns M1 and N1 are equal to columns M2 and n2 FROM t2 LIMIT 1; SELECT * FROM T1 where columns M1 and n1 are equal to columns M2 and n2.

  • Example query

    A subquery is a query for a single column of data, but the data in that column needs to contain multiple records. Like this:

    SELECT * FROM t1 WHERE m1 IN (SELECT m2 FROM t2);
    Copy the code

    SELECT * FROM t2; SELECT * FROM T2; SELECT * FROM T2; SELECT * FROM T2;

  • A table subquery

    The result of a subquery contains many records as well as many columns, for example:

    SELECT * FROM t1 WHERE (m1, n1) IN (SELECT m2, n2 FROM t2);
    Copy the code

    SELECT * FROM t2 where (SELECT m2, n2 FROM t2); SELECT * FROM t2 where (SELECT m2, n2 FROM t2);

(3) Partition the molecular query according to the relationship with the outer query

  • Irrelevant subqueries

    If a subquery can run on its own without depending on the value of the outer query, it can be called an irrelevant subquery.

  • Correlation subquery

    If the execution of a subquery depends on the value of the outer query, the subquery can be called a correlation subquery. Such as:

SELECT * FROM t1 WHERE m1 IN (SELECT m2 FROM t2 WHERE n1 = n2);
Copy the code

(4) The use of subqueries in Boolean expressions

What’s the point of writing a subquery like this:

SELECT (SELECT m1 FROM t1 LIMIT 1);
Copy the code

The most common use of subqueries is to use them as part of a Boolean expression as search criteria in a WHERE or ON clause. So let’s summarize the use of a single query in Boolean expressions.

  • Use =, >, <, >=, <=, <>,! =, <=> as Boolean expression operators

    For convenience, we call these operators “comparison_operator.” Boolean expressions for subqueries look like this:

    Operand comparison_operator (subquery)Copy the code

    The operand could be a column name, a constant, a more complex expression, or even another subquery. However, it is important to note that the subquery can only be a scalar or row subquery, that is, the result of the subquery can only return a single value or a single record. For example (scalar quantum query) :

    SELECT * FROM t1 WHERE m1 < (SELECT MIN(m2) FROM t2);
    Copy the code

    Or this (row subquery) :

    SELECT * FROM t1 WHERE (m1, n1) = (SELECT m2, n2 FROM t2 LIMIT 1);
    Copy the code
  • [NOT] IN/ANY/SOME/ALL sub-query

    For column subqueries and table subqueries, the result set contains many records that are equivalent to a set. Therefore, it is not possible to use the comparison_operator feature to form a Boolean expression with another operand. MySQL supports an operand and a set to form a Boolean expression using the following syntax:

    • IN or NOT IN

      The syntax is as follows:

      The operand [NOT] IN(Subquery)Copy the code

      The Boolean expression is used to determine whether an operand is in the result set of a subquery. For example, the following query is used to find some records in table T1 that are in the result set of a subquery:

      SELECT * FROM t1 WHERE (m1, n1) IN (SELECT m2, n2 FROM t2);
      Copy the code
    • ANY/SOME (ANY and SOME are synonyms)

      The syntax is as follows:

      The operand comparison_operatorANY/SOME(Subquery)Copy the code

      This Boolean expression means that the result of the entire expression is TRUE as long as there is a value in the subquery result set that comparison_operator with the given operand is TRUE. Otherwise, the result of the entire expression is FALSE. For example, the following query:

      SELECT * FROM t1 WHERE m1 > ANY(SELECT m2 FROM t2);
      Copy the code

      If there is a value smaller than m1 in the result set of the subquery (SELECT m2 FROM T2), then the value of the entire Boolean expression is TRUE. Otherwise, the value of the Boolean expression is FALSE. If the value of m1 column is greater than the minimum value in the result set of the subquery, the result of the entire expression is TRUE.

      SELECT * FROM t1 WHERE m1 > (SELECT MIN(m2) FROM t2);
      Copy the code

      IN addition, =ANY is equivalent to determining whether there is a value equal to the given operand IN the result set of a subquery.

    • ALL

      The syntax is as follows:

      The operand comparison_operatorALL(Subquery)Copy the code

      The Boolean expression means that when all values in the subquery result set are compared with the given operands by the comparison_operator function, the entire expression is TRUE; otherwise, the entire expression is FALSE. For example, the following query:

      SELECT * FROM t1 WHERE m1 > ALL(SELECT m2 FROM t2);
      Copy the code

      If all values in the result set of the subquery (SELECT m2 FROM T2) are less than the values in column M1, then the entire Boolean expression is TRUE, otherwise it is FALSE. If the value of m1 column is greater than the maximum value in the result set of the subquery, the result of the entire expression is TRUE.

      SELECT * FROM t1 WHERE m1 > (SELECT MAX(m2) FROM t2);
      Copy the code
  • The EXISTS the subquery

    Sometimes we only need to determine whether there is a record in the result set of a subquery, regardless of what the record is, we can use EXISTS or NOT EXISTS before the subquery statement, like this:

    [NOT] EXISTS(Subquery)Copy the code

    Let’s take an example:

    SELECT * FROM t1 WHERE EXISTS (SELECT 1 FROM t2);
    Copy the code

    For the subquery (SELECT 1 FROM T2), we don’t care what the result of the subquery is, so it doesn’t matter if the query list contains *, column names, or anything else. What we really care is whether there is a record in the result set of the subquery. That is, if there are records in the query (SELECT 1 FROM T2), the result of the EXISTS expression is TRUE.

(5) How is subquery executed in MySQL

Adapt single_table again:

CREATE TABLE single_table (
    id INT NOT NULL AUTO_INCREMENT,
    key1 VARCHAR(100),
    key2 INT,
    key3 VARCHAR(100),
    key_part1 VARCHAR(100),
    key_part2 VARCHAR(100),
    key_part3 VARCHAR(100),
    common_field VARCHAR(100),
    PRIMARY KEY (id),
    KEY idx_key1 (key1),
    UNIQUE KEY idx_key2 (key2),
    KEY idx_key3 (key3),
    KEY idx_key_part(key_part1, key_part2, key_part3)
) Engine=InnoDB CHARSET=utf8;
Copy the code

For convenience, we assume that there are two tables S1 and S2 that have the same structure as single_TABLE, and that there are 10000 entries in each table, with random values inserted in all columns except id.

As a naive teenager, I thought that sub-queries would be executed like this:

  • If the subquery is an unrelated subquery, such as the following query:

    SELECT * FROM s1 
        WHERE key1 IN (SELECT common_field FROM s2);
    Copy the code

    When I was younger, I thought this query would be executed like this:

    • Execute separately first(SELECT common_field FROM s2)This subquery.
    • The outer query is then executed using the results of the previous step as parameters of the outer querySELECT * FROM s1 WHERE key1 IN (...).
  • If the subquery is a related subquery, such as the following query:

    SELECT * FROM s1 
        WHERE key1 IN (SELECT common_field FROM s2 WHERE s1.key2 = s2.key2);
    Copy the code

    The condition s1.key2 = s2.key2 in the subquery in this query means that the subquery is executed depending on the value of the outer query, so when I was younger, I thought the query was executed like this:

    • Retrieves a record first from the outer query, in this case froms1Gets a record from the table.
    • It then finds the value involved in the subquery from the record retrieved in the previous step, in this cases1Select from the record retrieved from the tables1.key2Column value, and then executes a subquery.
    • Finally, the outer query is detected according to the query results of the sub-queryWHEREIf true, add the record of the outer query to the result set, otherwise discard.
    • Perform the first step again to retrieve the records in the second outer query, and so on ~

(6) the execution of standard quantum query and row sub-query

We often use scalar quantum queries or row subqueries in the following two scenarios:

  • SELECTClause, we said that the subquery in the query list must be a scalar quantum query.
  • Subquery use=,>,<,> =,< =,<>,! =,< = >Such operators and an operand form a Boolean expression. Such subqueries must be scalar or row subqueries.

For the above two scenarios, the execution of unrelated scalar quantum queries or row subqueries is simple. For example, the following query statement:

SELECT * FROM s1 
    WHERE key1 = (SELECT common_field FROM s2 WHERE key3 = 'a' LIMIT 1);
Copy the code

It works the way I thought it would when I was younger:

  • Execute separately first(SELECT common_field FROM s2 WHERE key3 = 'a' LIMIT 1)This subquery.
  • The outer query is then executed using the results of the previous step as parameters of the outer querySELECT * FROM s1 WHERE key1 = ....

That is, for queries that contain unrelated scalar or row subqueries, MySQL will execute the outer and subqueries independently, treating them as if they were two single-table queries.

For related scalar or row subqueries, such as the following query:

SELECT * FROM s1 WHERE 
    key1 = (SELECT common_field FROM s2 WHERE s1.key3 = s2.key3 LIMIT 1);
Copy the code

And it worked just as I thought it would when I was younger:

  • Retrieves a record first from the outer query, in this case froms1Gets a record from the table.
  • It then finds the value involved in the subquery from the record retrieved in the previous step, in this cases1Select from the record retrieved from the tables1.key3Column value, and then executes a subquery.
  • Finally, the outer query is detected according to the query results of the sub-queryWHEREIf true, add the record of the outer query to the result set, otherwise discard.
  • Perform the first step again to retrieve the records in the second outer query, and so on ~

(7) IN sub-query optimization

The proposal of physical and chemical table

For an unrelated IN subquery, for example:

SELECT * FROM s1 
    WHERE key1 IN (SELECT common_field FROM s2 WHERE key3 = 'a');
Copy the code
  • That’s how we start to feelINSubquery and irrelevant scalar quantum query or row subquery is the same, are the outer query and subquery as two independent single table query to treat, but unfortunately is designedMySQLThe uncle in order to optimizeINSubqueries are a lot of work (after allINSubqueries are the most commonly used subquery types in our daily lives),So the whole implementation process is not as simple as we thought (>_<).

Actually speak the truth, for unrelated IN the subquery, if the result of the subquery little focus on the number of records and then firing a query and outer query respectively as two separate single table query efficiency is very high, but if the result set after executing sub-queries alone too much, will lead to these problems:

  • There are too many result sets to fit in memory

  • For outer queries, if the result set of the subquery is too large, this means that there are too many parameters IN the IN clause, which results IN:

    • Index cannot be used effectively, only full table scan can be performed for outer layer query.

    • When a full table scan is performed IN the outer layer query, it takes too long to check whether a record matches the parameters IN the IN clause because there are too many parameters IN the IN clause.

      For example, the IN clause has only two arguments:

      SELECT * FROM tbl_name WHERE column IN (a, b);
      Copy the code

      Select * from tBL_NAME where column = a OR column = b; This is not a problem when there are fewer arguments IN the IN clause. If there are more arguments IN the IN clause, for example:

      SELECT * FROM tbl_name WHERE column IN(a, b, c ... ,...). ;Copy the code

      Column = a, column = B, column = c, OR… , so the performance cost can be much.

Instead of using the result set of an unrelated subquery as an argument to the outer query, write the result set to a temporary table. Writing to a temporary table looks like this:

  • The columns of the temporary table are the columns in the result set of the subquery.

  • Records written to temporary tables are de-duplicated.

    The IN statement is used to determine whether an operand is IN a set. It does not matter whether the value IN the set is repeated or not. Therefore, when we write the result set to a temporary table, we can make the temporary table smaller and less expensive

    Tip: How do temporary tables de-duplicate records? A temporary table is also a table. Create a primary key or unique index for all columns in the table

  • In general, the result set of the subquery will not be too large, so a temporary table based on Memory storage engine will be built for it, and the table will be indexed by hash.

    Tip: The essence of the IN statement is to determine whether an operand is IN a set. If the set has a hash index, the matching process is super fast. Any of you who don’t know what a hash index is? I here did not spread out, oneself get online search, won’t ask me again ~

    If the result set of a subquery is large enough to exceed the system variables tmp_TABLE_size or max_HEAP_table_size, the temporary table is switched to a disk-based storage engine to hold the records in the result set and the index type is changed to B+ tree index.

The uncle who designed MySQL calls this process of saving records from a subquery result set to a temporary table Materialize. For convenience, we call the temporary table that stores the result set of the subquery materialized table. Because the records IN the materialized table are indexed (memory based materialized tables have hash indexes, disk based tables have B+ tree indexes), it is very fast to determine if an operand is IN the subquery result set by executing IN statements through the index, which improves the performance of the subquery statement.

Materialized table to connect

SELECT * FROM s1 
    WHERE key1 IN (SELECT common_field FROM s2 WHERE key3 = 'a');
Copy the code

When we materialize the subquery, suppose the name of the materialized table of the subquery is materialized_TABLE and the column m_val of the result set of the subquery is stored in the materialized_table, then the query can actually be viewed from the following two perspectives:

  • From the tables1From the point of view of, the whole query actually means: fors1For each record in the table, if the recordkey1If the value of the column is in the materialized table corresponding to the subquery, the record is added to the final result set. So let me draw a picture of this:

  • From the perspective of the subquery materialized table, the whole query actually means: for each value of the subquery materialized table, if can be ins1Find the corresponding in the tablekey1Columns whose values are equal to that value are added to the final result set. So let me draw a picture of this:

Materialized_table (s1, materialized_table, materialized_table);

SELECT s1.* FROM s1 INNER JOIN materialized_table ON key1 = m_val;
Copy the code

The interesting thing about inner joins is that the query optimizer can evaluate the cost of different join sequences and select the least expensive way to execute the query. Let’s analyze the cost of inner join using outer query table S1 and materialized_table in the above query.

  • If table S1 is used as the driver table, the total query cost consists of the following components:

    • The cost of materializing a subquery
    • scannings1The cost of the table
    • S1 Number of records in the table x Passm_val = xxxrightmaterialized_tableThe cost of single-table access to the table (we said earlier that the records in the materialized table are not duplicated and the columns in the materialized table are indexed, so this step is obviously very fast).
  • If the materialized_table table is used as the driver table, the total query cost consists of the following components:

    • The cost of materializing a subquery
    • The cost of scanning the materialization table
    • Number of records in materialization table × passkey1 = xxxrights1Table cost of single table access (thankfullykey1Columns are indexed, so this step is very fast).

The MySQL query optimizer performs an operation to select the lower cost solution described above to execute the query.

Convert the subquery to semi-join

Although materializing a subquery and then executing a query has the cost of creating a temporary table, we have seen the power of converting a subquery into a join, and the designers of MySQL continue to wonder: can we convert a subquery into a join without materializing it? Let’s revisit the query statement above:

SELECT * FROM s1 
    WHERE key1 IN (SELECT common_field FROM s2 WHERE key3 = 'a');
Copy the code

We can think of this query as: For a record in s1, if we can find one or more records in S2 WHERE the value of common_field equals the value of key1 in S1, if we can find one or more records in S2 WHERE the value of common_field equals the value of key1 in S1 The records from that S1 table are then added to the final result set. This process is similar to joining s1 and S2:

SELECT s1.* FROM s1 INNER JOIN s2 
    ON s1.key1 = s2.common_field 
    WHERE s2.key3 = 'a';
Copy the code

Key1 = s2.common_field; key1 = s2.common_field; key1 = s2.common_field;

  • Case one: Fors1For a particular entry in the table,s2None of the records in the table satisfys1.key1 = s2.common_fieldThis condition, then the record will naturally not be added to the final result set.
  • Case two: Fors1For a particular entry in the table,s2There is one and only one record in the tables1.key1 = s2.common_fieldThis condition, then the record is added to the final result set.
  • Case three: Fors1For a particular entry in the table,s2At least two records in the table must meet the requirementss1.key1 = s2.common_fieldThis condition, then the record will be added to the final result set multiple times.

For a certain record IN s1 table, we only care about whether there is a record IN S2 table that meets the condition s1.key1 = s2.common_field, but not how many records match it, and because of the existence of case 3, the IN sub-query mentioned above is not completely equivalent to the join of the two tables. But turning a subquery into a join can really make the most of the optimizer’s power, so the uncle who designed MySQL came up with a new concept here – semi-join. Semi-join s1 and S2 means: for a certain record in S1, we only care about whether there is a matching record in S2, not the specific number of matching records, and only the records in S1 are retained in the final result set. MySQL > alter table subquery (‘ subquery ‘);

SELECT s1.* FROM s1 SEMI JOIN s2
    ON s1.key1 = s2.common_field
    WHERE key3 = 'a';
Copy the code

Tip: Semi-join is just a way of executing sub-queries within MySQL. MySQL does not provide user-oriented semi-join syntax, so we do not need or try to run the above statement in a black box. I just want to clarify that the above subquery is converted to a semi-join like the above statement in MySQL

Now that we have the concept, how do we implement this so-called semi-join? The uncle who designed MySQL has prepared several approaches.

  • Table pullout (Table pullout in subquery)

    When a subquery has only primary keys or unique index columns in the query list, you can directly pull up the table in the subquery into the FROM clause of the outer query, and merge the search criteria in the subquery into the search criteria in the outer query, such as this one

    SELECT * FROM s1 
        WHERE key2 IN (SELECT key2 FROM s2 WHERE key3 = 'a');
    Copy the code

    Since key2 is the only secondary index column of s2, we can directly pull S2 into the FROM clause of the outer query and merge the search criteria FROM the sub-query into the search criteria of the outer query. The drop-down query looks like this:

    SELECT s1.* FROM s1 INNER JOIN s2 
        ON s1.key2 = s2.key2 
        WHERE s2.key3 = 'a';
    Copy the code

    Why is it possible to convert a subquery to a join query if the subquery has only a primary key or unique index column in the query list? The data in a primary key or unique index column is itself unique. You can’t find more than two records that match s1.key2 = s2.key2 in the same table

  • DuplicateWeedout Execution strategy

    For this query:

    SELECT * FROM s1 
        WHERE key1 IN (SELECT common_field FROM s2 WHERE key3 = 'a');
    Copy the code

    After converting to a semi-join query, a record in s1 may have multiple matching records in S2, so the record may be added to the final result set several times. To eliminate duplication, we can create a temporary table, for example, the temporary table looks like this:

    CREATE TABLE tmp (
        id PRIMARY KEY
    );
    Copy the code

    Therefore, in the process of executing join query, whenever a record in s1 is to be added to the result set, the id value of the record is first added to the temporary table. If the adding is successful, it indicates that the record in S1 was not added to the final result set, and the record is now added to the final result set. DuplicateWeedout is used to eliminate duplicate values in the semi-join result set. This method is called DuplicateWeedout.

  • LooseScan Execution Strategy

    Look at this query:

    SELECT * FROM s1 
        WHERE key3 IN (SELECT key1 FROM s2 WHERE key1 > 'a' AND key1 < 'b');
    Copy the code

    In subquery, access to S2 can use the index of KEY1 column, and the query list of subquery is KEY1 column, so after the query is converted to semi-join query, if s2 is used as the drive table to execute the query, then the execution process is like this:

- As shown in the figure, there are three secondary index records in the 'idx_KEY1' index of 's2' with the value 'aa', so you only need to take the first value to 's1' and look for 's1. Key3 = 'aa' in' s1 '. If you can find the corresponding record in 's1', Add the corresponding record to the result set. Similarly, other secondary index records with the same value only need to take the value of the first record to find the matching record in 'S1' table. Although this is a scan index, it only takes the first value of the same record to perform the matching operation, which is called 'loose scan'.Copy the code
  • Semi-join Materialization execution strategy

    The IN clause of the outer query is materialized first, and then the join between the outer query and the materialized table is essentially a semi-join. However, since there are no duplicate records IN the materialized table, the subquery can be directly converted to join query.

  • FirstMatch execution strategy

    FirstMatch is a primitive form of semi-join execution, the same way we used to think of related subqueries when we were kids, which is to take one of the records in the outer query and look for a match in the subquery table. If you find a match, Then the records of the outer query are put into the final result set and the search for more matching records is stopped. If no records are found, the records of the outer query are discarded. Then start fetching the next record in the outer query and repeat the process.

For some related subqueries that use the IN statement, such as this query:

SELECT * FROM s1 
    WHERE key1 IN (SELECT common_field FROM s2 WHERE s1.key3 = s2.key3);
Copy the code

It can also be easily converted to a semi-join, and the converted statement looks something like this:

SELECT s1.* FROM s1 SEMI JOIN s2 
    ON s1.key1 = s2.common_field AND s1.key3 = s2.key3;
Copy the code

DuplicateWeedout, LooseScan, FirstMatch and other half-join execution strategies can then be used to execute the query. Of course, if the sub-query has only primary keys or secondary index columns in the query list, You can also use a table pullout strategy to execute the query directly, but it is important to note that since the associated subquery is not a standalone query, it cannot be converted into a materialized table to execute the query.

Conditions for semi-join

Of course, not all queries that contain IN subqueries can be converted to semi-JOIN. Only queries that look like this can be converted to semi-join:

SELECT.FROM outer_tables 
    WHERE expr IN (SELECT.FROM inner_tables ...) AND.Copy the code

Or something like this could work:

SELECT.FROM outer_tables 
    WHERE (oe1, oe2, ...) IN (SELECT ie1, ie2, ... FROM inner_tables ...) AND.Copy the code

To summarize, only subqueries that meet the following criteria can be converted to semi-join:

  • The subquery must be andINStatement composed of Boolean expressions, and queried in the outer layerWHEREorONClause.
  • Outer queries can also have other search criteria, except andINThe search criteria for the subquery must be usedANDConnect.
  • The subquery must be a single query and cannot be made up of several queriesUNIONJoin the form.
  • The subquery cannot containGROUP BYorHAVINGStatement or aggregate function.
  • . Still have a few conditions to compare rare, do not nag ~

This does not apply in the case of semi-join

For cases where sub-queries cannot be converted to semi-join, the following are typical:

  • Other search criteria IN the WHERE condition of the outer query are concatenated with the Boolean expression of the IN subquery using OR

    SELECT * FROM s1 
        WHERE key1 IN (SELECT common_field FROM s2 WHERE key3 = 'a')
            OR key2 > 100;
    Copy the code
  • The case where NOT IN is used instead of IN

    SELECT * FROM s1 
        WHERE key1 NOT IN (SELECT common_field FROM s2 WHERE key3 = 'a')
    Copy the code
  • The IN subquery IN the SELECT clause

    SELECT key1 IN (SELECT common_field FROM s2 WHERE key3 = 'a') FROM s1 ;
    Copy the code
  • Subqueries that contain GROUP BY, HAVING, or aggregate functions

    SELECT * FROM s1 
        WHERE key2 IN (SELECT COUNT(*) FROM s2 GROUP BY key1);
    Copy the code
  • The case where a subquery contains a UNION

    SELECT * FROM s1 WHERE key1 IN (
        SELECT common_field FROM s2 WHERE key3 = 'a' 
        UNION
        SELECT common_field FROM s2 WHERE key3 = 'b'
    );
    Copy the code

MySQL still has two tricks for optimizing subqueries that cannot be converted to semi-join queries:

  • For unrelated subqueries, try materializing them before participating in the query

    For example, the query we mentioned above:

    SELECT * FROM s1 
        WHERE key1 NOT IN (SELECT common_field FROM s2 WHERE key3 = 'a')
    Copy the code

    Materialize the subquery first, and then determine whether KEY1 can speed up query execution in the result set of the materialized table.

    Note that the subquery cannot be materialized as a join to the outer query table. The subquery can only be materialized by scanning s1 and then checking whether the key1 value of a record in S1 is in the materialized table.

  • An IN subquery can be converted to an EXISTS subquery regardless of whether the query is related or irrelevant

    An IN subquery can be converted to an EXISTS subquery.

    outer_expr IN (SELECT inner_expr FROM.WHERE subquery_where)
    Copy the code

    Can be converted to:

    EXISTS (SELECT inner_expr FROM.WHERE subquery_where AND outer_expr=inner_expr)
    Copy the code

    There are some special cases in this process, such as when the outer_expr or inner_expr value is NULL. Because expressions with NULL operands often result in NULL, for example:

    mysql> SELECT NULL IN (1.2.3);
    +-------------------+
    | NULL IN (1.2.3) |
    +-------------------+
    |              NULL |
    +-------------------+
    1 row in set (0.00 sec)
    
    mysql> SELECT 1 IN (1.2.3);
    +----------------+
    | 1 IN (1.2.3) |
    +----------------+
    |              1 |
    +----------------+
    1 row in set (0.00 sec)
    
    mysql> SELECT NULL IN (NULL);
    +----------------+
    | NULL IN (NULL) |
    +----------------+
    |           NULL |
    +----------------+
    1 row in set (0.00 sec)
    Copy the code

    The result of the EXISTS subquery must be TRUE or FASLE:

    mysql> SELECT EXISTS (SELECT 1 FROM s1 WHERE NULL = 1);
    +------------------------------------------+
    | EXISTS (SELECT 1 FROM s1 WHERE NULL = 1) |
    +------------------------------------------+
    |                                        0 |
    +------------------------------------------+
    1 row in set (0.01 sec)
    
    mysql> SELECT EXISTS (SELECT 1 FROM s1 WHERE 1 = NULL);
    +------------------------------------------+
    | EXISTS (SELECT 1 FROM s1 WHERE 1 = NULL) |
    +------------------------------------------+
    |                                        0 |
    +------------------------------------------+
    1 row in set (0.00 sec)
    
    mysql> SELECT EXISTS (SELECT 1 FROM s1 WHERE NULL = NULL);
    +---------------------------------------------+
    | EXISTS (SELECT 1 FROM s1 WHERE NULL = NULL) |
    +---------------------------------------------+
    |                                           0 |
    +---------------------------------------------+
    1 row in set (0.00 sec)
    Copy the code

    But fortunately, most of the scenarios WHERE we use the IN subquery place it IN the WHERE or ON clause, which does not distinguish between NULL and FALSE, for example:

    mysql> SELECT 1 FROM s1 WHERE NULL;
    Empty set (0.00 sec)
    
    mysql> SELECT 1 FROM s1 WHERE FALSE;
    Empty set (0.00 sec)
    Copy the code

    So as long as our IN subquery is placed IN a WHERE or ON clause, the IN -> EXISTS conversion is fine. With all that said, why switch? This is because the index may not be needed without the conversion, as in the following query:

    SELECT * FROM s1
        WHERE key1 IN (SELECT key3 FROM s2 where s1.common_field = s2.common_field) 
            OR key2 > 1000;
    Copy the code

    The subquery in this query is a related subquery and cannot be executed with an index, but an EXISTS subquery can be executed with an index:

    SELECT * FROM s1
        WHERE EXISTS (SELECT 1 FROM s2 where s1.common_field = s2.common_field AND s2.key3 = s1.key1) 
            OR key2 > 1000;
    Copy the code

    The IDX_KEY3 index of the S2 table can be used when converted to an EXISTS subquery.

    Note that an IN subquery is converted to an EXISTS query if it does not meet the criteria for converting to semi-join, cannot be converted to a materialized table, or the cost of converting to a materialized table is too high.

    Note: IN MySQL5.5 and earlier versions of MySQL5.5, when semi-join and materialization methods were not introduced, the optimizer used to convert an IN subquery to an EXISTS subquery. So at that time, there were a lot of voices suggesting that you should convert subqueries into joins. However, with the development of MySQL, there are a lot of subquery optimization strategies introduced in the recent version of MySQL. You can be relieved to use subqueries, and the internal conversion work optimizer will automatically implement it for you.

The subtotal
  • If the IN subquery meets the criteria for converting to semi-join, the query optimizer will first convert the subquery to semi-join, and then consider which of the following five strategies to perform semi-join is the least costly:

    • Table pullout
    • DuplicateWeedout
    • LooseScan
    • Materialization
    • FirstMatch

    Select the lowest cost execution strategy to execute the subquery.

  • If the IN subquery does not meet the criteria for conversion to semi-join, the query optimizer finds a cheaper way to execute the subquery from one of two strategies:

    • The subquery is materialized before the query is executed
    • performIN to EXISTSConversion.

ANY/ALL subquery optimization

If ANY/ALL subqueries are irrelevant subqueries, they can be executed in a familiar manner in many situations, such as:

Original expression convert
< ANY (SELECT inner_expr …) < (SELECT MAX(inner_expr) …)
> ANY (SELECT inner_expr …) > (SELECT MIN(inner_expr) …)
< ALL (SELECT inner_expr …) < (SELECT MIN(inner_expr) …)
> ALL (SELECT inner_expr …) > (SELECT MAX(inner_expr) …)

[NOT] EXISTS Indicates the execution of the sub-query

If the [NOT] EXISTS subquery is an unrelated subquery, execute the subquery first to determine whether the result of the [NOT] EXISTS subquery is TRUE or FALSE, and rewrite the original query statement, for example, for this query:

SELECT * FROM s1 
    WHERE EXISTS (SELECT 1 FROM s2 WHERE key1 = 'a') 
        OR key2 > 100;
Copy the code

Since the subquery in this statement is an unrelated subquery, the optimizer first executes the subquery, assuming that the EXISTS subquery results in TRUE, and then overwrites the query as:

SELECT * FROM s1 
    WHERE TRUE OR key2 > 100;
Copy the code

Further simplification would be:

SELECT * FROM s1 
    WHERE TRUE;
Copy the code

For related [NOT] EXISTS subqueries, such as this one:

SELECT * FROM s1 
    WHERE EXISTS (SELECT 1 FROM s2 WHERE s1.common_field = s2.common_field);
Copy the code

Unfortunately, this query can only be executed in the same way that we executed related subqueries when we were kids. If an index is used in the [NOT] EXISTS subquery, the speed of the query is also much faster.

SELECT * FROM s1 
    WHERE EXISTS (SELECT 1 FROM s2 WHERE s1.common_field = s2.key1);
Copy the code

Idx_key1 can be used in the EXISTS subquery above to speed up the query.

Optimization for derived tables

If you place a subquery after the FROM clause of the outer query, the result of the subquery is equivalent to a derived table, such as the following query:

SELECT * FROM  (
        SELECT id AS d_id,  key3 AS d_key3 FROM s2 WHERE key1 = 'a'
    ) AS derived_s1 WHERE d_key3 = 'a';
Copy the code

The result of a subquery (SELECT id AS d_id, key3 AS d_key3 FROM s2 WHERE key1 = ‘a’) is equivalent to a derived table with the name derived_s1. The table has two columns, d_id and d_key3.

MySQL provides two execution strategies for queries that contain derived tables:

  • The easiest thing to think of is to materialize derived tables.

    We can write the result set of a derived table to an internal temporary table and then participate in the query as if it were a normal table. Of course, when it comes to materializing derived tables, the uncle who designed MySQL uses a strategy called delayed materialization, which is to try to materialize derived tables only when they are actually used in the query, rather than materializing derived tables before the query has even started. For example, for the following query with derived tables:

    SELECT * FROM (
            SELECT * FROM s1 WHERE key1 = 'a'
        ) AS derived_s1 INNER JOIN s2
        ON derived_s1.key1 = s2.key1
        WHERE s2.key2 = 1;
    Copy the code

    If adopt the way of chemical derived tables to executing the query, then executes is first to find meet s2 s2 table. The record of key2 = 1 if all can not find, that participate in the connection of s2 table record is empty, so the query result set is empty, so there would be no need to go to the materialized query derived tables.

  • To merge a derived table with an outer table, you rewrite the query to a form without derived tables

    Let’s look at a simple query that contains a derived table:

    SELECT * FROM (SELECT * FROM s1 WHERE key1 = 'a') AS derived_s1;
    Copy the code

    Select * from s1 where key1 = ‘a’ where key1 = ‘a’;

    SELECT * FROM s1 WHERE key1 = 'a';
    Copy the code

    For slightly more complex statements containing derived tables, such as the one we mentioned above:

    SELECT * FROM (
            SELECT * FROM s1 WHERE key1 = 'a'
        ) AS derived_s1 INNER JOIN s2
        ON derived_s1.key1 = s2.key1
        WHERE s2.key2 = 1;
    Copy the code

    We can merge the derived table with the outer query’s table, and then place the search criteria from the derived table in the outer query’s search criteria, like this:

    SELECT * FROM s1 INNER JOIN s2 
        ON s1.key1 = s2.key1
        WHERE s1.key1 = 'a' AND s2.key2 = 1;
    Copy the code

    This successfully eliminates the derived tables by merging the outer query with the derived tables, which means that we no longer have to pay the cost of creating and accessing temporary tables. However, not all queries with derived tables can be successfully merged with the outer query. When a derived table has these statements, it cannot be merged with the outer query:

    • Aggregate functions, such as MAX(), MIN(), SUM(), etc
    • DISTINCT
    • GROUP BY
    • HAVING
    • LIMIT
    • UNION 或者 UNION ALL
    • Of the subquery corresponding to the derived tableSELECTClause contains another subquery
    • . There are also some uncommon cases not to mention ha ~

SQL > alter table materialize (); SQL > alter table materialize (); SQL > alter table materialize ();