MySQL query optimization needs to go through three steps: parsing, preprocessing and optimization. In each of these processes, errors are possible. This article will not delve into error handling, but will help you understand how MySQL executes queries so that you can write better queries.

Parsers and preprocessors

In the beginning, MySQL’s parser breaks the query into a series of instructions and builds a “parse tree” from them. The parser uses MySQL’s SQL syntax to translate and validate queries. For example, the parser ensures that the instructions in the query are valid and in the correct order, and checks for errors such as string quotes being mismatched.

The preprocessor checks the built parse tree for semantic information that the parser can’t handle. For example, check whether tables and columns exist, and process field names and aliases to ensure that column references are unambiguous. Next, the preprocessor checks permissions, which is usually pretty quick (unless your server has a bunch of permissions configured).

Query optimizer

After going through the parser and preprocessor, the parse tree is determined to be valid and can be processed by the optimizer and eventually transformed into a query plan. There are usually many ways to execute a query with the same result, and the optimizer’s job is to find the best option.

MySQL uses an optimizer based on cost estimation, which means it views the cost of many execution plans and selects the one with the lowest cost. The initial unit cost was a random 4KB data page read, but now it is more complex, including costs such as performing WHERE comparison conditions. You can see how much the query optimizer estimates the cost of the query statement by displaying the Last_query_cost session variable.

SELECT SQL_NO_CACHE COUNT(*) FROM sakila.film_actor;
SHOW STATUS LIKE 'Last_query_cost';
Copy the code

The Last_query_cost displayed means that the optimizer estimates that it will need to perform the corresponding number of random data page accesses to complete the query. This is based on the following statistical estimates:

  • The number of data pages occupied by a data table or index;
  • Candidate values for the index;
  • The data length corresponding to the row, key, and key value distribution.

The optimizer does not consider the cache of the estimated content — it assumes that the results are read from disk I/O every time. The optimizer does not always choose the optimal execution plan for the following reasons:

  • Statistics themselves can be wrong. Server-side statistics depend on the storage engine, which can be very accurate or very inaccurate. For example, InnoDB does not keep the exact number of rows in a table due to its MVCC architecture.
  • The estimated cost is not the same as the actual cost of running, so even if the statistics are accurate, the cost of the query will be more or less different from MySQL’s estimate. A query plan that reads more data pages may also be cheaper, for example faster if the disk I/O access is ordered, or the results themselves are already in the cache. Therefore, the optimizer itself does not know how many I/O operations the query will cause.
  • MySQL’s artificial optimizations may not be what we expected. Faster execution time is probably what we want, and MySQL isn’t just looking for speed, it’s looking to minimize costs. Therefore, passing costs is not necessarily scientific.
  • MySQL does not consider queries in concurrency, which can affect the speed at which queries run.
  • MySQL is not always optimized for cost estimates. Sometimes it just follows rules, such as using full-text indexes if there is a full-text MATCH condition (the MATCH method). MySQL does not perform queries in a faster manner, even if there is a faster alternative index and non-full-text conditional query.
  • The optimizer does not consider the cost of operations that are not under its control, such as executing stored procedures or custom functions.
  • The optimizer is not always able to estimate every execution plan, and sometimes it will ignore a better plan.

The MySQL query optimizer is a very complex part of this, using many optimizations to turn a query statement into a query execution plan. There are usually two kinds of optimization: static optimization and dynamic optimization. Static optimization can be done simply by examining the parse tree. For example, the optimizer can convert a WHERE condition into an equation using mathematical rules. Static optimization is independent of specific values, such as constant values for WHERE conditions. They remain valid once executed, even if the query is executed again with a different value. Think of it as “compile-time optimization.”

In contrast, dynamic optimization is scenario-based and depends on multiple factors. For example, the number of rows of data corresponding to a value in a WHERE condition or an index. This process needs to be re-estimated on each query, which can be understood as “runtime optimization.” Here are some typical MySQL optimizations:

  • Federated query reordering: Data tables do not necessarily need to be federated in the order of query statements. Determining the optimal federated query order is an important optimization.
  • Transform an outer join into an inner join: An outer join does not necessarily need to be queried by an outer join. Some factors, such as WHERE conditions and data table structure, can equate outer join queries to inner joins. MySQL can recognize these cases and rewrite the federated query.
  • Application mathematical equivalence formulas: MySQL uses mathematical equivalence conversions to simplify expressions. You can expand and reduce constants, rule out impossible cases and constant expressions. For example, the expression (5=5 AND a>5) is reduced to (a>5). Similarly, (a 5 AND b=c AND a=5. These rules are useful for conditional queries.
  • COUNT(), MIN(), and MAX() optimizations: Indexes and null-value columns often help MySQL optimize these functions. For example, to find the minimum value of the leftmost column of the binary tree, MySQL can request only the first row of the index. You can even do this during the query optimization phase and treat it as a constant value for the rest of the query. The same is true for the maximum value of the query, only the last u line is read. If the server uses this optimization, you can see “Select Tables Optimized Away” in EXPLAIN. This means that the optimizer removes the table from the query plan and replaces it with a constant. Similarly, COUNT(*) queries can be optimized in some storage engines without specifying a WHERE condition (MyISAM, for example, keeps the exact number of rows of the table).
  • Evaluate and reduce constant expressions: Once MySQL detects that an expression can be reduced to a constant, it does so during the optimization phase. For example, a user-defined variable can be converted to a constant if it does not change during the query. Surprisingly, during the optimization phase, some statements that you thought were queries are also converted to constants. An example is MIN() on an index. This situation can also be extended to constant queries against primary keys or independent indexes. If the WHERE condition specifies a constant for such an index, the optimizer knows that MySQL looks for the corresponding value at the start of the query. This value is then treated as a constant for the rest of the query. Here’s an example:

EXPLAIN SELECT film.film_id, film_actor.actor_id
FROM sakila.film
	INNER JOIN sakila.film_actor USING(film_id)
WHERE film.film_id = 1;
Copy the code

MySQL will split the query into 2 steps, so the analysis result will have 2 rows. The first step is to find the corresponding data row in the FILM table. Since the primary key film_id was queried, MySQL knows that there is only one row of data. Therefore, the ref of the query analysis result is constant. In the second step, MySQL takes the film_id as a known value, so the ref for the film_actor query is also constant. Other similar scenarios include WHERE, USING, or ON conditions WHERE the constraint is an equality. In this case, MySQL knows that the film_id of the USING condition is the same value in the query statement, and this value must be the same as the film_id of the WHERE condition.

  • Overwrite indexes: MySQL sometimes uses index data to avoid reading row data if the index contains all the columns needed for a query.
  • Subquery optimization: MySQL is able to convert some types of subqueries into more efficient variants, simplifying them into indexed queries rather than independent queries.
  • Early abort: MySQL can abort the query process after the query results are satisfied. The most obvious example is the LIMIT condition. There are other cases of early termination. For example, MySQL can abort the entire query after detecting a possible condition, as shown in the following example:
EXPLAIN SELECT film.film_id FROM sakila.film WHERE film_id=1;
Copy the code

“Impossible WHERE Noticed after Reading const tables” will be displayed in the Extra field in the analysis result. There are other cases of early termination, such as:

SELECT film.film_id
FROM sakila.film
	LEFT OUTER JOIN sakila.film_actor USING(film_id)
WHERE sakila.film_actor.film_id IS NULL;
Copy the code

This query excludes movies with actors. Each power can have multiple actors, but once one is found, MySQL will stop processing the current movie and move on to the next one. A similar situation applies to DISTINCT, NOT EXISTS.

  • Equivalent passing: MySQL recognizes whether the columns held in the query are equivalent. For example, in a JOIN condition, the WHERE condition affects the same columns, as in the following query:
SELECT film.film_id
FROM sakila.film
	INNER JOIN sakila.film_actor USING(film_id)
WHERE film.film_id > 500;
Copy the code

MySQL knows that the WHERE constraint applies not only to film but also to film_actor. But other databases may not have the same optimization effect.

  • IN query comparison: For many database servers, the IN query ratio is equivalent to multiple OR conditions, which are logically equivalent. This is not the case IN MySQL, which sorts the list values of the IN query and uses binary lookup to check whether the query value is IN the list. This reduces the complexity of the algorithm from order n to order log n.

In fact, MySQL uses many more optimizations than listed above, and I can’t list them all here. Just keep in mind the complexity and intelligence of MySQL’s optimizer. Therefore, you should let the optimizer do its job, rather than optimizing the query indefinitely until the MySQL optimizer is no longer useful. Of course, while MySQL’s optimizer is clever, it doesn’t always give optimal results, and sometimes you know optimal results that MySQL doesn’t necessarily know. In this case, you can optimize the query statement to help MySQL optimize, but sometimes you need to add hints to the query, or rewrite the query, modify the table design, or add indexes.