10 why sometimes choose the wrong | MySQL index?

In MySQL, a table can support multiple indexes. But when you write SQL statements, you do not actively specify which index to use. That is, it is up to MySQL to determine which index to use.

Sometimes a statement that could have been executed very quickly is slowed down because MySQL picked the wrong index?

For example, create a simple table with two columns, A and B, respectively.

CREATE TABLE `t` ( `id` int(11) NOT NULL, `a` int(11) DEFAULT NULL, `b` int(11) DEFAULT NULL, PRIMARY KEY (`id`), KEY 'a' (' a '), KEY 'b' (' b ')) ENGINE=InnoDB;Copy the code

We insert 100,000 rows into table t in integer increments, i.e. :(1,1,1), (2,2,2), (3,3,3) up to (100000,100000,100000).

delimiter ;; 
create procedure idata() 
begin 
  declare i int; 
  set i=1; 
  while(i<=100000)do 
  insert into t values(i, i, i); 
  set i=i+1; 
  end while; 
end;; 
delimiter ; 
call idata();
Copy the code

Let’s analyze an SQL statement:

mysql> select * from t where a between 10000 and 20000;

The key field is ‘a’, indicating that the optimizer has selected index A.

Let’s do this again on a table with 100,000 rows.

Session A starts A transaction. Session B then deletes all the data and calls the idATA stored procedure to insert 100,000 rows.

Select fromt where a between 10000 and 20000 from session B. We can use slow log to see how this works.

To illustrate whether the optimizer’s selection is correct, a comparison is added: force index(a) is used to force the optimizer to use index A.

The following three SQL statements are the experimental process.

set long_query_time=0; select * from t where a between 10000 and 20000; /*Q1*/ select * from t force index(a) where a between 10000 and 20000; /*Q2*/Copy the code
  1. The first sentence is to set the threshold of the slow query log to 0, indicating that the following statements of the thread will be recorded in the slow query log.
  2. Q1 is the original query of session B;
  3. Q2 adds force index(a) to compare session B’s original query execution.

The following figure shows the slow query logs after the execution of the three SQL statements. As you can see, Q1 scans 100,000 rows, which is obviously a full table scan, and the execution time is 40 milliseconds. Q2 scans 10001 rows and executes for 21 milliseconds. That is, MySQL used the wrong index when we did not use force Index, resulting in a longer execution time.

This example corresponds to our usual scenario of constantly deleting historical data and adding new data. Is it strange that MySQL should choose the wrong index? The following is an analysis of the optimizer logic.

Optimizer logic

01 mentioned that the optimizer can select indexes in order to find an optimal solution to execute statements with minimal cost. For databases, fewer rows scanned means fewer visits to disk data and less CPU consumption. Of course, the optimizer also considers whether to use temporary tables, whether to sort, and so on.

How to determine the number of lines scanned?

MySQL does not know exactly how many records meet this condition until it actually starts executing the statement, but can only estimate the number of records based on statistics.

This statistic is the “differentiation” of the index. Obviously, the more different values there are on an index, the better the index is differentiated. The number of different values in an index is called cardinality. That is, the larger the cardinality, the more differentiated the index.

We can use the showindex method to see the cardinality of an index. As shown, this is the result of the showindex for table T. Although the values of the three fields in each row of the table are the same, the cardinality values of the three indexes are not the same in the statistics, and none of them are actually accurate.

How does MySQL get the cardinality of the index?

Generally, “sampling statistics” is selected. Although accurate results can be obtained by taking out the whole table for row by row statistics, the cost is too high. InnoDB will select N data pages by default, count the different values of these pages, get an average, and then multiply by the number of pages in the index to get the index base, which is easy to miscalculate.

In addition, the index count is only an input, and for a particular statement, the optimizer also determines how many rows to scan to execute the statement itself. Next, let’s take a look at what the optimizer estimates will be the number of lines scanned for these two statements.

The rows field represents the number of rows expected to be scanned. Where, the result of Q1 is still in line with the expectation, the value of rows is 104620; But Q2’s rows value is 37116, which is a big deviation. In Figure 1, we see only 10001 rows using the Explain command. This bias misleads the optimizer.

At this point, the first question you might ask is not why it is not working, but why does the optimizer have an execution plan that scans 37,000 rows instead of 100,000 rows?

The optimizer estimates the cost of both choices, and from the results, the optimizer decides that it is faster to scan the primary key index directly. Of course, this option is not optimal in terms of execution time. Using plain indexes requires the cost of returning to the table to be factored in. The cost of this strategy was taken into account when executing Explain, but it was the right choice. In other words, there is nothing wrong with this strategy.

So MySQL chose the wrong index, and this is also due to the fact that MySQL did not accurately determine the number of rows to scan. Using the analyze table t command, you can re-analyze index information.

Use this method if you find that the predicted rows values in the explain results differ greatly from the actual situation. The analyze command can solve a lot of problems if the index statistics are inaccurate, but the optimizer does more than just look at the number of rows scanned.

Again based on table T, let’s look at another statement:

mysql> select * from t where (a between 1 and 1000) and (b between 50000 and 100000) order by id

Conditionally, this query has no eligible records and therefore returns an empty collection. If you were to choose an index, which one would you choose? For the sake of analysis, let’s first look at the structure diagram of the two indexes A and B.

If the query is performed using index A, the first 1000 values of index A are scanned, the corresponding IDS are retrieved, each row is checked by primary key index, and then filtered by field B. Obviously, 1000 rows need to be scanned.

If the query is performed using index B, the last 50001 values of index B will be scanned. The process is the same as the above, and the values need to be determined by the primary key index. Therefore, row 50001 needs to be scanned.

If index A is used, execution is significantly faster. Use the explain command to see if this is the case.

mysql> explain select * from t where (a between 1 and 1000) and (b between 50000 and 100000) order by b limit 1

As you can see, the key field is returned, this time the optimizer selects index B, and the Rows field shows 50198 rows to scan. From this result, you can draw two conclusions:

  1. Estimates of the number of scanned rows are still inaccurate;
  2. MySQL selects the wrong index in this example.

Index selection exception and handling

Most of the time the optimizer will find the correct index, but once in a while you’ll run into two situations like the one above: an SQL statement that could have been executed quickly turns out to be much slower than you expected.

The first method is to force an index with force index. MySQL analyzes possible indexes as candidates based on the results of lexical parsing, and then determines how many rows each index needs to scan in the candidate list. If the index specified by force index is in the list of candidate indexes, the index is directly selected without evaluating the execution cost of other indexes.

Let’s look at a second example. When we started our analysis, we thought it would be better to choose index A. The original statement takes 2.23 seconds to execute, but when you use Force Index (a), it takes 0.05 seconds, more than 40 times faster than the optimizer’s choice. That is, the optimizer does not select the right index, and force Index acts as a “corrective”.

If the name of the index is changed, this statement must also be changed. And if you migrate to another database later, this syntax may not be compatible. Therefore, database problems are best solved within the database. So, how do you solve this in the database?

Since the optimizer has abandoned the use of index A, a is not suitable enough. Second, we can consider modifying the statement to instruct MySQL to use the index we expect. For example, in this example, it is clear that the semantic logic is the same to change “order by b limit 1” to “Order by B,a limit 1”.

Before, the optimizer chose to use index B because it thought that using index B could avoid sorting (b itself is an index, which is already ordered, and if you choose index B, there is no need to do sorting, only need to traverse), so even if the number of rows scanned is large, it is judged to be less costly.

Now the way we write order by B,a, requires sorting by B,a, which means we need to sort both indexes. Therefore, the number of rows scanned becomes the main factor influencing the decision, so the optimizer chose index A, which only needs to scan 1000 rows.

The third approach is that, in some scenarios, we can create a new index that is more appropriate to provide the optimizer with options, or drop the misused index.