We all know that a table can have many indexes. When we execute a query, we do not execute which index to use. Who helps us decide which index to use? It is the mysql optimizer that decides what index to use, and the optimizer’s judgment may not always be correct.

Optimizer logic

The optimizer chooses indexes to find an optimal execution plan and to execute statements with minimal cost. Small cost metrics include the number of rows scanned, whether temporary tables are used, whether sorting is possible, and so on.

How is the number of lines scanned judged?

MySQL does not know exactly how many records meet this condition until it actually starts executing the statement, but can only estimate the number of records based on statistics. This statistic is the “differentiation” of the index. Obviously, the more different values there are on an index, the better the index is differentiated. The number of different values in an index is called cardinality. That is, the larger the cardinality, the more differentiated the index. For example, if you have a table with a gender field, can you index a row based on gender? A row is certainly a row, but the value of the index is not very significant. The index base is too small to distinguish. We can use the show index method to see the cardinality of an index. We can use the show index method to see the cardinality of an index.

How does MySQL get the cardinality of the index?

Why do we sample statistics? Although accurate results can be obtained by taking out the whole table for statistics row by row, the cost is too high, so we can only choose “sampling statistics”.

InnoDB will select N data pages by default, count the different values of these pages, get an average, and then multiply by the number of pages in the index to get the cardinality of the index.

Tables are constantly updated and index statistics are not fixed. Therefore, when the number of changed rows exceeds 1/M, a new index count is automatically triggered.

In MySQL, there are two ways to store index statistics, which can be selected by setting the value of innodb_stats_persistent:

  • If this parameter is set to on, statistics will be stored persistently. By default, N is 20 and M is 10.

  • If this parameter is set to off, statistics are stored only in memory. By default, N is 8 and M is 16.

Create table (a, b); create table (b, a, b);

CREATE TABLE `t` (
`id` int(11) NOT NULL, `a` int(11) DEFAULT NULL, `b` int(11) DEFAULT NULL.PRIMARY KEY (`id`),
KEY `a` (`a`),
KEY `b` (`b`)
) ENGINE=InnoDB;
Copy the code

We insert 100,000 rows into table t in integer increments, i.e. :(1,1,1), (2,2,2), (3,3,3) up to (100000,100000,100000).

delimiter ;;
create procedure idata() begin
declare i int;
set i=1; while(i< =100000)do
insert into t values(i, i, i);
    set i=i+1;
  end while;
end;;
delimiter ;
call idata();
Copy the code

Let’s start with the first SQL

select * from t where a between 10000 and 20000;
Copy the code

Is this it? And that’s what analysis is all about? A has an index on it, obviously using index A.

The query does perform as expected; the key field value is’ A ‘, indicating that the optimizer selected index A. Let’s use a more complicated scenario

session A session B
start transaction with consistent snapshot;
delete from t;

call idata();
explain select * from t where a between 10000 and 20000;
commit

Session A, which you’re already familiar with, starts A transaction. Session B then deletes all the data and calls the idATA stored procedure to insert 100,000 rows.

Select * from T where a between 10000 and 20000; select * from T where a between 10000 and 20000; We can use slow log to see how this works.

To illustrate whether the optimizer’s selection is correct, I have added a comparison: use force index(a) to force the optimizer to use index A (something I’ll refer to later in this article).

The following three SQL statements are the experimental process.

set long_query_time=0;
select * from t where a between 10000 and 20000; /*Q1*/
select * from t force index(a) where a between 10000 and 20000 /*Q2*/
Copy the code
  • The first sentence is to set the threshold of the slow query log to 0, indicating that the following statements of the thread will be recorded in the slow query log.
  • Q1 is the original query of session B;
  • Q2 adds force index(a) to compare session B’s original query execution.

The rows field represents the number of rows expected to be scanned.

Where, the result of Q1 is still in line with the expectation, the value of rows is 104620; But Q2’s rows value is 37116, which is a big deviation. For the first figure we use the explain command

The rows field represents the number of rows expected to be scanned. The rows we see are only 10001 rows, which misleads the optimizer.

At this point, the first question you might ask is not why it is not working, but why does the optimizer have an execution plan that scans 37,000 rows instead of 100,000 rows?

This is because, if you use index A, every time you get a value from index A, you have to go back to the primary key index and look up the entire row, which the optimizer also takes into account.

If you choose to scan 100,000 rows, the scan is done directly on the primary key index at no additional cost. The optimizer estimates the cost of both choices, and from the results, the optimizer decides that it is faster to scan the primary key index directly. Of course, this option is not optimal in terms of execution time.

MySQL selected the wrong index, and this is also due to the fact that MySQL did not accurately determine the number of rows scanned.

Since the statistics are wrong, then correct. The analyze table t command is used to re-analyze index information. Using the analyze command can solve a lot of problems if only the index statistics are inaccurate, but as we said earlier, the optimizer does more than just look at the number of rows scanned. Again based on table T, let’s look at another statement:

select * from t where (a between 1 and 1000) and (b between 50000 and 100000) order by b limit 1;
Copy the code

For the sake of analysis, let’s first look at the structure diagram of the two indexes A and B.

Select * from primary key (a, b, c, d, d); select * from primary key (a, C, d, d, d)

Field B to filter. Obviously, 1000 rows need to be scanned.

If the query is performed using index B, the last 50001 values of index B will be scanned, which is the same as the above execution process

Broken, so row 50001 needs to be scanned.

explain select * from t where (a between 1 and 1000) and (b between 50000 and 100000) order by b limit 1
Copy the code

As you can see, the key field is returned, this time the optimizer selects index B, and the Rows field shows 50198 rows to scan.

From this result, you can draw two conclusions:

  1. Estimates of the number of scanned rows are still inaccurate;
  2. MySQL selects the wrong index in this example.

Index selection exception and handling

Most of the time the optimizer will find the correct index, but occasionally you will encounter the two situations we exemplified above:

  1. One way is to force an index, as in our first example, with force Index.
  2. Second, we can consider modifying the statement to instruct MySQL to use the index we expect. For example, in this example, it is clear that the semantic logic is the same to change “order by b limit 1” to “Order by B,a limit 1”.
  3. The third approach is that, in some scenarios, we can create a new index that is more appropriate to provide the optimizer with options, or drop the misused index.