In front of the

As mentioned in the previous SQL Advanced Tips (part 2), using the following SQL results in slow queries

SELECT COUNT(*) FROM SomeTable
SELECT COUNT(1) FROM SomeTable
Copy the code

MySQL is optimized for COUNT(*) without where_clause. The optimizer selects the lowest cost auxiliary index query COUNT, which actually has the highest performance

To answer this question, I first went to the production to find a table of 10 million level and used EXPLAIN to query the execution plan

EXPLAIN SELECT COUNT(*) FROM SomeTable
Copy the code

The results are as follows

As shown in the figure: In this case, MySQL will use the lowest cost secondary index query for both COUNT(1) and COUNT(*). Use COUNT(*) because MySQL is optimized to ensure that it has the best query performance! By the way, COUNT(*) is the standard syntax for counting rows defined by SQL92 and is efficient, so use COUNT(*) directly to query the number of rows in the table!

So the reader is right. However, there is a premise that this optimization is available only after MySQL 5.6.

How do you define the minimum cost? Sometimes when you specify multiple conditions in WHERE, why does MySQL end up selecting another index or even not selecting an index?

This article will give you the answer, which will be analyzed from the following two aspects

  • How is the execution cost of SQL selection index calculated
  • Example is given to illustrate

How is the execution cost of SQL selection index calculated

As mentioned above, in the case of multiple indexes, MySQL will choose the index with the least cost before querying the data. The cost here mainly includes two aspects.

  • IO cost: That the cost of the load data from disk into memory, by default, the read data page I/o cost is 1, MySQL is read data in the form of pages, namely when some data are used, can not only read the data, and will put the data of the adjacent data is read into memory together, this is the famous program locality principle, So MySQL is going to read a whole page at a time, and the cost per page is 1. So the cost of IO is mainly related to the size of the page
  • CPU cost: The cost of CPU operations such as checking for conditions and sorting after reading data into memory, which is obviously related to the number of rows. By default, the cost of detecting records is 0.2.

Example is given to illustrate

To figure out the final cost of using an index based on these two costs, prepare a table (the following is based on MySQL 5.7.18)

CREATE TABLE `person` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) NOT NULL.`score` int(11) NOT NULL.`create_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `name_score` (`name`(191),`score`),
  KEY `create_time` (`create_time`))ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
Copy the code

In addition to the primary key index, this table has two other indexes, name_SCORE and CREATE_time. We then insert 10 w rows of data into this table by writing a stored procedure call as follows:

CREATE PROCEDURE insert_person()
begin
    declare c_id integer default 1;
    while c_id<=100000 do
    insert into person values(c_id, concat('name',c_id), c_id+100, date_sub(NOW(), interval c_id second));
    set c_id=c_id+1;
    end while;
end
Copy the code

After the insert we now use EXPLAIN to calculate which index is used for the total number of rows

EXPLAIN SELECT COUNT(*) FROM person
Copy the code

The create_time secondary index is chosen. MySQL believes that this index is the cheapest to use for the query, which is in line with our expectation that the secondary index is the highest performance!

Let’s see which index is used in the following SQL

SELECT * FROM person WHERE NAME >'name84059' AND create_time>'the 2020-05-23 14:39:18' 
Copy the code

Full table scan was used! SELECT * from WHERE WHERE SELECT * from WHERE WHERE name_SCORE = create_time = name_score = create_time = create_time

SELECT name FROM person WHERE NAME >'name84059'
Copy the code

MySQL still selects full table scan! It is interesting to see why MySQL has chosen a full table scan instead of an overwritten index. Since MySQL believes that a full table scan performs better than an overwritten index, let’s compare the query times between the two

-- Full table scan execution time: 4.0ms
SELECT create_time FROM person WHERE NAME >'name84059' AND create_time>'the 2020-05-23 14:39:18' 

Execution time using overwrite index: 2.0 ms
SELECT create_time FROM person force index(create_time) WHERE NAME >'name84059' AND create_time>'the 2020-05-23 14:39:18' 
Copy the code

The actual performance of the overridden index query was twice as fast as that of the full table scan! MySQL in the query before the cost estimate is not allowed! Let’s first look at how much MySQL costs to do a full table scan.

The main IO costs are related to CPU costs, which in the case of a full table scan are related to the number of pages used by the cluster index and the number of records in the table, respectively. Run the following command

SHOW TABLE STATUS LIKE 'person'
Copy the code

You can find

  1. The number of rows is 100264, we inserted 10 W rows of data, how to calculate the data instead of more, in fact, the calculation here is an estimate, it is possible that the number of rows here is less than 10 W, the estimation method interested in you go to the Internet to find, here is not the focus of this article, will not expand. Given the number of rows, we know that the CPU cost is 100264 * 0.2 = 20052.8.

  2. The data length is 5783552, and InnoDB’s per-page size is 16 KB, so you can calculate 353 pages.

That is, the cost of a full table scan is 20052.8 + 353 = 20406.

If this is true, we can use a tool to verify that. In MySQL 5.6 and later, we can use the Optimizer Trace feature to view the entire process of generating the optimizer plan. It lists the execution plan cost for each index selected and the final selection result. We can rely on this information to further optimize our SQL.

The optimizer_Trace function is used as follows

SET optimizer_trace="enabled=on";
SELECT create_time FROM person WHERE NAME >'name84059' AND create_time > 'the 2020-05-23 14:39:18';
SELECT * FROM information_schema.OPTIMIZER_TRACE;
SET optimizer_trace="enabled=off";
Copy the code

After execution we mainly looked at the cost of using name_SCORE, CREATE_time index, and full table scan.

Let’s look at the estimated execution cost of using the name_SCORE index:

{
    "index": "name_score"."ranges": [
      "name84059 <= name"]."index_dives_for_eq_ranges": true."rows": 25372."cost": 30447
}
Copy the code

You can see that the execution cost is 30447, which is higher than the 20406 cost we calculated earlier for a full table scan. Therefore, this index is not selected for execution

Note: 30447 is the sum of the IO cost and CPU cost of querying the secondary index, plus the IO cost and CPU cost of querying the cluster index back to the table.

Now look at the estimated execution cost for the create_time index:

{
    "index": "create_time",
    "ranges": [
      "0x5ec8c516 < create_time"
    ],
    "index_dives_for_eq_ranges": true,
    "rows": 50132,
    "cost": 60159,
    "cause": "cost"
}
Copy the code

You can see that the cost is 60159, which is much higher than the full table scan cost of 20406, so this index is not selected.

Let’s look at the calculated full table scan cost:

{
    "considered_execution_plans": [
      {
        "plan_prefix": [
        ],
        "table": "`person`",
        "best_access_path": {
          "considered_access_paths": [
            {
              "rows_to_scan": 100264,
              "access_type": "scan",
              "resulting_rows": 100264,
              "cost": 20406,
              "chosen": true
            }
          ]
        },
        "condition_filtering_pct": 100,
        "rows_for_plan": 100264,
        "cost_for_plan": 20406,
        "chosen": true
      }
    ]
}
Copy the code

Note that the cost: 20406 is exactly the same as what we calculated before! This value is the lowest of the three calculated execution costs, so MySQL chose to execute the SQL with a full table scan.

The Optimizer Trace actually lists the cost statistics of the overwrite index and the return table in detail.

Can be seen from the above analysis, the MySQL execution plan is not necessarily the best choice, because there are a lot, speak and the number of rows on such as statistics, such as MySQL again believe the best is not the same as we think, we can think of is optimal execution time is short, but the MySQL think small cost does not necessarily mean the execution time is short.

conclusion

This article uses an example to deeply analyze how MySQL’s execution plan is selected and why its selection may not be optimal as we think. It also reminds us that if there are multiple indexes in production, using WHERE filtering may not select the indexes you think. We can use EXPLAIN, Optimizer trace to optimize our query ahead of time.

Finally, welcome to pay attention to public communication, common progress!

Shoulders of giants

  • Time.geekbang.org/column/arti…