Author: IT Wang Xiaoer

Blog: itwxe.com

Order by and Group by, page query optimization, Join associated query optimization, in and Exprate query optimization, count(*) query optimization.

One, common optimization ten experience

In fact, this ten lessons may not be accurate, through the last article MySQL in-depth execution plan trace tool we already know that MySQL will perform cost analysis when executing a query statement, the amount of data and the actual data value will affect the actual MySQL query process. So don’t be surprised if you find it’s different from what you wrote in your essay.

So… Small two tangled for a long time to write this SQL optimization common optimization ten experience, after all, many of these blogs have been mentioned, think about it, or write it out.

- the sample table
CREATE TABLE `employees` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(24) NOT NULL DEFAULT ' ' COMMENT 'name',
  `age` int(11) NOT NULL DEFAULT '0' COMMENT 'age',
  `position` varchar(20) NOT NULL DEFAULT ' ' COMMENT 'job',
  `hire_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Entry Time'.PRIMARY KEY (`id`),
  KEY `idx_name_age_position` (`name`,`age`,`position`) USING BTREE
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='Staff Record';

INSERT INTO employees(name,age,position,hire_time) VALUES('itwxe'.22.'manager',NOW());
INSERT INTO employees(name,age,position,hire_time) VALUES('weiwei'.23.'test',NOW());
INSERT INTO employees(name,age,position,hire_time) VALUES('leilei'.23.'dev',NOW());

Insert 10W test data
drop procedure if exists insert_employees; 
delimiter $$
create procedure insert_employees()        
begin
  declare i int;                    
  set i = 1;                          
  while(i < = 100000)do                 
    insert into employees(name, age, position) values(CONCAT('itwxe', i), rand() * 42 + 18.'dev');  
    set i = i + 1;                       
  end while;
end$$
delimiter ;
call insert_employees();
Copy the code

1. Try to match all values

explain select * from employees where name = 'itwxe';
explain select * from employees where name = 'itwxe' and age = 22;
explain select * from employees where name = 'itwxe' and age = 22 and position = 'manager';
Copy the code

Remember the three key_len values. Idx_name_age_position (name,age,position) consists of these three fields. 74 indicates the name column is used. 78 indicates the name,age column is used; 140 indicates that the name,age,position columns are used.

2. Left-most prefix rule

Pay special attention to the left-most prefix rule when using a federated index, which means that queries start at the left-most front of the federated index and do not skip columns in the index.

explain select * from employees where name = 'itwxe' and age = '18';
explain select * from employees where name = 'itwxe' and position = 'manager';
explain select * from employees where position = 'manager';
Copy the code

It is not the order in which the query SQL was written, but the left column is the order in which the columns were created. For example, the three columns in idx_name_age_position (name,age,position) will be used to reverse the order in where.

explain select * from employees where name = 'itwxe' and age = 22 and position = 'manager';
explain select * from employees where age = 22 and position = 'manager' and name = 'itwxe';
explain select * from employees where position = 'manager' and name = 'itwxe' and age = 22;
Copy the code

You can see that even if the order is reversed, the three execution plans are exactly the same.

3. Do not perform any operation (calculation, function, (automatic/manual) type conversion) on the index column, which will cause the index to fail and turn to full table scan

Note that any operation on the index column (calculation, function, (automatic/manual) type conversion) does not operate after the WHERE condition, not in the query result field.

For example, the left function operation on the name column.

explain select * from employees where name = 'weiwei';
explain select * from employees where left(name,6) = 'weiwei';
Copy the code

4. The storage engine cannot use the column to the right of the range condition in the index

explain select * from employees where name = 'itwxe' and age = 22 and position = 'manager';
explain select * from employees where name = 'itwxe' and age > 22 and position = 'manager';
Copy the code

You can see that the second SQL query uses the name and age columns as indexes, but does not use position.

5. Use overwrite indexes as much as possible and reduceselect *statements

Overwriting indexes was mentioned in the previous article, so I won’t repeat it.

6. MySQL does not equal (! If an index cannot be used, a full table scan will occur

<, >, <=, >=, the MySQL internal optimizer calculates whether to use indexes for query costs based on multiple factors such as retrieval ratio, table size, and so on.

explain select * from employees where name ! = 'itwxe';
Copy the code

7. Is NULL and is not NULL cannot be used in normal cases

explain select * from employees where name is null;
explain select * from employees where name is not null;
Copy the code

8. Like start with wildcard (‘%itwxe… MySQL index failure is a full table scan operation

explain select * from employees where name like 'wei%';
explain select * from employees where name like '%wei';
Copy the code

B+tree data structure B+tree data structure

  • Set up federated indexes based on the business and use overlay index queries.
  • Those that cannot use an overwritten index use search engines such as ES.

Like KK% = constant; %KK% = range;

Here we introduce the concept of index push-down:

explain select * from employees where name like 'weiwei%' and age = 22 and position = 'manager';
explain select * from employees where name like 'itwxe%' and age = 22 and position = 'manager';
Copy the code

Name = ‘weiwei%’; position = ‘0’; key_len = 140; This means that MySQL uses the name,age, and postion columns to query queries. This is because MySQL was optimized in version 5.6 to introduce index push-downs.

In the index traversal process, you can judge all fields in the index and filter out the unqualified records before returning to the table, which effectively reduces the number of back to the table.

In other words, if name = ‘weiwei%’, index push-down optimization is used. When name is filtered, age and position conditions are also filtered in the index. After filtering the primary key IDS corresponding to the rest of the index, the table can check the whole row data.

MySQL selects full table scan because the cost of querying a secondary index + a back table is greater than that of a full table scan.

9. Index failure caused by type mismatch

Inindexing strings without single quotes, or inindexing numeric types with single quotes, is the automatic type conversion (also known as implicit conversion) mentioned in point 3.

10. Use or or in sparingly, MySQL does not necessarily use indexes when querying with it

The MySQL internal optimizer calculates whether to use an index for the cost of a query based on multiple factors such as the percentage of retrieves, the size of the table, and so on.

Order by and Group by optimization

Order by optimization

In front of the small and medium-sized 2 paper to introduce the Index of the underlying data structure, know the Index itself is a kind of sorted data structure, so the best way is to implement the sorting optimization to the Index, so the query data have been sorted out, the sorting in MySQL is called Using the Index, the Index. If the data itself is not sorted by the required field, then Using filesort will appear.

Therefore, we need to optimize order by, which is mainly to eliminate inefficient Using filesort, and establish appropriate joint index to use overwrite index to sort.

Using filesortFile sorting principle explained in detail

Using filesort is divided into single-path sorting and double-path sorting (also called table sorting).

  • Single-way sort: Fetch all the fields that meet the condition at one time and sort them in sort buffer. Use the trace tool to see what is displayed in the sort_mode message< sort_key, additional_fields >or< sort_key, packed_additional_fields >
  • Double sort: is the first according to the corresponding conditions to take out the correspondingSort fieldandRow IDS that can directly locate row data, and sort it in sort buffer. After sorting, it needs to fetch other required fields again. Use the trace tool to see what is displayed in the sort_mode message< sort_key, rowid >

MySQL determines which sort mode to use by comparing the size of the system variable max_LENGTH_FOR_sort_data (default: 1024 bytes) with the total size of the fields to be queried.

  • If the total length of the field is less than max_LENGTH_FOR_sort_data, single-way sort mode is used
  • If the total length of the field is greater than max_LENGTH_FOR_sort_data, then the two-way sort mode is used

Let’s use the trace tool to look at single-path and double-path sorts:

-- Single-path sort
mysql> set session optimizer_trace="enabled=on", end_markers_in_json=on;

mysql> select * from employees where name = 'itwxe' order by position;

mysql> select * from information_schema.OPTIMIZER_TRACE;

"join_execution": {  -- SQL execution phase
  "select#": 1,
  "steps": [
    {
      "filesort_information": [
        {
          "direction": "asc",
          "table": "`employees`",
          "field": "position"
        }
      ] /* filesort_information */,
      "filesort_priority_queue_optimization": {
        "usable": false,
        "cause": "not applicable (no LIMIT)"
      } /* filesort_priority_queue_optimization */,
      "filesort_execution": [
      ] /* filesort_execution */,
      "filesort_summary": {  -- File ordering information
        "rows": 1.-- Expected number of rows scanned
        "examined_rows": 1.The number of rows participating in the sort
        "number_of_tmp_files": 0.The number of temporary files used. If 0 represents the total sort_buffer memory sort used, otherwise the disk file sort used
        "sort_buffer_size": 262056.-- Size of the sort cache, in bytes
        "sort_mode": "<sort_key, packed_additional_fields>"  -- Sorting mode, here we use single-path sorting
      } /* filesort_summary */}]/* steps */
} /* join_execution */

-- Dual-path sort
mysql> set max_length_for_sort_data = 10;  -- The total length of all fields in the employees table must be greater than 10 bytes for even one record

mysql> select * from employees where name = 'itwxe' order by position;

mysql> select * from information_schema.OPTIMIZER_TRACE;

"join_execution": {
  "select#": 1,
  "steps": [
    {
      "filesort_information": [
        {
          "direction": "asc",
          "table": "`employees`",
          "field": "position"
        }
      ] /* filesort_information */,
      "filesort_priority_queue_optimization": {
        "usable": false,
        "cause": "not applicable (no LIMIT)"
      } /* filesort_priority_queue_optimization */,
      "filesort_execution": [
      ] /* filesort_execution */,
      "filesort_summary": {
        "rows": 1,
        "examined_rows": 1,
        "number_of_tmp_files": 0,
        "sort_buffer_size": 262136,
        "sort_mode": "<sort_key, rowid>"  -- Sorting mode, here we use double - way sorting
      } /* filesort_summary */}]/* steps */
} /* join_execution */
Copy the code

The detailed process of single-path sorting:

  1. Find the first satisfy from index namename = 'itwxe'The primary key ID of the condition
  2. Extract the entire row based on the primary key ID, extract the values of all fields, and store them in sort_buffer
  3. Find the next satisfy from index namename = 'itwxe'The primary key ID of the condition.
  4. Repeat steps 2 and 3 until they are not satisfiedname = 'itwxe'
  5. Sort_buffer for data by fieldpositionsorting
  6. Returns the result to the client

The detailed process of double sort:

  1. Find the first satisfy from index namename = 'itwxe'The primary key id
  2. Insert position and primary key ID into sort Buffer
  3. Fetch the next satisfy from index namename = 'itwxe'The primary key ID of the record
  4. Repeat 3 and 4 until you are not satisfiedname = 'itwxe'
  5. Fields in sort_bufferpositionAnd primary key ID by fieldpositionsorting
  6. Iterate over sorted ids and fieldspositionAccording to the value of idBack to the original tableReturns the values of all fields in the

Summary: Single-way sort will cache the whole row into sort Buffer, double-way sort will only sort the primary key id and the sorted field into sort buffer, and then return the sorted data from the original table by id to the client.

How to select single-path sort or multi-path sort?

The MySQL optimizer has its own algorithm to determine whether to use a double sort or a single sort. If the column column in the query is larger than max_LENGTH_FOR_sort_data, it will use a double sort, and if not, it will use a single sort, which is faster, but takes up more memory. If you want to use a single-way sort if memory space allows, you can increase the size of the max_LENGTH_FOR_sort_data variable.

Sort_buffer (default: 1M) is used to sort the memory of the mysql database, so it is not necessary to increase sort_buffer(default: 1M).

2. Group by optimization

Order by group by group by group by group by group by For group by optimizations that do not require sorting, we can add order by NULL to disable sorting.

Third, paging query optimization

The sample table for paging query optimization remains employees.

select * from employees limit 90000.10;
Copy the code

Most of the time, our business system may use the following SQL to implement the paging function, it seems that MySQL is to take the first 10 records of 90001, but in fact, MySQL in this process of pagination is to read the first 90010 records, and then discard the first 90,000 records. Obtain 90001-90010 data and return it to the client. Therefore, if you want to query data later in a large table, the execution efficiency is very low.

1. Paging queries sorted by self-increment and consecutive primary keys

select * from employees where id > 90000 limit 10;
Copy the code

Principle: Exclude data <90000 based on the primary key index ID, and select the last 10 data to avoid full table scan.

Disadvantages: This does not work if the primary key ID is not consecutive, or if there is deleted data in the middle, so the second method below is usually used.

2. Paging queries sorted by non-primary key fields

select * from employees order by name limit 90000.10;
select * from employees ed_all inner join (select id from employees order by name limit 90000.10) ed_id on ed_all.id = ed_id.id;
Copy the code

You can see the same query results, but very large query speed difference, this is only 10W + test data, add 100W, how big difference!

Ali Java development manual also describes this situation, it is highly recommended that you take a look at ali Java development manual, which has a special chapter on MySQL development specifications are explained.

4. Join associated query optimization

- the sample table
CREATE TABLE `t1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL.PRIMARY KEY (`id`),
  KEY `idx_a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

create table t2 like t1;

Insert 1W rows into t1
drop procedure if exists insert_t1;
delimiter $$
create procedure insert_t1()
begin
  declare i int;
  set i = 1;
  while(i < = 10000)do
    insert into t1(a,b) values(i,i);
    set i = i + 1;                  
  end while;
end$$
delimiter ;
call insert_t1();

Insert 100 rows into t2 table
drop procedure if exists insert_t2;
delimiter $$
create procedure insert_t2()
begin
  declare i int;
  set i = 1;
  while(i < = 100)do
    insert into t2(a,b) values(i,i);
    set i = i + 1;                  
  end while;
end$$
delimiter ;
call insert_t2();
Copy the code

1. Two common algorithms for table association

Nested Loop Join nested-loop Join(NLJ) algorithm

The NLJ algorithm iterates rows at a time from the first table (the driven table), retrieves the associated fields from this row, retrieves the eligible rows from the other table (the driven table) based on the associated fields, and retrieves the result set of the two tables.

explain select * from t1 inner join t2 on t1.a = t2.a;
Copy the code

It can be seen from the execution plan that T2 is the driver table and T1 is the driven table. The driver table is executed first (if the id of the execution plan result is the same, the SQL will be executed from top to bottom). The optimizer usually selects the small table as the driver table first, filters the driver table with the WHERE condition, and then performs the associated query with the driven table. So when using an inner join, the first table is not necessarily the driver table.

  • When left JOIN is used, the left table is the driver table and the right table is the driven table.
  • When using the right Join, the right table is the drive table and the left table is the driven table.
  • When using inner Join, MySQL usually selects a small table as the driver table and a large table as the driven table.

Note: When deciding which table should be the driver table, the two tables should be filtered according to their own conditions. After filtering, the total amount of data of each field participating in the join should be calculated. The table with the small amount of data is the “small table” and should be the driver table. Instead of simply comparing the total amount of data in two tables.

The general flow of SQL above is as follows:

  • Read a row from t2 (if t2 has query filter criteria, filter the row from t2).
  • Extract the associated field A from the data in step 1 and look it up in table T1.
  • Take out the rows that meet the conditions in table T1, merge them with the results obtained in table T2, and return them to the client as the results.
  • Repeat the above three steps.

The whole process reads all the data in table T2 (scan 100 rows), then iterates over the value of field A in each row, and scans the corresponding row in table T1 according to the value index of A in table T2 (scan 100 times the index of table T1, 1 scan can be considered as the final scan of only one row of table T1 complete data. The t1 table scans 100 rows. So the whole process scanned 200 lines.

If there is no index on the column of join query, NLJ algorithm performance will be low, then MySQL will choose BNL algorithm.

Block based Nested Loop Join(BNL) algorithm

The BNL algorithm reads the data from the driven table into join_buffer, then scans the driven table and compares each row of the driven table to join_buffer.

explain select * from t1 inner join t2 on t1.b = t2.b;
Copy the code

Using join Buffer (Block Nested Loop) in Extra indicates that the BNL algorithm is used in this associated query. You can also see that T2 is still the driver table and T1 is the driven table.

The general flow of SQL above is as follows:

  • Put all t2 data into join_buffer.
  • Take each row from t1 and compare it to join_buffer.
  • Returns data that meets the join criteria.

A full table scan is performed on both tables T1 and T2, so the total number of rows scanned is 10000(total data in table T1) + 100(total data in table T2) = 10100. And the data in join_buffer is unordered, so every row in table T1 needs to be judged 100 times, so the number of judgments in memory is 100 * 10000= 1 million.

Table T2 has only 100 rows. What if table T2 is a large table and join_buffer does not fit?

The size of join_buffer is set by the join_buffer_size parameter. The default value is 256K. If you can’t fit all the data in table T2, the strategy is simply to fragment. The process is as follows:

  • Take some data from the driver table and put it into join_buffer until join_buffer cannot fit. (For example, if the T2 table has 1000 rows and join_buffer can only hold 800 rows at a time, the procedure is to first add 800 rows to join_buffer.)
  • Each row of the driven table is scanned and compared to join_buffer. The data that meets the join criteria is returned as part of the result set. (Compare t1 data with join_buffer data to obtain partial result)
  • Empty join_buffer. (empty join_buffer)
  • Continue to read the remaining data of the driver table, repeat the previous three steps until the data of the driver table is scanned. Then add the remaining 200 rows of t2 into join_buffer and compare the data from T1 with that in JOIN_buffer. Therefore, the T1 table and the T2 table with multiple times of scanning are scanned.)

Therefore, in this segmented example, if the number of rows in the driven table is N, it needs to be divided into K segments to complete the scanning, and the number of rows in the driven table is M, then the number of rows to be scanned is N + K * M, that is, 1000 + 2 * 10000 = 21000 rows; The total number of memory judgments is (800 + 200) x 10000 = 1 million.

Why use the BNL algorithm instead of NLJ algorithm when the associated field of the driven table has no index?

If the NLJ algorithm is selected without indexes, the number of rows to be scanned is 100 x 10000 = 100 W, but this is a disk scan.

Obviously, disk scans with the BNL algorithm are much less frequent, and BNL memory calculations are much faster than disk scans.

Therefore, MySQL generally uses BNL algorithm for associated query of driven table without index. If there are indexes, the NLJ algorithm is generally selected. If there are indexes, the NLJ algorithm performs better than the BNL algorithm.

2. Optimization of associated SQL

After looking at NLJ algorithm and BLJ algorithm, presumably associated SQL optimization also have some ideas.

  • Do not join more than three tables. In associated query of multiple tables, ensure that the associated fields have indexes. Use the NLJ algorithm as much as possible and ensure that the data types of the fields to be joined are the same.
  • Small table drive large table, write multiple table join SQL if you know which table is small table can be usedstraight_joinWrite fixed connection driver mode, save mysql optimizer’s own judgment time.

Straight_join: Straight_JOIN funcited like inner join, but allowed the left table to drive the right table, changing the order in which the table optimizer executed the table queries. For example, select * from T2 straight_join T1 on T2.a = t1.a; MySQL select table T2 as driver table

  • Straight_join only worked for inner joins, not left and right joins. (Left JOIN and right Join already specify the order of execution of the table)
  • By Straight_join, you should be careful when using straight_join as much as possible, because in some cases an artificial execution order was not necessarily better than an optimization engine.

In and Exsits optimizations

Principle: Small tables drive large tables.

In optimization: In is superior to EXISTS when the data set of table B is smaller than that of table A.

select * from A where id in (select id fromB); # is equivalent to:for(select id from B) {
    select * from A where A.id = B.id;
}
Copy the code

Exsits optimization: When the data set of table A is smaller than that of table B, EXISTS is better than in.

select * from A where exists (select 1 from B where B.id =A.id); # is equivalent to:for(select * from A) {
    select 1 from B where B.id = A.id;
}
Copy the code
  • exists (subquery)Returns only true or false, so in the subquerySELECT * You can also useSELECT 1Substitution, the official word for this is that the actual execution ignores the SELECT list, so there is no difference.
  • Subquery exists can also be replaced by join. How to optimize the query needs specific analysis.

Count (*) query optimization

1. Compare the count

-- Temporarily disable MySQL query cache
set global query_cache_size=0;
set global query_cache_type=0;

-- count
explain select count(*) from employees;
explain select count(1) from employees;
explain select count(id) from employees;
explain select count(name) from employees;
Copy the code

You may have heard dbAs or bloggers advise against using count(*) to count rows, but this is not the case. You can see that the four SQL execution plans above are identical.

Since the execution plan is the same, the four statements are almost as efficient — even count(*) is more efficient after 5.7. Note that count(name) does not count rows whose name is null.

  • Fields have indexes:Count (*) ≈ count(1) > count(1) > count(1)// The field has an index,Count (field)The secondary index stores less data than the primary key index> count(primary key ID).
  • Field without index:Count (*) ≈ count(1)> count(1)// The field has no index,Count (field)Statistics don’t go through the index,Count (primary key id)You can also go to the primary key index, soCount (primary key id) > count(field)
  • count(1)Count (field)The execution is similar, butcount(1)I don’t need to pull out the field statistics, I use the constant 1 to do the statistics,Count (field)You still have to pull out the fields, so in theorycount(1)Count (field)It will be faster.
  • count(*)MySQL does not extract all columns from the MySQL database, but optimizes them by row instead of valueCount (column name)orCount (constant)Instead of count(*).

Why does MySQL end up choosing secondary indexes over primary key clustered indexes for count(ID)?

Because the secondary index stores less data than the primary key index, the retrieval performance should be higher, and some internal optimizations have been made in MySQL5.7.

2. Common optimization methods

A. Query the total number of rows maintained by MySQL

The performance of count queries without WHERE conditions is high for MyISAM storage engine tables because the total number of rows in MyISAM storage engine tables is stored by MySQL on disk and the query does not need to be calculated.

select count(*) from test_myisam;
explain select count(*) from test_myisam;
Copy the code

You can see that the execution schedule is not queried.

b. show table status

Run show table status like ’employyees’ to obtain an estimate of the total number of rows in a table. Query, the query result is an estimate.

C. Maintain the total number in Redis

When inserting or deleting a table row, maintain the total number of rows in the Redis table key (incR or decr command). However, this method may not be accurate, and it is difficult to guarantee the transaction consistency of the table operation and Redis operation.

D. Add the database count table

Maintain count tables while inserting or deleting table rows so that they operate in the same transaction.

This is the end of common SQL performance tuning tips. It’s late at night

Of course, the SQL optimization technique is of course how to create a proper index, this article is probably a long one… So… In the next article, Xiao Er will talk about index design principles and techniques through an interesting case study. See 😋 for the next article.