Note: If the SQL in this document cannot be executed, please use the editor to edit it again and then paste and copy it. The web browser may transcode Spaces or special symbols.

Query optimization

Data initialization

# DDL statementsCREATE TABLE `employees` (
	`id` INT(10) NOT NULL AUTO_INCREMENT,
	`name` VARCHAR(24) NOT NULL DEFAULT ' ' COMMENT 'name' COLLATE 'utf8_general_ci',
	`age` INT(10) NOT NULL DEFAULT '0' COMMENT 'age',
	`position` VARCHAR(20) NOT NULL DEFAULT ' ' COMMENT 'job' COLLATE 'utf8_general_ci',
	`hire_time` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT 'Entry Time'.PRIMARY KEY (`id`) USING BTREE,
	INDEX `idx_name_age_position` (`name`, `age`, `position`) USING BTREE,
	INDEX `idx_age` (`age`) USING BTREE
)
COMMENT='Staff Record'; # random number (randomly generated employee age)drop function if exists `rand_num`;
delimiter ;;
create function `rand_num`(
    `start_num` integer,
    `end_num` integer
)
    returns int
    comment ' '
begin
    return floor(start_num + rand() * (end_num - start_num + 1));
end;; # random string functiondrop function if exists `rand_str`;
delimiter ;;
create
    definer = `root`@`localhost` function `rand_str`(
    `n` int
)
    returns varchar(255)
begin
    declare chars_str varchar(100) default 'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz';
    declare return_str varchar(100) default ' ';
    declare i int default 0;
    while i < n
        do
            set return_str = concat(return_str, substring(chars_str, floor(1 + rand() * 52), 1));
            set i = i + 1;
        end while;
    return return_str;
end;; Generate test datadrop procedure if exists `insert_emp`;
delimiter ;;
create procedure `insert_emp`(
    in `max_num` int(10))begin
    declare i int default 0;
    set autocommit = 0;
    repeat
        set i = i+1;
        insert into `employees` (`name`, `age`, `position`, `hire_time`)
        values (rand_str(6), rand_num(20.40), 'dev', now());

    until i = max_num
        end repeat;
    commit;
end ;;

delimiter ;
call insert_emp(10000);
Copy the code

Most of the time, our business system implementation of paging may use the following SQL implementation:

select * from employees limit 50000.10;
Copy the code

Represents the removal of the first 10 rows of row 50001 from the table employees. The first 10000 records are discarded, and the next 10 records are read. Therefore, it is very inefficient to query the data at the lower end of a large table.

Page Limit optimization

1, more self-increasing and continuous independent love paging query

Let’s start with an example of a paging query sorted by increment and consecutive primary keys:

mysql> select * from employees limit 90000.5; 
Copy the code

Select * from 90001; select * from 90001; Select * from employees where primary key = ‘90001’ and primary key = ‘0001’;

select * from employees where id > 90000 limit 5; 
Copy the code

The results of the query are consistent. Let’s compare the execution plan again:

EXPLAIN select * from employees limit 90000.5; 
Copy the code

EXPLAIN select * from employees where id > 90000 limit 5;
Copy the code

Obviously, the rewritten SQL is indexed, and the number of rows scanned is greatly reduced, making the execution more efficient. However, this rewritten SQL is not useful in many scenarios, because some records in the table may be deleted, leaving the primary key empty, resulting in inconsistent results, as shown in the following experiment (delete a previous record, and then test the original SQL and the optimized SQL) :Direct limit: **id > x limit ** The results of the two SQL statements are not the same, so the optimization method described above cannot be used if the primary key is not contiguous. SQL > alter table order by; alter table order by; So this rewriting must satisfy the following two conditions:

  • The primary key is self-increasing and continuous
  • The results are sorted by primary key

2. Paging queries sorted by non-primary key fields

A paginated query that sorts by non-primary key fields looks like this:

select * from employees order by name limit 90000.5;
Copy the code

explain select * from employees order by name limit 90000.5;
Copy the code

The index of the name field is not used (the key field is null).The cost of scanning an entire index and finding unindexed rows (possibly traversing multiple index trees) is higher than the cost of scanning a full table, so the optimizer abandons indexes. If you know why you don’t go to the index, how do you optimize it? The point is thatMake sorting return as few fields as possibleSQL > select * from primary key; select * from primary key

select * from employees e inner join (select id from employees order by name limit 90000.5) ed on e.id = ed.id;
Copy the code

The required result is consistent with the original SQL, and the execution time is reduced by more than half. Let’s compare the execution plan of SQL before and after optimization:

explain select * from employees e inner join (select id from employees order by name limit 90000.5) ed on e.id = ed.id;
Copy the code

The original SQL uses Filesort sorting, while the optimized SQL uses index sorting.

Join Join optimization

Data preparation:

Example table:
CREATE TABLE `t1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL.PRIMARY KEY (`id`),
  KEY `idx_a` (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

create table t2 like t1;

Insert some sample data
Insert 10000 rows into t1 table
drop procedure if exists insert_t1;
delimiter ;;
create procedure insert_t1()
begin
    declare i int;
set i=1;
while(i< =10000)do                 
    insert into t1(a,b) values(i,i);
set i=i+1;
end while;
end;;
delimiter ;
call insert_t1();

Insert 100 rows into t2 table
drop procedure if exists insert_t2;
delimiter ;;
create procedure insert_t2()
begin
    declare i int;
set i=1;
while(i< =100)do                 
    insert into t2(a,b) values(i,i);
set i=i+1;
end while;
end;;
delimiter ;
call insert_t2();
Copy the code

Mysql table association common two algorithms

  • Nested – Loop Join algorithm
  • Block nested-loop Join algorithm

Nested Loop Join(NLJ) algorithm

A row at a time is read from the first table (called the driven table) in a loop, the associated fields are fetched from this row, the rows that satisfy the condition are fetched from the other table (driven table) based on the associated fields, and then the result set of the two tables is fetched.

mysql> explain select * from t1 inner join t2 on t1.a = t2.a;

+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+--------- ----+
| id | select_type | table | partitions | type | possible_keys | key   | key_len | ref       | rows | filtered | Extra       |
+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+--------- ----+
|  1 | SIMPLE      | t2    | NULL       | ALL  | idx_a         | NULL  | NULL    | NULL      |  100 |   100.00 | Using where |
|  1 | SIMPLE      | t1    | NULL       | ref  | idx_a         | idx_a | 5       | oemp.t2.a |    1 |   100.00 | NULL        |
+----+-------------+-------+------------+------+---------------+-------+---------+-----------+------+----------+--------- ----+
2 rows in set.1 warning (0.01 sec)
Copy the code

You can see this information from the execution plan:

  • The driver table is T2, and the driven table is T1. The driver table is executed first (if the id of the execution plan result is the same, the SQL is executed from top to bottom); The optimizer usually selects the smallest table as the driver table, filters the driver table with the WHERE condition, and then performs associated query with the driven table. So when using an inner join, the first table is not necessarily the driver table.
  • When left JOIN is used, the left table is the driver table and the right table is the driven table. When right Join is used, the right table is the driver table and the left table is the driven table. When join is used, mysql selects the small table as the driver table and the large table as the driven table.
  • NLJ algorithm is used. In general join statements, if Using Join buffer does not appear in plan Extra, it indicates that NLJ is the join algorithm used.

The general flow of SQL above is as follows:

  1. Read a row of data from table T2 (if table T2 has query filtering criteria, filter the row with the criteria first, and then extract the row from the filtering result);
  2. Extract associated field A from the data in step 1 and search it in table T1.
  3. Take out the rows that meet the conditions in table T1, merge them with the results obtained in T2, and return them to the client as the results;
  4. Repeat the above three steps.

The whole process reads all the data in table T2 (scan 100 rows), then iterates over the value of field A in each row, and scans the corresponding row in table T1 according to the value index of A in table T2 (scan 100 times the index of table T1, 1 scan can be considered as the final scan of only one row of table T1 complete data. The t1 table scans 100 rows. So the whole process scans ** 200 lines **. If the associated fields of the driven table are not indexed, the performance of NLJ algorithm is low (explained below), and mysql chooses the Block nested-loop Join algorithm.

2, Block based Nested Loop Join(BNL) algorithm

Read the data from the driven table into join_buffer, then scan the driven table and compare each row of the driven table with the data in join_buffer.

mysql> EXPLAIN select * from t1 inner join t2 on t1.b= t2.b;

+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+-------------- --------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra                                              |
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+-------------- --------------------------------------+
|  1 | SIMPLE      | t2    | NULL       | ALL  | NULL          | NULL | NULL    | NULL |   100 |   100.00 | NULL                                               |
|  1 | SIMPLE      | t1    | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 10337 |    10.00 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------+------+---------+------+-------+----------+-------------- --------------------------------------+
2 rows in set.1 warning (0.00 sec)

Copy the code

The Using join buffer (Block Nested Loop) in Extra indicates that the BNL algorithm is used in this associated query. The general flow of SQL above is as follows:

  1. Put all t2 data into ** join_buffer**
  2. Take each row from t1 and compare it to join_buffer
  3. Returns data that meets the join criteria

A full table scan is performed on both tables T1 and T2, so the total number of rows scanned is 10000(total data in table T1) + 100(total data in table T2) = 10100. And the data in join_buffer is unordered, so every row in table T1 needs to be judged 100 times, so the number of judgments in memory is 100 * 10000= 1 million. Table T2 has only 100 rows. What if table T2 is a large table and join_buffer does not fit? · The size of join_buffer is set by the parameter join_buffer_size. The default value is 256K. If you can’t fit all the data in table T2, the strategy is simply to fragment. For example, if the T2 table has 1000 rows and join_buffer can only hold 800 rows at a time, the procedure is to first put 800 rows into join_buffer and then compare the data in t1 table with the data in Join_buffer to obtain some results. Then empty the join_buffer and put the remaining 200 rows in t2. Compare the data in T1 with the data in JOIN_buffer. So I scan t1 one more time. Why use BNL instead of nested-loop Join when associated fields of a driven table have no indexes? If the second SQL above uses nested-loop Join, then the number of rows scanned is 100 * 10000 = 1 million. Obviously, BNL disk scans are much less frequent, and BNL memory calculations are much faster than disk scans. Therefore, MySQL generally uses the BNL algorithm for associative queries that do not have indexes in the associated fields of the driven table. If there are indexes, the NLJ algorithm is generally selected. If there are indexes, the NLJ algorithm performs better than the BNL algorithm

Optimization for relational SQL

  • Associated fields and indexes, for mysql to do join operation as far as possible choose NLJ algorithm, the driver table because of the need to query out entirely, so the conditions of the filter also index to go as far as possible, avoid a full table scan, in short, you could walk index index of filter conditions all walk as far as possible
  • Straight_join small tables drive large tables. When writing a multi-table join SQL, if you knew exactly which table was small, you could use straight_join to fix the join driver, saving the mysql optimizer time to decide for itself

Straight_join Explanation: Straight_JOIN funcited like JOIN, but allowed the left table to drive the right table, changing the order in which the table optimizer executed a join table query. For example, select * from T2 straight_join T1 on T2.a = t1.a; Mysql select table T2 as driver table

  • Straight_join only worked for inner join, not left join, right join. (Since left JOIN and right Join already specify the order of execution of the table)
  • Let the optimizer decide as much as possible because mysql optimizers are smarter than humans in most cases. Straight_join must be used with caution, because in some cases an artificial execution order was not necessarily better than an optimization engine.

The definition of small table is clear. When deciding which table should be the driver table, the two tables should filter according to their own conditions. After filtering, the total amount of data of each field participating in the join should be calculated.

In and Exsits optimizations

Principle: Small tables drive large tables, i.e. small data sets drive large data sets ** IN: ** When the data set of table B is smaller than that of table A, in is superior to EXISTS

 select * from A where id in (select id fromB) #.for (select id form B) {
       select * from A where A.id = B.id
   } 
Copy the code

**exists: ** If the data set of table A is smaller than that of table B, exists is better than in. If the data set of table A is smaller than that of table B, exists is better than in

 select * from A where exists (select 1 from B where B.id =A.id) #.for (select id form A) {
       select * from B where B.id = A.id
   } 
Copy the code

1, EXISTS (subquery) only returns TRUE or FALSE, so SELECT * in a subquery can also be replaced by SELECT 1 2. The actual execution process of the EXISTS subquery may be optimized instead of comparison item by item in our understanding. 3

Several forms of counting

-- Temporarily disable the mysql query cache to see the true time of multiple SQL executions
mysql> set global query_cache_size=0;
mysql> set global query_cache_type=0;

mysql> EXPLAIN select count(1) from employees;
mysql> EXPLAIN select count(id) from employees;
mysql> EXPLAIN select count(name) from employees;
mysql> EXPLAIN select count(*) from employees;
Copy the code

Note: The execution plan of the above four SQL columns is the same as that of the data row whose field count is null, indicating that the execution efficiency of the four SQL columns should be similar

** = count(*)≈count(1)>count(primary key ID) ** = count(*)≈count(1)>count(primary key ID) ** = count(primary key ID

Field no indexes: the count (*) material count (1) > count (primary key id) > count (field) / / field no index count (field) has statistical index, the count (primary key id) can also be the primary key index, so the count (primary key id) > count (field)

Count (1) is the same as count(field), but count(1) does not need to fetch the field statistics, the constant 1 does not need to fetch the field statistics, so count(1) is theoretically faster than count(field).

Count () is an exception. Mysql does not extract all columns from the database. Instead, it optimizes the mysql database to accumulate columns by row.

Why does mysql end up choosing secondary indexes over primary key clustered indexes for count(ID)? Because secondary indexes store less data than primary key indexes, the retrieval performance is supposed to be higher. Mysql has made some internal optimizations (probably in version 5.7).

MySQL > select count(*), count(1), count(1)

InnoDB handles SELECT COUNT(*) and SELECT COUNT(1) operations in the same way. There is no performance difference.

Address: document dev.mysql.com/doc/refman/…

Common optimization methods

1. Query the total number of rows maintained by mysql

Mysql does not store the total number of rows in innoDB. Mysql does not store the total number of rows in innoDB (MVCC). More on that later), the query count requires real-time computation

2, show table status

If you only need to know the estimate of the total number of rows in the table, you can use the following SQL query with high performance

3. Maintain the total in Redis

When inserting or deleting a table row, maintain the total number of rows in the redis table key (incR or decr command). However, this method may not be accurate, and it is difficult to guarantee the transaction consistency of the table operation and redis operation

4, add database count table

Maintain count tables while inserting or deleting table rows so that they operate in the same transaction

Select the MySQL data type

In MySQL, choosing the right data type is critical to performance. In general, the following two steps should be followed: (1) determine the appropriate large types: number, string, time, binary; (2) Determine the specific type: no sign, value range, variable length fixed length, etc. When it comes to MySQL data type Settings, try to use smaller data types because they generally have better performance and cost less hardware resources. Also, try to define the field as NOT NULL instead of NULL.

1. Numeric types

type The size of the Range (signed) Scope (unsigned) use
TINYINT 1 byte (128, 127) (0, 255) A small integer value
SMALLINT 2 – (-32 768, 32 767) (0, 65, 535) Large integer value
MEDIUMINT 3 bytes (-8 388 608, 8 388 607) (0, 16, 777, 215) Large integer value
INT or an INTEGER 4 bytes (-2 147 483 648, 2 147 483 647) (0, 4 294 967 295) Large integer value
BIGINT 8 bytes (-9 233 372 036 854 775 808, 9 223 372 036 854 775 807) (0, 18 446 744 073 709 551 615) Maximum integer value
FLOAT 4 bytes (-3.402 823 466 E+38, 1.494 369 E+38), 0, (1.494 369 E+38, 3.402 823 466 369 E+38) 0, (1.175 494 351 E-38, 3.402 823 466 E+38) Single precision
Floating point value
DOUBLE 8 bytes (1.797 693 134 862 362 7e +308, 2.797 858 507 367 e-308), 0, (2.797 858 507 369 e-308), 1.797 693 134 862 315 7 E+308) 0, (1.797 693 134 862 315 7 E+308) double
Floating point value
DECIMAL For DECIMAL(M,D), M+2 if M>D otherwise D+2 Depends on the values of M and D Depends on the values of M and D Small numerical

Optimization Suggestions

  1. If the integer does not have a negative number, such as an ID number, it is recommended to specify the type as UNSIGNED and double the capacity.
  2. You are advised to use TINYINT instead of ENUM, BITENUM, and SET.
  3. Avoid using integer widths (see the end of the document), that is, do not specify field widths like INT(10), use INT instead.
  4. DECIMAL is best for data that requires high accuracy and is used in calculations, such as prices. However, when using DECIMAL types, be careful about the length Settings.
  5. It is recommended to use an integer type to calculate and store real numbers by multiplying them by the appropriate multiple.
  6. Integers are usually the best data type because they are fast and can use AUTO_INCREMENT.

2. Date and time

type Size (bytes) The scope of format use
DATE 3 The 1000-01-01 to 1000-01-01 YYYY-MM-DD Date value
TIME 3 ‘- 838:59:59’ to ‘838:59:59’ HH:MM:SS Time value or duration
YEAR 1 From 1901 to 2155 YYYY Year value
DATETIME 8 1000-01-01 00:00:00 to 9999-12-31 23:59:59 YYYY-MM-DD HH:MM:SS Mixes date and time values
TIMESTAMP 4 1970-01-01 00:00:00 to 2038-01-19 03:14:07 YYYYMMDDhhmmss Mix date and time values, time stamps

Optimization Suggestions

  1. The minimum time granularity that MySQL can store is seconds.
  2. The DATE data type is recommended for storing dates. The default date format in MySQL is YYYY-MM-DD.
  3. Use MySQL’s built-in DATE, TIME, and DATETIME types to store TIME instead of strings.
  4. When the data format is TIMESTAMP and DATETIME, CURRENT_TIMESTAMP can be used as the default (after MySQL5.6), and MySQL automatically returns the exact time the record was inserted.
  5. TIMESTAMP is the UTC TIMESTAMP, time zone dependent.
  6. DATETIME is stored as an integer YYYYMMDD HH:MM:SS, regardless of time zone.
  7. Unless there is a specific need, companies generally recommend using TIMESTAMP, which is more space-efficient than DATETIME, but companies like Ali usually use DATETIME because they don’t have to worry about the future time limit of TIMESTAMP.
  8. Sometimes people save Unix timestamps as integer values, but this usually doesn’t do any good, it’s not a convenient format to process, and we don’t recommend it.

3. Strings

type The size of the use
CHAR 0-255 bytes Fixed-length character string, char(n) If the number of characters inserted is less than N (n indicates the number of characters), space is inserted for supplementary saving. Trailing Spaces are removed during retrieval.
VARCHAR 0-65535 bytes Variable length character string. N in varchar(n) indicates the maximum number of characters. If the number of characters is less than N, no space is added
TINYBLOB 0-255 bytes A binary string of up to 255 characters
TINYTEXT 0-255 bytes Short text string
BLOB A scale of 0-65 to 535 bytes The number of long texts in binary form

Refer to the content

  1. dev.mysql.com
  2. Alibaba Java Coding Specification
  3. MySQL > select * from page_limit;
  4. Mysql join join algorithm and optimization ideas
  5. Count (1); count(*); count(*)