A very interesting SQL optimization experience

This article should take about 6 minutes to read.

preface

Brush online to a database optimization article, to study a wave of their own.

scenario

Database version: 5.7.25, running on a VIRTUAL machine.

The curriculum

# the curriculum
create table Course(
c_id int PRIMARY KEY,
name varchar(10)
)Copy the code

Add 100 pieces of data

Add 100 pieces of data to the class schedule
DROP PROCEDURE IF EXISTS insert_Course;
DELIMITER $
CREATE PROCEDURE insert_Course()
BEGIN
    DECLARE i INT DEFAULT 1;
        WHILE i<=100 DO
        INSERT INTO Course(`c_id`,`name`) VALUES(i, CONCAT('Chinese',i+' '));
        SET i = i+1;
    END WHILE;
END $
CALL insert_Course();Copy the code

Run time consuming

CALL insert_Course(); > OK > Time: 0.152sCopy the code

Course data

Students table

# students table
create table Student(
s_id int PRIMARY KEY,
name varchar(10)
)Copy the code

Add 7W pieces of data

Add 70000 entries to student table
DROP PROCEDURE IF EXISTS insert_Student;
DELIMITER $
CREATE PROCEDURE insert_Student()
BEGIN
    DECLARE i INT DEFAULT 1;
        WHILE i<=70000 DO
        INSERT INTO Student(`s_id`,`name`) VALUES(i, CONCAT('Joe',i+' '));
        SET i = i+1;
    END WHILE;
END $
CALL insert_Student();Copy the code

The results

CALL insert_Student(); > OK > Time: 175.838sCopy the code

The student data

League tables

The result for #
CREATE table Result(
r_id int PRIMARY KEY,
s_id int,
c_id int,
score int
)Copy the code

Add 70W pieces of data

Add 70W data to the score table
DROP PROCEDURE IF EXISTS insert_Result;
DELIMITER $
CREATE PROCEDURE insert_Result()
BEGIN
    DECLARE i INT DEFAULT 1;
        DECLARE sNum INT DEFAULT 1;
        DECLARE cNum INT DEFAULT 1;
        WHILE i<=700000 DO
                if (sNum%70000 = 0) THEN
                    set sNum = 1;
                elseif (cNum%100 = 0) THEN 
                    set cNum = 1;
                end if;
        INSERT INTO Result(`r_id`,`s_id`,`c_id`,`score`) VALUES(i,sNum ,cNum , (RAND()*99)+1);
        SET i = i+1;
                SET sNum = sNum+1;
                SET cNum = cNum+1;
    END WHILE;
END $
CALL insert_Result();Copy the code

The results

CALL insert_Result(); > OK > Time: 2029.5sCopy the code

Results data

test

The business requirements

Find Chinese 1 score 100 test takers

The query

Select * from test1 where test1 = 100
select s.* from Student s where s.s_id in 
(select s_id from Result r where r.c_id = 1 and r.score = 100)Copy the code

Execution time: 0.937s

Query result: 32 students meet the conditions

0.9s, to view the query plan:

EXPLAIN
select s.* from Student s where s.s_id in 
(select s_id from Result r where r.c_id = 1 and r.score = 100)Copy the code

Select * from ‘where’; select * from ‘where’; select * from ‘where’;

In the query result, column TYPE: all indicates that all tables are scanned, and index indicates that index is scanned.

Create index (c_id, score)

CREATE index result_c_id_index on Result(c_id);

CREATE index result_score_index on Result(score);Copy the code

Run the preceding query statement again in 0.027 seconds

Fast 34.7 times (rounded), greatly shorten the query time, it seems that the index can greatly improve the query efficiency, it is necessary to build an index in the right column, many times forget to build an index, when the amount of data is small, there is no feeling, this optimization feeling is very nice.

If the same SQL statement is executed multiple times, you will find that the first time is the longest, and subsequent executions will take slightly less time than the first time because the same statement is read directly from the cache the second time.

0.027s is very short, but can it be optimized again? Take a closer look at the execution plan:

View the optimized SQL:

SELECT
    `example`.`s`.`s_id` AS `s_id`,
    `example`.`s`.`name` AS `name` 
FROM
    `example`.`Student` `s` semi
    JOIN ( `example`.`Result` `r` ) 
WHERE
    (
    ( `example`.`s`.`s_id` = `<subquery2>`.`s_id` ) 
    AND ( `example`.`r`.`score` = 100 ) 
    AND ( `example`.`r`.`c_id` = 1 ) 
    )Copy the code

How do I view optimized statements?

The method is as follows (executed in a command window) :

# to perform
EXPLAIN
select s.* from Student s where s.s_id in 
(select s_id from Result r where r.c_id = 1 and r.score = 100);
# in the implementation
show warnings;Copy the code

The results are as follows

A type = all

As previously thought, the order in which this SQL is executed is to execute subqueries

select s_id from Result r where r.c_id = 1 and r.score = 100Copy the code

Time: 1.402 s

The following results are obtained (partially)

And then execute

select s.* from Student s where s.s_id in(12871409 87467 29613 81395 5106 87140 47269 17288 97311 74388 96565 18107 74250 30977 8125 44247 21272 95603 61, 38479469 90669 88679 0359 95461 92475 78581 71632 20668 5673 72462 79646 93.Copy the code

Time: 0.222 s

MySQL > execute (MATERIALIZED subquery); MySQL > execute (MATERIALIZED subquery); MySQL > execute (MATERIALIZED subquery); MySQL > select * from student where ID = 1; MySQL > select * from student where ID = 1;

Materialized subqueries: The optimizer uses materialization to process subqueries more efficiently. Materialization speeds up query execution by treating subquery results as a temporary table, normally in memory. The first time mysql needs a subquery result, it materializes the result into a temporary table. Whenever the result set is needed later, mysql references the temporary table again. The optimizer may use a hash index to make the query faster and cheaper. Indexes are unique, eliminating duplication and making the table less data.

What about joining queries instead?

In order to re-analyze the connection query, delete the indexes result_C_ID_index and result_score_index temporarily.

DROP index result_c_id_index on Result;
DROP index result_score_index on Result;Copy the code

Join queries

select s.* from 
Student s 
INNER JOIN Result r 
on r.s_id = s.s_id 
where r.c_id = 1 and r.score = 100;Copy the code

Execution time: 1.293s

The query results

EXPLAIN + query SQL to check the execution plan:

Select * from s_id; select * from s_id

CREATE index result_s_id_index on Result(s_id);
show index from Result;Copy the code

A join query is being executed

Time: 1.17s (a little weird, it should be longer depending on the article you are reading)

Take a look at the execution plan:

The optimized query statement is as follows:

SELECT `example`.`s`.`s_id` AS `s_id`, `example`.`s`.`name` AS `name` FROM `example`.`Student` `s` JOIN `example`.`Result` `r` WHERE ( ( `example`.`s`.`s_id` =  `example`.`r`.`s_id` ) AND ( `example`.`r`.`score` = 100 ) AND ( `example`.`r`.`c_id` = 1 ) )Copy the code

SQL > select ‘where’ from ‘join’;

Back to the previous execution plan:

SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL > alter table SQL

Normally, join is performed before WHERE filtering, but in our case, if join is performed first, 70W data will be sent to join. Therefore, the wise scheme of WHERE filtering is implemented first. Now, in order to exclude mysql query optimization, I write an optimized SQL.

Delete index first

DROP index result_s_id_index on Result;Copy the code

Execute your own written optimized SQL

SELECT
    s.* 
FROM
    (
        SELECT * FROM Result r WHERE r.c_id = 1 AND r.score = 100 
    ) t
INNER JOIN Student s ON t.s_id = s.s_idCopy the code

The time is 0.413 seconds

Shorter than the previous SQL time.

Viewing the Execution Plan

It is much more efficient to extract result first and then join the table. The problem now is that the scan table appears when extracting result, so it is clear that relevant indexes need to be established.

CREATE index result_c_id_index on Result(c_id);
CREATE index result_score_index on Result(score);Copy the code

Execute the query again

SELECT
    s.* 
FROM
    (
        SELECT * FROM Result r WHERE r.c_id = 1 AND r.score = 100 
    ) t
INNER JOIN Student s ON t.s_id = s.s_idCopy the code

The time is 0.044 seconds

That’s a good time, 10 times faster.

Execution Plan:

We’re going to see that the index is used to get the result and then the table.

SQL > execute SQL

EXPLAIN
select s.* from 
Student s 
INNER JOIN Result r 
on r.s_id = s.s_id 
where r.c_id = 1 and r.score = 100;Copy the code

Execution time: 0.050 seconds

Execution Plan:

SQL > select * from ‘where’; SQL > select * from ‘where’; SQL > select * from ‘where’;

Expand the test data, adjust the content of the result table to increase the data to 300W, student data is more scattered.

DROP PROCEDURE IF EXISTS insert_Result_TO300W;
DELIMITER $
CREATE PROCEDURE insert_Result_TO300W()
BEGIN
    DECLARE i INT DEFAULT 700001;
        DECLARE sNum INT DEFAULT 1;
        DECLARE cNum INT DEFAULT 1;
        WHILE i<=3000000 DO
        INSERT INTO Result(`r_id`,`s_id`,`c_id`,`score`) 
                VALUES(i,(RAND()*69999)+1 ,(RAND()*99)+1 , (RAND()*99)+1);
        SET i = i+1;
    END WHILE;
END $
CALL insert_Result_TO300W();Copy the code

Replaced the way of data generation, all using random number format.

A quick recap:

show index from Result;Copy the code

Execute SQL

select s.* from 
Student s
INNER JOIN Result r
on r.s_id = s.s_id
where r.c_id = 81 and r.score = 84;Copy the code

Execution time: 1.278s

Execution Plan:

Intersect union operation is used here, that is, the union of the results of the two indexes is obtained at the same time, and then the distinction between score and C_id is looked at. However, the distinction between score and C_id is not very high in a single field. If the Result table is retrieved, c_id = 81, the Result is 81. Score = 84 is 84.

The result of c_id = 81 and score = 84 is 19881, that is, the distinction between the two fields combined is relatively high, so the query efficiency of establishing the joint index will be higher. From another perspective, the data of the table is 300W, and there will be more in the future, which is not a small number in terms of index storage. As the amount of data increases, indexes cannot be loaded into the memory, but must be read from disk. In this way, the more indexes there are, the more the cost of reading disk is. Therefore, it is necessary to establish a multi-column joint index according to specific service conditions.

DROP index result_c_id_index on Result;
DROP index result_score_index on Result;
CREATE index result_c_id_score_index on Result(c_id,score);Copy the code

Points to the above query statement

Consumption time: 0.025s

That’s pretty fast and acceptable.

The optimization of this statement is over for now.

conclusion

MySQL nested subqueries are really inefficient
It can be optimized for join queries
When joining a table, you can filter the table with a WHERE condition and then join the table.
Set up appropriate indexes, and set up multi-column federated indexes if necessary
Learn to analyze SQL execution plans, mysql will optimize SQL, all analysis plans are important

Knowledge extension

The index optimization

I talked about optimization of subqueries and how to build indexes, and in cases where multiple fields are indexed, a single index is created for each field.

Later, it is found that in fact, it is more efficient to establish joint index, especially in the case of large data volume and low differentiation of single column.

Single index

The query statement is as follows:

select * from user_test_copy where sex = 2 and type = 2 and age = 10Copy the code

Index:

CREATE index user_test_index_sex on user_test_copy(sex);
CREATE index user_test_index_type on user_test_copy(type);
CREATE index user_test_index_age on user_test_copy(age);Copy the code

Sex, Type and age fields are indexed, and the data volume is 300W

Query time: 0.415s

Execution Plan:

Find type = index_merge

This is mysql’s optimization for multiple single-column indexes with the INTERSECT union operation on the result set

The column index more

We can create multiple column indexes on these three columns and copy the table for testing purposes.

create index user_test_index_sex_type_age on user_test(sex,type,age);Copy the code

Query statement:

select * from user_test where sex = 2 and type = 2 and age = 10Copy the code

Execution time: 0.032s

More than 10 times faster, and the higher the differentiation of multi-column indexes, the faster the improvement.

Execution Plan:

The most left prefix

Multi-column indexes also have the left-most prefix feature:

That is, the first field in the index, sex, should appear in the WHERE condition.

Execute the following statement:

select * from user_test where sex = 2
select * from user_test where sex = 2 and type = 2
select * from user_test where sex = 2 and age = 10Copy the code

Indexes cover

The columns of the query are indexed, so that when the result set is retrieved, the index data can be returned directly without having to fetch data of other columns from disk

Such as:

select sex,type,age from user_test where sex = 2 and type = 2 and age = 10Copy the code

Execution time: 0.003s

It’s much faster than getting all the fields

The sorting

select * from user_test where sex = 2 and type = 2 ORDER BY user_nameCopy the code

Time: 0.139 s

Indexing sorted fields improves sorting efficiency

select * from user_test where sex = 2 and type = 2 ORDER BY user_nameCopy the code

Finally, some SUMMARY of SQL tuning is attached for further study when there is time

Column types should be defined as numeric types and as short as possible, such as primary and foreign keys, type fields, and so on
Create a single-column index
Create multi-column federated indexes as needed
If there is a lot of data after a single column is filtered, the efficiency of the index will be low, that is, the differentiation of the columns will be low. If the index is built on multiple columns, the differentiation of the columns will be large and the efficiency will be significantly improved.
Build coverage indexes based on business scenarios
Query only the fields required by services. If these fields are covered by indexes, the query efficiency is greatly improved
Indexes are required on fields of multiple table joins
This greatly improves the efficiency of table joins
An index is required on the WHERE condition field
An index is required on the sort field
Indexes need to be created on grouped fields
Do not use operations on Where conditions to avoid index invalidation

, END,

The growth path of programmers

Though the road is long, the journey is sure to come

This article was originally posted on the wechat public account of the same name “The Growth of programmers”, reply to “1024” you know, give a thumbs up.

Wechat ID: cxyDCzzl

Highlights from the past

7 big platform tools for programmers to connect private work

The growth path of Java programmers

Why does TCP require a three-way handshake

50 Details of Java Performance Optimization (Collector’s Edition)

Design e-commerce platform coupon system

A conversation that tells you what an architect does?

Teach you a way to use IDE programming to improve efficiency SAO operation!

The classic ebook package for programmers

A very interesting SQL optimization experience

The index optimization

Single index

The column index more

The most left prefix

Indexes cover

The sorting

Related Posts

NodeJs version of I3S service based on Quick_SLPK_Server is released

InfluxDB general articles

After reading this article, Docker you will get started