SQL tuning summary

preface

As developers, we are not immune to SQL. Some SQL may be executed without problem at the beginning of the business. However, as the volume of business increases and the complexity of the business increases, it is possible that the previous SQL will gradually show fatigue. This is where SQL tuning comes in.

So how do you tune? Different people have different postures. Probably the first thing most people think of is indexing.

Yes, indexing is a typical and cheap way to do it.

But, how to add index? Where do I add it? Will this have any impact on existing SQL? These are all things that need to be considered.

At the same time, it should be realized that indexes are a double-edged sword, like two sides of the coin, speeding up the query speed, but also slowing down the insertion and deletion speed.

Of course, in addition to adding indexes, there are some mature lessons that can be learned to prevent problems when creating SQL. For example, follow the leftmost matching principle, specify specific fields in select, do not recommend using functions, do not recommend join more than 3 tables, and so on.

But in reality, the problems faced by different business scenarios are not the same. The execution results of the same SQL may be completely different in the case of different data volume and distribution. When SQL is no longer able to sustain business growth, it needs to be tuned accordingly.

methodology

I have a methodology for the whole PROCESS of SQL tuning, which can be roughly divided into the following four stages:

1. White-box analysis

At this stage, we can guess the reason why SQL may be slow based on our own accumulated knowledge and experience.

2. Execution plan interpretation

Interpret and simulate the real execution of SQL by the mysql server through explain results.

3. Determine the bottleneck point

After white-box analysis + execution plan reading, you can basically determine why SQL may be slow.

4. Take the right medicine

When you find the bottlenecks, break them one by one.

Next, I will use two examples (single table query and join query) to share some of the experience of tuning.

The following tuning process is mainly optimized at the SQL level. Not in the whole system, such as table, switch storage media, etc.

Practice a

The main SQL is a single-table query, and its execution time can reach 2.3 seconds.

selectA lot of fieldsfrom
  t_voucher_header
where
  period_year = 2020
  and period_month = 10
  and user_je_source_name = 'xxx'
  and `status` in (10.30.50)
  and employee_number ! = '1000'
  and can_auto_push = 1
  and is_delete = 0
  and batch_id > xxx
limit
  200
Copy the code

SQL background

The total amount of data in the table is 700w+. The batch_id field is the primary key field.

The existing indexes are:

Joint index 1 user_je_source_name, voucher_category, record_type, company_code, status Joint index 2: Period_year, period_month, user_je_source_name, voucher_category, IS_delete, company_code, status combined index 3: Period_year, period_month, user_je_source_name, and status

White box analysis

For single-table queries, which are simpler, there are two main considerations.

1. Do you use an index for a single table query?

2. Single table query, is the correct index?

Execution plan interpretation

Above we can read the information:

1. Index use length is low, only the first field of the joint index is used to speed up the search speed.

2.Using index condition is a new feature introduced after 5.6. What happens is that the index column appears in the search criteria, but is not available (prefix matching). But because the index contains search criteria, you can use the index to filter. Based on this SQL query, the query conditions are period_year, period_month, user_je_source_name, status, employee_number, CAN_AUTO_push, and IS_delete. The resulting indexes are user_je_source_name, voucher_category, record_type, company_code, status. In other words, the user_je_source_name column of the index is used to find the corresponding record, and the status and batch_id of the index are used to filter the record, and then the primary key IDS that meet the conditions are obtained, and then the primary index is queried based on the primary key IDS.

3.Using WHERE together with Using Index condition indicates that data queried by primary key ID will be filtered according to remaining conditions after being returned to mysql server. User_je_source_name and status are matched in the Using index condition phase. The remaining period_year, period_month, employee_number, CAN_AUTO_push, and IS_DELETE phases are filtered by the mysql server Using WHERE.

Thus, we have a mental outline of the execution of this SQL, as follows:

Bottleneck determination

The execution of a single table query SQL can be roughly divided into two steps. Secondary index find and back table SELECT.

Secondary index find.

There are two meanings, on the one hand, the use of secondary index to find, on the other hand, the use of index for filtering (that is, index conditions push down).

Therefore, the correct use of indexes also has two meanings, one is the use of the length of the index, the better, the second is that after the index find, the more data can be filtered out the better.

Select back table.

After all, the secondary index contains a limited number of columns. If we select a field that cannot be all contained in the index, after the end of “secondary index find”, we need to query the required columns on the primary index according to the obtained primary key ID. This process is called table-back. However, the primary key IDS obtained in the “secondary index find” stage are not ordered, which means that table back is a random I/O process, which is destined to be a high cost operation. This is why mysql sometimes prefers a full table scan to an index.

The push down of index conditions was invented to save the cost of back to the table.

Of course, if the final select field is fully contained by the secondary index, after the “find” stage, there is no need to perform the table back operation, which is called index overwrite. When index overwriting occurs, the Extra column in the execution plan will tell us Using index.

In our SQL, there are joint indexes (period_year, period_month, user_je_source_name, status) that perfectly match the query conditions, but mysql does not use them. What does this tell us?

Period_year, period_month, user_je_source_name, and status do not have a high degree of differentiation. The real differentials are in employee_number, CAN_AUTO_push, and IS_DELETE!

By using the control variable method, the data volumes of employee_number, CAN_AUTO_push, and IS_DELETE are compared. It is found that when IS_DELETE is set to 0 and 1 respectively, the data volumes differ greatly.

As you can see, for the IS_DELETE field, the amount of data is 50W + at 1 and 1000+ at 0.

It is is_delete=0 that we want to query.

Combined with our second phase of the implementation plan interpretation. Finally, you can determine that the bottleneck point of this SQL execution time is the back table.

This SQL query only wants to query 1000+ entries from is_delete=0, which can be retrieved in several times. However, through the “index find” phase, we cannot filter the IS_DELETE field, resulting in the “index find” phase getting the 50W + ID, which is then returned to the table. The is_DELETE field is filtered Using WHERE after the table is returned.

The whole process is slow to slow back to the table.

Suit the remedy to the case

The above analysis has identified the bottleneck point.

That is, the index created by the table does not filter out the data efficiently before returning to the table, resulting in a large amount of time to return to the table. So we append a is_DELETE column to the existing joint index of period_year, period_month, user_je_source_name, and status. In this way, the problem of the current SQL can be solved without affecting the existing SQL.

After this adjustment, SQL queries were correctly indexed to period_year, period_month, user_je_source_name, status, and IS_DELETE, and the time was reduced from 2.3 seconds to milliseconds.

Comparison diagram before and after optimization:

summary

The business scenario of this SQL determines that the final data to be queried is is_DELETE =0. If is_delete=1, add is_delete to existing index column, it does not solve the problem, because it does not reduce the 50W + data is_delete=1 return table.

Why do businesses have so many data that are physically deleted (is_delete=1)? Is it accidental or normal? If it is normal, whether there is a problem in the structure design of our table, do we need to dismantle the table? And so on.

And that’s exactly what I’m trying to say, SQL tuning = business scenario + SQL execution analysis, both are indispensable. It is not advisable to talk about SQL tuning in isolation from either party.

Practice 2

The second hero SQL is:

SELECTB. Many fieldsFROM
  t_voucher_line b
  INNER JOIN (
    SELECT
      batch_id batchId
    FROM
      t_voucher_header
    WHERE
      combine_batch_id = 2007044062
      AND is_delete = 0
  ) a ON b.batch_id = a.batchId
  AND b.is_delete = 0
WHERE
  b.segment3 LIKE '1002%'
  and b.id > 18496282
ORDER BY
  b.id ASC
LIMIT
  300
Copy the code

The execution time is 1s+.

This is a join query involving multiple tables. Before tuning, I want to briefly explain the basic principles of join.

The process of joining queries

Joining itself is a Cartesian product.

For a simple join query SQL:

SELECT * FROM t1, t2 WHERE t1.c1 > 1 AND t1.c1 = t2.c1 AND t2.c2 < 'd';
Copy the code

Its conditions can be divided into

C1 > 1 and T2.c2 < ‘d’

C1 = T2.c1

The whole connection process can be roughly divided into the following:

1. Determine the driver table. Assume that table T1 is the driver table. Apply the single table condition of driver table T1 to query the records that meet the condition.

2. Search for data in the driven table T2 for the records matched in Step 1. Because two records were looked up from the driver table T1, two single-table queries are performed on the driven table T2. The condition t1.c1 = T2.c1 that involves multi-table queries comes into play here.

C1 = t2. C1 = t2. C1 = t2. C1 = t2. C1 = t2.

C1 = t2. C1 = t2. C1 = t2. C1 = t2.

3. Merge the query result obtained in Step 2

For join queries, the driven table will be accessed only once and the driven table will be accessed multiple times, depending on the number of entries in the result set after the single-table query is executed by the driven table.

The above method is a straightforward one, iterating over the results of Step 1 to query the driven table one at a time (although the query of the driven table can be accelerated by using the index of the driven table). It is called a Nested Loop Join.

Pseudocode similar to the following

forEach row in T1 {#forEach row in t2 {# indicate that for a t1 table, each row satisfies the query result set of t2if row satisfies join conditions, send to client
        }
    }
}
Copy the code

If M records are read from the driven table, the driven table is accessed m times. For each access to the driven table, a page of data is read from the hard disk multiple times. If the driven table is divided into N data pages, m* N data page reads are required to access the driven table. Each data page read is an I/O operation, which is extremely time-consuming in the case of a large amount of data.

Based on nested loop join, Mysql has carried out horizontal optimization and proposed the concept of Join buffer.

Join buffer can be used to join multiple records from the driven table with the driven table at a time, thus reducing the number of visits to the driven table. This approach is called Block nested-loop Join. If the Join buffer is large enough to hold all the records from the driven table, the access to the driven table can be done only once.

Of course, there are other Join methods for vertical scaling, such as Merge Join and Hash Join. They all have their own uses.

Nested loop connections are suitable for small outer loops and ordered memory loop conditions. This method has low requirements for CPU and memory, but high requirements for IO. Merge connection applies to the scenario where both ends of the connection are in order (a bit like merge sort). The CPU and memory requirements are average, and the I/O requirements are relatively low. Hash connection applies to the case where there is a large amount of data and no index. It has high CPU and memory requirements and may have high or low I/O requirements.

Based on the above analysis, continue to optimize the SQL in accordance with the established methodology.

SQL background

The T_voucher_header table was introduced in the first SQL optimization.

T_voucher_line table data volume is 1700W +, id primary key.

Existing indexes are those created in the batch_id column. T_voucher_header and T_voucher_line are related by the batch_ID field. One T_voucher_header record corresponds to multiple T_voucher_line records.

White box analysis

1. Subqueries appear in SQL, resulting in consumption of temporary tables. The substitution relationship between subqueries and join queries is that a join query must be replaced by a subquery, and a subquery may not be replaced by a join query. Can I replace the subquery with a join query for this SQL?

2. For t_voucher_header query involving combine_batch_id and IS_delete, use the batch_id (primary key) field when connecting. Is there an index containing combine_batch_id and is_delete to reduce back to the table?

3. All the columns in the select table are from table B, and the return to table B is inevitable. For t_voucher_line queries involving IS_DELETE, Segment3, ID (primary key), the batch_ID field was used when connecting. Is the connection field batch_id eligible to be indexed? Does the query make good use of the index? Can the query fields be covered as fully as possible?

4. What is the choice between driver table and driven table?

Can the order by order operation take advantage of the index order, so as to avoid the time-consuming sorting of a large amount of data?

Execution plan interpretation

With this analysis in mind, take a look at the MySQL execution plan

1. T_voucher_header driver table, T_voucher_line driver table.

2. T_voucher_header query, Using Index indicates that the query condition and select field can match the Index overwrite without returning to the table.

3. T_voucher_line query, Using index condition+Using WHERE indicates that only the index is used for partial filtering. There may be multiple entries to the table, which may be slow.

4. T_voucher_header query, 10W + rows scanned and all returned. Order by, Using temporary, and Using filesort are used to store and sort subqueries Using temporary tables.

5. The Using join buffer (Block Loop) does not appear in the execution plan, indicating that the query of the driven table can be accelerated Using indexes.

Bottleneck determination

At this point, you can basically simulate the mysql execution process, roughly as follows:

1. Subqueries occupy additional temporary tables for storage, and space creation also takes time.

2. The cost of table back when accessing the driven table is not minimized.

3. Sorting large amounts of data is time-consuming.

Suit the remedy to the case

1. SQL equivalent rewriting, subquery into join query

SELECTB. Many fieldsFROM
    t_voucher_line b
    INNER JOIN t_voucher_header a
    ON b.batch_id = a.batch_id
WHERE
    a.combine_batch_id = 2007044062
    AND a.is_delete = 0
    AND b.is_delete = 0
    AND b.segment3 LIKE '1002%'
    AND b.id > 18496282
ORDER BY
    b.id ASC
    LIMIT 300
Copy the code

2. For t_voucher_line, there are batch_id, segment3, is_DELETE fields besides primary key ID, while the existing index idx_batch_id only overrides batch_id. For Segment3, IS_DELETE must be judged by the back table. In the first step, we avoid the unnecessary consumption of returning to the table by expanding the index. Extend batch_id to batch_id, segment3, is_delete. Look again at the SQL execution plan

For t_voucher_line, the Using WHERE query is missing, i.e. the table consumption is minimized and the execution time is reduced to 0.6s.

3. Solve sorting

The result set from join queries is naturally ordered by the index fields that drive the table.

SQL > query driver table by using the combine_batch_id index. For combine_batch_id, the access type is const. Batch_id is the primary key field of the driver table. The two tables are joined by batch_id. Can we use the orderliness of batch_id to avoid sorting?

In business, the join fields of the driven table and the driven table are one-to-many, which makes it impossible to sort using the index order of the driven table.

So it looks like you can’t avoid sorting, right?

(b.id > 18496282) + (ORDER BY B.id ASC) + (limit 300); Paging. Since you can’t take advantage of the natural order that drives table batch_id, you might as well switch to paging. Get full IDS + select by ID. The final SQL becomes:

SELECT
    b.id
FROM
    t_voucher_line b
    INNER JOIN t_voucher_header a ON b.batch_id = a.batch_id
WHERE
    a.combine_batch_id = 2007044062
    AND a.is_delete = 0
    AND b.is_delete = 0
    AND b.segment3 LIKE '1002%'
Copy the code

The SQL usage mode is also changed. In the end, SQL was reduced from 1s to milliseconds.

Of course, one point we need to pay attention to in this way is max_allowed_packet, which limits the size of packets returned between the server and the client. Take 32MB as an example, the primary key is bigINT, then it can return about 3210241024/8≈ 419W ID.

summary

This SQL business scenario corresponds to the run batch task, we through (get full IDS) + (select by ID) way no big problem, then if the corresponding is the front-end paging query? If only a page of data needs to be queried, how can (get full IDS) + (Select by ID) be used?

Again, SQL tuning = business scenario + SQL execution analysis, both are indispensable. It is not advisable to talk about SQL tuning in isolation from either party.

conclusion

After the above two SQL optimization, also summed up some experience, I hope to help you:

1. SQL optimization, if possible, back table consumption needs to be minimized.

2. If you need to sort a table, try to sort it according to the index field of the driving table. Because of its natural order, you can avoid creating a temporary table to sort a large amount of data, which is very expensive.

3. When we encounter slow SQL, execution plan is the starting point for optimization. We can clearly understand the specific execution process of the whole SQL, so as to find the bottleneck point, so as to solve the problem.

4. The tuning at that time can solve the problem at that time. With the development of the business, the data volume will increase and the data distribution will become more uneven.

5. When writing SQL, you can follow some mature experience summary to avoid some problems in advance. But systems are always moving forward. You can’t just use one SQL for every scenario. Tuning is an ongoing process.

6. Details are important. The slow execution time of a single SQL may be caused by the accumulation of multiple details. Ten 0.1s is 1s.