preface

SQL optimization is a hot topic that everyone pays close attention to. Whether you are in an interview or at work, you are likely to encounter it.

If one of your online interfaces has a performance problem and needs to be optimized. Your first thought is likely to be to optimize the SQL statement, which is also much cheaper to modify than the code.

So, how do you optimize SQL statements?

This article shares some tips for SQL optimization from 15 aspects that you hope will help you.

Recently, I accidentally got a copy of the notes written by a big boss of BAT factory. All of a sudden, I got through to both my supervisor and my supervisor. I felt that the algorithm was not as difficult as I imagined.

BAT boss wrote the brush notes, let me get the offer soft

1 Avoid using select *

Most of the time, when we write SQL statements, we like to use select * directly to find all columns in a table at once.

Example:

select * from user where id=1;
Copy the code

In a real business scenario, we might only really need to use one or two of these columns. Check a lot of data, but do not use, waste database resources, such as: memory or CPU.

In addition, the time of data transmission will be increased during the IO transmission over the network.

One of the most important issues is that SELECT * does not overwrite the index, resulting in a large number of back-table operations, which results in poor performance of the query SQL.

So, how do you optimize?

Is:

select name,age from user where id=1;
Copy the code

During SQL query, only the required columns are queried. The redundant columns are not detected at all.

2 Use union all instead of union

We all know that using the union keyword in an SQL statement can retrieve the rearranged data.

If you use the union all keyword, you can get all data, including duplicate data.

Example:

(select * from user where id=1) 
union 
(select * from user where id=2);
Copy the code

The process of reloading requires traversal, sorting, and comparison, and is more time consuming and CPU consuming.

So if you can use a union all, try not to use a union.

Is:

(select * from user where id=1) 
union all
(select * from user where id=2);
Copy the code

Except in some special scenarios, for example, after union All, duplicate data appears in the result set, which is not allowed in business scenarios, you can use the union.

3 Small table drives large table

The small table drives the large table, that is, the data set of the small table drives the data set of the large table.

Suppose you have two tables, order and user, where order has 10,000 entries and user has 100 entries.

Now, if you want to look up a list of all the orders that have been placed by valid users.

This can be done using the in keyword:

select * from order
where user_id in (select id from user where status=1)
Copy the code

It can also be implemented using the exists keyword:

select * from order
where exists (select 1 from user where order.user_id = user.id and status=1)
Copy the code

The business scenario mentioned earlier, using the IN keyword to implement business requirements, is more appropriate.

Why is that?

Because if the SQL statement contains the in keyword, it will execute the subquery inside of in first, and then the statement outside of IN. If the amount of data in is small, the query is faster as a condition.

If the SQL statement contains the EXISTS keyword, it takes precedence over the statement to the left of exists (the main query statement). And then use that as a condition to match the statement on the right. If a match is found, the data can be queried. If they don’t match, the data is filtered out.

In this requirement, the ORDER table has 10,000 entries and the User table has 100 entries. The ORDER table is a large table, and the user table is a small table. If the order table is on the left, the in keyword performs better.

To sum up:

  • inApplies to large tables on the left and small tables on the right.
  • existsApplies to small tables on the left and large tables on the right.

Whether in or EXISTS is used, the core idea is to use small tables to drive large tables.

4 Batch Operations

What if you have a batch of data that needs to be inserted after business processing?

Example:

for(Order order: list){
   orderMapper.insert(order):
}
Copy the code

Insert data item by item in the loop.

insert into order(id,code,user_id) 
values(123.'001'.100);
Copy the code

This operation requires multiple requests to the database to complete the batch of data inserts.

But as we all know, every time we make a remote request to the database in our code, it’s going to cost some performance. If our code had to request the database multiple times to complete this business function, it would have consumed more performance.

So how do you optimize?

Is:

orderMapper.insertBatch(list):
Copy the code

Provides a method to insert data in batches.

insert into order(id,code,user_id) 
values(123.'001'.100), (124.'002'.100), (125.'003'.101);
Copy the code

In this way, you only need to remotely request the database once, and SQL performance will improve. The more data, the greater the improvement.

However, it is not recommended to batch operate too much data at one time. If there is too much data, the database response will be slow. A certain amount of data is required for batch operations. It is recommended that the data in each batch be less than 500. If the data is more than 500, it is processed in batches.

5 multi-purpose limit

Sometimes, we need to query the first item in some data, for example, to query the first order that a user placed to see the time of his first order.

Example:

select id, create_date 
 from order 
where user_id=123 
order by create_date asc;
Copy the code

Query the order according to the user ID, sort the order time, find out all the order data of the user first, and get an order set. Then, in the code, you get the data for the first element, the first order, and you get the first order time.

List<Order> list = orderMapper.getOrderList();
Order order = list.get(0);
Copy the code

Although this approach is functional, but it is very inefficient, need to query all the data first, a bit of a waste of resources.

So, how do you optimize?

Is:

select id, create_date 
 from order 
where user_id=123 
order by create_date asc 
limit 1;
Copy the code

Use limit 1 to return only the item with the smallest order time.

In addition, when deleting or modifying data, you can also add limit at the end of the SQL statement to prevent misoperations that may cause irrelevant data to be deleted or modified.

Such as:

update order set status=0,edit_time=now(3) 
where id> =100 and id<200 limit 100;
Copy the code

This way, even if you do something wrong, like get the ID wrong, you won’t affect too much data.

The median value of 6 in is too high

For bulk query interfaces, we usually filter out data using the IN keyword. For example, you want to batch query user information using specified ids.

The SQL statement is as follows:

select id,name from category
where id in (1.2.3..100000000.);
Copy the code

If we do not make any restrictions, the query statement may query too much data at once, which can easily cause the interface to time out.

What should I do?

select id,name from category
where id in (1.2.3..100.)
limit 500;
Copy the code

You can limit data in SQL.

But more importantly, we want to add restrictions to the business code, pseudo code is as follows:

public List<Category> getCategory(List<Long> ids) {
   if(CollectionUtils.isEmpty(ids)) {
      return null;
   }
   if(ids.size() > 500) {
      throw new BusinessException("A maximum of 500 records can be queried at a time")}return mapper.getCategoryList(ids);
}
Copy the code

Alternatively, if IDS exceeds 500 records, the data can be queried in batches using multiple threads. Only 500 records are checked in each batch, and the queried data is aggregated and returned.

However, this is only a temporary solution and is not suitable for scenarios where IDS is too numerous. Because there are too many IDS, even if the data can be detected quickly, if the returned data is too large, the network transmission is very performance consuming, and the interface performance is not much better.

7 Incremental Query

Sometimes we need to query data through a remote interface and then synchronize it to another database.

Example:

select * from user;
Copy the code

If you just grab all the data and synchronize it. This is very convenient, but it brings a very big problem, is if there is a lot of data, query performance will be very poor.

What should I do?

Is:

select * from user 
where id>#{lastId} and create_time >= #{lastCreateTime} 
limit 100;
Copy the code

In ascending order by ID and time, only one batch of data is synchronized at a time. This batch of data has only 100 records. After each synchronization, the maximum ID and time of the 100 data are saved for the next batch of data synchronization.

This incremental query improves the efficiency of a single query.

Efficient paging

Sometimes, when the list page is querying data, in order to avoid the impact of too much data returned at one time on the interface performance, we generally do paging processing on the query interface.

Mysql > select * from page_limit;

select id,name,age 
from user limit 10.20;
Copy the code

If the amount of data in the table is small, use the limit keyword to do pagination, no problem. However, if there is a lot of data in the table, it can cause performance problems.

For example, now the paging parameter is:

select id,name,age 
from user limit 1000000.20;
Copy the code

Mysql will look up 1000020 items of data, then discard the first 1000000 items, only look up the next 20 items, this is a waste of resources.

So how do you paginate this huge amount of data?

Optimized SQL:

select id,name,age 
from user where id > 1000000 limit 20;
Copy the code

First find the maximum ID of the last page, then use the index on the ID to query. However, this scheme requires that ids be continuous and ordered.

You can also use BETWEEN to optimize paging.

select id,name,age 
from user where id between 1000000 and 1000020;
Copy the code

It is important to note that between should be paginated on a unique index, otherwise the size of each page will be inconsistent.

9 Replace subqueries with join queries

If you need to query data from more than two tables in mysql, there are generally two implementation methods: subquery and join query.

An example of a subquery is as follows:

select * from order
where user_id in (select id from user where status=1)
Copy the code

Subqueries can be implemented using the IN keyword. The conditions of one query statement are placed in the results of another SELECT statement. The program runs in the innermost nested statement and then in the outer statement.

The advantage of a subquery statement is that it is simple and structured, if the number of tables involved is small.

The disadvantage is that temporary tables need to be created when mysql executes sub-queries. After the query is completed, these temporary tables need to be deleted, resulting in some additional performance consumption.

You can change this to join queries. Examples are as follows:

select o.* from order o
inner join user u on o.user_id = u.id
where u.status=1
Copy the code

10 Do not add too many join tables

According to the Alibaba developer manual, the number of join tables should not exceed 3.

Example:

select a.name,b.name.c.name,d.name
from a 
inner join b on a.id = b.a_id
inner join c on c.b_id = b.id
inner join d on d.c_id = c.id
inner join e on e.d_id = d.id
inner join f on f.e_id = e.id
inner join g on g.f_id = f.id
Copy the code

If there are too many joins, mysql can be very complicated in selecting the wrong index.

In addition, if no match is made, a nested loop join is performed by reading rows from two tables and comparing them in pairs. The complexity is N ^2.

So we should try to limit the number of join tables.

Is:

select a.name,b.name.c.name,a.d_name 
from a 
inner join b on a.id = b.a_id
inner join c on c.b_id = b.id
Copy the code

If you need to query data in other tables in a service scenario, you can create specialized fields in tables A, B, and C. For example, you can create d_name fields in table A to save the data to be queried.

However, I have also seen some ERP systems before. The amount of concurrency is not large, but the business is complicated, and more than ten tables need to be joined to query the data.

Therefore, the number of join tables should be determined according to the actual situation of the system.

11 Pay attention to join

We usually use the join keyword when we are involved in the joint query of multiple tables.

Left Join and inner join are most commonly used.

  • left join: Find the intersection of the two tables plus the remaining data of the left table.
  • inner join: Find the intersection of two tables.

The following is an example of an inner join:

select o.id,o.code,u.name 
from order o 
inner join user u on o.user_id = u.id
where u.status=1;
Copy the code

If two tables are associated with an inner join, mysql will automatically select the smaller table from the two tables to drive the larger table, so there will be no major performance problems.

The following is an example of a left join:

select o.id,o.code,u.name 
from order o 
left join user u on o.user_id = u.id
where u.status=1;
Copy the code

If two tables are associated with left JOIN, mysql will use the left join keyword to drive the right table by default. Performance problems can occur if the tables on the left have a lot of data.

Note that when using a left JOIN associative query, you can use a small table on the left and a large table on the right. Where inner Join is available, use left Join as little as possible.

12 Control the number of indexes

It is well known that indexes can significantly improve query SQL performance, but more indexes is not always better.

When new data is added to a table, an index needs to be created for it at the same time, which requires extra storage space and certain performance consumption.

According to alibaba’s developer manual, the number of indexes in a single table should be limited to 5, and the number of fields in a single index should not exceed 5.

Mysql uses the B+ tree structure to store indexes, which need to be updated during insert, UPDATE, and DELETE operations. Too many indexes can consume a lot of extra performance.

So, what if there are more than five indexes in the table?

This problem should be seen dialectically, if your system concurrency is not high, the amount of data in the table is not much, in fact, more than 5 is ok, as long as not more than too much.

However, for highly concurrent systems, be sure to adhere to the limit of no more than five indexes in a single table.

So how do high-concurrency systems optimize the number of indexes?

If you can build a joint index, do not build a single index, you can delete a single index that is useless.

Migrate some query functions to other types of databases, such as Elastic Seach and HBase. You only need to create several key indexes in service tables.

13 Select a proper field type

Char indicates a fixed string type. This type of field stores a fixed amount of storage space, wasting storage space.

alter table order 
add column code char(20) NOT NULL;
Copy the code

Varchar indicates a variable length string. The storage space of this type of field is adjusted according to the actual data length without wasting storage space.

alter table order 
add column code varchar(20) NOT NULL;
Copy the code

If the field is of a fixed length, such as the user’s mobile phone number, which is usually 11 bits, it can be defined as the char type, which is 11 bytes long.

However, if the enterprise name field is defined as a char, there is a problem.

If the length is defined too long, such as 200 bytes, when the actual enterprise length is only 50 bytes, 150 bytes of storage space will be wasted.

If the length is defined too short, such as 50 bytes, but the actual enterprise name has 100 bytes, it will not be stored and an exception will be thrown.

Therefore, it is recommended to change the enterprise name to VARCHAR. The storage space of a long field is small, which can save storage space. In addition, it is obviously more efficient to search in a relatively small field.

When selecting a field type, we should follow the following principles:

  1. Strings are not used when numeric types are used, because characters tend to be processed more slowly than numbers.
  2. Use small types whenever possible, such as bit for booleans, tinyInt for enumerations, etc.
  3. A fixed-length string field of type CHAR.
  4. Variable length string field, vARCHAR type.
  5. The amount field uses Decimal to avoid loss of precision.

There are many more principles, which are not listed here.

14 improve the efficiency of Group by

We use the group by keyword in many business scenarios. Its main functions are de-duplication and grouping.

It is usually used in conjunction with HAVING to filter data based on certain criteria after grouping.

Example:

select user_id,user_name from order
group by user_id
having user_id < = 200;
Copy the code

This method does not perform well. It groups all orders by user ID and then filters users whose USER ID is greater than or equal to 200.

Grouping is a relatively time-consuming operation, so why don’t we narrow down the scope of the data first and then group it?

Is:

select user_id,user_name from order
where user_id < = 200
group by user_id
Copy the code

Using the WHERE condition filters out excess data before grouping, making grouping more efficient.

In fact, this is a way of thinking, not limited to group by optimization. To improve the overall performance of our SQL statements, try to narrow the data range as much as possible before performing time-consuming operations.

15 Index Optimization

SQL optimization, there is a very important content is: index optimization.

Most of the time, SQL statements, index, and no index, the execution efficiency is very different. So index optimization is the first choice for SQL optimization.

The first step in index optimization is to check whether the SQL statement is indexed.

So, how to check the SQL index?

You can use the explain command to view the mysql execution plan.

Such as:

explain select * from `order` where code='002';
Copy the code

Results:These columns can be used to determine the index usage. The following figure shows the meanings of the columns in the execution plan:If you want to see more details on the use of explain, check out my other article,Explain | index to optimize the best sword, you can really use?”

To be honest, the SQL statement did not go to the index, excluding no index, most likely index failure.

Recently, I accidentally got a copy of the notes written by a big boss of BAT factory. All of a sudden, I got through to both my supervisor and my supervisor. I felt that the algorithm was not as difficult as I imagined.

BAT boss wrote the brush notes, let me get the offer soft

Here are some common reasons for index failure:If not, you need to look for other reasons.

In addition, have you ever encountered a situation where it is the same SQL but the input parameters are different? Sometimes you go to index A, sometimes you go to index B?

Yes, sometimes mysql picks the wrong index.

If necessary, you can use force Index to force the query SQL to follow an index.

Mysql > select index (‘ index ‘, ‘index’);