Writing in the front
In the application scenarios requiring fast response, such as Web services, the performance of SQL directly determines whether the system can be used. Especially in some small and medium-sized applications, SQL performance is the only criterion to determine whether the service can respond quickly
When strictly optimizing query performance, it is important to understand the functional characteristics of the database you are using, and queries can be slow not only because of the SQL statements themselves, but also because of poor memory allocation, improper file structure, dirty page flushing, and other reasons
Therefore, the method of tuning SQL described in this article does not solve all performance problems, but it can solve many performance problems caused by the improper writing of SQL
Here are some general optimization tips for faster SQL execution and less memory consumption, independent of the specific database implementation, that can be implemented by simply adjusting SQL statements
Environment to prepare
What follows is at the SQL level, not specific to a particular database, which means that it applies to almost any relational database
However, with so many relational databases, it is obviously not realistic to demonstrate the examples one by one; We use the commonly used MySQL to illustrate examples
MySQL: 5.7.30-log, storage engine: InnoDB
Prepare two tables: TBL_CUSTOMER and TBL_recharge_RECORD
Use efficient queries
Sometimes there are multiple SQL implementations for a particular query, such as conversions between IN, EXISTS, and joins
In theory, different SQL statements with the same results should have the same performance, but unfortunately, the execution plan generated by the query optimizer is heavily influenced by external structures
So if you want to optimize query performance, you have to know how to write SQL statements so that the optimizer can generate a more efficient execution plan
Use EXISTS instead of IN
IN, I believe everyone is familiar with, easy to use, and easy to understand; While IN is easy to use, it has a performance bottleneck
If IN takes a list of values like 1,2, and 3, you generally don’t need to pay attention, but if IN takes a subquery, you do
Most of the time, [NOT] IN and [NOT] EXISTS return the same result, but EXISTS is faster when both are used for subqueries
Suppose we want to query the customer information with recharge records, SQL how to write?
IN: SELECT * FROM TBL_customer WHERE ID IN (SELECT customer_id FROM tBL_recharge_record);
IN is really simple to use and very easy to understand; Let’s take a look at the execution plan
Let’s look at the EXISTS execution plan:
As you can see, a new temporary table is created IN the execution plan:, which results IN slow efficiency
Generally speaking, EXISTS is faster than IN for two reasons
1. If there is an index on the join column customer_ID, tBL_recharge_RECORD can be queried by an index rather than by a full table
2, terminate the query if a row EXISTS that meets the condition. Do NOT scan the entire table as IN.
When IN takes a subquery, the database first executes the subquery, then stores the results IN a temporary table (inline view), and then scans the entire view, which IN many cases is very expensive
If EXISTS is used, the database does not generate temporary tables
However, IN is better than EXISTS IN terms of code readability. The code using IN looks more obvious and easy to understand
Therefore, there is no need to change to EXISTS if you are sure that you can get results quickly using IN
There are many databases that try to improve IN performance
IN An Oracle database, if we use IN on an indexed column, the index is also scanned first
PostgreSQL has also improved query speed with subqueries as IN predicate parameters since version 7.4
Perhaps one day IN will have the same performance as EXISTS on any relational database
For more details on EXISTS, see: Predicates of Amazing SQL → Unintelligible EXISTS
Use connections instead of IN
IN fact, IN daily work, more use of connection instead of IN to improve query performance, not EXISTS, not that connection is better, but EXISTS is difficult to master
Back to the question: query customer information with recharge records, if the use of connection to achieve, how to write SQL?
This makes full use of indexes; Also, because there are no subqueries, the database does not generate intermediate tables; So, query efficiency is good
It is not easy to say whether JOIN is better than EXISTS. EXISTS may be superior without an index, but with an index it is almost the same
Avoid sorting
ORDER BY allows you to print the results in ORDER of some specified column
However, in addition to the sort shown BY ORDER BY, there are many operations inside the database that are secretly sorting; Typical operations that sort are the following
If you sort only in memory, that’s fine; But if you run out of memory and need to sort on the hard disk, performance degrades dramatically
Therefore, avoiding (or reducing) unnecessary sorting can greatly improve query efficiency
Flexible use of the ALL option of the set operator
There are three set operators UNION, INTERSECT and EXCEPT in SQL, and the sub-table represents the UNION, intersection and difference of the set operation
By default, these operators sort to exclude duplicate data
Using temporary indicates that sorting or grouping is performed. It is clear that the SQL is not grouped, but rather that sorting is performed
If we don’t care if there are duplicates in the result, or if we know in advance that there won’t be duplicates, we can use UNION ALL instead of UNION
As you can see, there is no sorting operation in the execution plan
The same is true for INTERSECT and EXCEPT; the ALL option will not be sorted
Adding the ALL option is a very effective optimization tool, but its implementation varies from database to database, as shown in the figure below
Note: Oracle uses MINUS instead of EXCEPT; MySQL does not implement the INTERSECT and EXCEPT operations at all
Use EXISTS instead of DISTINCT
To exclude duplicate data, DISTINCT is also sorted
Remember when you used join instead of IN, if you didn’t use DISTINCT
SQL: SELECT tc.* FROM tbl_recharge_record trr LEFT JOIN tbl_customer tc on trr.customer_id = tc.id
There will be a lot of duplicate records in the result, so let’s improve the SQL
SELECT DISTINCT tc.* FROM tbl_recharge_record trr LEFT JOIN tbl_customer tc on trr.customer_id = tc.id
You can see that there is a Using temporary in the execution plan that indicates that a sort operation is used
We use EXISTS for optimization
As you can see, sorting has been circumvented
Use indexes in extremum functions
The SQL language has two extreme functions: MAX and MIN, which are used to sort
For example, SELECT MAX(recharge_amount) FROM tBL_recharge_record
A full table scan is performed and an implicit sort is performed to find the maximum amount of a single top-up
However, if the parameter field has an index, you only need to scan the index, not the whole table
For example, SELECT MAX(customer_id) FROM tBL_recharge_record;
It scans through index: idx_c_id to find the largest customer ID in the recharge record
This approach does not eliminate the sorting process, but rather optimizes the speed of lookup before sorting, thus reducing the impact of sorting on overall performance
Conditions that can be in the WHERE clause should not be in the HAVING clause
Let’s look at two SQL statements and their execution results
As a result, the two SQL statements are the same; But in terms of performance, the second statement is more efficient for two reasons
Reduce the amount of data sorted
The GROUP BY clause is sorted when aggregated, and if you filter a few rows through the WHERE clause beforehand, you can reduce the sorting burden
Efficient use of indexes
You can use indexes in the conditions of the WHERE clause
The HAVING clause filters aggregated views, but many times aggregated views do not inherit the index structure of the original table
See more about HAVING: The magic of SQL HAVING → Easy to despise protagonist
Use indexes in the GROUP BY and ORDER BY clauses
In general, both the GROUP BY and ORDER BY clauses are sorted
If the GROUP BY and ORDER BY columns have indexes, you can improve query efficiency
In particular, in some databases, the sorting process itself is omitted if a unique index is created on a column
Using the index
Use index (s) as the most common method of SQL optimization, this is known, but we may not know: there is an index, why the query is still so slow (why the index is not used)
About the index is not used, you can view: the magic of SQL pass by → really used index, this article will not do too much elaboration
In short: the query as close to the index as possible, avoid the index is not used
Reduced temporary table
In SQL, the result of a subquery is treated as a new table (temporary table) that can be manipulated in SQL, just like the original table
However, frequent use of temporary tables creates two problems
1. A temporary table is equivalent to a backup of the original table data, which consumes memory resources
2. Many times (especially when aggregated), temporary tables do not inherit the index structure of the original table
Therefore, minimizing the use of temporary tables is also an important way to improve performance
Be flexible with the HAVING clause
The HAVING clause is the cardinal rule when specifying filters for aggregate results
But if we’re not familiar with HAVING, we often find an alternative to it, like this
However, you do not need to specifically generate intermediate tables to specify filters for aggregate results, using the HAVING clause as follows
HAVING clauses and aggregations are executed at the same time, so it’s more efficient and the code looks cleaner than executing the WHERE clause after generating a temporary table
When you need to use the IN predicate on multiple fields, aggregate them into one place
The row-to-row comparison feature was added to SQL-92, so that the arguments to compare the =, <, >, and IN predicates are no longer just scalar values, but lists of values
Let’s look at an example where multiple fields use the IN predicate
This code uses two subqueries, and we can optimize the column summary to write the logic together
In this way, subqueries do not care about correlation and can be executed only once
You can further simplify by writing a combination of multiple fields IN IN
With this simplification, you don’t have to worry about type-casting when concatenating fields, and you don’t have to process the fields, so you can use indexes
Join first and then aggregate
When join and aggregate are used together, join first to avoid creating intermediate tables
Use views wisely
Views are very handy tools that we use a lot in our daily work
However, defining complex views without much thought can lead to significant performance problems
Especially if the view definition statement contains the following operations, SQL can be very inefficient and slow to execute
conclusion
There are a few points, but the core idea of optimization is to identify performance bottlenecks and solve them
It’s not just databases and SQL. It’s access to hard disks, or file systems, that’s a performance bottleneck in the computer world (so you can improve performance by adding more memory, or by using faster hard disks, etc.)
Whether it’s reducing sorting, using indexes, or avoiding the use of temporary tables, the essence is to reduce access to the hard disk
Summary Tips below
1. If the parameter is a subquery, use EXISTS or JOIN instead of IN
2, in SQL, many operations will be secretly sorted, try to avoid these operations
3, SQL writing, try to rely on the index, to avoid the use of the index
4. Minimize the use of intermediate tables
Source: www.cnblogs.com/youzhibing/…