SQL itself is not hard to learn, and writing queries is easy, but writing queries that run efficiently can be difficult.
Query optimization is a complex project involving hardware, parameter configuration, parsers of different databases, optimizer implementation, execution order of SQL statements, indexes and statistics collection, and even the overall architecture of applications and systems.
This article introduces a few key rules that can help you write efficient SQL queries; Especially for beginners, these rules at least prevent us from writing poorly performing queries.
The following rules apply to all relational databases, including but not limited to MySQL, Oracle, SQL Server, PostgreSQL, and SQLite.
Rule 1: Return only what you need
Be sure to specify a WHERE condition for the query statement to filter out unwanted rows.
Typically, OLTP systems need to return only a few records at a time from a large amount of data; Specifying query criteria helps us return results by index rather than by full table scan.
Most of the time performance is better when using indexes because indexes (B-trees, B+ trees, B* trees) perform binary searches and have logarithmic time complexity rather than linear time complexity.
MySQL cluster index
For example, assuming that each index branch node can store 100 records, one million (1003) records can be indexed with only three levels of B-tree.
When searching for data by index, you need to read index data three times (each disk I/O reads the entire branch node), and add one disk I/O reads the data to obtain the query result.
In contrast, with a full table scan, the number of disk I/OS that need to be performed may be orders of magnitude higher. When the data volume increases to 100 million (1004), the B-tree index only needs to add IO once more. A full table scan requires several orders of magnitude more IO.
Similarly, we should avoid using SELECT * FROM because it queries all the fields in the table.
As a result, the database needs to read more data, and the network needs to transfer more data, resulting in performance degradation.
Rule 2: Make sure the query uses the correct index
If a suitable index is missing, data will not be found through the index even if the query criteria are specified. Therefore, we first need to ensure that the appropriate index is created.
In general, the following fields need to be indexed:
- Field indexes that frequently appear in WHERE conditions can avoid full table scans.
- Add the fields sorted BY ORDER to the index to avoid additional sorting operations.
- The performance of join queries can be improved by indexing the associated fields of multi-table join queries.
- Add the GROUP BY GROUP action field to the index to complete the grouping using the index.
Even if a proper index is created, the database will not use the index if the SQL statement is written incorrectly.
Common problems that cause indexes to fail include:
- An index field can be invalidated by expression operations or functions in a WHERE clause, including field data types that do not match, such as string and integer comparisons.
- Index cannot be used if the wildcard appears on the left side of a LIKE match. For fuzzy matching of large text data, full-text retrieval functions provided by databases, or even specialized full-text search engines (Elasticsearch, etc.) should be considered.
- If an index is created on a field in a WHERE condition, try to set it to NOT NULL. NOT all databases can utilize indexes when using IS [NOT] NULL.
An execution plan is the steps taken by the database to execute SQL statements, such as access to data in a table through an index or a full table scan, how queries are implemented and in what order they are joined.
If the SQL statement is not performing well, we should first look at its execution plan to ensure that the query uses the correct index through the execution plan (EXPLAIN).
Rule 3: Avoid subqueries whenever possible
Using MySQL as an example, the following query returns information about employees whose monthly salary is greater than the department’s average monthly salary:
EXPLAIN ANALYZE SELECT emp_id, emp_name FROM employee e WHERE salary > ( SELECT AVG(salary) FROM employee WHERE dept_id = e.dept_id); -> Filter: (e.salary > (select #2)) (cost=2.75 rows=25) (actual time=0.232.. Rows =6 loops=1) -> Table scan on e (cost=2.75 rows=25) (actual time=0.099.. 0.190 rows=25 loops=1) -> Select #2 (condition; Dependent) -> Aggregate: avG (employee. Salary) (actual time=0.147.. Loops =25) -> Index lookup on employee using idx_emp_dept (dept_id=e.dept_id) (cost=1.12 rows=5) (actual) Time = 0.068.. 0.104 rows = 7 loops = 25)Copy the code
According to the execution plan, MySQL uses a Nested Loop Join implementation. The subquery loops 25 times, when the average monthly salary for each department can actually be calculated and cached in a single scan.
The following statement replaces the Subquery with the equivalent JOIN statement to implement the Subquery Unnest:
EXPLAIN ANALYZE SELECT e.emp_id, e.emp_name FROM employee e JOIN (SELECT dept_id, AVG(salary) AS dept_average FROM employee GROUP BY dept_id) t ON e.dept_id = t.dept_id WHERE e.salary > t.dept_average; -> Nested loop inner join (actual time=0.722.. Loops =1) -> Table scan on e (cost=2) (actual time=0.096.. 0.205 rows=25 loops=1) -> Filter: (e.salary > t.dept_average) (actual time=0.068.. 0.076 rows=0 loops=25) -> Index lookup on t using <auto_key0> (dept_id=e.dept_id) (actual time=0.011.. 0.015 rows=1 loops=25) -> Materialize (actual time=0.048.. 0.057 rows=1 loops=25) -> Group Aggregate: AVG (employee. Salary) (actual time=0.228.. Loops =1) -> Index scan on employee using idx_emp_dept (cost=2.75 rows=25) (actual time=0.181.. 0.348 rows = 25 loops = 1)Copy the code
The rewritten query uses Materialization technology to generate a temporary table in memory from the results of sub-query. Then join with the Employee table. The actual execution time shows that this approach is faster.
In the preceding example, subquery expansion is automatically performed in Oracle and SQL Server. Both methods have the same effect. In PostgreSQL, similar to MySQL, the first statement uses Nested Loop Join and then Hash Join is used to achieve better performance.
IN addition, similar conclusions can be drawn for IN and EXISTS subqueries. Since the optimizer capabilities vary from database to database, we should avoid subqueries as much as possible and consider rewriting using Joins.
Rule 4: Do not use OFFSET for paging
A paging query skips a specified number of rows and returns a top-N record. A schematic diagram of paging queries is shown below:
The database generally supports FETCH/LIMIT and OFFSET to achieve top-N ranking and paging query. This approach to paging queries can cause performance problems when there is a large amount of data in the table.
Take MySQL as an example:
-- MySQL
SELECT *
FROM large_table
ORDER BY id
LIMIT 10 OFFSET N;
Copy the code
Copy the code
The speed of the above query will be slower and slower with the increase of OFFSET. Because even if we only need to return 10 records, the database still needs to access and filter out N (say 1000000) rows, even through the index, which would involve unnecessary scanning operations.
For the above paging query, it is better to remember the maximum ID obtained last time and pass it in as a condition in the next query:
-- MySQL
SELECT *
FROM large_table
WHERE id > last_id
ORDER BY id
LIMIT 10;
Copy the code
If there is an index on the ID field, this paging query can be largely unaffected by the amount of data.
Rule 5: Know the logical execution order of SQL clauses
The following is the syntax order of each clause in SQL, with the number in parentheses representing the logical order in which they are executed:
(6)SELECT [DISTINCT | ALL] col1, col2, agg_func(col3) AS alias
(1) FROM t1 JOIN t2
(2) ON (join_conditions)
(3) WHERE where_conditions
(4) GROUP BY col1, col2
(5)HAVING having_condition
(7) UNION [ALL]
...
(8) ORDER BY col1 ASC,col2 DESC
(9)OFFSET m ROWS FETCH NEXT num_rows ROWS ONLY;
Copy the code
That is, SQL does not execute the SELECT clause in the order in which it was written, followed by the FROM clause.
Logically, SQL statements are executed in the following order:
-
** First, **FROM and JOIN are the first steps in SQL statement execution. Their logical result is a Cartesian product that determines the data set to be manipulated next.
Note that the logical order of execution does not represent the physical order of execution, and in fact the database uses ON and WHERE filtering conditions to optimize access before fetching the data in the table.
-
** Second, ** applies the ON condition to filter the results of the previous step and generate a new data set.
-
** Then, ** executes the WHERE clause to filter the data set from the previous step again. WHERE and ON have the same effect in most cases, but there are differences in outer join queries, which we’ll illustrate below.
-
** Next, ** groups based on the expression specified BY the GROUP BY clause; At the same time, the result of the aggregation function AGg_func is computed for each grouping. After the GROUP BY process, the structure of the dataset changes, preserving only the results of the grouped fields and aggregate functions.
-
If a GROUP BY clause exists, HAVING can be used to further filter the grouped results, usually the results of aggregate functions.
-
** Next, **SELECT can specify the column to return; If the DISTINCT keyword is specified, the result set needs to be de-duplicated. Alias is also generated for fields that specify AS.
-
If there are set operators (UNION, INTERSECT, EXCEPT), and other SELECT statements, execute the query and merge the two result sets. Databases typically support concurrent execution of multiple SELECT statements in a collection operation.
-
** Then, ** applies the ORDER BY clause to sort the results. If there is a GROUP BY clause or a DISTINCT keyword, only GROUP fields and aggregate functions can be used for sorting. Otherwise, you can sort using any field in the FROM and JOIN tables.
-
** Finally, OFFSET and FETCH (LIMIT, TOP) LIMIT the number of rows to be returned.
Understanding the order in which SQL logic is executed can help with SQL optimization. For example, the WHERE clause is executed before the HAVING clause, so use WHERE to filter data to avoid unnecessary operations. Unless the business needs to filter the results of the aggregation function.
In addition, understanding the logical execution order of SQL can also help us avoid some common errors, such as the following statements:
SELECT emp_name AS empname FROM employee WHERE empname =' zhang ';Copy the code
The error in this statement is that column aliases are referenced in the WHERE condition; As you can see from the logical order above, the SELECT clause has not been executed when the WHERE condition is executed, and thus the aliases for the fields are not generated.
Another operation to watch out for is GROUP BY, for example:
SELECT dept_id, emp_name, AVG(salary) FROM employee GROUP BY dept_id;Copy the code
The emp_NAME field in the example no longer exists because only the grouped fields and aggregate function results remain in the result set after the GROUP BY processing.
From a business logic point of view, it does not make sense to display an employee’s name after grouping statistics by department. If you want to display both employee information and a summary of the department, you can use window functions.
If GROUP BY is used, subsequent SELECT, ORDER BY, and so on can only reference GROUP fields or aggregate functions. Otherwise, you can reference any field in the FROM and JOIN tables.
There are also logic problems that may not directly cause the query to fail, but may return incorrect results; Examples are ON and WHERE conditions in outer join queries.
Here is an example of a left outer join query:
SELECT e.ep_name, d.dept_name FROM employee e LEFT JOIN department d ON (e.ept_id = d.dept_id) WHERE e.ep_name =' f_ '; SELECT e.ep_name FROM employee e LEFT JOIN department d ON (e.ept_id = d.dept_id) WHERE e.ep_name =' f_ '; Emp_name | dept_name | -- -- -- -- -- -- -- - | -- -- -- -- -- -- -- -- - | zhang fei administrative management department | | SELECT e.e mp_name, D.dept_name FROM employee e LEFT JOIN department d ON (e.dept_id = d.dept_id AND e.ep_name =' df_'); d.dept_name FROM employee e LEFT JOIN department d ON (e.dept_id = d.dept_id AND e.ep_name =' df_'); Emp_name | dept_name | -- -- -- -- -- -- -- - | -- -- -- -- -- -- -- -- - | | liu2 bei4 (NULL) | guan yu | | [NULL] zhang fei administrative management department | | zhuge liang | (NULL) |...Copy the code
The first query specifies the join condition in the ON clause and finds information about “Zhang Fei” through the WHERE clause.
The second query puts all the filtering criteria in the ON clause and returns all the employee information.
This is because the left outer join returns all the data in the left table, even if the employee name is specified in the ON clause. The WHERE condition logically filters the result of the join operation.
conclusion
SQL optimization is essentially about understanding how the optimizer works, and creating the right indexes and correct statements for it. Also, when the optimizer isn’t smart enough, manually make it smart.
The article has been included at GitHub: github.com/JavaFamily
Article source: link to original article