preface
MySQL > SELECT * from MySQL > SELECT * from MySQL > SELECT * from MySQL > SELECT * from MySQL Do you really understand these optimization techniques? Do you understand how it works? Is there a real performance improvement in the real world? I don’t think so. Therefore, it is important to understand the principles behind these optimization suggestions. I hope this article will help you review these optimization suggestions and apply them properly in actual business scenarios.
MySQL Logical Architecture
It helps to have an architectural picture in your head of how the MySQL components work together. The following figure shows the logical architecture of MySQL.
MySQL Logical Architecture, from: High performance MySQL Study Notes
The overall logical architecture of MySQL is divided into three layers. The client layer is the top layer, which is not unique to MySQL. Functions such as connection processing, authentication and security are all handled in this layer.
Most of MySQL’s core services are in the middle layer, including query parsing, analysis, optimization, caching, and built-in functions (such as timing, math, encryption, etc.). All cross-storage engine functionality is also implemented in this layer: stored procedures, triggers, views, and so on.
At the lowest level is the storage engine, which is responsible for data storage and extraction in MySQL. Like file systems under Linux, each storage engine has its advantages and disadvantages. The service layer in the middle communicates with storage engines through apis that mask the differences between storage engines.
MySQL > select * from ‘MySQL’;
We always want MySQL to get better query performance, and the best way to do that is to understand how MySQL optimizes and executes queries. Once you understand this, you’ll see that a lot of query optimization is really just following some principles to make the MySQL optimizer work the way it’s supposed to.
What exactly does MySQL do when it sends a request to MySQL?
Client/server communication protocol
The MySQL client/server communication protocol is “half-duplex” : at any given moment, either the server sends data to the client or the client sends data to the server. The two actions cannot occur at the same time. Once one end starts sending a message, the other end needs to receive the entire message before it can respond to it, so there is no way or need to cut a message into small pieces and send it independently, and there is no way to control the flow.
The client sends the query request to the server in a single packet, so the max_allowed_packet parameter needs to be set when the query statement is long. Note, however, that if the query is too large, the server will refuse to receive more data and throw an exception.
In contrast, the server typically responds to the user with a large number of data packets. But when the server responds to a client request, the client must receive the entire result, not simply take the first few results and tell the server to stop sending. Therefore, in the actual development, it is a very good habit to keep the query simple and only return the necessary data, and reduce the size and number of packets between communications. This is also one of the reasons to avoid the use of SELECT * and LIMIT in the query.
The query cache
Before parsing a query statement, if the query cache is open, MySQL checks to see if the query statement matches the data in the query cache. If the current query happens to hit the query cache, the result in the cache is returned after the user permission is checked once. In this case, the query is not parsed, nor is an execution plan generated, nor is it executed.
MySQL stores the cache in a reference table, indexed by a hash value calculated from the query itself, the current database being queried, the client protocol version number, and other information that may affect the results. So any difference between two queries on any character (e.g., space, comment) will result in a cache miss.
If the query contains any user-defined functions, storage functions, user variables, temporary tables, or system tables in the mysql library, the query results will not be cached. For example, the functions NOW() or CURRENT_DATE() may return different results for different query times. For example, a query containing CURRENT_USER or CONNECION_ID() may return different results for different users. It doesn’t make sense to cache such query results.
Since it’s a cache, it’s invalidated, so when is the query cache invalidated? MySQL’s query caching system keeps track of every table involved in a query, and if these tables (data or structure) change, all cached data associated with that table is invalidated. Because of this, MySQL must invalidate all caches of the corresponding table during any write operation. If the query cache is very large or fragmented, this operation can be very costly to the system, or even cause the system to freeze for a while. The additional cost of the query cache is not limited to write operations, but also to read operations:
- Any query statement must be checked before it is started, even if the SQL statement will never hit the cache
- If the query results can be cached, the results will be cached after execution, with additional system overhead
Based on this, it is important to know that not all query caching will improve system performance. Caching and invalidation will cost additional resources, and only when the resource savings from caching are greater than the resources consumed will the system improve performance. However, evaluating whether or not opening the cache will lead to performance gains is difficult and beyond the scope of this article. If the system does have some performance problems, you can try to turn on the query cache and make some optimizations in the database design, such as:
- Replace one large table with several small ones, and be careful not to over-design
- Batch insert replaces cyclic single insert
- Reasonable control of cache space size, generally speaking, its size is set to tens of megabytes is more appropriate
- You can use SQL_CACHE and SQL_NO_CACHE to control whether a query statement needs to be cached
A final word of advice is not to open query caching lightly, especially in write-intensive applications. If you can’t resist, you can set query_cache_type to DEMAND so that only queries that are added to SQL_CACHE are cached, giving you a lot of freedom over which queries need to be cached.
Of course, the query caching system itself is very complex, and this is only a small part of the discussion. Other more in-depth topics, such as: how does the cache use memory? How to control memory fragmentation? You can read about how transactions affect the query cache and so on.
Parsing and preprocessing
MySQL parses SQL statements by keyword and generates a corresponding parse tree. The procedure parser validates and parses mainly through syntax rules. For example, whether the SQL is using the wrong keywords or whether the keywords are in the correct order and so on. Preprocessing further checks whether the parse tree is valid based on MySQL rules. For example, check whether the tables and columns to be queried exist, and so on.
Query optimization
The syntax tree generated by the previous steps is considered valid and is turned into a query plan by the optimizer. In most cases, a query can be executed in many ways, all of which return the corresponding result. The optimizer’s job is to find the best execution plan.
MySQL uses a cost-based optimizer, which tries to predict the cost of a query using a certain execution plan and selects the one with the least cost. In MySQL, you can query the value of last_query_cost of the current session to find the cost of calculating the current query.
mysql> select * from t_message limit 10; . Mysql > show status like'last_query_cost';
+-----------------+-------------+
| Variable_name | Value |
+-----------------+-------------+
| Last_query_cost | 6391.799000 |
+-----------------+-------------+
Copy the code
The results in the example indicate that the optimizer thinks it would take about 6391 random data pages to complete the above query. This result is calculated based on column statistics such as the number of pages per table or index, index cardinality, index and data row length, index distribution, and so on.
There are so many reasons lead to MySQL choose wrong execution plan, such as statistical information is not accurate, not consider not subject to the control of operating costs (user-defined functions, stored procedures), MySQL think optimal is not the same as we think (we hope that the execution time as short as possible, but it think cost small MySQL value choice, But low cost does not mean short execution time) and so on.
MySQL’s query optimizer is a very complex part that uses a number of optimization strategies to generate an optimal execution plan:
- Redefine the association order of tables (When a query is associated with multiple tables, it is not always in the order specified in SQL, but there are techniques to specify the association order)
- Optimize MIN() and MAX() functions (find the minimum value of a column, if the column has an index, only need to find the left end of the B+Tree index, otherwise can find the maximum value, see below)
- Terminate the query early (e.g., when Limit is used, terminate the query as soon as a sufficient number of result sets are found)
- Optimization (in the old version of MySQL will use two transmission sorting, namely first read row pointer and need sort fields in memory for the sorting, then according to the result of sorting to read the data line, and the new version USES a single transmission sequence, which is a read all of the data line, and then according to the given column sorting. For I/O intensive applications, efficiency is much higher.)
With the continuous development of MySQL, the optimization strategies used by the optimizer are also constantly evolving. Here are just a few of the most common and easy to understand optimization strategies.
Query execution engine
After completing the parsing and optimization phase, MySQL will generate the corresponding execution plan, and the query execution engine will execute the results step by step according to the instructions given by the execution plan. Most of the execution is done by calling interfaces implemented by the storage engine, known as the Handler API. Each table in the query is represented by a handler instance. In fact, MySQL creates a handler instance for each table during the query optimization phase. The optimizer can obtain information about the table based on the interface of these instances, including all column names of the table, index statistics, etc. The storage engine interface provides a lot of functionality, but there are only a few dozen interfaces underneath that do most of the work of a query like building blocks.
Returns the result to the client
The final stage of query execution is to return the results to the client. Even if no data is retrieved, MySQL still returns information about the query, such as the number of rows affected by the query and the execution time.
If the query cache is enabled and the query can be cached, MySQL will also store the results in the cache.
The result set return to the client is an incremental and gradual return process. It is possible that MySQL will start gradually returning the result set to the client as soon as the first result is generated. In this way, the server does not need to store too many results and consume too much memory, and the client can get the results in the first time. Note that each row in the result set will be sent as a packet that meets the communication protocol described in ①, and then transmitted through TCP. During transmission, MySQL packets may be cached and sent in batches.
MySQL query execution process is divided into 6 steps:
- The client sends a query request to the MySQL server
- The server first checks the query cache and immediately returns the results stored in the cache if a hit is made. Otherwise move on to the next stage
- The server parses the SQL, preprocesses it, and generates the corresponding execution plan by the optimizer
- MySQL invokes the storage engine API to execute the query according to the execution plan
- The result is returned to the client and the query result is cached
Performance Optimization Suggestions
After reading so much, you might expect some optimization tips, and yes, here are some optimization tips from three different areas. But wait, one more word of advice: Don’t listen to the “absolute truths” you read about optimization, including the ones discussed in this article, but test your assumptions about execution plans and response times in real business scenarios.
Scheme design and data type optimization
Choose data types that are small and simple. Smaller data types are usually faster, use less disk, less memory, and require fewer CPU cycles for processing. Simpler data types require fewer CPU cycles to compute. For example, integers are less expensive than character operations, so they are used to store IP addresses and DATETIME is used to store time instead of strings.
Here are a few tips that can go wrong:
- In general, changing a nullable column to NOT NULL will NOT do much to improve performance, but if you plan to create indexes on a column, you should set the column to NOT NULL.
- Specifying the width of an integer type, such as INT(11), does nothing. INT uses 32 bits (4 bytes) of storage space, so its range of representation is already defined, so INT(1) and INT(20) are the same for storage and computation.
- UNSIGNED does not allow negative values and roughly doubles the maximum number of positive numbers. For example, TINYINT stores in the range of -128 to 127, while UNSIGNED TINYINT stores in the range of 0-255.
- In general, there is not much need to use DECIMAL data types. You can still use BIGINT even when you need to store financial data. For example, to be accurate to 1/10,000, you can multiply the data by a million and store it using BIGINT. This avoids the problems of inaccurate floating-point calculations and costly accurate DECIMAL calculations.
- TIMESTAMP uses 4 bytes of storage space and DATETIME uses 8 bytes of storage space. Thus, TIMESTAMP can only represent the years 1970-2038, a much smaller range than DATETIME, and the value of TIMESTAMP varies from time zone to time zone.
- In most cases, there is no need to use enumerated types. One drawback is that the list of enumerated strings is fixed, and adding and deleting strings (the enumeration option) must use ALTER TABLE (if you simply append elements to the end of the list, you do not need to rebuild the TABLE).
- Do not have too many columns in the schema. The reason is that the storage engine API works by copying data between the server layer and the storage engine layer in row buffer format, and then decoding the cached content into columns at the server layer, a very expensive conversion process. If there are too many columns and too few columns are actually used, the CPU usage can be too high.
- Big TABLE ALTER TABLE is very time consuming, MySQL perform most of the modification result TABLE operation method is to use a new structure to create an empty TABLE, find out all of the data from the old TABLE to insert a new TABLE, and then delete the old TABLE. Especially if memory is low and the table is large and has large indexes, it takes longer. Of course, there are some strange techniques can solve this problem, interested to consult.
Creating a High-performance index
Indexes are an important way to improve the performance of MySQL queries, but too many indexes may lead to high disk usage and high memory usage, which may affect the overall performance of the application. You should avoid adding an index as an afterthought, because you may need to monitor a lot of SQL to locate the problem, and adding an index takes much longer than the time required to add an index in the first place.
I’ll show you a series of strategies for creating high-performance indexes and how each strategy works. But before I do that, it’s helpful to understand some of the algorithms and data structures associated with indexing.
Index related data structures and algorithms
Usually we refer to the index refers to the B-tree index, which is the most common and effective index for searching data in the relational database. Most storage engines support this index. The term b-tree is used because MySQL uses this keyword in CREATE TABLE and other statements, but in fact different storage engines may use different data structures. For example, InnoDB uses B+Tree.
B+Tree B+Tree Note that a B+ tree index does not find a specific row for a given key value. It only finds the page of the row being searched. The database then reads the page into memory, searches the page in memory, and finally retrieves the data to be searched.
Before introducing B+Tree, let’s take a look at the binary search Tree. It is a classical data structure in which the value of the left subtree is always less than the value of the root and the value of the right subtree is always greater than the value of the root, as shown in figure 1. If you want to find a record with a value of 5 in this tree, the general process is as follows: find the root, which has a value of 6 and is greater than 5, so look for the left subtree, find 3, which is greater than 3, and then look for the right subtree of 3, three times in total. In the same way, if you look for a record with a value of 8, you need to look three times. Therefore, the average search times of the binary search tree is (3 + 3 + 3 + 2 + 2 + 1) / 6 = 2.3 times. In the case of sequential search, it takes only one search for the record with the value of 2, but six searches for the record with the value of 8. Therefore, the average search times of sequential search is: (1 + 2 + 3 + 4 + 5 + 6) / 6 = 3.3 times, so in most cases the average binary lookup tree is faster than the sequential lookup.
Since the binary search tree can be constructed arbitrarily, the binary search tree as shown in Figure 2 can be constructed with the same value. Obviously, the query efficiency of this binary search tree is almost the same as that of sequential search. If the binary lookup number has the highest query performance, the binary lookup tree needs to be balanced, i.e., balanced binary tree (AVL tree).
A balanced binary tree must first meet the definition of a binary search tree, and secondly must satisfy that the height difference between the two subtrees of any node cannot be greater than 1. Obviously, Figure 2 does not satisfy the definition of balanced binary tree, while Figure 1 is a balanced binary tree. The search performance of balanced binary tree is relatively high (the best performance is optimal binary tree), and the better the query performance, the higher the maintenance cost. For example, in the balanced binary tree in Figure 1, when the user needs to insert a new node with a value of 9, the following changes are required.
Turning the inserted tree back into a balanced binary tree with a single left rotation is the simplest case, and may require multiple rotations in a practical application scenario. The search efficiency of balanced binary tree is not bad, the implementation is also very simple, the corresponding maintenance cost is acceptable, why MySQL index does not directly use balanced binary tree?
As the data in the database increases, the size of the index itself increases. Therefore, it is impossible to store all the indexes in memory. Therefore, indexes are usually stored on disks in the form of index files. In this case, index lookups incur disk I/O costs that are orders of magnitude higher than memory accesses. You can imagine what is the depth of a binary tree with millions of nodes? If you put a binary tree with such a large depth on disk, it would require an I/O read from disk for every node read, and the overall search time would obviously be unacceptable. So how do you reduce the number of I/O accesses during lookups?
An effective solution is to reduce the depth of the Tree and change the binary Tree into an M-tree (multi-path search Tree), which B+Tree is. To understand B+Tree, we only need to understand its two most important features: first, all the keywords (which can be understood as data) are stored in the Leaf Page, while the non-leaf node (Index Page) does not store real data. All the record nodes are stored in the same Leaf node in the order of key values. Second, all leaf nodes are connected by Pointers. A simplified B+Tree of height 2 is shown below.
How to understand these two characteristics? MySQL sets the size of each node to an integer multiple of a page (for reasons explained below), which means that each node can store more internal nodes for a given amount of node space, so that each node can be indexed more accurately. The advantage of pointer links for all leaf nodes is that they can be accessed in intervals. For example, in the figure above, if records greater than 20 but less than 30 are searched, node 20 is only needed to be found, and then the pointer can be traversed to find 25 and 30 successively. You can’t do interval lookup without a link pointer. This is why MySQL uses B+Tree as its index storage structure.
Why MySQL sets the node size to integer multiples of pages requires an understanding of disk storage. Disk itself is a lot slower than main memory access, combined with the loss of mechanical movement (especially common mechanical hard disk, the disk access speed is often hundreds of one over ten thousand of the main memory, in order to minimize disk I/O, disk is often not in accordance with the need to read, but every time to proofread, even if only need one byte, disk will begin from this position, A certain length of data is read backwards and stored in the memory. The length of preread is usually an integer multiple of a page.
A page is a logical block of computer-managed memory. Hardware and OS often divide main memory and disk storage into contiguous, equal-sized blocks, each of which is called a page (in many OSS, pages are typically 4K in size). Main memory and disk exchange data on a page basis. When the data the program is trying to read is not in main memory, a page missing exception will be triggered. At this point, the system will send a disk read signal to the disk. The disk will find the starting location of the data and load one or more pages back into memory, and then return together.
MySQL takes advantage of disk prefetch by setting the size of a node to be equal to one page so that each node can be fully loaded with only one I/O. To achieve this, each time a new node is created, a page space is requested directly. This ensures that a node is physically stored on a page, and computer storage allocation is page-aligned, so that a node can be read in one I/O. Assuming that the height of B+Tree is H, a search requires a maximum of H-1 I/ OS (the root node is resident in memory), and the complexity is O(h) = O(logmN). In practical application scenarios, M is usually large, often more than 100, so the height of the tree is usually small, usually less than 3.
Finally, a brief understanding of the operations of B+Tree nodes is given to have a general understanding of index maintenance. Although indexes can greatly improve query efficiency, it still costs a lot to maintain indexes. Therefore, it is particularly important to create reasonable indexes.
Using the tree above as an example, we assume that each node can store only four inner nodes. The first node, 28, is inserted, as shown below.
The next node 70 should be inserted into the leaf node between 50 and 70 according to the query in Index Page, but the leaf node is full, so the operation of splitting the leaf node is needed. The starting point of the current leaf node is 50, so the leaf node is split according to the median value, as shown in the figure below.
Finally, insert a node 95, at which time both Index Page and Leaf Page are full, and split twice, as shown in the figure below.
The result is a tree like this.
To maintain balance, B+Tree requires a large number of page splitting operations for newly inserted values, and page splitting requires I/O operations. To minimize page splitting operations, B+Tree also provides rotation functions similar to balanced binary trees. When a Leaf Page is full but its left and right siblings are not, B+Tree does not rush to split, but moves the record to the sibling node of the current Page. Normally, left brothers are checked first for rotation operations. As in the second example above, when 70 is inserted, page splitting is not done, but left rotation is done.
Rotation minimizes page splitting, reduces disk I/O operations during index maintenance, and improves index maintenance efficiency. Note that deleting a node is similar to inserting a node, but still requires rotation and splitting operations, which will not be explained here.
High Performance Strategy
MySQL > create index (); MySQL > create index (); As a simple example, suppose we have the following data table:
CREATE TABLE People(
last_name varchar(50) not null,
first_name varchar(50) not null,
dob date not null,
gender enum(`m`.`f`) not null,
key(last_name,first_name,dob)
);
Copy the code
For each row in the table, the index contains the values of the last_name, first_name, and DOB columns. The following figure shows how the index organizes the data store.
How does index organize data storage, from: High Performance MySQL Study Notes
As you can see, the index is sorted first by the first field, and then by the third field, date of birth, when the name is the same. This is the reason for the “leftmost rule” of indexes.
The principle of left
1, MySQL does not use index: columns that are not independent
“Independent column” means that the index column cannot be part of an expression or arguments to a function. Such as:
select * from where id + 1 = 5
Copy the code
It is easy to see that this is equivalent to id = 4, but MySQL cannot parse this expression automatically, and the same is true with functions.
2. Prefix index
If the column is long, you can usually index the first part of the character, which can effectively save index space and improve index efficiency.
3, multi-column index and index order
In most cases, creating independent indexes on multiple columns does not improve query performance. The reason for this is simple: MySQL does not know which index to choose for better query efficiency, so prior to older versions such as MySQL5.0, it randomly selects a column index, while newer versions use the merge index strategy. For a simple example, in a movie actor table, create an independent index on each column actor_id and film_id, and then query as follows:
select film_id,actor_id from film_actor where actor_id = 1 or film_id = 1
Copy the code
Older versions of MySQL randomly selected an index, but newer versions of MySQL have the following optimizations:
select film_id,actor_id from film_actor where actor_id = 1
union all
select film_id,actor_id from film_actor where film_id = 1 and actor_id <> 1
Copy the code
- When multiple indexes intersect (multiple AND conditions), an index that contains all related columns is generally preferable to multiple independent indexes.
- When multiple indexes perform joint operations (multiple OR conditions), the operations such as merging and sorting of result sets need to consume a lot of CPU and memory resources. Especially, when some indexes are not highly selective and need to return a large amount of data, the query cost is higher. So in this case it’s better to go full table scan.
Therefore, if there is an index merge (Extra field appears Using union), we should check whether the query and table structure is already optimal. If there is no problem with the query and table, it can only indicate that the index is built very badly, and we should carefully consider whether the index is appropriate. It is possible that a multi-column index containing all related columns would be more appropriate.
As you can see from the diagram, the order of the indexes is critical to the query. It is obvious that the most selective fields should be placed in front of the index, so that the first field will filter out most of the data that does not meet the criteria.
Index selectivity refers to the ratio of the number of unique indexes to the total number of records in the table. The more selective the index, the more efficient the query is, because the more selective the index, the more rows MySQL filters out during the query. Unique index selectivity is 1, which is the best index selectivity and the best performance.
Once you understand the concept of index selectivity, it is not difficult to determine which fields are more selective, just look them up, for example:
SELECT * FROM payment where staff_id = 2 and customer_id = 584
Copy the code
Should the index (staff_id,customer_id) be created or should the order be reversed? Perform the following query and index the fields that are closer to 1.
select count(distinct staff_id)/count(*) as staff_id_selectivity,
count(distinct customer_id)/count(*) as customer_id_selectivity,
count(*) from payment
Copy the code
There’s nothing wrong with using this principle most of the time, but keep an eye out for special situations in your data. For example, if you want to query the information of users who have made transactions under a user group:
select user_id from trade where user_group_id = 1 and trade_amount > 0
Copy the code
MySQL selects the index (user_group_id,trade_amount) for this query, which doesn’t seem to be a problem unless you consider special cases, but the reality is that most of the data in this table is migrated from the old system, and due to data incompatibility between the old and new systems, So the data migrated from the old system is given a default user group. In this case, the number of rows through an index scan is basically the same as a full table scan, and the index is useless.
Broadly speaking, rules of thumb and inferences are useful in most situations and can guide development and design, but the reality is often more complex, and specific situations in real business scenarios can ruin your entire design.
4. Avoid multiple scope conditions
In real development, we often use multiple scope criteria, such as when we want to query users who have logged in during a certain period of time:
select user.* from user where login_time > '2017-04-01' and age between 18 and 30;
Copy the code
There is one problem with this query: it has two scope conditions, the login_time column and the AGE column. MySQL can use either the index of the login_time column or the index of the AGE column, but not both.
5. Overwrite indexes
If an index contains or overwrites the values of all the fields to be queried, there is no need to go back to the table to query, which is called an overwriting index. Overwriting an index is a very useful tool that can greatly improve performance, as a query that only needs to scan the index provides many benefits:
- Index entries are much smaller than the data row size, and reading only indexes greatly reduces data access
- Indexes are stored in order of column values, much less for I/O intensive range queries than for random IO reads of each row from disk
6. Use index scans to sort
MySQL has two ways to produce ordered result sets. One is to sort the result sets, and the other is to scan the results in order by index. If the type column in the explain result has a value of index, it indicates that an index scan was used for sorting.
Scanning the index itself is fast because you only need to move from one index record to the next adjacent record. But if the index itself does not cover all the columns that need to be queried, then you have to query the corresponding row back to the table for each index record scanned. This read operation is basically random I/O, so reading data in index order is usually slower than sequential full table scans.
When designing an index, it is best to have an index that satisfies both sorts and queries.
The index can be used to sort results only if the index column ORDER is exactly the same as the ORDER BY clause, and all columns are sorted in the same direction. If a query requires multiple tables to be associated, an index can be used to sort only if all the fields referenced BY the ORDER BY clause are the first table. The ORDER BY clause has the same constraints as the query, requiring the left-most prefix (except for the case where the left-most column is specified as a constant, as shown in a simple example), but otherwise requiring sorting rather than index sorting.
// Left column constant, index: (date,staff_id,customer_id) select staff_id,customer_id from demo where date = ‘2015-06-01’ order by staff_id,customer_id
7. Redundant and duplicate indexes
Redundant indexes refer to indexes of the same type that are created on the same columns in the same order. Avoid such indexes and delete them immediately after discovery. For example, if you have an index (A,B), creating index (A) is A redundant index. Redundant indexes often occur when A new index is added to A table, such as when someone creates A new index (A,B) that does not extend an existing index (A).
In most cases, you should try to extend existing indexes rather than create new ones. But there are rare instances where performance considerations require redundant indexes, such as extending an existing index so that it becomes too large to affect other queries that use the index.
8. Delete indexes that have not been used for a long time
Periodically deleting indexes that have not been used for a long time is a good habit.
I’m going to leave the topic of indexes at that. Finally, an index is not always the best tool, and is only effective if the benefit of helping to speed up queries outweighs the extra work it does. For very small tables, a simple full table scan is more efficient. For medium to large tables, indexes are very effective. With very large tables, the cost of creating and maintaining indexes increases, and other techniques may be more effective, such as partitioning tables. Last but not least, it is a virtue to explain.
Specific type of query optimization
Optimize the COUNT() query
COUNT() is probably the most misunderstood function. It counts the number of columns and the number of rows. Column values are required to be non-empty when counted; it does not count NULL. If you confirm that the expression in parentheses cannot be null, you are actually counting the number of lines. The simplest is that when COUNT(*) is used, instead of expanding to all columns as we might expect, it actually ignores all columns and counts rows directly.
This is where the most common misconception is to specify a column in parentheses and expect the result to be the number of rows, often under the mistaken impression that the former will perform better. This is not the case. If you want to COUNT the number of rows, use COUNT(*) directly, which has clear meaning and better performance.
Sometimes some business scenarios do not require a perfectly accurate COUNT value and can be replaced with an approximation. The number of rows produced by EXPLAIN is a good approximation, and the execution of EXPLAIN does not require actual query execution, so the cost is very low. In general, performing COUNT() requires scanning a large number of rows to get accurate data, making it difficult to optimize, and the only thing MySQL can do is overwrite indexes. If it doesn’t solve the problem, it can only be solved at an architectural level, such as adding summary tables or using external caching systems such as Redis.
Optimizing associated query
In big data scenarios, tables are associated by a redundant field, which provides better performance than directly using JOIN. If you do need to use relational queries, you need to pay special attention to the following:
- Make sure there are indexes ON the columns in the ON and USING clauses. The order of associations should be considered when creating indexes. When tables A and B are associated with column C, there is no need to create an index on the corresponding column of table A if the optimizer is associated in order A and B. Unused indexes are an additional burden, and in general, only the corresponding column of the second table in the association order needs to be created unless there is another reason (as explained below).
- Ensure that any expressions in GROUP BY and ORDER BY refer to only one column in a table so that MySQL can use indexes for optimization.
To understand the first tip for optimizing associated queries, you need to understand how MySQL performs associated queries. The current MySQL association strategy is very simple. It performs nested loop association for any association, which loops through one table to fetch a single piece of data, then loops through the nested loop to the next table to find matching rows, and so on, until matching behavior is found in all tables. It then returns the columns needed in the query based on the rows that each table matches.
Too abstract? To illustrate, for example, a query like this:
SELECT A.xx,B.yy
FROM A INNER JOIN B USING(c)
WHERE A.xx IN (5.6)
Copy the code
Given that MySQL associates the query in order A and B, we can use the following pseudocode to show how MySQL completes the query:
outer_iterator = SELECT A.xx,A.c FROM A WHERE A.xx IN (5.6);
outer_row = outer_iterator.next;
while(outer_row) {
inner_iterator = SELECT B.yy FROM B WHERE B.c = outer_row.c;
inner_row = inner_iterator.next;
while(inner_row) {
output[inner_row.yy,outer_row.xx];
inner_row = inner_iterator.next;
}
outer_row = outer_iterator.next;
}
Copy the code
As you can see, the outermost query is queried against the A.x column. If there is an index on A.c, the entire associated query will not be used. Looking at the inner query, it is clear that an index on B.c will speed up the query, so you only need to create an index on the corresponding column of the second table in the association order.
Optimize LIMIT paging
When paging is required, the LIMIT plus offset is usually implemented with an appropriate ORDER BY clause. If there is a corresponding index, it is usually efficient, otherwise, MySQL needs to do a lot of file sorting.
A common problem is when the offset is very large, such as LIMIT 10000 20, MySQL needs to query 10020 records and only return 20 records, the first 10000 will be discarded, which is very expensive.
One of the easiest ways to optimize such queries is to use overwrite index scans whenever possible, rather than querying all columns. Then do an associated query as needed and return all columns. For large offsets, the efficiency gains are huge. Consider the following query:
SELECT film_id,description FROM film ORDER BY title LIMIT 50.5;
Copy the code
If the table is very large, the query would be better off looking like this:
SELECT film.film_id,film.description
FROM film INNER JOIN (
SELECT film_id FROM film ORDER BY title LIMIT 50.5
) AS tmp USING(film_id);
Copy the code
Deferred association here will greatly improve query efficiency, allowing MySQL to scan as few pages as possible, obtain the records that need to be accessed, and then return the columns required by the original table query according to the associated columns.
Sometimes if you can use a bookmark to record where the data was last fetched, then the next scan can start directly from the bookmark, thus avoiding the use of OFFSET, as in the following query:
SELECT id FROM t LIMIT 10000.10; SELECT id FROM t WHERE id >10000 LIMIT 10;
Copy the code
Other optimizations include using pre-computed summary tables or associating with a redundant table that contains only primary key columns and columns that need to be sorted.
Optimization of the UNION
MySQL’s strategy for dealing with unions is to create a temporary table first, then insert each query result into the temporary table, and then do the query. Therefore, many optimization strategies do not work well in UNION queries. It is often necessary to manually “push” WHERE, LIMIT, ORDER BY, and so on into each subquery so that the optimizer can take full advantage of these conditions to optimize first.
Use UNION ALL unless you really need to de-duplicate the server. If you do not have the ALL keyword, MySQL will add a DISTINCT option to the temporary table. This will result in a unique check for the data in the entire temporary table, which is very expensive. Of course, even with the ALL keyword, MySQL always puts the results into a temporary table, reads them out, and returns them to the client. Most of the time this is not necessary, for example you can sometimes return the results of each subquery directly to the client.
conclusion
Understanding how queries are executed and where time is spent, along with some knowledge of the optimization process, will help you better understand MySQL and understand the principles behind common optimization techniques. I hope the principles and examples in this paper can help you better connect theory with practice and apply theoretical knowledge to practice.
There is nothing else to say, leave two questions for you to think about, you can think about the answer in your head, this is also what we often hang in the mouth, but few people will think why?
- There are many programmers who share the idea that stored procedures should never be used at all. Stored procedures are very difficult to maintain and expensive to use, and business logic should be put on the client side. Why store procedures when clients can do all these things?
- JOIN itself is very convenient, just query, why need a view?