You can’t use SELECT *, you can’t use NULL fields, you can’t create indexes, you can’t SELECT data types for fields, you can’t use NULL fields, you can’t create indexes, you can’t SELECT data types for fields…

Do you really understand these optimization techniques? Do you understand how it works? Is there any performance improvement in real life scenarios? I don’t think so. It is therefore important to understand the rationale behind these optimization suggestions, and hopefully this article will give you a chance to revisit them and apply them in a real business scenario.

MySQL Logical Architecture

Being able to build an architectural diagram in your mind of how MySQL’s components work together can help you gain a deeper understanding of MySQL server. The following figure shows the logical architecture of MySQL.

MySQL logical architecture is divided into three layers. The top layer is the client layer, which is not unique to MySQL. Functions such as connection processing, authentication, security and so on are handled in this layer.

Most of MySQL’s core services are in the middle tier, including query parsing, analysis, optimization, caching, built-in functions (e.g., time, math, encryption, etc.). All the cross-storage engine functionality is also implemented at this layer: stored procedures, triggers, views, and so on. The lowest layer is the storage engine, which is responsible for storing and extracting data in MySQL. Similar to file systems under Linux, each storage engine has its advantages and disadvantages. The intermediate service layer communicates with the storage engine through apis that mask the differences between different storage engines.

MySQL Query Procedure

We’re always looking for better query performance from MySQL, and the best way to do that is to figure out how MySQL optimizes and executes queries. Once you understand this, you’ll see that a lot of query optimization is really about following a few principles to make the MySQL optimizer work as it should.

What does MySQL do when it sends a request to MySQL?

Client/server communication Protocol The MySQL client/server communication protocol is “half-duplex” : at any given moment, either the server is sending data to the client or the client is sending data to the server. The two actions cannot occur simultaneously.

Once one end starts sending a message, the other end needs to receive the entire message in order to respond to it, so we can’t and don’t need to cut a message into small pieces and send it independently, and there’s no way to control the flow.

The client sends the query request to the server in a separate packet, so the max_allowed_packet parameter needs to be set when the query statement is long. Note, however, that if the query is too large, the server will refuse to receive more data and throw an exception.

In contrast, the server responds to the user with a large number of data packets. But when a server responds to a client request, the client must receive the entire return, rather than simply fetching the first few results and then telling the server to stop sending them. Therefore, in the actual development, it is a good habit to keep the query as simple as possible and return only the required data, and reduce the size and number of packets between the communication. This is also one of the reasons to avoid using SELECT * and LIMIT limits in the query.

Query Cache Before parsing a query statement, if the query cache is turned on, MySQL checks whether the query statement matches the data in the query cache. If the current query happens to match the query cache, the result is returned directly to the cache after checking the user permissions once. In this case, the query is not parsed, the execution plan is not generated, and the execution is not executed.

MySQL stores the cache in a reference table (don’t think of it as a table, think of it as a data structure similar to HashMap), indexed by a hash value calculated from the query itself, the current database being queried, the client protocol version number, and other information that might affect the result. So any difference between two queries on any character (for example, Spaces, comments) will result in a cache miss.

If the query contains any user-defined functions, stored functions, user variables, temporary tables, or system tables in the MySQL library, the query results will not be cached. For example, functions such as NOW() or CURRENT_DATE() will return different results for different query times, and queries including CURRENT_USER or CONNECION_ID() will return different results for different users. It makes no sense to cache the results of such a query.

If it’s a cache, it’s invalidated. When does the query cache invalidate? MySQL’s query cache system keeps track of every table involved in a query, and if the table (data or structure) changes, all cached data associated with that table is invalidated.

Because of this, MySQL must set all caches of the corresponding table to invalid during any write operation. If the query cache is very large or fragmented, this operation can be very costly to the system, or even cause the system to freeze for a while. The additional cost of querying the cache is not only for write operations, but also for read operations:

1. Any query must be checked before it starts, even if the SQL statement never hits the cache. If the results of the query can be cached, then the results will be cached after execution, which will also incur additional system costsCopy the code

Based on this, it is important to know that querying the cache does not always improve the performance of the system. Caching and invalidations both bring additional costs. Only when the resources saved by the cache are greater than the resources consumed by the cache will the performance be improved. However, how to evaluate the performance benefit of turning on caching is very difficult and beyond the scope of this article. If the system does have some performance problems, you can try to turn on query caching and make some optimization in the database design, such as:

1. Replace one large table with multiple small tables, careful not to overdesign 2. Reasonable control of the cache space size, generally speaking, its size is set to dozens of megabits more appropriate 4. You can use SQL_CACHE and SQL_NO_CACHE to control whether a query statement needs to be cachedCopy the code

A final piece of advice is not to turn on query caching easily, especially for writer-intensive applications. If you really can’t resist, you can set query_cache_type to DEMAND so that only SQL_CACHE queries are cached, but not all other queries. This gives you much control over which queries need to be cached.

Of course, querying the cache system itself is very complex, and this is only a small part of the discussion, other more in-depth topics, such as: How does the cache use memory? How do I control memory fragmentation? How transactions affect the query cache, etc., can be read by the reader, but this is a good introduction.

Syntax parsing and preprocessing MySQL parses SQL statements by keywords and generates a corresponding parse tree. This process is validated and parsed primarily by the parser using syntax rules. For example, whether the SQL used the wrong keywords or keywords in the correct order and so on. Preprocessing will further check whether the parse tree is valid according to MySQL rules. For example, check whether the data table and data column to query exists.

The syntax tree generated by the previous step of query optimization is considered legitimate and is converted into a query plan by the optimizer. In most cases, a query can be executed in a number of ways, all of which return results. The job of the optimizer is to find the best plan of execution.

MySQL uses a cost-based optimizer, which tries to predict the cost of a query using a certain execution plan and chooses the one with the least cost. MySQL can query the value of last_query_cost of the current session to calculate the cost of the current query.

mysql> select * from t_message limit 10; . Mysql > show status like 'last_query_cost'; +-----------------+-------------+ | Variable_name | Value | +-----------------+-------------+ | Last_query_cost | 6391.799000 | + -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- - + -- -- -- -- -- -- -- -- -- -- -- -- -- +Copy the code

The results in the example indicate that the optimizer thinks it would take about 6,391 random data pages to complete the query above. This result is calculated based on column statistics such as the number of pages per table or index, the cardinality of the index, the length of the index and data rows, the distribution of the index, and so on.

There are so many reasons lead to MySQL choose wrong execution plan, such as statistical information is not accurate, not consider not subject to the control of operating costs (user-defined functions, stored procedures), MySQL think optimal is not the same as we think (we hope that the execution time as short as possible, but it think cost small MySQL value choice, But small cost does not mean short execution time) and so on.

MySQL’s query optimizer is a very complex piece of work, and it uses a number of optimization strategies to generate an optimal execution plan: Redefining the association order of tables (when querying multiple tables, it is not always the order specified in SQL, but there are some tricks to specify the association order)

Optimize MIN() and MAX() functions (find the minimum value of a column, if the column has an index, you only need to find the left end of B+Tree index, otherwise you can find the maximum value, see below for details)

Premature termination of the query (e.g. when Limit is used, the query is terminated as soon as a sufficient number of result sets are found)

Optimized sort (In the old version of MySQL, we used a two-transfer sort, which reads the row pointer and sorts the fields in memory, and then reads the rows based on the sorted results. In the new version, we use a single-transfer sort, which reads all rows at once and sorts the rows based on the given columns. For I/O intensive applications, it is much more efficient.)

As MySQL continues to evolve, the optimization strategies used by the optimizer are also evolving. Here are just a few of the most common and easy to understand optimization strategies.

After the query execution engine completes the parsing and optimization stage, MySQL will generate the corresponding execution plan, and the query execution engine gradually executes the instructions given by the execution plan to get the results. Much of the execution is done by calling interfaces implemented by the storage engine, known as the Handler API. Each table in the query is represented by a handler instance.

In fact, MySQL creates a handler instance for each table during the query optimization phase. The optimizer can obtain information about the table based on the interface of these instances, including all column names, index statistics, and so on. The storage engine interface provides a lot of functionality, but there are only a few dozen underlying interfaces, and these interfaces are like building blocks to complete most of the operations of a query.

Return results to the Client The final stage of query execution is the return of results to the client. MySQL returns information about the query, such as the number of rows affected by the query and the execution time, even if the query fails.

If the query cache is opened and the query can be cached, MySQL will also store the results in the cache. The return of the result set to the client is an incremental and step-by-step process. It is possible that MySQL will gradually return the result set to the client after generating the first result. In this way, the server does not need to store too many results and consume too much memory, and the client can get the results back first time.

It should be noted that each line in the result set will be sent as a packet that meets the communication protocol described in ①, and then transmitted through TCP. During the transmission process, the packets of MySQL may be cached and sent in batches.

MySQL query execution process is divided into 6 steps:

The client sends a query request to the MySQL server. The server first checks the cache of the query and returns the result stored in the cache immediately if it hits the cache. Otherwise, in the next stage, the server parses THE SQL, preprocesses it, and then the optimizer generates the corresponding execution plan. According to the execution plan, MySQL calls the API of the storage engine to execute the query and return the results to the client. At the same time, the query results are cachedCopy the code

After all this reading, you might expect to see a few optimizations, and yes, here are some optimizations from three different areas. But wait, here’s a word of advice: Don’t listen to the “absolute truths” you read about optimization, including the ones discussed in this article, and test your assumptions about execution plans and response times in real business scenarios.

As long as you choose data types that are small and simple, smaller data types tend to be faster, use less disk and memory, and require fewer CPU cycles for processing. Simpler data types require fewer CPU cycles to compute. For example, integers are cheaper than character operations, so use integers to store IP addresses and DATETIME to store time rather than strings. Here are a few tips that may be misunderstood:

  1. Changing a NULL column to NOT NULL generally does NOT help performance much, but if you plan to create an index on a column, you should set that column to NOT NULL.
  2. Specifying a width for an integer type, such as INT(11), has no effect. INT uses 32 bits (4 bytes) of storage, so its representation range is determined, so INT(1) and INT(20) are the same for storage and computation.
  3. UNSIGNED means that negative values are not allowed, roughly doubling the upper limit of positive numbers. For example, the TINYINT storage range is -128 to 127, while the UNSIGNED TINYINT storage range is 0 to 255.
  4. In general, there is not much need to use the DECIMAL data type. Even when you need to store financial data, you can still use BIGINT. For example, if you want to be accurate to 1/10,000, you can multiply the data by a million and use BIGINT to store it. This avoids the problem of inaccurate floating-point calculations and high cost of precise DECIMAL calculations.
  5. TIMESTAMP uses 4 bytes of storage space and DATETIME uses 8 bytes. Thus, TIMESTAMP can only represent the years 1970-2038, a much smaller range than DATETIME, and the value of TIMESTAMP varies from time zone to time zone.
  6. In most cases, there is no need to use an enumeration type. One of the disadvantages is that the enumerated list of strings is fixed, and adding and removing strings (the enumeration option) must use ALTER TABLE (if you just append elements to the end of the list, you do not need to rebuild the TABLE).
  7. Do not have too many columns in the schema. The reason is that the storage engine API works by copying data in row buffer format between the server layer and the storage engine layer, and then decoding the cached content into columns at the server layer. This conversion process is very expensive. If there are too many columns and too few are actually used, it can lead to high CPU usage.
  8. MySQL creates an empty TABLE with a new structure, retries all data from the old TABLE, inserts the new TABLE, and then deletes the old TABLE. This is especially true when memory is low and the table is large with large indexes. Of course, there are some tricks to solve this problem, interested in their own access.

Indexes are an important way to improve the query performance of MySQL, but too many indexes may lead to high disk usage and high memory usage, which will affect the overall performance of the application. You should try to avoid adding indexes as an afterthought, because you may need to monitor a large amount of SQL to figure out what the problem is, and adding indexes can take much longer than the initial addition time, so adding indexes can be very technical.

What follows is a series of strategies for creating a high performance index and how each strategy works. But before you do that, it will be helpful to understand some of the algorithms and data structures associated with indexing.

The b-Tree index is the most commonly used and effective index for finding data in relational databases. Most storage engines support this index. The term B-tree is used because MySQL uses this keyword in CREATE TABLE and other statements, but in fact different storage engines may use different data structures, such as InnoDB’s B+Tree.

B+Tree B+Tree Note that the B+ tree index does not find the specific row of a given key value. It only finds the page of the row that is being looked up. The database then reads the page into memory and searches the memory to get the data to be looked up.

Before introducing B+Tree, let’s take a look at the binary search Tree, which is a classical data structure. The value of the left subtree is always less than the value of the root, and the value of the right subtree is always greater than the value of the root, as shown in Figure ①. If you want to find a record with a value of 5 in this tree, the general process is: you find the root, which is 6, which is greater than 5, so you look for the left subtree, you find 3, which is greater than 3, and then you look for the right subtree of 3, three times. In the same way, if you look for a record with a value of 8, you need to look it up three times.

Therefore, the average search times of the binary search tree is (3 + 3 + 3 + 2 + 2 + 1) / 6 = 2.3 times. In the case of sequential search, it only takes 1 time to find the record with the value of 2, but 6 times to find the record with the value of 8, so the average search times of sequential search is: (1 + 2 + 3 + 4 + 5 + 6) / 6 = 3.3 times, so the average search speed for binary search trees is faster than sequential search in most cases.

Since the binary search tree can be constructed arbitrarily, the same value can be constructed as shown in FIG. 2. Obviously, the query efficiency of this binary tree is similar to that of sequential search. In order to obtain the highest performance, the binary search tree must be balanced, that is, balanced binary tree (AVL tree).

A balanced binary tree must firstly conform to the definition of binary search tree and secondly satisfy that the height difference between the two subtrees of any node should not be greater than 1. Obviously FIG. 2 does not satisfy the definition of a balanced binary tree, and FIG. 1 is a lesson balanced binary tree. The search performance of balanced binary tree is relatively high (the best performance is the optimal binary tree), the better the query performance, the greater the maintenance cost. For example, in the balanced binary tree in FIG. 1, when the user needs to insert a new node with a value of 9, the following changes need to be made.

The simplest case is to turn the inserted tree back to a balanced binary tree with a single left turn. In practical applications, many rotations may be required. At this point we can consider a question, balanced binary tree lookup efficiency is good, it is very simple to implement, the corresponding maintenance cost is acceptable, why MySQL index not directly use balanced binary tree?

As the data in the database increases, the size of the index itself increases, and it is not possible to store all the data in memory, so the index is often stored on disk in the form of index files. As a result, index lookups generate disk I/O consumption, which is orders of magnitude higher than memory access.

You can imagine what is the depth of a binary tree with millions of nodes? If a binary tree with such a large depth is placed on the disk, each node read requires a disk I/O read, and the entire search time is obviously unacceptable. So how do you reduce the number of I/O accesses during the search?

An effective solution is to reduce the depth of the Tree and change the binary Tree into an M-tree (multi-path search Tree), and B+Tree is a multi-path search Tree. To understand B+Tree, you only need to understand its two most important features: First, all keywords (can be understood as data) are stored in Leaf Page, non-leaf Page does not store real data, all record nodes are stored in the same layer of Leaf node in order of key value size. Second, all leaf nodes are connected by Pointers. A simplified B+Tree of height 2 is shown below.

How do you understand these two features? MySQL sets the size of each node to an integer multiple of one page (for reasons described below), which means that each node can store more inner nodes for a given node space, so that each node can index a larger and more accurate range.

The advantage of pointer link for all leaf nodes is that interval access can be carried out. For example, in the figure above, if the record is greater than 20 but less than 30, only node 20 can be found and then the pointer can be traversed to find 25 and 30 successively. If there is no link pointer, there is no interval lookup. This is why MySQL uses B+Tree as an index storage structure.

Why MySQL sets the node size to an integer multiple of pages requires an understanding of disk storage. Disk itself is a lot slower than main memory access, combined with the loss of mechanical movement (especially common mechanical hard disk, the disk access speed is often hundreds of one over ten thousand of the main memory, in order to minimize disk I/O, disk is often not in accordance with the need to read, but every time to proofread, even if only need one byte, disk will begin from this position, The data of a certain length is read backward and put into memory. The length of the preread data is usually an integer multiple of pages.

A page is a logical block of computer managed memory. Hardware and OS often divide main storage and disk storage into contiguous equally sized blocks, each of which is called a page (in many OS, the page size is usually 4K). Main storage and disk exchange data on a page basis.

When the program to read the data is not in the main memory, will trigger a missing page exception, at this time the system will be sent to the disk read disk signal, disk will find the starting position of data and read a page or a few pages into memory, and then return together, the program continues to run.

MySQL makes good use of the disk prefetch principle by setting the size of a node equal to one page, so that each node only needs one I/O to fully load. In order to achieve this goal, each time a new node is created, the space of a page is directly applied, so that a node is physically stored in a page. In addition, the computer storage allocation is aligned according to the page, so that a node can be read only once I/O.

Assuming the height of B+Tree is h, a retrieval requires at most H-1I /O (root resident memory), and the complexity O(h)=O(logMN). In practical application scenarios, M is usually large, often exceeding 100. Therefore, the height of a tree is usually small, usually no more than 3.

Finally, I briefly understand the operation of B+Tree node, and have a general understanding of the overall maintenance of the index. Although the index can greatly improve the query efficiency, it still costs a lot to maintain the index, so it is particularly important to create a reasonable index.

Using the tree above as an example, let’s assume that each node can store only four inner nodes. First, insert the first node, 28, as shown in the figure below.

Then the next node, 70, is inserted. After the query in the Index Page, it is found that the leaf nodes between 50 and 70 should be inserted, but the leaf nodes are full. At this time, it is necessary to split the leaf nodes.

Finally, insert a node 95. At this point, both Index Page and Leaf Page are full, and you need to do two splits, as shown in the figure below.

And then you split it up and you end up with this tree.

In order to maintain balance, B+Tree needs to do a lot of page splitting operations for newly inserted values, and page splitting requires I/O operations. In order to minimize page splitting operations, B+Tree also provides rotation functions similar to balanced binary trees.

When a Leaf Page is full but its left and right siblings are not, B+Tree does not rush to do the split operation, but instead moves the record to the sibling of the current Page. In general, the left side is checked first for rotation operations. As in the second example above, when inserting 70, the page is not split, but left handed.

The rotation operation minimizes page splitting, reduces disk I/O operations during index maintenance, and improves index maintenance efficiency. Note that deleting a node is similar to inserting a node, but still requires rotation and split operations, which are not covered here.

You already have a general idea of the B+Tree data structure, but how do indexes organize data storage in MySQL? To illustrate a simple example, suppose you have the following data table:

CREATE TABLE People(

    last_name varchar(50) not null,

    first_name varchar(50) not null,

    dob date not null,

    gender enum(`m`,`f`) not null,

    key(last_name,first_name,dob)

);
Copy the code

For each row in the table, the index contains the values of the last_name, first_NAME, and DOB columns. The following figure shows how the index organizes the data store.

As you can see, the index is first ordered by the first field, and then, when names are the same, by the third field, the date of birth. This is where the “leftmost rule” of the index comes in.

“Independent columns” means that the index column cannot be part of an expression or be a parameter to a function. Select * from where id + 1 = 5 select * from where id + 1 = 5 select * from where id + 1 = 5 select * from where id + 1 = 5

2. If the prefix index column is long, you can index the beginning part of the column. This saves the index space and improves the index efficiency.

Multiple column Indexes and Index order In most cases, creating separate indexes on multiple columns does not improve query performance. The simple reason is that MySQL does not know which index to select is more efficient for queries, so in older versions, such as MySQL5.0, it will select a random index for a column, whereas the new version will use the strategy of merging indexes. A simple example would be to create separate indexes for actor_id and film_id in a movie cast list, and then have the following query:

select film_id,actor_id from film_actor where actor_id = 1 or film_id = 1
Copy the code

Previous versions of MySQL randomly selected an index, but the new version optimizes it as follows:

select film_id,actor_id from film_actor where actor_id = 1 

union all

select film_id,actor_id from film_actor where film_id = 1 and actor_id <> 1
Copy the code

When multiple indexes intersect (multiple AND conditions), one index with all related columns is generally preferable to multiple independent indexes.

When multiple indexes are combined (multiple OR conditions), the operations such as merging and sorting the result set require a large amount of CPU and memory resources. In particular, when some indexes have low selectivity and a large amount of data needs to be returned for merging, the query cost is higher. So in this case, you might as well go for a full table scan.

Therefore, if Using union is found in Extra field during explain, you should check whether the query and table structure are optimal. If there are no problems with the query and table, it only means that the index is badly built, and you should carefully consider whether the index is appropriate. It is possible that a multi-column index containing all related columns would be more appropriate.

We mentioned earlier how indexes organize data stores. As you can see from the figure, the order of indexes is critical for querying multi-column indexes. It is obvious to put the more selective fields at the front of the index, so that the first field can filter out most of the data that does not meet the criteria.

Index selectivity refers to the ratio of the number of unique index values to the total number of records in the data table. A higher index selectivity enables MySQL to filter out more rows during the query. When the selectivity of the unique index is 1, the best index selectivity is also the best performance.

After understanding the concept of index selectivity, it is not difficult to determine which field is more selective. Just look it up, for example:

SELECT * FROM payment where staff_id = 2 and customer_id = 584
Copy the code

Should the index (staff_id,customer_id) be created or should the order be reversed? Perform the following query and index whichever field is closer to 1.

select count(distinct staff_id)/count(*) as staff_id_selectivity,

       count(distinct customer_id)/count(*) as customer_id_selectivity,

       count(*) from payment
Copy the code

There’s nothing wrong with using this principle most of the time, but still pay attention to whether there are special cases in your data. For example, if you want to query information about users who have made transactions in a user group:

select user_id from trade where user_group_id = 1 and trade_amount > 0
Copy the code

MySQL selects indexes for this query (user_group_id,trade_amount). This may not seem like a problem if you don’t consider special cases, but the fact is that most of the data in this table is migrated from the old system. Therefore, the data migrated from the old system is given a default user group. In this case, the number of rows in an index scan is basically the same as in a full table scan, and the index is useless.

Broadly speaking, rules of thumb and corollary are useful in most cases to guide development and design, but the real world is often more complex, and some special case in a real business scenario can destroy your entire design.

4. Avoid multiple scope conditions in actual development, we often use multiple scope conditions, for example, to query the users who have logged in within a certain period of time:

select user.* from user where login_time > '2017-04-01' and age between 18 and 30;
Copy the code

There is a problem with this query: it has two range conditions, the login_time column and the age column. MySQL can use either the login_time column index or the age column index, but not both.

If an index contains or overwrites the values of all the fields that need to be queried, then there is no need to query back to the table. This is called overwriting the index. Overwriting indexes is a very useful tool that can greatly improve performance because queries that only scan indexes have a number of benefits:

Indexes are stored in order of column values. For I/ O-intensive range queries, it takes much less I/O than reading each row randomly from disk

There are two ways to produce an ordered result set. The first is to sort the result set, and the second is to order the results from the index scan. If the value of the type column in the explain result is index, an index scan was used for sorting.

Scanning the index itself is fast because you only need to move from one index record to the next adjacent record. But if the index itself does not cover all the columns that need to be queried, then you have to go back to the table to query for the corresponding row every time you scan an index record. This read operation is basically random I/O, so reading data in index order is generally slower than sequential full table scans. When designing indexes, it is best to have an index that satisfies both sorts and queries.

You can use an index to sort results only if the index’s columns are in exactly the same ORDER as the ORDER BY clause, and all columns are sorted in the same direction. If multiple tables need to be associated in the query, the ORDER BY clause can be sorted only when all the fields referenced BY the ORDER BY clause are from the first table.

The limit of the ORDER BY clause is the same as that of the query, in that it must satisfy the leftmost prefix (except in the case where the leftmost column is specified as a constant, as shown in a simple example below), and in all other cases it needs to be sorted rather than indexed.

// The leftmost column is constant, index: (date,staff_id,customer_id) select staff_id,customer_id from demo where date = '2015-06-01' order by staff_id,customer_idCopy the code

7. Redundant and Duplicate Indexes Redundant indexes are indexes of the same type that are created on the same columns in the same order. Avoid these indexes and delete them immediately after they are discovered. For example, if you have an index (A,B) and create an index (A), it is A redundant index. Redundant indexes often occur when A new index is added to A table, such as when someone creates an index (A,B) that does not extend an existing index (A).

In most cases, you should try to extend existing indexes rather than create new ones. But there are rare occasions when performance considerations require a redundant index, such as extending an existing index so that it becomes so large that it affects other queries that use the index.

It is a good practice to periodically delete indexes that have not been used for a long time. I’m going to stop here and say that indexes are not always the best tool, and they are only effective if the benefit of helping to speed up queries outweighs the extra work. For very small tables, a simple full table scan is more efficient.

For medium to large tables, indexes are very useful. For very large tables, where the cost of creating and maintaining indexes increases, other techniques, such as partitioning tables, may be more effective. Finally, it is a virtue to test after explaining.

Specific type of query optimization

COUNT() is probably the most misunderstood function. It has two different functions: it counts the number of values in a column, and it counts the number of rows. Column values are required to be non-null, it does not count NULL. If you verify that the expression in parentheses cannot be null, you are actually counting the number of rows. The simplest thing is that when you use COUNT(*), it doesn’t extend to all the columns as you might expect. In fact, it just ignores all the columns and counts all the rows.

This is where the most common misconception is to specify a column in parentheses and expect the number of rows to be counted, often in the mistaken belief that the former will perform better. However, this is not the case. If you want to COUNT rows, you can use COUNT(*), which makes sense and performs better. Sometimes some business scenarios do not require a perfectly accurate COUNT value. Instead, an approximation can be used. The number of rows explained is a good approximation, and the cost of performing EXPLAIN is very low without actually executing the query. In general, executing COUNT() requires scanning a large number of rows to get accurate data, so it is difficult to optimize, and the only thing the MySQL layer can do is overwrite the index. If it doesn’t work, it can only be done at the architectural level, such as adding summary tables or using an external caching system like Redis.

Optimizing associated query In the scenario of big data, tables are associated by a redundant field, which provides better performance than direct JOIN. If you need to use associated query, pay special attention to the following:

  1. Ensure that the columns in the ON and USING sentences have indexes. Consider the order of associations when creating indexes. When table A and table B are associated with column C, there is no need to create indexes on the corresponding columns of table A if the optimizer is associated in the order A and B. Unused indexes are an additional burden, and in general, you only need to create indexes on the corresponding columns of the second table in the correlation order, unless there is a reason for that (which is discussed below).
  2. Make sure that any expressions in GROUP BY and ORDER BY only refer to the columns in one table so that MySQL can use indexes to optimize.

To understand the first technique for optimizing associated queries, you need to understand how MySQL performs associated queries. The current policy of MySQL association execution is very simple. It performs nested loop association operation for any association, that is, it first loops to fetch a single row in one table, then nested loop to the next table to look for a matching row, and so on, until it finds the matching behavior in all tables. It then returns the desired columns in the query, based on the rows that each table matches.

Too abstract? The example above illustrates, for example, a query like this:

SELECT a.xx, b.yy FROM A INNER JOIN B USING(c) WHERE a.xx IN (5,6) SELECT a.xx FROM A INNER JOIN B USING(c) WHERE a.xx IN (5,6)Copy the code

Given that MySQL performs association operations in the order A, B in the query, the following pseudocode represents how MySQL completes the query:

Outer_iterator = SELECT A.x,A.c FROM A WHERE A.x IN (5,6); outer_row = outer_iterator.next; while(outer_row) { inner_iterator = SELECT B.yy FROM B WHERE B.c = outer_row.c; inner_row = inner_iterator.next; while(inner_row) { output[inner_row.yy,outer_row.xx]; inner_row = inner_iterator.next; } outer_row = outer_iterator.next; }Copy the code

As you can see, the outermost query is based on the a.xx column, and if there is an index on A.c, the entire associated query will not be used. Looking at the inner query, it is clear that an index on B.c can speed up the query, so you only need to create an index on the corresponding column of the second table in the association order.

When paging is required, it is usually done BY adding a LIMIT plus an offset, along with the appropriate ORDER BY clause. If you have a corresponding index, it is usually very efficient, otherwise MySQL has to do a lot of file sorting.

A common problem is when the offset is very large, such as LIMIT 10000 20, MySQL needs to query for 10020 records and only returns 20 records. The first 10000 records will be discarded, which is very expensive.

One of the easiest ways to optimize such queries is to use an override index scan whenever possible, rather than querying all columns. Then do an associated query as needed to return all the columns. For large offsets, the efficiency gain is very large. Consider the following query:

SELECT film_id,description FROM film ORDER BY title LIMIT 50,5;
Copy the code

If the table is very large, the query should look something like this:

SELECT film_id FROM film INNER JOIN (SELECT film_id FROM film ORDER BY title LIMIT 50,5) AS TMP USING(film_id);Copy the code

The delayed association here will greatly improve the query efficiency, allowing MySQL to scan as few pages as possible, obtain the records that need to be accessed, and then query the required columns back to the original table according to the associated columns.

SELECT id FROM t LIMIT 10000, 10; SELECT id FROM t LIMIT 10000, 10; To:

SELECT id FROM t WHERE id > 10000 LIMIT 10;
Copy the code

Other optimizations include using precomputed summary tables, or associating with a redundant table that contains only primary key columns and columns that need to be sorted.

MySQL’s strategy for UNION processing is to create temporary tables first, insert the results of each query into the temporary table, and then do the query. So many optimization strategies don’t work very well in UNION queries. It is often necessary to manually “push” the words WHERE, LIMIT, ORDER BY, and so on into each subquery so that the optimizer can optimize first to take full advantage of these conditions.

Use UNION ALL unless you really need to deduplicate the server. If you do not have the ALL keyword, MySQL will add a DISTINCT option to the temporary table, which will cause the entire temporary table to have a uniqueness check, which is very expensive. Of course, even with the ALL keyword, MySQL always puts the result into a temporary table, then reads it, and returns it to the client. Although this is not necessary in many cases, for example, it is possible to return the results of each subquery directly to the client.

Conclusion Understanding how queries are executed and where time is spent, together with some knowledge of the optimization process, will help you better understand MySQL and understand the principles behind common optimization techniques. I hope the principles and examples in this paper can help you to better connect theory and practice, and more theoretical knowledge into practice.

I don’t know what else to say, but I’m going to leave you with two questions to think about, and you can think about the answer in your head, and this is something that people often say, but very few people think about why?

  1. Too many programmers share the idea that stored procedures should be avoided if possible, that they are difficult to maintain and cost more to use, and that business logic should be put on the client side. Why have stored procedures when the client can do these things?
  2. JOIN itself is quite convenient, the direct query is fine, why do you need a view?