4. Query performance optimization

4.1 Why Is the Query Speed Slow

  • Fast queries depend on response time
  • If you think of a query as a task, it consists of a series of subtasks, each of which consumes a certain amount of time. If you want to optimize the query, in fact, to optimize its task, either to eliminate one of the subtasks off, or reduce the number of executions subtasks, or let the child tasks run faster (sometimes need to modify some query and reduce these queries on the system running in the influence of other queries, this time is to reduce the resource consumption) a query.
  • Query lifecycle:
    • From the client
    • To the server
    • On the server:
      • parsing
      • Generate an execution plan
      • Execution: The most important stage. It includes a large number of calls to retrieve data to the storage engine and post-call data processing, including sorting, grouping, etc.
    • Returns the result to the client
  • To accomplish these tasks, you need to spend time in different places. This includes operations such as networking, CPU calculations, generating statistics and execution plans, locking waits (mutually exclusive waits), and especially calls to retrieve data to the underlying storage engine, which require time spent on memory operations, CPU operations, and IO operations when out of memory. Depending on the storage engine, there may also be a lot of context switching and system calls.

4.2 Slow Query Basics: Optimize Data Access:

The most fundamental reason for poor query performance is that too much data is accessed. Some queries may inevitably require filtering large amounts of data, but this is not common. Most low-performing queries can be optimized by reducing the amount of data accessed. The following two steps can be used to analyze inefficient queries:

  • Verify that your application is retrieving more data than it needs. This usually means that too many rows are accessed, but sometimes it can also mean that too many columns are accessed.
  • Verify that the MySQL server layer is parsing a large number of more rows than necessary.

4.2.1 Whether unnecessary data is requested from the database

Some queries request more data than they actually need, and the excess data is then discarded by the application. This places an additional burden on the MySQL server and increases network overhead, as well as consuming the CPU and memory resources of the application server.

Typical cases:

  • Querying for unneeded records: A common mistake is to mistakenly assume that MySQL will only return the required data, when in fact MySQL does return the entire result set before calculating. The simplest and most effective way is to append limits to such queries.
  • Return all columns for a multi-table association: the correct approach is to take only the required columns
  • Always fetch all columns:
    • Fetching all columns makes it impossible for the optimizer to perform optimizations such as overlay scans, and also incurs additional resource consumption on the server. Be careful.
    • However, this can simplify development, provide reuse of the same code fragment, or the application uses some kind of caching mechanism, and other factors that make it necessary to pull out all columns. If you know the performance impact of doing this, you can also consider it.
  • Query the same data repeatedly: it is recommended to cache this data for the first time and retrieve it from the cache when needed.

4.2.2 Whether MySQL is scanning for additional records

With MySQL, the three simplest measures of query overhead are response time, number of rows scanned, and number of rows returned. No metric is a perfect measure of the cost of a query, but they give you a rough idea of how much data MySQL needs to access when executing queries internally, and can give you a rough idea of how long queries will take to run. All three metrics are recorded in the MySQL slow log, and checking the slow log is a good way to find queries that scan too many rows.

  • Response time:
    • Response time = Service time + queue time. There’s actually no way to break it down, no way to measure it yet.
      • Service time: The actual time spent by the database processing the query.
      • Queue time: The time that the server does not actually execute a query because it is waiting for some resource.
    • When you see the response time for a query, evaluate its reasonableness. In a nutshell, know what indexes are required for this query and what its execution plan is, then calculate approximately how many sequential and random I/OS are required, multiply this by the time it takes to consume the next I/O in a specific hardware condition, and finally add up the consumption to get a reference value.
  • Number of rows scanned and number of rows returned
    • When analyzing a query, it is helpful to look at the number of rows scanned by that query. To a certain extent, it can explain the high efficiency of the query to find the required data.
    • This metric is not perfect, however, because not all rows have the same access costs.
    • Ideally, the number of rows scanned and returned should be the same. But it rarely exists in practice, such as associative query. The ratio of the number of rows scanned to the number of rows returned is usually between 1:1 and 10:1.
  • Number of rows scanned and type of access
    • When evaluating the cost of a query, consider the cost of finding a row of data from a table.
    • Type of access (the type column in the EXPLAIN statement, where the row column shows the number of rows scanned) :
      • ref
      • ALL(Full table scan)
    • MySQL applies the WHERE condition as follows:
      • Use the WHERE condition in the index to filter out mismatched records, which is done in the storage engine.
      • Use an overwrite index scan (Using index appears in the EXTRA column) to return records, filtering unwanted records directly from the index and returning hit results. This is done on the MySQL server without the need to go back to the table
      • Return data from the table, then filter records that do not meet the criteria (Using Where appears in the Extra column). This is done at the MySQL server layer, where MySQL first reads records from tables and then filters them.
    • Optimization tips:
      • An index override scan is used to place all required columns into the index so that the storage engine can return the result without going back to the table to retrieve the corresponding row.
      • Change the library table structure. For example, use separate summary tables
      • Rewrite this complex query so that the MySQL optimizer can execute the query in a more optimized manner.

4.3 Reconstructing the Query Mode

4.3.1 One complex query or Multiple simple queries

  • MySQL is designed to be lightweight in connection and disconnection, and efficient in returning a small query result.
  • MySQL internally scans millions of rows of data in memory per second, and responds much more slowly to clients.
  • Use as few queries as possible, but in some scenarios it may be necessary to decompose a large query into several smaller ones.

4.3.2 Split Query

  • Sometimes you need to split a large query into smaller queries, each of which performs exactly the same function, completing only a small portion of the query, and returning only a small portion of the query results at a time.
  • For example, deleting old data, shred the query to minimize performance impact and reduce MySQL replication latency. Deleting 10,000 rows at a time is generally a more efficient approach with minimal impact on the server. If you have a transactional engine, many times small transactions can be more efficient. In addition, by pausing for a while after each deletion, you can spread the one-time pressure over a long period of time, greatly reducing the impact on the server and the time the lock is held during deletion.

4.3.3 Disintegrating Associated Query

Many high-performance applications decompose associative queries. Simply, you can perform a single table query for each table and then manage the results in your application

SELECT * FROM tag

    JOIN tag_post ON tag_post.tag_id=tag.id

    JOIN post ON tag_post.post_id=post.id

WHERE tag.tag='mysql';

-- can be broken down into:

SELECT * FROM tag WHERE tag='mysql';

SELECT * FROM tag_post where tag_id=1234;

SELECT * FROM post whre post.id in (123, 456);Copy the code
  • Advantage:
    • Make caching more efficient. Many applications can easily cache the result object corresponding to a single table query, which can reduce query conditions. In the case of MySQL’s query cache, if a table in the association changes, the query cache will be unavailable for a long time, while after splitting, if a table rarely changes, the query based on that table can reuse the query cache results.
    • Performing a single query can reduce lock contention.
    • Association at the application layer makes it easier to split the database and achieve high performance and scalability.
    • The query itself may also be more efficient. For example, using IN() instead of associative queries allows MySQL to perform queries IN ID order, which may be more efficient than random association.
    • Can reduce redundant record queries. Associated query at the application layer means that a record application needs to be queried only once, whereas associated query in the database may require repeated access to some data. From this point of view, such refactoring may also reduce network and memory consumption.
    • Further, this is equivalent to implementing hash association in the application, rather than using MySQL’s nested loop association. Some scenarios are much more efficient at hashing associations.
  • Refactoring queries to put associations into the application for more efficient scenarios:
    • When the application can easily cache the results of a query
    • When data can be distributed to different MySQL servers
    • When IN() can be used instead of associative query
    • When the same table is used in the query

4.4 Basis for Query Execution

When sending a request to MySQL, MySQL works like this:

Query execution Path

  • The client sends a query to the server
  • The server first checks the query cache, and if a hit is made, it immediately returns the result stored in the cache. Otherwise move on to the next stage
  • The server performs SQL parsing, preprocessing, and the optimizer generates the corresponding execution plan.
  • MySQL invokes the storage engine API to execute the query based on the execution plan generated by the optimizer
  • Returns the result to the client

4.4.1 MySQL Client/Server Communication Protocol

  • The communication protocol between the client and server is “half-duplex”, meaning that only one client or server can send data at any one time.
    • Limitations:
      • There is no way to control the flow, and once one end starts sending a message, the other end cannot respond until it receives the full message.
    • The client passes the query to the server in a separate packet. If the query is too large, the server rejects more data and throws an error based on the max_allowed_packet configuration.
    • The server typically responds to the user with more data, consisting of multiple packets. When the server begins to respond to client requests, the client must receive the entire return result in its entirety. If you only obtain the first few results or disconnect the connection after receiving several results, you are advised to add a LIMIT to the query.
    • Client most library functions connected to MySQL (such as Java and Python) can fetch the entire result set and cache it in memory, as well as fetch the required data line by line. The default is generally to get the entire result set and cache it in memory.
      • MySQL usually waits until all the data has been sent to the client to release the resources used by the query, so receiving all the results and caching the pass can reduce the stress on the server and release the resources as soon as the query ends.
      • When you use most library functions that connect to MySQL to get data from MySQL, it looks like you’re getting data from the MySQL server, when in fact you’re getting data from the library function’s cache. However, when a large result set is returned, the library function has a lot of time and memory to store all the result sets. If you can start processing as early as possible, memory consumption can be greatly reduced. In this case, you can process the results directly instead of using the cache. The downside of this is that the server has to wait until the query is complete before releasing resources, so the server’s resources are tied up with the query.
    • Query state. For a MySQL connection or thread, there is a state at any time. The easiest way to useSHOW FULL PROCESSLISTCommand view:
      • SLEEP: The thread is waiting for the client to send a new request
      • QUERY: The thread is executing a QUERY or sending the result of the QUERY to the client
      • LOCKED: At the MySQL server layer, the thread is waiting for a table lock. Locks implemented at the storage engine level, such as InnoDB’s row locks, are not reflected in thread state. This is a typical state for MyISAM, and is common in other engines that do not have row locks.
      • Analyzing and Statistics: The thread is collecting storage engine statistics and generating query execution plans
      • Copying to TMP table [on disk] : Threads are performing operations and Copying their result sets into a temporary table, typically doing either GROUP BY, file ordering, or UNION operations. If the status is followed by a “on Disk” flag, it indicates that MySQL is placing a temporary memory table on disk.
      • Sorting result: Threads are Sorting result sets.
      • Sending Data: A thread may be Sending data between states, generating a result set, or Sending data back to a client.

4.4.2 Querying the Cache

  • Before parsing a query statement, if the query cache is turned on, MySQL checks first to see if the query matches the data in the query cache.
    • This check is implemented through a case-sensitive hash lookup.
    • The query differs from the cached query by even a single byte and does not match the cached result, in which case the query moves on to the next stage of processing.
    • If the current query happens to hit the query cache, MySQL checks user permissions before returning the query result. There is also no need to parse the query SQL statement, because the query cache already stores the table information that the current query needs to access. If the permissions are ok, MySQL skips all other stages and gets the results directly from the cache and returns them to the client.

4.4.3 Query Optimization

The next step in the query life cycle is to convert an SQL into an execution plan, and MySQL interacts with the storage engine according to this execution plan. This includes several sub-stages: parsing THE SQL, preprocessing, and optimizing the SQL execution plan. Any errors in this process, such as syntax errors, can terminate the query, and in practice, these parts may be executed together or separately.

  • Syntax parsers and preprocessing:

    • Process:
      • MySQL parses SQL by keyword and generates a corresponding “parse tree”. The MySQL parser validates and parses queries using MySQL syntax rules. For example, it verifies that the wrong keywords are used, or that the keywords are used in the right order, or that the quotes match correctly.
      • The preprocessor further checks that the parse tree is valid based on some MySQL rules. For example, tables and columns are checked for existence, and names and aliases are resolved to see if they are ambiguous.
      • The preprocessor validates the permissions. This is usually quick, unless there are a lot of permissions configured on the server.
  • Query optimizer:

    • After the syntax parser and preprocessing, the syntax tree is considered legitimate and converted into an execution plan by the optimizer.

      • A query can be executed in many ways, all returning the same result.
      • The optimizer’s job is to find the best execution plan.
    • MySQL uses a cost-based optimizer, which will try to predict the cost of a query using a certain execution plan and select the one with the least cost.

      • Initially, the minimum unit of cost was the cost of randomly reading a PAGE of 4K data. Later, the costing formula became more complex and “factors” were introduced to estimate the cost of these operations, such as the cost when performing a WHERE conditional comparison.
      • Can be achieved bySHOW STATUS LIKE 'Last_query_cost';The cost of the current query for the current session, whose value is N, is that the MySQL optimizer thinks it will take approximately N random data pages to complete the current query.
      • Last_query_cost is calculated based on statistics such as the number of pages per table or index, index cardinality (number of different values in the index), index and row length, and index distribution. The optimizer estimates the cost without considering any level of caching, assuming that it takes one disk IO to read any data.
    • MySQL > select an incorrect execution plan

      • Statistical information is not accurate. MySQL relies on statistics provided by storage engines to estimate costs, but the information provided by some storage engines can be wildly inaccurate. For example, InnoDB cannot maintain accurate statistics on the number of rows in a table because of its MVVC architecture.
      • The cost estimate in the execution plan is not the same as the actual cost of execution. So even if the statistics are accurate, the execution plan given by the optimizer may not be optimal. For example, an execution plan that reads more pages will cost less because the pages may be read sequentially or are already in memory, and its access cost will be small.
      • MySQL optimality may not be what you think optimality is. You might want the execution time to be as short as possible, but MySQL just chooses the best execution plan based on its cost model, and sometimes that’s not the fastest way to execute. So, here we see that choosing an execution plan based on the cost of execution is not a perfect model.
      • MySQL never considers other queries that are executed concurrently, which can affect the speed of the current query.
      • MySQL is not cost-based optimization all the time. Sometimes it is based on fixed rules. For example, if there is a MATCH() clause for full-text search, the full-text index is used when there is a full-text index, and MySQL still uses the corresponding full-text index even when it is much faster to use other indexes and the WHERE condition.
      • MySQL does not consider the cost of operations that are not under its control, such as the cost of executing stored procedures or user-defined functions.
      • The optimizer is sometimes unable to estimate all possible execution plans, so it may miss the actual optimal execution plan.
    • Optimization strategy:

      • Static optimization: Directly analyze the analytic tree and complete optimization. For example, the optimizer can transform the WHERE condition into another equivalent form by some simple algebraic transformations. Static optimization does not depend on particular values, such as some constants substituted in the WHERE condition. Static optimization persists after the first completion, even if the query is executed repeatedly with different parameters. Think of this as a form of compile-time optimization.
      • Dynamic optimization: context-dependent and possibly dependent on many other factors, such as the value in the WHER condition, the number of rows of data corresponding to an entry in the index, and so on. This needs to be re-evaluated at each query execution and even re-optimized during its execution, which can be considered “runtime optimization.”
    • Types of optimizations MySQL can handle:

      • Redefine the order of associated tables: Data table associations are not always performed in the order specified in the query. Determining the order of associations is an important part of the optimizer’s function.

      • Convert OUTER JOIN to inner JOIN: Not all OUTER JOIN statements must be executed as OUTER JOIN. For example, the WHERE condition, the library table structure may make an outer join equivalent to an inner join.

      • Use equivalence transformation rules: MySQL uses some equivalence transformation to simplify and standardize expressions. It can merge and reduce some comparisons, and it can remove some judgments that are always true and some that are never true.

      • Optimize COUNT(), MIN(), and MAX() : Indexes and columns can be nullable to help MySQL optimize such expressions. For example, to find the minimum value of a column, simply query the leftmost record of the B-tree index, which MySQL can retrieve directly and take advantage of when the optimizer generates an execution plan (the optimizer treats this expression as a constant, You can see “Select Tables Optimized Away “in EXPLAIN. Similarly, COUNT(*) queries without any WHERE conditions can often use some of the optimizations provided by the storage engine (MyISAM maintains a variable to hold the number of rows in the table)

      • Estimate and convert to a constant expression: when MySQL detects that an expression can be converted to a constant, it optimizes the expression as a constant all the time. Examples include a user-defined variable that does not change in the query, mathematical expressions, certain queries (MIN ON index columns, even primary key or unique key lookup statements), passing constant values from one table to another through an equation (WHERE, USING, or ON limiting a column to constant values).

      • Overwrite index scan: When a column in an index contains all the columns needed in a query, MySQL can use the index to return the required data without querying the corresponding row.

      • Subquery optimization: In some cases, subqueries can be converted to a more efficient form, reducing the need for multiple queries to access data multiple times.

      • Terminate the query early: Terminate the query immediately when the query requirements are met. For example, the LIMIT clause is used, or a condition is found to be invalid (return an empty result immediately). MySQL uses this optimization when the storage engine needs to retrieve “different values” or determine the existence of queries such as DISTINCT, NOT EXIST() or LEFT JOIN.

      • Equivalence propagation: If the values of two columns are related by an equality, then the WHERE condition of one column can be passed to the other column.

        SELECT film.film_id

        FROM sakila.film

            INNER JOIN sakila.film_actor USING(film_id)

        WHERE film.file_id > 500;

        Manually telling the optimizer that the WHERE condition applies to both tables makes the query harder to maintain in MySQL.

        . WHERE film.file_id > 500 AND film_actor.film_id > 500;Copy the code
      • Comparison of list IN() : Different from other databases where IN() is completely equivalent to multiple OR condition statements, MySQL sorts the data IN the list first and then determines whether the value of the list meets the condition through binary search. The query complexity of the former is O(n) and the latter is O(log n). For a large number of values, MySQL can do this much faster.

  • Statistics for data and indexes:

    • The MySQL architecture has a query optimizer at the server level, but it does not store statistics on data and indexes. Because statistics are implemented by storage engines, different storage engines may store different statistics.
    • When the MySQL query optimizer generates the query execution plan, it needs to obtain the corresponding statistics from the storage engine. The storage engine provides the optimizer with statistics such as how many pages per table or index, the cardinality of each index in each table, the length of data rows and indexes, and the distribution of indexes.
  • MySQL > execute associated query

    • MySQL considers any query to be management, not limited to a match that requires two tables, including each query, each fragment (such as subqueries, or even SELECT based on a single table).
      • Example of a UNION query: MySQL first puts a series of single query results into a temporary table, and then re-reads the temporary table data to complete the UNION query. The operation that reads the results of the temporary table is also associated.
    • MySQL association implements nested loop association for any association, i.e., MySQL loops through one table to retrieve a single row, then loops through the next table to find a newly configured row, and so on, until it finds a matching behavior in all tables. It then returns the columns needed in the query based on the rows that each table matches. MySQL tries to find all matching rows in the last associated table. If the last associated table cannot find more rows, MySQL returns to the previous associated table to see if it can find more matching records, and so on.
      • Essentially, all types of queries run the same way. For example, when MySQL encounters a subquery in the FROM clause, it executes the subquery and puts the result into a temporary table (MySQL’s temporary tables don’t have any indexes, as does the UNION query), and then treats the temporary table as a normal table. In short, all query types are converted to similar execution plans (significant changes in MySQL5.6 and MariaDB, both introducing more complex execution plans)
      • Not all queries can be converted, however. For example, an all-outside join cannot be done by nested loops and backtracking when no matching rows are found in the associated table, perhaps because the association happened to start from a table that did not have any matches. This is probably why MySQL does not support full external joins.
  • Execution Plan:

    • Unlike other relational databases that generate query bytecode to execute the query, MySQL generates an instruction tree for the query, then executes the tree through the storage engine and returns the result.
    • The final execution plan contains all the information for the refactored query. You can perform EXPLAIN EXTENDED on the query and then SHOW WARNINGS to see the refactored query (it has the same semantics as the original query, but the query statements may not be identical)
    • Any multi-table query can be represented by a tree, which we can think of as a balanced tree, but MySQL’s execution plan is a left-depth-first tree.
  • Relational query optimizer:

    • The most important part of the MySQL optimizer is relational query optimization, which determines the order in which multiple tables are associated. In general, when multiple tables are associated, there can be multiple different association orders, and the relational query optimizer selects the least expensive association order by evaluating the cost of different orders.
    • STRAIGHT_JOIN sometimes the optimizer did not give you the optimal association order, so you could rewrite the query using the STRAIGHT_JOIN keyword to let the optimizer execute the order you thought was optimal — and most of the time, the optimizer made more accurate choices.
    • The optimizer iterates through as many tables as possible and then runs a nested loop to calculate the cost of each tree of possible execution plans and returns an optimal execution plan.
      • If you have N table associations, you need to check the order of N factorial associations. We call this the “search space” for all possible execution plans. If the search space is very large and the number of tables to be associated exceeds the optimizer_search_depth limit, the optimizer chooses to use a greedy search to find the “optimal” association.
    • Occasionally, the optimizer will choose an execution plan that is not optimal.
    • Sometimes, the order of the individual queries is not arbitrary, and the association optimizer can greatly reduce the search space based on these rules, for example, left join, related subqueries. This is because queries from later tables depend on the results of previous tables, a dependency that often helps the optimizer greatly reduce the number of execution plans that need to be scanned.
  • Order to optimize

    • Sorting is an expensive operation in any case, so for performance reasons, avoid sorting or sorting large amounts of data whenever possible.

    • File sorting: When it is not possible to use indexes to generate sort results, MySQL needs to sort the results itself, either in memory if the data is small or on disk if the data is large.

      • If the amount of data to be sorted is smaller than the sort buffer, MySQL uses memory to perform a quicksort operation.
      • If there is not enough memory to sort, MySQL first divides the data into chunks, uses “quicksort” to sort each individual block, stores the sorted results of each block to disk, then merges the sorted blocks, and finally returns the sorted results.
    • Sorting algorithm:

      • Transfer sort twice (used in older versions) : read the row pointer and the field to be sorted, sort it, and then read the desired row based on the sorted result. This requires two data transfers, that is, two reads from the table. The second read is expensive because it generates a lot of random IO to read all the records sorted by the sequential column. The cost can be higher when using MyISAM tables because MyISAM uses system calls to read data (heavily dependent on the operating system caching of data). This has the advantage of storing as little data as possible, allowing the sort buffer to hold as many rows as possible.
      • Single-transport sort (used in new version after 4.1) : Read all columns required by the query, then sort according to the given column, and finally return the sort result directly. The disadvantage of reading all the data in one sequential IO without any random IO is that it takes up a lot of extra space if the columns to be returned are very large, and these columns are useless for sorting. Because a single sort record is large, there may be more sort blocks to merge.
      • It is hard to say that the above two algorithms are more efficient. When the total length of all columns in the query is not greater than max_lenght_for_sort_data, MySQL uses “single-transfer sort”.
    • Sorting files may require much more temporary storage than you think. The reason for this is that when MySQL sorts, it allocates a fixed length space of sufficient length for each sorted record.

      • The fixed-length space must be long enough to accommodate the longest string in it. If it is a VARCHAR column, it needs to be allocated its full length, and if utF-8 character set is used, three bytes are reserved for each character.
    • Sort when associated query:

      • If all columns in the ORDER BY clause are from the first table associated, then MySQL performs file sorting when associated with the first table. You can see “Using filesort” in the Extra field in EXPLAIN.

      • Except for the first scenario, MySQL will first put the result of the association into a temporary table, and then perform the file sorting operation after all the association has been completed. Use EXPLAIN to see “Using temporary; Using filesort “. If there is a LIMIT in the query, the LIMIT is applied after the sort, so even if less data needs to be returned, the temporary tables and the data to be sorted are still very large.

        Some improvements have been made since version 5.6: When only partial sorting results are required, such as the LIMIT clause, MySQL no longer sorts all results. Instead, it chooses to discard the results that do not meet the criteria and sort again.

4.4.4 Querying an Execution Engine

  • During the parsing and optimization phase, MySQL generates the corresponding execution plan for the query, and the query execution engine of MySQL completes the entire query according to this execution plan.

    • The execution plan here is a data structure, unlike the bytecode generated by other relational databases.
  • The query execution phase is not that complicated, MySQL simply executes step by step according to the instructions given in the execution plan. In the process of gradual execution according to the execution plan, a large number of operations need to be performed by invoking the “Handle API” interface implemented by the storage engine.

    • Each table in the query is identified by a handler instance. In fact, MySQL creates a handler instance for each table during the optimization phase, and the optimizer can retrieve information about the table based on the interface of these instances.
    • The interfaces of the storage engine have very rich functions, but there are only a dozen low-level interfaces, which can cooperate with each other to complete most of the operations of the query
  • Not all operations are done by handlers. For example, when MySQL needs to do table locks, handlers may implement their own level of more fine-grained locking, such as InnoDB, which implements its own basic row locks, but this is not a substitute for server-level table locks. If features are common to all stores, they are implemented by the server layer, such as time, date functions, views, and triggers.

4.4.5 Sending a Result to the Client

  • The last stage of query execution. Even if the query does not need to return a result set to the client, MySQL still returns some information about the query, such as the number of rows affected by the query.
  • If the query can be cached, MySQL will also place the results in the query cache at this stage.
  • The result set return to the client is an incremental, step-by-step return process. For example, when the server processes the last associated table of the association operation and starts generating the first result, MySQL can start gradually returning results to the client.
    • Benefits:
      • The server does not have to store too many results, so it does not consume too much memory by returning too many results.
      • Let the client get the result returned the first time. This can be set using SQL_BUFFER_RESULT.
    • Each row in the result set is sent in a packet that meets the MySQL client/server communication protocol, and then transmitted over TCP. During TCP transmission, MySQL packets may be cached and then transmitted in batches.

4.5 Limitations of MySQL query optimizer

MySQL’s omnipotent “nested loop” is not optimal for every query, but only for a small number of queries, and we can often make MySQL work efficiently by rewriting the query. In addition, version 5.6 will remove many of the original restrictions, allowing more queries to be completed as efficiently as possible.

4.5.1 Associating SubQueries

MySQL’s subqueries are very poorly implemented, and the worst kind of query is the one that contains IN() IN the WHERE condition statement.

SELECT * FROM sakila.film

WHERE film_id IN(

    SELECT film_id FROM sakil.film_actor WHERE actor_id =1 );

MySQL has a special optimization strategy for the options IN the IN() list, but this is not the case for associative subqueries. MySQL pushes related outer tables into subqueries, which it thinks will be efficient IN finding rows. MySQL > alter database query;

SELECT * FROM sakila.film

WHERE EXISTS(

    SELECT film_id FROM sakil.film_actor WHERE actor_id =1

    AND film_actor.film_id = film.film_id);

MySQL considers that the subquery cannot be executed first because the film_id field is required. EXPLIAN can see that the SUBQUERY is a DEPENDENT SUBQUERY, and you can see that the film table is scanned in full, and then the SUBQUERY is performed one by one based on the returned film_id. If the outer layer is a large table, query performance will be poor.

-- Optimized rewrite method 1:

SELECT film.* FROM sakila.film

    INNER JOIN sakil.film_actor USING(film_id) 

WHERE actor_id =1;

Use GROUP_CONCAT() to construct a comma-separated list IN().

-- Optimize rewrite method 3, using the EXISTS() equivalent rewrite query:

SELECT * FROM sakila.film

WHERE EXISTS(

    SELECT film_id FROM sakil.film_actor WHERE actor_id =1

    AND film_actor.film_id = film.film_id);Copy the code
  • How to use associative subquery well
    • Not all associated subqueries perform poorly and need to be tested before judging. Many times, associative subqueries are a very reasonable, natural, and even high-performance way to write them.

4.5.2 Restrictions of UNION

  • Sometimes, MySQL cannot “push down” constraints from the outer layer to the inner layer, which prevents the conditions that restrict partial returns from the original table from being applied to the optimization of the inner query.

  • If you want the clauses of the UNION to take only part of the result set according to the LIMIT, or if you want the result set to be sorted and then merged, you need to use these clauses separately in the clauses of the UNION. In addition, the ORDER in which the data is fetched from the temporary table is not certain. To obtain the correct ORDER, you need to add a global ORDER BY and LIMIT

    (SELECT first_name, last_name

     FROM sakila.actor

     ORDER BY last_name)

    UNION ALL

    (SELECT first_name, last_name

     FROM sakila.customer

     ORDER BY last_name)

    LIMIT 20;

    Use LIMIT separately in the UNION clause

    (SELECT first_name, last_name

     FROM sakila.actor

     ORDER BY last_name

     LIMIT 20)

    UNION ALL

    (SELECT first_name, last_name

     FROM sakila.customer

     ORDER BY last_name

     LIMIT 20)

    LIMIT 20;Copy the code

4.5.3 Index Merge Optimization

  • In version 5.0 and new, MySQL can access multiple indexes of a table to find rows by merging and cross-filtering when a WHERE condition contains multiple complex conditions.

4.5.4 Equivalent transfer

  • In some cases, equivalence transfer can cause unexpected additional costs. For example, if you have a very large IN() list, the MySQL optimizer finds a WHERE, ON, or USING clause that associates the value of this list with a column IN another table.
    • The optimizer copies all of the IN() lists and applies them to the associated tables. Often, the optimizer can filter records from the storage engine more efficiently because of the added filter criteria for individual tables. However, if the list is very large, optimization and execution will be slower.

4.5.5 Parallel Execution

  • Unlike other relational databases, MySQL cannot take advantage of the multi-core nature to execute queries in parallel.

4.5.6 Hash association

  • MySQL does not support hash associations — all associations are nested loop associations. However, the curve can be hashed by creating a hash index.
  • If the Memory storage engine is used, the index is a hash index, so the association is similar to a hash association.

4.5.7 Loose Index Scanning

  • MySQL does not support loose index scanning, so it is not possible to scan an index in a discontinuous manner. In general, MySQL needs to define a starting point and an ending point for an index scan. Even if only a small number of data is required, MySQL still needs to scan every field in the index.

  • Example: Suppose we have indexes (a,b) and the following query SELECT… FROM tb1 WHERE b BETEWEEN 2 AND 3; MySQL cannot use the index because only field B is used and does not match the leftmost prefix of the index, so it can only find matching rows through a full table scan.

    MySQL uses a full table scan to find the desired record

    If you understand the index structure, there is a faster way to execute the above query. The physical structure of the index (not the storage engine API) makes it possible to scan the range of column B corresponding to the first value of column A and then only scan the range of column B corresponding to the second value of column A, a loose index scan. At this point, there is no need to use WHERE filtering because all unnecessary records have already been skipped. MySQL does not support loose index scanning

    Loose index scan

  • Loose-index scanning can be used in some special scenarios in MySQL5.0 or later. For example, in a grouping query we need to find the maximum and minimum values of the grouping:

    -- "Using index for group-by" is displayed in the Extra field, indicating that loose index scanning is used

    EXPLAIN SELECT actor_id, MAX(film_id)

    FROM sakila.film_actor

    GROUP BY actor\G;Copy the code
  • Before MySQL well supported loose index scanning, a simple way around this was to add possible constant values to the preceding columns. After 5.6, some restrictions on loose index scanning will be resolved by index condition pushdown

4.5.8 Optimization of maximum and minimum values

  • MySQL is not optimized very well for MIN() and MAX() queries.

    SELECT MIN(actor_id) FROM sakila.actor WHERE first_name = 'PENELOPE';

    MySQL will perform a full table scan because there is no index on first_name. If MySQL is able to perform primary key scans, then in theory this is the minimum value that needs to be found when MySQL reads the first record that meets the condition, because primary keys are in strict order by the size of acTOR_ID.

    Curve optimization: Remove MIN() and use LIMIT

    SELECT actor_id FROM sakila.actor USE INDEX(PRIMARY) WHERE first_name = 'PENNLOPE' LIMIT 1;

    SQL tells the server what data is needed, and the server decides how best to get the data. But sometimes some principles need to be dropped in order to achieve higher performance.Copy the code

4.5.9 Querying and Updating information in the Same Table

  • MySQL does not allow simultaneous queries and updates to the same table. This is not really a limitation of the optimizer and can be avoided if you know how MySQL performs queries. This limitation can be circumvented by generating tables.

    -- Standard SQL, but not running

    mysql> UPDATE tbl AS outer_tbl

    -> SET cnt = (

    -> SELECT count(*) FROM tbl AS inner_tbl

    -> WHERE inner_tbl.type = outer_tbl.type

    -> );

    Create a table to circumvent this restriction:

    mysql> UPDATE tbl

    -> INNER JOIN(

    -> SELECT type, count(*) AS cnt

    -> FROM tbl

    -> GROUP BY type

    -> ) AS der USING(type)

    -> SET tbl.cnt = der.cnt;Copy the code

4.6 Query optimizer Hints

If you are not satisfied with the execution plan selected by the optimizer, you can use several hints provided by the optimizer to control the final execution plan. However, these prompts may be invalid after MySQL upgrade and need to be reviewed again.

  • Some types of hints:

    • LOW_PRIORITY and HIGH_PRIORITY:

      Tell MySQL the priority of multiple statements accessing a table. This works only for storage engines that use table locks, but be careful even in MyISAM, as both prompts can cause concurrent inserts to be disabled, potentially leading to serious performance degradation

      • HIGH_PRIORITY: for SELECT statements, MySQL reschedules this statement before all statements that are waiting for table locks to modify data. It is actually placed at the top of the queue for the table, rather than waiting in the normal order. For INSERT statements, the effect is simply to cancel out the effect of the global LOW_PRIORITY setting on the statement.
      • Used for SELECT, INSERT, UPDATE, and DELETE statements to keep the statement in a wait state as long as there are other statements in the queue that need to access the same table — even if those statements are submitted to the server later than the statement.
    • Of:

      • Only valid for INSERT and REPLACE.
      • MySQL returns the statement using the prompt to the client immediately, puts the inserted row data into a buffer, and then writes the data in batches when the table is idle.
      • This is useful for logging systems, or for other applications where a lot of data needs to be written but the client does not need to wait for a single statement to complete the IO.
      • Limitations: Not all storage engines support this; This prompt will cause function 1255628 to not work properly
    • STRAIGHT_JOIN:

      STRAIGNT_JOIN is useful when MySQL does not select the correct order of associations, or when MySQL cannot evaluate all associations because there are so many possible orders. In particular, in the second case below, MySQL may spend a lot of time in the “statistics” state, and this hint greatly reduces the optimizer’s search space.

      You can first use the EXLPAN statement to see the association order selected by the optimizer, and then use this prompt to rewrite the query to determine the optimal association order. But when upgrading MySQL, take a fresh look at this type of query.

      • Placed after the SELECT keyword of the SELECT statement: Associates all tables in the query in the order in which they appear in the statement
      • Placed between the names of any two associated tables: fixes the association order of the two tables before and after it.
    • SQL_SMALL_RESULT and SQL_BIG_RESULT:

      • This is valid only for SELECT statements and tells the optimizer how to use temporary tables and sorts for GROUP BY or DISTINCT queries.
      • SQL_SMALL_RESULT tells the optimizer that the result set is small and can be placed in an in-memory indexed temporary table to avoid sorting operations
      • SQL_BIG_RESULT tells the optimizer that the result set can be very large, and it is recommended to use disk temporary tables for sorting
    • SQL_BUFFER_RESULT:

      • Tell the optimizer to put the query results into a temporary table and then release the table locks as much as possible.
    • SQL_CACHE and SQL_NO_CACHE

      • Tell MySQL whether the result set should be cached in the query cache
    • SQL_CALC_FOUND_ROWS:

      • Strictly speaking, it is not an optimizer hint, and it does not tell the optimizer anything about the execution plan.
      • Make the result set returned by MySQL contain more information.
      • Add this prompt to the query and MySQL will count the total number of result sets to be returned by the query without the LIMIT clause, instead returning only the result set required by LIMIT
    • FOR UPDATE and LOCK IN SHARE MODE

      • It’s not really an optimizer hint either.
      • Controls the locking mechanism for SELECT statements, but only for storage engines that implement row-level locking.
      • Rows that match the query criteria are locked. For the INSERT… The SELECT statement does not require these two prompts because new versions 5.0 and 5.0 will lock by default.
      • InnoDB is the only engine with built-in support for both hints. However, these two prompts prevent certain optimizations from working properly, such as index overwrite scans. InnoDB cannot lock rows exclusively without accessing the primary key because the version information of rows is stored in the primary key.
      • These two tips are often abused, can easily cause lock contention problems on the server, and should be avoided whenever possible.
    • USING INDEX, IGONRE INDEX and FORCE INDEX:

      • Tell the optimizer which indexes to use or not to use to query records (for example, which indexes to use when determining association order).
      • FOR ORDER BY and FOR GROUP BY are available FOR sorting and grouping.
      • USING INDEX and FORCE INDEX are basically the same. But FORCE INDEX tells the optimizer that a full table scan will cost much more than an INDEX scan, even if the INDEX is not really useful. You can use this hint when you find that the optimizer has chosen the wrong index, or if you want to use another index for some reason (such as wanting the results to be ordered when ORDER BY does not apply).
  • New parameters for controlling optimizer behavior in versions 5.0 and newer:

    • Optimizer_search_depth: Controls the limit to which the optimizer can exhaust the plan. If the query is in the “Statistics” state for a long time, lower the value
    • Optimizer_prune_level: On by default, allowing the optimizer to decide whether to skip certain execution plans based on the number of rows that need to be scanned
    • Optimizer_swith: Contains flags that enable/disable optimizer features.

4.7 Optimizing a Specific Type of Query

4.7.1 Optimizing count() queries

  • Count ()

    • Count the number of column values that are non-null (NULL is not counted). If a column or column expression is specified inside the parentheses of COUNT(), the COUNT is the number of results that the expression has a value for (not NULL)
    • Count rows. When MySQL confirms that the expression value in parentheses cannot be null, it actually counts the number of rows.
      • When using COUNT (),Instead of expanding to all columns, as we thought, it actually ignores all columns and counts all rows.
      • Common mistake: Specifying a column in parentheses but expecting to count the number of rows in the result set.
  • Myths about MyISAM:

    • MyISAM’s count() function is very fast only if there is no WHERE condition for count(*), because there is no need to actually count the number of rows in the table. MySQL can take advantage of the storage engine features to get this value directly.
    • Count (col) = count(*);
    • When counting with a WHERE clause, MyISAM’s COUNT() is no different from any other storage engine, and no longer has the mythical speed.
  • Simple optimization

    • Take advantage of MyISAM’s very fast feature in the count(*) full table to speed up some queries for specific conditions.

      -- Use standard data according to Worold

      SELECT count(*) FROM world.city WHERE ID > 5;

      -- Reversing the conditions can greatly reduce the number of scanned lines to less than 5 lines

      SELECT (SELECT count(*) FROM world.city) - COUNT(*) 

      FROM world.city WHERE ID <= 5;Copy the code
    • Example: Suppose you need a query to return the number of items in various colors

      - the SUM

      SELECT SUM(IF(color = 'blue', 1, 0)) AS blue,SUM(IF(color = 'red', 1, 0)) AS red FROM items;

      -- Use COUNT, only set true if the condition is met, and NULL if the condition is not met

      SELECT COUNT(color = 'blue' OR NULL) AS blue, COUNT(color = 'red' OR NULLASred FROM items;Copy the code
  • Using approximations:

    • Sometimes some business scenarios do not require an exact COUNT value, and an approximation value can be used instead.
    • The number of rows estimated by the optimizer that comes out of EXPLAIN is a good approximation, and executing EXPLAIN does not require actually executing the query, so the cost is low.
    • For example, if you want to accurately count how many people are online on your site, often WHERE conditions can be complex, filtering current inactive users on the one hand and default users with certain ids in the system on the other. Removing these constraints will have little impact on the total number, but it can greatly improve the performance of the query. Further optimization can try to remove constraints such as DISTINCT to avoid file sorting. This rewritten query is much faster than an exact query and returns nearly the same result.
  • More complex optimizations:

    • In general, COUNT() requires a large number of rows to be scanned (meaning a large amount of data to be accessed) to get accurate results, making it difficult to optimize.
    • Optimization method:
      • The method mentioned earlier
      • The only thing you can do at the MySQL level is an index overwrite scan
      • Consider changing your application architecture by adding summary tables or an external caching system like Memcached. You may soon find yourself in a familiar dilemma where “fast, accurate, and easy to implement” always satisfy two, and one must go.

    4.7.2 Optimizing Associated Query

    • Ensure that the columns of the ON or USING clause have indexes.
      • The order of associations needs to be considered when creating indexes. When table A and table B are associated with column C, the optimizer’s association order is B and A, so there is no need to create an index on the corresponding column of table B. The absence of indexing wisdom brings additional burden.
      • In general, unless there is another reason, you only need to create an index on the response column of the second table in the association order.
    • Ensure that any expressions in GROUP BY and ORDER BY refer to only one column in a table so that MySQL can optimize the process using indexes.
    • Note when upgrading MySQL: association syntax, operator priority, and other things that may change. Because what was previously an ordinary association may become a Cartesian product, different types of associations may yield different results.

4.7.3 Optimizing subQueries

  • Use relational queries instead whenever possible. This is not absolute. 5.6 and newer versions, or MariaDB, can use subqueries directly.

4.7.4 Optimizing GROUP BY and DISTINCT

  • In many cases, MySQL optimizes both types of queries in the same way; in fact, the MySQL optimizer converts the two types of queries to each other during internal processing. They can all be optimized using indexes, which is the most efficient way to optimize.
  • When indexes are unavailable, GROUP BY uses two strategies: using temporary tables or file sorting for grouping. You can prompt SQL_BIG_RESULT and SQL_SMALL_RESULT to make the optimizer run the way you want it to.
  • If you need to GROUP an associated query BY a column in the lookup table, it is usually more efficient to GROUP the identity columns of the lookup table than the other columns
    • The non-grouped column that appears after SELECT must be directly dependent on the grouped column and have a unique value within each group, or business does not care what the value is.
    • It is usually not a good idea to use non-grouped columns directly in the SELECT of grouped queries, because such results are often variable and can vary when the index changes, or when the optimizer chooses a different optimization strategy.
  • If the ORDER BY clause is not explicitly sorted, the result set is automatically sorted BY grouped fields when the query uses the GROUP BY clause. If you don’t care about sorting the result set, and the default sort results in a file sort, you can use ORDER BY NULL to make MySQL stop sorting files.
  • Optimize GROUP BY WITH ROLLUP:
    • A variant of group queries is required to perform a super aggregation of the returned group results. This logic can be implemented using the WITH ROLLUP clause, but it may not be optimized enough.
      • In many cases, it is better if you can do super aggregation in your application, although you need to return more results to the client.
      • You can also use nested subqueries in the FROM clause, or you can use a temporary table to hold intermediate data and then UNION with the temporary table to get the final result
      • The best approach is to move the WITH ROLLUP functionality into the application whenever possible.

4.7.5 Optimize LIMIT paging

  • When paging is required on a system, it is usually done using LIMIT plus offset, along with an appropriate ORDER BY clause. If you have a corresponding index, it is usually very efficient, otherwise, MySQL needs to do a lot of file sorting.
  • Optimization method when the offset is very large:
    • Limit the number of pages in a page
    • Optimize performance at large offsets:
      • Whenever possible, use overwrite index scans instead of querying all columns. For example, use deferred association
      • Sometimes you can convert a LIMIT query to a query with a known location and let MySQL get the corresponding result through a range scan.
      • Use “bookmark” to record the location of the last fetched data.
      • Use a pre-computed summary table, or associate it with a redundant table that contains only primary key columns and data columns that need to be sorted.

4.7.6 optimization SQL_CALC_FOUND_ROWS

Another common technique used when paging is to add SQL_CALC_FOUND_ROWS prompt to the LIMIT statement. This gives you the number of rows that satisfy the condition without the LIMIT, and can therefore be used as the total number of pages. With this prompt, MySQL scans all rows that meet the condition, whether it needs to or not, and then discards the rows that are not needed, rather than terminating the scan after the LIMIT is met. So the cost of this tip can be very high.

  • Design plan 1: Change the specific page number to “Next page” button. Suppose 20 records are displayed on each page, then use LIMIT to return 21 and only 20 records are displayed. If 21 is present, then “Next page” button is displayed, otherwise there is no more data and no need to display
  • Design scenario 2: Fetch and cache a large amount of data, and then fetch from this cache for each page.
  • Other design options: Sometimes consider using the value of the Rows column in the EXPLAIN results as an approximation of the total number of result sets (in fact, the total number of Google search results is also an approximation). When precise results are needed, COUNT(*) alone is used to satisfy the requirement, and if you can use overridden indexes, it is usually much faster than SQL_CALC_FOUND_ROWS.

4.7.7 Optimizing the UNION Query

  • MySQL always performs UNION queries by creating and populating temporary tables. Therefore, many optimization strategies cannot be used well. It is often necessary to manually push the WHERE, LIMIT, and ORDER BY clauses down into the UNION’s subqueries so that the optimizer can take full advantage of these conditions for optimization
  • Use UNION ALL unless you really need the server to eliminate duplicate rows.
    • Without the ALL keyword, MySQL adds a DISTINCT option to temporary tables, which can result in a very expensive uniqueness check for the entire temporary table.
    • Even with the ALL keyword, MySQL still uses temporary tables to store results.

4.7.8 Static Query Analysis

Percona Toolkit contains pt-query-advisor, a tool that parses a log of queries, analyzes

the query patterns, and gives annoyingly detailed advice about potentially bad practices

in them.

4.7.9 Using User-defined Variables

  • A user-defined variable is a temporary container used to store content that exists throughout the connection to MySQL. This feature is useful when a mixture of procedural and relational logic is used in queries.

  • Usage:

    SET @one       := 1;

    SET @min_actor := (SELECT MIN(actor_id) FROM sakila.actor);

    SET @last_week := CURRENT_DATE - INTERVAL 1 WEEK;

    SELECT ... WHERE col <= @last_week;

    -- Has the "lvalue" property, which uses a variable while assigning a value to it

    SELECT actor_id, @rownum := @rownum + 1 As rownum ...Copy the code
  • Unusable scenarios:

    • Queries that use custom variables cannot use query caching
    • You cannot use custom variables where constants or identifiers are used, such as table names, column names, and LIMIT clauses.
    • The lifetime of user-defined variables is valid within a connection, so they cannot be used to communicate within a connection.
    • If you use connection pooling or persistent connections, custom variables can cause seemingly unrelated code interactions (if so, it’s usually a code or connection pool bug)
    • Prior to version 5.0, it is case sensitive, so be aware of code compatibility issues between different MySQL versions.
    • The type of a custom variable cannot be explicitly declared. The timing of determining the specific type of an undefined variable may also vary from MySQL version to MySQL version. If you want it to be an integer/float/string, it is best to initialize it with 0/0.0/ ‘ ‘. MySQL user – defined variables are dynamic types that change when assigned.
    • The MySQL optimizer may optimize these variables away in some scenarios, which may cause the code to not behave as expected.
    • The order and time of assignment are not always fixed, depending on the optimizer’s decision. The reality can be confusing.
    • The assignment symbol := has very low precedence, so it is important to note that assignment expressions should use unambiguous parentheses.
    • Using undefined variables doesn’t cause any syntax errors, and it’s very easy to make a mistake if you’re not aware of this.
  • Application Scenarios:

    • Optimized ranking statement:

      Search for the top 10 actors who appeared in the most movies and rank them based on the number of movies they appeared in. If they appear in the same number of movies, they rank the same.

      mysql> SET @curr_cnt := 0, @prev_cnt := 0, @rank := 0;

      -> SELECT actor_id,

      -> @curr_cnt := cnt AS cnt,

      -> @rank := IF(@prev_cnt <> @curr_cnt, @rank + 1, @rank) AS rank,

      -> @prev_cnt := @curr_cnt AS dummy

      -> FROM (

      -> SELECT actor_id, COUNT(*) AS cnt

      -> FROM sakila.film_actor

      -> GROUP BY actor_id

      -> ORDER BY cnt DESC

      -> LIMIT 10

      -> ) as der;Copy the code
    • Avoid repeated queries for newly updated data:

      Update a row and want to get information about that row. While it still seems to require two queries and two network round-trips, the second query is much faster without accessing any tables

      UPDATE t1 SET lastUpdated = NOW() WHERE id = 1 AND @now := NOW();

      SELECT @now;Copy the code
    • Count the number of updates and inserts

      INSERT ON DUPLICATE KEY UPDATE (DUPLICATE KEY UPDATE, DUPLICATE KEY UPDATE, DUPLICATE KEY UPDATE)

      The essence of this approach is to increment the @x variable every time it is updated due to a conflict, and then multiply the expression by 0 so that it does not affect what is being updated

      INSERT INTO t1(c1, c2) VALUES(4, 4), (2, 1), (3, 1)

      ON DUPLICATE KEY UPDATE

      c1 = VALUES(c1) + ( 0 * ( @x := @x +1 ) );Copy the code
    • Determine the order of values

      • One of the most common problems not noticed is that the use of variables in assignment and reading can be at different stages of the query.

        -- WHERE and SELECT are executed at different stages of query execution, while WHERE is executed before the ORDER BY file sorting operation.

        mysql> SET @rownum := 0;

        mysql> SELECT actor_id, @rownum := @rownum + 1 AS cnt

        -> FROM sakila.actor

        -> WHERE @rownum <= 1;

        +----------+------+

        | actor_id | cnt  |

        +----------+------+

        | | 1 |

        2 | 2 | |

        +----------+------+Copy the code
      • Try to have variable assignments and values occur in the same phase of query execution.

        mysql> SET @rownum := 0;

        mysql> SELECT actor_id, @rownum AS rownum

        -> FROM sakila.actor

        -> WHERE (@rownum := @rownum + 1) <= 1;Copy the code
      • Place the assignment spacing at LEAST() so that assignments can be completed without changing the sort order at all. This technique is useful when you want to make a variable copy without having any effect on the result of the execution of the clause. Other such functions are GREATEST(), LENGTH(), ISNULL(), NULLIF(), IF(), and COALESCE().

        -- LEAST() always returns 0

        mysql> SET @rownum := 0;

        mysql> SELECT actor_id, first_name, @rownum AS rownum

        -> FROM sakila.actor

        -> WHERE @rownum <= 1

        -> ORDER BY first_name, LEAST(0, @rownum := @rownum + 1);Copy the code
  • Write lazy UNION:

    Suppose you need to write a UNION query whose first subquery is executed as a branch condition, and if a matching row is found, the second branch is skipped. This is true in some business scenarios, such as looking for “hot” data in one frequently accessed table and not being able to look for “cold data” in another less frequently accessed table. (Distinguishing hot and cold data is a good way to improve cache hit ratio).

    Find a user in two places, a primary user table and a long inactive user table. The inactive user table is intended for more efficient archiving.

    The old UNION query will look up the archive table again even if the record has already been found in the Users table.

    SELECT id FROM users WHERE id = 123

    UNION ALL

    SELECT id FROM users_archived WHERE id = 123;

    A lazy UINON query is used to suppress the return of such data. When there is no data in the first table, we query in the second table. Once the record is found in the first table, define a variable @found, which is implemented by doing an assignment in the result column and then placing the assignment in the function GREATEST to avoid returning additional data. To make it clear which table the result is from, a new column containing the table name is added. Finally, you need to reset the variable to NULL at the end of the query to ensure that the traversal does not interfere with subsequent results.

    SELECT GREATEST(@found := −1, id) AS id, 'users' AS which_tbl

    FROM users WHERE id = 1

    UNION ALL

    SELECT id, 'users_archived'

    FROM users_archived WHERE id = 1 AND @found IS NULL

    UNION ALL

    SELECT 1, 'reset' FROM DUAL WHERE ( @found := NULL ) IS NOT NULL;Copy the code
  • Other uses of user-defined variables:

    • Variables can be assigned in any type of SQL statement, not just in SELECT statements. For example, you can improve the UPDATE statement in the same way you improved the ranking statement using subqueries earlier.
    • Sometimes tricks are needed to get the desired result. Because the optimizer may treat a variable as a compile-time constant rather than assign it a value. Placing functions in functions like LEAST() usually avoids this problem. Another option is to check that the variable is assigned before the query is executed.
  • Other uses:

    • The query runtime calculates totals and averages
    • Simulate the functions FIRST() and LAST() in the GROUP statement
    • Do some calculations on a lot of data.
    • Calculates the MD5 hash of a large table
    • Write a sample handler that sets the value in the sample to 0 when it exceeds a certain boundary value
    • Simulate read/write cursors
    • Add the variable value to the WEHRE clause of the SHOW statement.

4.8 Case Study

4.8.1 Build a queue table using MySQL

Using MySQL to implement lists is a trick, and many systems do not perform well in high-traffic, high-concurrency situations. A typical pattern is that a table contains multiple types of records: unprocessed records, processed records, processed records, and so on. One or more consumer threads look for an unprocessed record in a table, claim to be processing it, and update the record to the processed state when processing is complete. General, such as mail sending, multi-command processing, comment modification, etc., use similar mode, but

Reasons for improper original treatment:

  • As the list grows larger and the index depth increases, the speed of finding unprocessed records slows down.
  • The general processing process consists of two steps: find the unprocessed record and lock it. Finding records increases the stress on the server, while locking increases competition among consumer processes because it is a serialized operation.

Optimization process:

  • Splitting the pair list into two parts, archiving processed records or storing them in a history table, always keeps the pair list small.

  • Finding unprocessed records is generally fine, and if there is a problem, individual consumers can be notified using messages.

    • You can use an annotated SLEEP() function for timeout handling. This causes the thread to block until it times out or another thread terminates the current SLEEP with a KILL QUERY. So, when you add another batch of data to the pair list, you can use SHOW PROCESSLIST to find the thread that is currently hibernating and KILL it. Notification can be implemented using the functions GET_LOCK and RELEASE_LOCK(), or it can be implemented outside of the database, such as using a messaging service.

      SELECT /* waiting on unsent_emails */ SLEEP(10000), col1 FROM table;Copy the code
  • The final issue is how to get consumers to mark the records they are working on without having multiple consumers work on the same record repeatedly.

    • Try to avoid using SELECT FOR UPDATE, which is often the root cause of extensibility problems and can cause a lot of libraries to block and wait. Avoid not just queue tables, but all situations.

    • You can use UPDATE directly to UPDATE records and then check to see if there are any other records that need to be processed. (All SELECT FOR UPDATE entries can be overwritten in a similar way)

      The owner of this table stores the connection ID of the current record being processed, which is returned by CONNECTION_ID(), or 0 if the current record is not being processed by any consumer

      CREATE TABLE unsent_emails (

          id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,

          -- columns for the message, from, to, subject, etc.

          status ENUM('unsent', 'claimed', 'sent'),

          owner INT UNSIGNED NOT NULL DEFAULT 0,

          ts TIMESTAMP,

          KEY (owner, status, ts)

      );

      -- Common treatment methods. The SELECT query here uses two columns of the index, which should theoretically be faster. The problem is the "gap time" between two queries, where the lock blocks all other queries that are the same. All such queries will use the same index, scan parts of the index with the same results, and are likely to be blocked.

      BEGIN;

      SELECT id FROM unsent_emails

          WHERE owner = 0 AND status = 'unsent'

          LIMIT 10 FOR UPDATE;

      -- result: 123, 456, 789

      UPDATE unsent_emails

          SET status = 'claimed', owner = CONNECTION_ID()

          WHERE id IN(123, 456, 789);

      COMMIT;

      Improved and more efficient writing eliminates the need to use SELECT queries to find which records have not yet been processed. The client protocol tells you how many records have been updated, so you can figure out how many records need to be processed this time.

      SET AUTOCOMMIT = 1;

      COMMIT;

      UPDATE unsent_emails

          SET status = 'claimed', owner = CONNECTION_ID()

          WHERE owner = 0 AND status = 'unsent'

          LIMIT 10;

      SET AUTOCOMMIT = 0;

      SELECT id FROM unsent_emails

          WHERE owner = CONNECTION_ID() AND status = 'claimed';

      -- result: 123, 456, 789Copy the code
  • Finally, you need to deal with one special case: those that are being processed by a process, but the process itself exits for some reason.

    • Just run the UPDATE statement periodically to UPDATE it all to its original state, then execute SHOW PROCESSLIST to get the ID of the currently working thread, and use some WHERE conditions to avoid fetching processes that you just started processing

      Given the thread ids (10, 20, 30), the following update statement updates the status of records that have been processed for more than 10 minutes to the initial state.

      By placing the range condition at the end of the WHERE condition, this query will be able to use all the columns in the index, and other queries will be able to use the index, avoiding the need to create an additional index to satisfy other queries

      UPDATE unsent_emails

          SET owner = 0, status = 'unsent'

        WHERE owner NOT IN(0, 10, 20, 30) AND status = 'cla

          AND ts < CURRENT_TIMESTAMP - INTERVAL 10 MINUTE;Copy the code

Some basic principles in this case:

  • Do as little as possible, nothing if you can. Don’t use polling unless you have to, because it adds load and creates a lot of low-output work.
  • Do what needs to be done as quickly as possible. Use UPDATE instead of SELECT FOR UPDATE, because the faster the transaction commits, the shorter the lock time, which can greatly reduce contention and speed up serial execution. Separate processed and unprocessed data to keep the data set small enough.
  • Another lesson from this case is that some queries cannot be optimized; Consider using different queries or different policies to achieve the same goal. This is usually required FOR SELECT FOR UPDATE

Sometimes the best thing to do is to migrate the task queue out of the database, and Redis and memcached are good queue containers.

6.8.2 Calculate the distance between two points

MySQL is not recommended for complex spatial computing storage, PostgreSQL is a good choice for this. A typical example is to count all the points within a certain radius, centered around a point. For example, find all the houses available for rent near a certain point, or “match” nearby users on a social networking site.

Suppose we have the following table, where longitude and latitude are both measured in degrees:

CREATE TABLE locations (

  id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,

  name VARCHAR(30),

  lat FLOAT NOT NULL,

  lon FLOAT NOT NULL

);

INSERT INTO locations(name, lat, lon)

VALUES (' Charlottesville, Virginia, 38.03, - 78.48).

(' Chicago, Illinois, 41.85-87.65).

(Washington, DC, 38.89-77.04).Copy the code

Assume that the earth is round, then use the maximum circle (semi-positive vector) formula to calculate the distance between two points. The existing coordinates latA and lonA, latB and lonB, then the distance between point A and point B is calculated as follows:

ACOS(

COS(latA) * COS(latB) * COS(lonA - lonB)

+ SIN(latA) * SIN(latB)

)Copy the code

The result is one radian, and to convert the result into miles or kilometers, you need to multiply it by the radius of the Earth.

SELECT * FROM locations WHERE 3979 * ACOS(

  COS(RADIANS(lat)) * COS(RADIANS(38.03)) * COS(RADIANS(lon) - RADIANS(-78.48))

+ SIN (RADIANS (lat)) * SIN (RADIANS (38.03))

) < = 100;Copy the code

Not only do these queries not use indexes, but they also consume CPU time, strain the server, and require repeated computations.

Optimization:

  • Let’s see if we really need to do this exact calculation. There are already many inaccuracies in this algorithm:

    • The distance as the crow flies may be 100 miles, but the actual distance they travel between them is probably not that.
    • If you determine a person’s location based on a zip code, and then calculate the distance to others based on the location of the center of the area, that’s an estimate in itself.
  • If you don’t need too much precision, you can think of the earth as round. For more optimization, put trigonometric calculations in the application rather than in the database.

  • To see if you really need to compute a circle, consider using a square instead. For a square with 200 miles of sides, the distance from one vertex to the center is about 141 miles, which is not too far off from the actual 100 miles. Calculate the distance from center to side length of 0.0253 radians (100 miles) according to the square formula:

    SELECT * FROM locations

      WHERE lat BETWEEN 38.03 - DEGREES(0.0253) AND 38.03 + DEGREES(0.0253)

      AND lon BETWEEN −78.48 - DEGREES(0.0253) AND −78.48 + DEGREES(0.0253);Copy the code

Now let’s see how to optimize this query with indexes:

  • Add an index (LAT, LON), which does not work very well because of the range query.
  • Optimize with IN().

Add two new columns to store the approximate value of the coordinates FLOOR(), and then use IN() IN the query to put the integer values of all points IN the list:

mysql> ALTER TABLE locations

-> ADD lat_floor INT NOT NULL DEFAULT 0,

-> ADD lon_floor INT NOT NULL DEFAULT 0,

-> ADD KEY(lat_floor, lon_floor);Copy the code

It is now possible to search based on approximations of a range of coordinates, including floor values and ceiling values, geographically corresponding to north and south respectively:

-- Query all points in a range. Values need to be computed in the application, not MySQL

Mysql > SELECT FLOOR(38.03-degrees (0.0253)) AS lat_lb,

-> DEGREES(38.03 + DEGREES(0.0253)) AS lat_ub,

-> FLOOR(-78.48-degrees (0.0253)) AS lon_lb,

-> temperature (-78.48 + DEGREES(0.0253)) AS lon_ub;

+--------+--------+--------+--------+

| lat_lb | lat_ub | lon_lb | lon_ub |

+--------+--------+--------+--------+

36 40 | | | | - 80-77 |

+--------+--------+--------+--------+

-- Generates integers IN the IN() list:

SELECT * FROM locations

  WHERE lat BETWEEN 38.03 - DEGREES(0.0253) AND 38.03 + DEGREES(0.0253)

  AND lon BETWEEN −78.48 - DEGREES(0.0253) AND −78.48 + DEGREES(0.0253)

  AND lat_floor IN(36,37,38,39,40) AND lon_floor IN(-80,-79,-78,-77);Copy the code

Using approximations would skew our results, so we need some additional criteria to filter out points outside the square, similar to the CRC32 hash index: build an index to filter out approximations, then use the exact criteria to match all records and remove those that don’t meet the criteria.

In fact, instead of filtering the data based on the square approximation, you can use the maximum circle formula or Pythagoras’ theorem:

SELECT * FROM locations

  WHERE lat_floor IN(36,37,38,39,40) AND lon_floor IN(-80,-79,-78,-77)

  AND 3979 * ACOS(

  COS(RADIANS(lat)) * COS(RADIANS(38.03)) * COS(RADIANS(lon) - RADIANS(-78.48))

+ SIN (RADIANS (lat)) * SIN (RADIANS (38.03))

) < = 100;Copy the code

At this point, the calculation accuracy is back to using an exact circle, but now it’s faster. As long as you can filter out most of the points efficiently, such as using approximate integers and indexes, it’s not too expensive to do the exact math afterwards. As long as you’re not using the big circle algorithm, it’s going to be slower.

Optimization strategy used in this case:

  • Do as little or nothing as possible. In this case, don’t calculate the big circle formula for all the points; Most of the data is filtered using simple schemes, and then complex formulas are applied to smaller sets of filtered data.
  • Get things done quickly. Be sure to design queries with appropriate indexes as much as possible, and use approximations to avoid complex calculations.
  • Let the application do as much computation as possible when needed.

4.8.3 Using User-defined Functions

  • When SQL statements can no longer do certain tasks efficiently, C and C++ are good choices if you need more speed.
  • Example: You need to calculate the XOR values of two random 64-bit numbers to see if they match. About 35 million records need to be made at the second level.
    • A simple calculation shows that it is impossible to do this in MySQL with current hardware conditions.
    • Write a calculation program that runs on the same server as a background program, and then write a user-defined function that interacts with the previous program through a simple network communication protocol. When the program is distributed, it can perform 4 million matching calculations in 130 milliseconds. In this way, you can put intensive computations on some general-purpose server while being completely transparent to the outside world, making it look like MySQL is doing all the work.

4.9 summarize

If creating high-performance applications is an interlocking “puzzle,” query optimization should be the last step in solving the puzzle, in addition to schema, index, and query statement design described earlier.

Understanding how queries are executed and where time is spent is still part of the response time described earlier. Add some knowledge of the parsing and optimization process, and you’ll have a closer understanding of how MySQL accesses tables and indexes, as discussed in the previous chapter. This also provides another dimension to understand the relationship between MySQL queries and indexes when accessing tables and indexes.

Optimization usually requires a three-pronged approach: do nothing, do little, and do it fast.