Query operations are the most frequently used operations in relational databases and form the basis of other SQL statements such as DELETE and UPDATE. When you want to delete or update some records, you first query them and then perform the appropriate SQL operation on them. So SELECT – based query operations are very important. Query processing can be divided into logical query processing and physical query processing. Logical query processing represents what results should be produced by executing the query, while physical queries represent how the MySQL database obtained the results. The methods of the two queries may be completely different, but the results must be the same

Logical query processing

The SQL language differs from other programming languages such as C, C++, Java, and Python, most notably in the order in which the code is processed. In most programming languages, code is processed in the order in which it is encoded. But in SQL, the first clause to be processed is always the FROM clause

(8)SELECT (9)DISTINCT <select_list>

(1)FROM <left_table>

(3)<join_type>JOIN <right_table>

(2)ON<join_condition>

(4)WHERE<where_condition>

(5)GROUP BY<group_by_list>

(6)WITH {CUBE|ROLLUP}

(7)HAVING<having_condition>

(10)ORDER BY<order_by_list>

(11)LIMIT<limit_number>

Copy the code

The sequence number of a query statement indicates the processing order of the query statement

You can see that there are 11 steps, starting with the FROM operation and ending with the LIMIT operation. Each operation produces a virtual table that serves as input to a process. These virtual tables are transparent to the user, and only the virtual tables generated in the last step are returned to the user. If a clause is not specified in the query, the corresponding step is skipped

Analyze the various stages of query processing:

  1. FROM: Perform the Cartesianproduct of the left table < left_TABLE > and the right table

    in the FROM clause to produce the virtual table VT1
  2. ON: Applies ON filtering to virtual table VT1. Only rows matching

    are inserted into virtual table VT2
  3. JOIN: If the OUTER JOIN (such as LEFT OUTER JOIN, RIGHT OUTER JOIN) is specified, then unmatched rows from the table are added to the virtual table VT2 as external rows, resulting in virtual table VT3. If the FROM clause contains more than two tables, repeat steps 1) to 3) for the resulting table VT3 FROM the previous join and for the next table until all tables are processed
  4. WHERE: The WHERE filter is applied to the virtual table VT3. Only the records matching

    are inserted into the virtual table VT4
  5. GROUP BY: GROUP the records in VT4 according to the columns in the GROUP BY clause, resulting in VT5
  6. CUBE | ROLLUP: CUBE or a ROLLUP operation on table VT5 VT6 produce table
  7. HAVING: The HAVING filter is applied to virtual table VT6. Only records matching

    are inserted into virtual table VT7.
  8. SELECT: Performs the second SELECT operation to SELECT the specified column and insert it into the virtual table VT8
  9. DISTINCT: deletes duplicate data and generates virtual table VT9
  10. ORDER BY: The virtual table VT9 is sorted according to

    to produce virtual table VT10. 11) 11. LIMIT: Retrieve the records in the specified row, generate virtual table VT11, and return to the query user

The following details the 11 stages of logical processing with a query example. Start by creating a user data table, customers and Orders, and populating it with a certain amount of data, according to the code below

create table customers

(

    customer_id VARCHAR(10),

    city VARCHAR(10) NOT NULL,

    PRIMARY KEY(customer_id)

)ENGINE=InnoDB;



INSERT INTO customers VALUES('163'.'HangZhou'), ('9you'.'ShangHai'), ('TX'.'HangZhou'), ('baidu'.'HangZhou');



create table orders

(

    order_id INT AUTO_INCREMENT,

    customer_id VARCHAR(10),

    PRIMARY KEY(order_id)

)ENGINE=InnoDB;



INSERT INTO orders VALUES(1, '163'), (2,'163'), (3,'9you'), (4,'9you'), (5,'9you'), (6,'TX'),(7, NULL);

Copy the code

Customers table records

The orders table records

[Through the following statement to query from Hangzhou and the order number of less than 2 customers, and query out their order number, query results in order from small to large]

SELECT c.customer_id,count(o.order_id) AS total_orders 

FROM customers as c

LEFT JOIN orders as o

ON c.customer_id = o.customer_id

WHERE c.city = 'HangZhou'

GROUP BY c.customer_id

HAVING count(o.order_id) < 2

ORDER BY total_orders DESC;

Copy the code

Customers from Hangzhou with order number less than 2

The following section analyzes the execution process of the SQL

The first step is to perform a cartesian product, also known as a CrossJoin, on the two tables before and after the FROM clause to generate the virtual table VT1. If the table before the FROM clause contains row A and the table after the FROM clause contains row B, then the virtual table VT1 will contain row A *b. The columns of the virtual table VT1 are defined by the source table. For the previous SQL query, the Cartesian product of the tables Orders and Customers is performed first

FROM customers as c ....... JOIN orders as o

Copy the code

The virtual table VT1 returned by cartesian product

(2) Apply ON filter to SELECT query, there are three filter processes, respectively ON, WHERE, HAVING. ON is the first filtering process performed. According to the virtual table VT1 generated in the previous section, the filter criteria are:

ON c.customer_id = o.customer_id

Copy the code

For most programming languages, there are only two values for logical expressions: TRUE and FALSE. But there are not only two types of logical expressions that play a role in relational databases. There is also an expression called three-valued logic. This is because NULL values are compared differently in the database than in most programming languages. In C, a comparison of NULL ==NULL returns 1, which is equal. In a relational database, a comparison of NULL is completely different. For example:A comparison of the first NULL value returns NULL instead of 0, and a comparison of the second NULL value still returns NULL instead of 1. If the comparison returns a NULL value, the user should treat it as UNKNOWN, which means UNKNOWN. Because in some cases, a NULL return value may represent 1, that is, NULL equals NULL, and sometimes a NULL return value may represent 0.

For the comparison of NULL values in the ON filter condition, the comparison result is UNKNOWN but is treated as FALSE, that is, the two NULL values are not the same. However, a comparison of two NULL values is considered equal in two cases:

  • The GROUP BY clause groups all NULL values into the same GROUP
  • The ORDER BY clause arranges all NULL values together (you can test this for yourself)

Therefore, when the virtual table VT2 is generated, an extra column is added to represent the return value of the ON filter. The return value can be TRUE, FALSE, or UNKNOWN. VT2 = VT2; VT2 = VT2

(3) Add external rows

customers as c LEFT JOIN orders as o

Copy the code

This step occurs only when the JOIN type is OUTERJOIN, such as LEFT OUTERJOIN, RIGHT OUTERJOIN, or FULL OUTERJOIN. Although we can omit the OUTER keyword most of the time, the OUTER row represents the OUTER row. Sign LEFT the LEFT OUTER JOIN submission to survive in a table, RIGHT OUTER JOIN the sign RIGHT submission to survive in a table, FULL OUTER JOIN are around the table to keep watch. Adding external rows is to add the data filtered out by the filter conditions in the reserved table on the basis of VT2 table. The data in the non-reserved table is given NULL values. Finally, the virtual table VT3 is generatedIn this example, the reserved table is Customers, and the customer Baidu is filtered in the VT2 table because there is no order, so Baidu is added as an external row to the virtual table VT2, assigning NULL to the data in the non-reserved table. If more than two tables need to be joined, redo steps (1) to (3) at the beginning of this section for the virtual table VT3, and the resulting virtual table serves as the output of the next step

(4) Apply WHERE filter to perform WHERE condition filtering on the virtual table VT3 generated in the previous step. Only the records matching <where_condition> will be output to the virtual table VT4

There are two types of filtering that are not allowed when the WHERE filter is currently applied:

  • Filters such as where_condition=MIN(col) for statistics are not yet available in the WHERE filter because the data is not yet grouped
  • SELECT city as c FROM t WHERE c=’ShangHai’ SELECT city as c FROM t WHERE c=’ShangHai

Look at an example WHERE using a group filter query in the WHERE filter condition causes an error

SELECT customer_id,count(customer_id)

FROM orders

WHERE COUNT(customer_id)<2;

Copy the code

You can see that the MySQL database prompts that the grouping function is incorrectly used. Let’s look at an example of a column alias that goes wrong:

SELECT order_id as o, customer_id as c

FROM orders

WHERE c='163';

Copy the code

In the current step, the column alias is not supported and the MySQL database has thrown an error indicating the unknown column C

Select * from WHERE city=’HangZhou’ WHERE city=’HangZhou’ WHERE city=’HangZhouIn addition, filtering in the WHERE filter is different from filtering in the ON filter. For the OUTERJOIN filtering, records filtered out by the ON condition in the reserved table are added after the ON filter, while records filtered out by the WHERE condition are permanently filtered

(5) Grouping In this step, the virtual tables generated in the previous step are grouped according to the specified columns, and the virtual table VT5 is finally obtained

GROUP BY c.customer_id



SELECT * FROM customers as c LEFT JOIN orders as o ON c.customer_id = o.customer_id WHERE c.city='HangZhou' GROUP BY c.customer_id

Copy the code

Virtual table VT5

If the ROLLUP option is specified, an additional record is created and added to the end of the virtual table VT5 and the virtual table VT6 is generated. Because our query does not use ROLLUP, we will skip this step

This is the last conditional filter, HAVING already applied the ON and WHERE filters. In this step, the HAVING filter is applied to the virtual table generated in the previous step. HAVING is a filter that filters for grouping conditions. For the example query statement, the grouping condition is

HAVING count(o.order_id < 2)



SELECT c.customer_id,c.city,o.order_id,o.customer_id FROM customers as c LEFT JOIN orders as o ON c.customer_id = o.customer_id WHERE c.city = 'HangZhou' GROUP BY c.customer_id HAVING count(o.order_id)<2;

Copy the code

Therefore, delete the order customer_id 163 from the virtual table, and the generated virtual table VT6 virtual table VT6It is important to note that you cannot use COUNT (1) or COUNT (*) in this group, as this would COUNT rows added by OUTER JOIN and cause the final query result to be different from the expected result. In this example, you can only use COUNT O.oder_id to get the expected result

Note: Subqueries cannot be used as grouping aggregation functions, such as HAVING COUNT(SELECT…) <2 is illegal

Although SELECT is the first part of the query to be specified, it is not processed until Step 8. In this step, SELECT the column specified in SELECT from the virtual table generated in the previous step as:

SELECT c.customer_id,count(o.customer_id) AS total_orders



SELECT c.customer_id,count(o.customer_id) AS total_orders FROM customers as c LEFT JOIN orders as o ON c.customer_id = o.customer_id WHERE c.city = 'HangZhou' GROUP BY c.customer_id HAVING count(o.order_id)<2;

Copy the code

Virtual table VT7

(9) Applying DISTINCT clauses If a DISTINCT clause is specified in the query, a temporary table in memory is created (if not available in memory, it is put on disk). The structure of this in-memory temporary table is the same as that of the virtual table generated in the previous step, except that a unique index is added to the DISTINCT column to remove duplicate data.

Because DISTINCT is not specified in this SQL query, skip this step. In addition, for queries that use GROUP BY, using DISTINCT is redundant because no rows are removed because the groups are already in place

(10) Apply the ORDER BY clause to ORDER the virtual table from the previous step according to the columns specified in the ORDER BY clause, and return the new virtual table. You can also specify the sequence number of the columns in the SELECT list in the ORDER BY clause, as follows:

SELECT order_id,customer_id 

FROM orders

The ORDER BY 2, 1;



Is equivalent to



SELECT order_id,customer_id 

FROM orders

ORDER BY customer_id,order_id;

Copy the code

In general, it is not recommended to do sorting in this way, because the programmer might change the columns in the SELECT list and forget to change the list in ORDER BY. However, if the user has high requirements for network transmission, this can be regarded as a way to save the network transmission bytes

For the example, the ORDER BY clause is:

ORDER BY total_orders DESC



SELECT c.customer_id,count(o.customer_id) AS total_orders FROM customers as c LEFT JOIN orders as o ON c.customer_id = o.customer_id WHERE c.city = 'HangZhou' GROUP BY c.customer_id HAVING count(o.order_id)<2 ORDER BY total_orders DESC;

Copy the code

The final result is the virtual tableIt is believed that many DBAs and developers make the mistake of thinking that when selecting data in a table, the records are fetched in ORDER of the size of the primary key in the table, that is, as if the result were an ORDER BY. The main reason for this classic error is a failure to understand what a relational database really is

Relational database is developed on the basis of mathematics, and relation corresponds to the concept of set in mathematics. Common query operations in databases actually correspond to certain operations of sets: select, projection, join, union, intersection, difference, division. The final result is presented to the user in the form of a two-dimensional table, but from the perspective of the database is a series of set operations. Therefore, the records in a table need to be understood by the user in terms of collections

Because the data in a table is an element of a set, and a set is unordered. Therefore, for SQL statements without an ORDER BY clause, the parsing result should be: select the desired subset from the collection. This shows that the results don’t have to be orderly

Note: In the MySQL database, NULL values are always selected first during the ascending process, that is, NULL values are treated as minimum values in the ORDER BY clause

The LIMIT clause is applied in this step to select data from the virtual table in the previous step for the specified row starting at the specified location. For a LIMIT clause that does not apply ORDER BY, the result can also be unordered, so the LIMIT clause is often used with the ORDER BY clause

Since the SQL statement in the example does not have a LIMIT clause, the final result should be as follows:

Physical query processing

The previous section described logical query processing and described what results should be obtained from executing the query. But the database may not perform queries exactly the way logical query processing does. We know that there are two components in the MySQL database layer: Parser and Optimizer. The job of the Parser is to analyze THE SQL statement, and the job of the Optimizer is to optimize the SQL statement and select the best path to select the data, but the final result of the physical query processing must be equal to the logical query processing

If there is an index on the table, the optimizer determines whether the SQL statement can use the index for optimization. If there is no index available, the execution of the entire SQL statement can be very costly. Here’s an example:

CREATE TABLE x(

    a int

)ENGINE=InnoDB;



CREATE TABLE y(

    a int

)ENGINE=InnoDB;

Copy the code

Suppose 10W and 18W pieces of data are inserted into table X and table Y respectively, and no indexes are created in the two tables. Therefore, the final execution result parsed by SQL parser is a logical processing step, that is, according to the analysis above, a total of 11 steps are taken to query the data. First, a virtual table VT1 is generated based on the Cartesian product. Table X has 100,000 rows, and table Y has 180,000 rows. This means that the cartesian product produced virtual table VT1 has a total of 18 billion rows! So running this SQL statement on a dual-core laptop with an InnoDB buffer pool of 128MB takes a total of more than 50 minutes.

One might argue that the 128MB InnoDB buffer pool is too small to hold so much data in memory that execution would take such a long time. In fact, neither table X nor table Y is larger than 20MB, which is large enough to be stored in a 128MB memory buffer pool. The main reason for the slow execution of the statement is that it needs to generate data 18 billion times. Even generating data that many times in memory takes a long time. However, if you add a primary key to table Y and execute the SQL statement, you will be surprised to find that it takes less than a second

Performance increased by more than 3000 times! The reason for this significant reduction in query time is simply that adding indexes prevents cartesian tables from being created, thus dramatically shortening the statement run time. We can use the EXPLAIN command to see how the MySQL database actually chooses to execute after being optimized by the SQL optimizer. Indexes in MySQL, and what each column means in the EXPLAIN execution plan, I have previously put together an article called “An Article To Familiarize You with MyQL Indexes.”

{

    "code": 1,

    "result"false.

    "message""Please log in first.".

    "meta": {

        "request_time": 0.093.

        "timestamp": 1594522447

    }

}

Copy the code