It takes about 20 minutes to read this article!
Hi, everybody. We’ve seen some implementations of count(*). Today, we will introduce the implementation of order BY and the knowledge points involved inside.
Regular order
So let’s prepare some examples, and let’s say our table looks like this.
CREATE TABLE `t` (
`id` INT ( 11 ) NOT NULL.`city` VARCHAR ( 16 ) NOT NULL.`name` VARCHAR ( 16 ) NOT NULL.`age` INT ( 11 ) NOT NULL.`addr` VARCHAR ( 128 ) DEFAULT NULL,
PRIMARY KEY ( `id` ),
KEY `city` ( `city` )) ENGINE = INNODB;
Copy the code
Based on the above table structure, we execute the following SQL
select city,name,age from t where city='hangzhou' order by name limit 1000 ;
Copy the code
This SQL I believe many people are using very skilled. So today we’re going to talk a little bit about how that works internally.
Let’s take a page from Mr. Dinch
As you can see from the figure, the rows that satisfy the condition city=’ Hangzhou ‘are the records from ID_X to ID_(X+N).
So its execution flow is
- We need to initialize the sort_buffer, which is used to store data (name, city, age).
- Find each satisfy from index City
City = 'hangzhou'
ID_X = ID_X - Then the whole row of data (name, city, age) corresponding to ID_X in the primary key index tree is searched through ID_X and stored in sort_buffer
- Go back to index City and continue searching
City = 'hangzhou'
The ID of the - Repeat steps 3 and 4 until the conditions are not met and exit
- Select * from sort_buffer; select * from sort_buffer; select * from sort_buffer;
- Finally, return 1000 rows of data. If there is no
limit 1000
The following is the flow chart
sort_buffer_size
MySQL has made the following optimizations for sorting performance. The boundary value is a parameter. If it is below this threshold, the sort is done in memory. If the value is higher than the threshold, the sorting operation is performed on disk. Because the data is large enough, the memory is obviously insufficient, and then it affects other SQL.
You can use the following method to determine whether a sort statement uses temporary files.
/* Open optimizer_trace, valid only for this thread */
SET optimizer_trace='enabled=on';
/* @a saves the initial value of Innodb_rows_read */
select VARIABLE_VALUE into @a from performance_schema.session_status where variable_name = 'Innodb_rows_read';
/* Execute the statement */
select city, name,age from t where city='hangzhou' order by name limit 1000;
/* View the OPTIMIZER_TRACE output */
SELECT * FROM `information_schema`.`OPTIMIZER_TRACE`\G
/* @b saves the current value of Innodb_rows_read */
select VARIABLE_VALUE into @b from performance_schema.session_status where variable_name = 'Innodb_rows_read';
/* Calculate the Innodb_rows_read difference */
select @b-@a;
Copy the code
Execute the above SQL to obtain
number_of_tmp_files
This parameter tells you whether and how many temporary files are used.
It can be seen from the figure that MySQL divides the data to be sorted into 12 pieces, and each piece is sorted separately and stored in these temporary files. Then combine the 12 ordered files into one large ordered file.
If sort_buffer_size is greater than the amount of data to sort, number_of_tmp_files is the corresponding number of pages; if less, number_of_tmp_files is 0. The smaller the sort_buffer_size, the more shares to be split, and the larger the number_of_tmp_files value.
examined_rows
Examined_rows =4000, which specifies the number of rows in the examination.
packed_additional_fields
The packed_additional_fields in sort_mode means that the sorting process “compacts” the string. Even if the name field is defined as vARCHar (16), space is allocated according to the actual length during sorting.
Meanwhile, the last query, select @b-@a, returns 4000 rows, meaning only 4000 rows were scanned.
Here are the differences between InnoDB and MyISam.
To avoid interfering with the conclusion, I set internal_tmp_disk_storage_engine to MyISAM. Otherwise, the result of select @b-@a is 4001. This is because a temporary table is required to query OPTIMIZER_TRACE, and internal_tmp_disk_storage_engine defaults to InnoDB. If the InnoDB engine is used, the value of Innodb_rows_read will be increased by 1 when data is fetched from temporary tables.
The rowid sorting
The above process only reads the data from the original table once; the rest is done in sort_buffer and temporary files. However, the problem with this algorithm is that if the query returns a large number of fields, the sort_buffer will contain too many fields, so that the memory can be put down too few rows at the same time, and will be divided into many temporary files, resulting in poor sorting performance.
So if the single row is big, it’s not efficient enough.
max_length_for_sort_data
And then I’m going to use this parameter. max_length_for_sort_data
Max_length_for_sort_data is a parameter in MySQL that controls the length of rows used for sorting. This means that if the length of a single row exceeds this value, MySQL considers the single row to be too large and needs to change algorithms.
If max_LENGTH_FOR_sort_data is set to 16, sort_buffer will contain only name and primary key ID. The result of sorting cannot be returned directly because the city and age fields are missing, and the entire execution process looks like this:
- Initialize sort_buffer and make sure to put two fields, name and ID;
- Select the first primary key id (ID_X) that meets the condition city=’ hangzhou ‘;
- Select * from sort_buffer; select * from sort_buffer;
- Fetch the primary key ID of a record from index city;
- Repeat steps 3 and 4 until the condition city=’ Hangzhou ‘is not met, i.e. ID_Y in the figure;
- Sort the data in sort_buffer by field name
- Iterate through the sorting result, take the first 1000 rows, and return to the original table according to the value of ID to retrieve the city, name and age fields and return them to the client. The following figure
Compared with the first algorithm, ROWId sort has one more step back to the table operation. so
Examined_rows has the value of 4000, meaning that the number of rows used to sort is 4000. But the select @b-@a statement has a value of 5000. Because this time in addition to the sorting process, after the sorting is completed, but also according to the ID to the original table value. Because the statement is LIMIT 1000, 1000 more lines will be read.
Sort_mode also becomes sort_key, rowid, which means that only the name and ID fields are involved in sorting.
Number_of_tmp_files becomes 10 because the number of rows involved in sorting is still 4000, but each row is smaller, so the total amount of data to sort is smaller, and fewer temporary files are required.
Regular sort VS Rowid sort
- If MySQL is really worried that the sort memory is too small, it will affect the efficiency of sorting, so it will use rowid sorting algorithm, which can sort more rows at a time, but need to go back to the original table to fetch data.
- If MySQL determines that the memory is large enough, it will select a full-field sort and place all the required fields in sort_buffer. This will return the query result directly from memory, without having to go back to the original table to retrieve the data.
This reflects one of MySQL’s design philosophies: if you have enough memory, use more memory to minimize disk access.
For InnoDB tables, roWID sorting will require multiple disk reads from tables, so it will not be preferred.
So many introductions. MySQL sorting is a costly operation. So you might ask, do all order BY require sort operations? If you get the right result without sorting, the system will be much less expensive and the statement execution time will be shorter.
Do all orders by need to be sorted?
The answer is definitely wrong. Not all orders by need to be sorted. Sorting is required because the above table is unordered data, if the table is not ordered not to need sorting.
If you can ensure that rows retrieved from city are naturally sorted incrementally by name, you don’t have to do that anymore.
That’s true. Select * from city where name = ‘city’ and name = ‘city’;
alter table t add index city_user(city, name);
Copy the code
Let’s use Mr. Dinky’s picture again
In this index, we can still use tree search to locate the first record that meets city=’ Hangzhou ‘, and also ensure that the value of name is in order as long as the value of city is Hangzhou in the following “next record” traversal.
Then the flow of the whole query process becomes:
- Select first primary key from index (city,name) where city=’ hangzhou ‘;
- Select name, city, age as part of the result set;
- Drop a record primary key ID from index (city,name);
- Repeat steps 2 and 3 until the 1000th record is found or the condition city=’ Hangzhou ‘is not met.
As you can see, this query process does not require temporary tables, nor does it require sorting. Next, we use the results of explain to confirm.
As you can see from the figure, there is no use of filesort in Extra. And because the union index (city,name) is itself ordered, the query does not have to read all 4000 rows, but only needs to find the first 1000 records that meet the condition to exit. So in our case, we only need 1,000 scans.
Further optimization
By overwriting the index, the joint index defect of the previous segment is solved
For this query, we can create a joint index of city, name, and age. The corresponding SQL statement is as follows:
alter table t add index city_user_age(city, name, age);
Copy the code
In this case, rows with the same value of the city field are sorted incrementally by the value of the name field, and the query does not need to be sorted. The entire query execution flow becomes:
- Find the first record from the index (city,name,age) that meets the condition city=’ Hangzhou ‘, retrieve the values of the three fields city,name, and age, and return them directly as part of the result set.
- Fetches a record from the index (city,name,age), also fetches the values of these three fields, and returns them directly as part of the result set;
- Repeat Step 2 until the 1000th record is found or the condition city=’ Hangzhou ‘is not met. Flowchart explain diagram is shown below
The Extra field contains “Using index”, indicating that the overwrite index is used. The performance is much faster. Of course, this is not to say that you need to create a federated index for all the fields involved in the statement in order for every query to be able to override the index. After all, indexes have maintenance costs. It’s a tradeoff.
conclusion
The two algorithms of order BY are roughly introduced. And how to optimize it.
Only know the realization process, in order to clear in the subconscious buried excellent table design concept!