The word sorting, my first impression is that almost all apps have a sorting place, Taobao products are sorted according to the purchase time, B site reviews are sorted according to the heat… Of course, today we are not talking about how to elegantly sort in big data, how to improve sort performance, we are talking about MySQL sort.

What’s the first thing that comes to mind when you talk about sorting with MySQL? Order by? The order by field must have an index. The leaves are already sequential, right? Or try not to sort within MySQL?

Cause of the matter

Now suppose we have a table of users’ friends:

CREATE TABLE `user` (
  `id` int(10) AUTO_INCREMENT,
  `user_id` int(10),
  `friend_addr` varchar(1000),
  `friend_name` varchar(100),  
  PRIMARY KEY (`id`),
  KEY `user_id` (`user_id`)
) ENGINE=InnoDB;
Copy the code

There are two points to note in the table:

  1. The user’suser_id, the name of a friendfriend_name, a friend’s addressfriend_addr
  2. User_id isThe indexthe

One day, there is a junior development engineer xiao Ape, received from the junior product manager Xiao Wang demand: Xiao Wang: Xiao Ape comrade, now need to add a function in the background, this function to support according to the user ID can look up all his friends name and address, and require the friend’s name is in accordance with the dictionary sort. Small ape: OK, this function is simple, I will get online right away.

So little Ape wrote this SQL:

selectFriend_name, friend_addrfrom user where user_id=? order by name
Copy the code

In the twinkling of an eye, the little ape was on the line, and everything was going well until one day an operation classmate led to this query:

selectFriend_name, friend_addrfrom user where user_id=10086 order by name
Copy the code

However, this inquiry unexpectedly than at ordinary times slow a lot of, the database reported slow inquiry, small ape right now panic of a B: this is how to return a matter? User_id = friend_name; user_id = friend_addr; user_id = friend_addr; Little ape kept comforting himself at this time, to calm to calm, and then suddenly thought of an explain command, use explain to view the execution plan of that SQL, when the little ape used explain, found that there was a look very dangerous word in the extra field: using filesort.

User_id =10086; user_id=10086; user_id=10086; user_id=10086; user_id=10086;

Using filesort in the end how to sort the principle of using filesort?

Anatomical file sorting

MySQL > limit 1000; MySQL > limit 1000; MySQL > limit 1000; MySQL > limit 1000; The problem with network bandwidth is definitely fixed, since packets are smaller overall, but the problem with using Filesort is still not solved. How is it sorted in a file? Or I ask: What would you do if you were asked to design a ranking? With these questions and considerations in mind, let’s take a look at the technical difficulties involved in using Filesort and how to solve them.

  1. If user_id=10086, select friend_name and friend_ADDR from user_id. If user_id=10086, select friend_name from friend_addr. The user_id index alone cannot find the values of these two fields
  2. Select friend_name from friend_addr where user_id=10086; select friend_name from friend_addr where user_id=10086
  3. What to do? I’m not going to go back because I need to sort friend_name. You don’t have all the data, so you have to put all the data in one place, and this is the placesort_bufferSort_buffer is the buffer used for sorting in this case. It is important to note that each thread has a separate sort_buffer. The main purpose of this is to avoid lock contention caused by multiple threads operating on the same memory block.
  4. When the friend_name and friend_ADDR of the first data are already in the sort_buffer, the synchronization step will be repeated. All friend_name and friend_addr user_id=10086 are added to sort_buffer
  5. MySQL will fast-sort friend_name from sort_buffer. MySQL will fast-sort friend_name from sort_buffer
  6. Finally, the first 1000 entries in sort_buffer are returned.

Memory itself is not infinite. It must have a limit. Of course, sort_buffer can’t be too small. In InnoDB storage engine, this value is 256K by default.

mysql> show variables  like 'sort_buffer_size';
+------------------+--------+
| Variable_name    | Value  |
+------------------+--------+
| sort_buffer_size | 262144 |
+------------------+--------+
Copy the code

If you want to put more than 256K of data into sort_buffer, it will not work to use sort_buffer as a fast buffer. In this case, you may ask: can’t MySQL automatically scale to size? Well, MySQL is a multi-threaded model. If each thread expands, the buffer allocated to other functions will be small (such as change buffer, etc.), which will affect the quality of other functions.

And then you have to sort it in a different way, and that’s right, so now you have a real file sort, which is a temporary file on disk, and MySQL uses the idea of merge sort, which is to sort the data into several pieces, and then each piece of data is sorted in memory and then put into a temporary file, Finally, merge and sort the sorted temporary file data again. Typical divide-and-conquer principle, its specific steps are as follows:

  1. The sorted data can be divided into pieces and put into sort_buffer
  2. Sort each piece of data in sort_buffer and write it to a temporary file
  3. When all the data is written to the temporary file, then the internal of each temporary file is in order, but they are not a whole, the whole is not in order, so you have to merge the data
  4. Let’s say we have two temporary files, tmpX and tmpY, and then we read some data from tmpX into memory, and then we read some data from tmpY into memory, so you might be wondering why a part and not the whole or a single file? Since the disk is slow at first, try to read as much data as possible into memory at a time, but do not read too much because there are buffer space limitations.
  5. For tmpX, suppose tmpX[0-5] is read in, and for tmpY, suppose tmpY[0-5] is read in, so the comparison is as follows:

If tmpX[0] < tmpY[0], then tmpX[0] must be the smallest, then tmpX[1] is compared to tmpY[0], and if tmpX[1] > tmpY[0], then tmpY[0] must be the second smallest… Finally, tmpX and tmpY can be combined into one ordered file tmpZ, and multiple such tmpZ can be combined again… Eventually, you can merge all your data into one large, ordered file.

File sorting is very slow. Is there any other way

File sorting involves batch sorting and merging, which is time-consuming. The root cause of this problem is that sort_buffer is not enough. I don’t know if you noticed that there is no need to sort friend_name, but we inserted friend_addr into the sort_buffer, so that the size of the single row is equal to the length of friend_name + the length of friend_addr. Sort_buffer = friend_name; sort_buffer = friend_name; Yeah, that’s another sort that we’re going to talk about optimizing Rowid sort.

The idea of rowid sorting is to remove unnecessary data from sort_buffer and keep only necessary data in sort_buffer. What do you think is necessary data? Only put friend_name? That’s not going to work. So what happens to friend_addr after sorting? Select friend_addr from friend_addr; select friend_addr from friend_addr; select friend_addr from friend_addr;

  1. Select * from sort_buffer where user_id = ‘friend_name’; select * from sort_buffer where user_id = ‘friend_name’
  2. Repeat step 1 until all the target data is in sort_buffer
  3. Sort the data in sort_buffer by the friend_name field
  4. Search for friend_ADDR in the table based on the ID. The result is displayed until 1000 entries are returned.

Here are a few things to note:

  1. This method requires two returns to the table
  2. Sort_buffer is small, but if the amount of data itself is still large, you should still use temporary file sorting

So the question is, how do you choose between the two methods of MySQL? If the size of sort_buffer is too large (friend_name + friend_addr), rowid will be used. If not, rowid will be used. The length criterion is based on max_LENGTH_FOR_sort_data, which defaults to 1024 bytes:

mysql> show variables like 'max_length_for_sort_data';
+--------------------------+-------+
| Variable_name          | Value |
+--------------------------+-------+
| max_length_for_sort_data | 1024  |
+--------------------------+-------+
Copy the code

I don’t want to go back, I don’t want to sort again

Either way, they need to return the table because there is no target field on the secondary index, and sort because the data is not ordered. If there is a target field on the secondary index and it is already sorted, then it is the best of both.

Select * from user_id, friend_name, friend_addr; select * from user_id, friend_name, friend_addr; select * from friend_addr; One move, no need to go back, no need to sort again. Therefore, for the above SQL, its general flow is as follows:

  1. Select * from friend_name where user_id=10086; select * from friend_name where friend_ID =10086
  2. Repeat the first step, follow the leaf node and then look back until you find the first data that is not 10086, end.

Although the joint index can solve this problem, it is not necessary to create a joint index blindly in practical applications. You need to determine whether to create a joint index based on the actual service logic. If similar queries are not frequent, you do not need to create a joint index, because it will occupy more storage space and maintenance overhead.

conclusion

  1. Use filesort (Extra) in explain when order by is not indexed
  2. Don’t panic when using filesort appears. If you have a small amount of data, such as a few dozen, you can use sort Buffer quickly
  3. If the amount of data is large enough to exceed the size of sort Buffer, a temporary file sort, called merge sort, is required, partly determined by the MySQL optimizer
  4. If you want to avoid the use of temporary file sorting, you can try to set the max_LENGTH_FOR_sort_DATA field size to be less than the sum of the length of all the fields in the query
  5. In practical business, we can also set up a joint index for the frequently queried field combination, which neither needs to return to the table nor separate sorting, but the joint index will occupy more storage and overhead
  6. When querying a large number of data, try to batch, explain in advance to observe the SQL execution plan is a good choice.

The last

Wechat search [pretend to understand programming], and the author to learn together, common progress. It is not easy to create. Your three lines are the biggest support for the author and also the biggest motivation for his creation. We will see you next time.

Past highlights:
  • Memory management: programs load those things
  • Simple! This is how the CPU runs the code
  • 20 picture! Common distributed theories and solutions