English and algorithms are two legs of a programmer

This article is intended for MySQL 5.6 and later

Ask questions first

If the category field has no index and duplicate values, the combination of order by category and limit will not result in the expected result.

Problem recurrence:

Table structure (just two fields)

CREATE TABLE `ratings` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `category` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
Copy the code

Select * from ratings order by category;

id category
1 1
5 1
10 1
3 2
4 2
6 2
9 2
2 3
7 3
8 3

Select * from ratings order by category limit 5; select * from ratings order by category limit 5;

The expected order of ids is 1, 5, 10, 3, 4.

But here are the actual results:

id category
1 1
10 1
5 1
3 2
4 2

How fat? MySQL Bug?

Some students may have encountered this problem, Baidu or Google to solve it, have you ever thought, you find the way is the best solution? How did someone else figure this out? Why does MySQL do this? Is it version-dependent?

First throw conclusion:

  1. The optimal solution is to add a unique sort field after the column value, such as:order by category,id;
  2. Why does MySQL do this? The answer is fast! (MySQL 5.6And after)
  3. The suboptimal solution is trueorder byAt the back of thecategoryAdd index (why suboptimal solution? You’ll find out by the end of this article);

In the following lecture, the representative will reconstruct the production process of these three conclusions.

1. The optimal solution

The MySQL documentation 8.2.1.19 LIMIT Query Optimization describes this scenario as follows:

If multiple rows have identical values in the ORDER BY columns, the server is free to return those rows in any order, and may do so differently depending on the overall execution plan. In other words, the sort order of those rows is nondeterministic with respect to the nonordered columns. One factor that affects the execution plan is LIMIT, so an ORDER BY query with and without LIMIT may return rows in different orders.

To sum up:

If the ORDER BY column has duplicate field values, the ORDER of the data returned BY the ORDER BY statement will be different because of the LIMIT

This is the default MySQL optimization for this scenario. If you need to ensure that the order of the LIMIT is the same as that of the LIMIT, there is also a solution:

If it is important to ensure the same row order with and without LIMIT, include additional columns in the ORDER BY clause to make the order deterministic.

Add a sort field (such as ID field) after ORDER BY.

The above description first appeared in the MySQL 5.6 documentation, and since this release, this optimization for ORDER BY LIMIT has been introduced.

Select * from ratings order by category,id; Can be solved.

So why did MySQL make such a seemingly Bug optimization?

MySQL ORDER BY logic

As the name implies, ORDER BY means of ORDER.

Explain select * from ratings order by category limit 5;

* * * * * * * * * * * * * * * * * * * * * * * * * * *1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ratings
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 10
     filtered: 100.00
        Extra: Using filesort
1 row in set, 1 warning (0.00 sec)

Copy the code

You can see that Extra: Using filesort means sorting is required.

MySQL has two types of internal sort and external sort:

  • If the amount of data to sort is less than the sort buffer size, the sort is done in memory (quicksort);

  • If the amount of data to sort is larger than the sort buffer size, external sort is done using temporary files (merge sort);

If there is a LIMIT or no LIMIT, the order of the results returned will not be affected by the order of the results returned.

However, MySQL 5.6 makes a small optimization for ORDER BY LIMIT (when the sort field has no index and the column value is not unique) : the optimizer uses the Priority queue when it comes to ORDER BY LIMIT statements.

The following pseudocode in filesort.cc describes this optimization:

while (get_next_sortkey())
   {
     if (using priority queue)
       push sort key into queue
     else
     {
       try to put sort key into buffer;
       if (no free space in sort buffer)
       {
         do {
           allocate new, larger buffer;
           retry putting sort key into buffer;
         } until (record fits or no space for new buffer)
         if (no space for new buffer)
         {
           sort record pointers (all buffers);
           dump sorted sequence to 'tempfile';
           dump Merge_chunk describing sequence location into 'chunk_file'; }}if(key was packed) tell sort buffer the actual number of bytes used; }}if (buffer has some elements && dumped at least once)
     sort-dump-dump as above;
   else
     don't sort, leave sort buffer to be sorted by caller.
Copy the code

Optimizing logic is described in WL#1393: Optimizing filesort with small limit:

Many web customers have to do
"SELECT ... ORDER BY non_index_column LIMIT X",

When X *  is smaller than sort_buff_size we can use
the following algoritm to speed up the sort:

- Create a queue to hold 'limit' keys.
- Scan through the table and store the first (last if DESC) keys in the queue
- Return values from queue

This is much faster than the current algoritm that works as:
Copy the code

The result is recorded in the WorkLog: 10 to 20 times faster than a quicksort.

So, to be fast!

MySQL considers this scenario to be a TOP N problem, which can be solved by using the Priority queue.

3. Priority queue

A Priority queue is a heap, and the java.util.PriorityQueue class in Java is essentially a heap.

A quick explanation of what a heap is:

The heap is a complete binary tree;

The value of each node in the heap must be greater than or equal to (large top heap) or less than or equal to (small top heap) the value of each node in its child tree.

If MySQL is using merge or quicksort, you need to order all the data, and then select the first few items of LIMIT, the rest of the sorted data will be wasted.

With a Priority queue, you can maintain a heap based on the number of limits and just run all the data through the heap to get the result.

To verify that MySQL uses the priority queue, use the following statement:

SET optimizer_trace='enabled=on';
select * from ratings order by category limit 5;
SELECT * FROM `information_schema`.`OPTIMIZER_TRACE`\G;
Copy the code
 "filesort_priority_queue_optimization": {
              "limit": 5."chosen": true
            },
Copy the code

Filesort_priority_queue_optimization.chosen = true

Priority queue execution logic (LIMIT 5 as an example)

Note: The small top heap in the figure is sorted by category value

  1. Take the first five data to form a small top heap:

  1. Take the next row (6,2) and find that 2 is less than the largest in the current heapcategory3, delete (2,3) from the heap, add (6,2) to the heap:

  1. Repeat Step 2 until all the data that meet the query conditions have been compared into the heap, and the data in the heap is shown as follows:

That’s how you find the smallest five rows of category data through the Priority queue.

Finally, we can get the result by removing it from the heap. After removing the smallest element of the heap, we put the last element on the top of the heap and re-heap it according to the small top heap, as shown in the figure:

Select * from ratings order by category limit 5; Output consistent with

4. Why is indexing a suboptimal solution

Obviously, following the logic of ORDER BY, we can also solve this problem BY indexing the sorted fields directly without the in-memory sorting step.

However, indexes are not silver bullet. The extra category indexes will increase the maintenance cost of the table. If there is no obvious business need, adding indexes simply to bypass the optimization of the priority queue is not worth the loss.

Especially when the table data volume is very large, the size of the index can be considerable. In addition, for the scenario in this paper, category as a category field has a high repetition rate. MySQL may not select this index even if there are service SQL queries by category.

To sum up, for this scenario, I personally believe that order by category and ID are the optimal solution to this problem.

PS: I don’t care, I have never written LIMIT SQL ah!

Don’t you write CRUD functions with pagination? PageHelper source code to understand?

5. To summarize

The case in this paper is a practical problem encountered by the class representative in the online process. After consulting the classmates around, several students have encountered this problem. Most of the online articles are superficial and superficial. Then arrange this article.

It involves data structure, PageHelper, MySQL documentation, and related reference materials listed at the end of the article. If you have time to read the reference documents in person, I believe you will have a deeper harvest.

6. References:

  1. “Beauty of data structures and algorithms” — talk 28 and 29 of Wang Zheng
  2. MySQL Practice lecture 45 — Lin Xiaobin Lecture 04, 05, 10, 16, 17
  3. 8.2.1.16 LIMIT the Query Optimization—dev.mysql.com/doc/refman/…
  4. , MySQL, unriddling for MySQL, Sort pages – mysql.taobao.org/monthly/201…
  5. Filesort.cc—dev.mysql.com/doc/dev/mys…
  6. WL# 1393: Optimizing filesort with small limit—dev.mysql.com/worklog/tas…

Follow the Java class representatives for the latest Java dry goods