English and algorithms are the programmer’s two legs
This article works for MySQL 5.6 and above
0. Throw the questions first
Assuming that the field category is unindexed and has duplicate values, the combination of Order by Category and Limit will not match the expected results.
Question repeat:
Table structure (that is, two fields)
CREATE TABLE `ratings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`category` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
Select * from ratings order by category;
id | category |
---|---|
1 | 1 |
5 | 1 |
10 | 1 |
3 | 2 |
4 | 2 |
6 | 2 |
9 | 2 |
2 | 3 |
7 | 3 |
8 | 3 |
Select * from ratings order by category limit 5;
The expected order of IDs is 1, 5, 10, 3, 4.
But the actual results are as follows:
id | category |
---|---|
1 | 1 |
10 | 1 |
5 | 1 |
3 | 2 |
4 | 2 |
How fat? A Bug in MySQL?
Maybe some students have encountered this problem and solved it immediately on Baidu or Google. Have you ever thought that the solution you found is the optimal solution? How did others come up with this idea? Why does MySQL do this? Is it version related?
Throw the conclusion first:
- The best solution is to add a sorting field with a unique column value, such as:
order by category,id
; - Why does MySQL do this? The answer is to be quick! (
MySQL 5.6
And after that.) - The suboptimal solution is correct
order by
At the back of thecategory
Indexing (why is it suboptimal? Read on to find out);
The following class representative will restore the production process of these three conclusions.
1. The optimal solution
MySQL 8.2.1.19 Limit Query Optimization describes this scenario as follows:
If multiple rows have identical values in the
ORDER BY
columns, the server is free to return those rows in any order, and may do so differently depending on the overall execution plan. In other words, the sort order of those rows is nondeterministic with respect to the nonordered columns.
One factor that affects the execution plan is
LIMIT
, so an
ORDER BY
query with and without
LIMIT
may return rows in different orders.
To sum up:
When the values of fields in the ORDER BY column are duplicates, the ORDER of data returned BY this ORDER BY statement is changed BY
LIMIT
The existence of become different
If you want to make sure that you are adding or not adding limits in the same order, there is an official way to do this:
If it is important to ensure the same row order with and without
LIMIT
, include additional columns in the
ORDER BY
clause to make the order deterministic.
That is, add an additional sorting field (such as the ID field) after ORDER BY.
The above description first appeared in the MySQL 5.6 documentation, and since this release, this optimization for Order BY Limit has been introduced.
Select * from ratings order by category id; select * from ratings order by category id; It can be solved.
So the question is, why does MySQL make such a seemingly Bug optimization?
2.MySQL ORDER BY logic
As the name implies, ORDER BY means sort.
Explain select * from ratings order by category limit 5;
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: ratings
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 10
filtered: 100.00
Extra: Using filesort
1 row in set, 1 warning (0.00 sec)
You can see that Extra: Using Filesort means you need to sort.
Normally, MySQL has an in-memory sort and an external sort:
- If the amount of data to be sorted is less than
sort buffer size
, the sort is done in memory (quicksort); - If the amount of data to be sorted is greater than
sort buffer size
For external sort (merge sort) using temporary files;
Obviously, both kinds of sorting are going to sort all of the results, and to make sense, whether you have a LIMIT or not, you’re going to take the number of entries that you want from the sorted results in order, and it doesn’t matter whether you have a LIMIT or not.
However, MySQL 5.6 has a minor optimization for ORDER BY LIMIT (when sorting fields are not indexed and column values are not unique) : the optimizer uses a priority queue when it encounters an ORDER BY LIMIT statement.
Cc contains the following pseudocode to describe this optimization:
while (get_next_sortkey())
{
if (using priority queue)
push sort key into queue
else
{
try to put sort key into buffer;
if (no free space in sort buffer)
{
do {
allocate new, larger buffer;
retry putting sort key into buffer;
} until (record fits or no space for new buffer)
if (no space for new buffer)
{
sort record pointers (all buffers);
dump sorted sequence to 'tempfile';
dump Merge_chunk describing sequence location into 'chunk_file';
}
}
if (key was packed)
tell sort buffer the actual number of bytes used;
}
}
if (buffer has some elements && dumped at least once)
sort-dump-dump as above;
else
don't sort, leave sort buffer to be sorted by caller.
Optimizing logic is described in WL#1393: Optimizing Filesort with Small Limit:
Many web customers have to do
"SELECT ... ORDER BY non_index_column LIMIT X",
When X * is smaller than sort_buff_size we can use
the following algoritm to speed up the sort:
- Create a queue to hold 'limit' keys.
- Scan through the table and store the first (last if DESC) keys in the queue
- Return values from queue
This is much faster than the current algoritm that works as:
The optimized results are documented in the WorkLog: 10 to 20 times faster than a quicksort(read the original).
So, just to be quick!
MySQL considers this scenario to be a TOP N problem, which can be solved using the priority queue.
The Priority Queue is a priority queue.
The priority queue is just a heap, and the java.util.PriorityQueue class in Java is essentially a heap data structure.
A brief explanation of what a heap is:
The heap is a complete binary tree;
The value of each node in the heap must be greater than or equal to the value of each node in its subtree (large top heap) or less than or equal to the value of each node in its subtree (small top heap).
If MySQL uses merge or quicksort, you will need to sort all the data in order, and then select the first few entries of LIMIT. The rest of the sorted data will be wasted.
The Priority Queue allows you to maintain a heap based on the number of LIMIT entries, and simply traversal all the data in the heap once to get a result.
SQL > verify that MySQL uses the priority queue
SET optimizer_trace='enabled=on';
select * from ratings order by category limit 5;
SELECT * FROM `information_schema`.`OPTIMIZER_TRACE`\G;
"filesort_priority_queue_optimization": {
"limit": 5,
"chosen": true
},
You can see that filesort_priority_queue_optimize. chosen = true
Restore the execution logic of the Priority Queue using the flowchart (for example, LIMIT 5) :
Tip: The top heap in the figure is sorted by the size of the category value
- Take the first five data to form a small top heap:
- Take the next row of data (6,2) and find that 2 is less than the largest in the current heap
category
3, delete (2,3) from the heap, put (6,2) into the heap:
- Repeat Step 2 until all the data that meet the query conditions have been placed into the heap by comparison, and the data in the final heap is shown in the figure below:
This is how the Priority Queue finds the minimum 5 rows of Category data.
Finally, we can get the result by taking it out of the heap. Each time, we put the last element into the top of the heap after taking the smallest element out of the heap, and then restack it according to the small-top heap. The process is shown in the figure below:
Select * from ratings order by category limit 5; The output of the
4. Why is indexing suboptimal
Obviously, BY following the logic of ORDER BY, indexing the sort field directly also solves this problem BY eliminating the memory sort step.
However, indexes are not silver bullets. Additional category indexes will increase the maintenance cost of the table. If there is no obvious business need, adding indexes simply to bypass the optimization of the priority queue may not be worth the cost, according to the class representative.
Especially when the table data volume is very large, the index volume can be very considerable. Moreover, according to the scenario in this paper, category is used as a category field, which has a relatively high repetition rate. Even if there is business SQL that queries by category, MySQL will not necessarily select this index.
To sum up, for this scenario, I personally think that Order by Category and ID is the optimal solution to this problem.
It’s none of my business. I’ve never written a LIMIT SQL before.
Don’t you write CRUD functions without paging? PageHelper source to see?
5. To summarize
The case in this paper is the actual problem that the class representative encountered in the process of on-line. After consulting the classmates around, several of them have encountered this problem. Most online articles are shallow in and shallow out, and after reading them, they feel like scratching the surface and cannot answer the doubts in their hearts. Then sort out this article.
It involves data structure, PageHelper and MySQL documentation. Relevant reference materials are listed at the end of the article. If you have time to read the reference document by yourself along the way of the article, I believe there will be a deeper harvest.
6. References:
- “The Beauty of Data Structure and Algorithm” — Lecture 28,29 of Wang Zheng
- “MySQL Practice 45 Lecture” — Lecture 04, 05, 10, 16, 17 by Lin Xiaobin
- 8.2.1.16 LIMIT the Query Optimization—https://dev.mysql.com/doc/ref…
- , MySQL, unriddling for MySQL, Sort page – http://mysql.taobao.org/month…
- filesort.cc—https://dev.mysql.com/doc/dev…
- WL#1393: Optimizing filesort with small limit—https://dev.mysql.com/worklog…
Follow the Java class representatives to get the latest Java articles