The story

Now, suppose you read this text back to the time of the second grade primary school, at this time you are constantly pursuing the monitor of the next class xiao Hong, anxious to give all the things at home to TA. So the question is, if you want to move all the things in the house to Xiao Hong, how many ways do you have? Here’s what COMES to mind

  • If you can’t move it, break it up. (You may get beat up by your parents.)
  • Trying to take pills to make yourself strong

The above example may seem comical, but it is the solution to the problem of large numbers that mankind has always had: either increase our capacity to cope with large numbers, or divide and conquer.

In the IT industry, there are mainframes because of the limited processing power of traditional minicomputers. What if you don’t use mainframes? The services had to be split, and so came microservices.

Classic Interview questions

Interviewer: Suppose you only have 100 megabytes of memory available, and you have a file of 1 gigabyte containing integers, each of which is 4 bytes, and you need to sort the data in this file. What’s your solution?

I: make a phone call to find administrative sister with her to an 8G DDR4 memory, in order to express gratitude by the way about her to have a meal, perhaps also can smoothly off single.

Interviewer: EmMM… . Go back and wait for notice

The solution

Me: To solve this problem, first we need to divide into two cases:

  • If the data is not duplicated, we can use a bitmap to mark the corresponding data and iterate over the bitmap when we need to output the results. (This solution is relatively simple and is beyond the scope of this article)

  • Data duplication Since only 100M memory is available, fully utilizing this 100M memory means that we can sort 2,6214,400 integers (100 * 1024 * 1024/4) at a time, which means that we have to read files in batches and sort what we read, The result of each sort is saved to the file system, and then the files are merged.

Interviewer: Can you draw it in a picture?

Me: The process is shown in the picture below

Interviewer: Sure, why don’t you write the code on the spot

Implementation of the solution

The solution implementation generally consists of the following steps

  • Read in the amount of data according to the size of the buffer, convert them to an integer array, sort them, and write to a file, repeating this step until there is no data to read from the original data file.

  • Merge the sorted files until only one file remains

Taking the problem apart, we need to solve the following sub-problems

  • Since we are using four bytes of data to hold integers, we need to solve the byte access problem of integers

Why do we use four bytes to access integers? Instead of converting it to a string

  • An algorithm for merging sorted files

Scheme 1: Prefetch some data and write it into the cache. Then merge and sort the data in the split file (all the data in the split file is in order). When the data is used up, read the data in the file

Plan 2: Read integers one at a time from both files, compare them, and write larger/smaller values (depending on whether you want to increment or descend) to the file

Scheme one is relatively simple and fast and I leave it to you to implement.

As for scheme 2, I adopt the design mode of state machine to realize scheme 2 because state machine is involved in recent development.

The state machine is shown below

Implementation of external sorting

To everyone to provide a reference, MY implementation of the scheme and further optimization of the space 😄

test

To give an idea, we sort a 16MB file with a buffer of 512KB.

The following are the test results

  • File segmentation stage, it can be seen that the time used for file segmentation is almost the same

  • In the merge phase, you can see that the time it takes to merge sorted files is increasing because the size of the merged files is increasing

What if we just set the buffer to 16MB? Here are the test results, not even for the merge phase.

Interviewer: Good. Can you tell me the application scenario?

I: Using file to sort work = filesort, seems to have seen…

Interviewer: Give you one word explain

FileSort

Me: Come to think of it, suppose we have a table

CREATE TABLE `users` (

  `id` int(11NOT NULL.

  `account` varchar(45COLLATE utf8mb4_bin DEFAULT NULL.

  `nickname` varchar(45COLLATE utf8mb4_bin DEFAULT NULL.

  `password` varchar(45COLLATE utf8mb4_bin DEFAULT NULL.

  PRIMARY KEY (`id`)

ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin

Copy the code

If we need to continue sorting by a field without adding an index, we will see filesort in Extra if we query the SQL using Explain

As shown in the figure below

This means that MySQL cannot sort the data by index (if there is an index, it can just fetch it, no sort operation is required).

Had to prioritize the to sort field, but the production environment can be very large amount of data, if all loaded into memory, memory is bound to cause leading to collapse, database therefore must draw a special memory area for sorting, and this memory area is likely to hold this huge amount of data, is bound to use external file system for sorting, That’s where filesort comes in

Interviewer: Good. Do you know how to look at the size of this chunk of memory?

I: buffer = buffer, according to the tradition of mysql, the following statement should be found

show variables like '%buffer%'

Copy the code


(The area highlighted in blue in the figure, i.e
sort_buffer_size)

Interviewer: Good. Do you know how to optimize it?

I: add index bai, return ability how kind, don’t call operation and maintenance to server again add a memory bar? Or replace the CPU of toothpaste factory (Intel) with that of farm factory (AMD! YES)

Interviewer: As long as you like AMD, we’re brothers from another mother. Oh, no, I mean how do I index it

Me: We know that indexes are ordered, and if the information on the index already meets our requirements, then filsort is not necessary.

For example, in the users table mentioned above, we create an index

alter table users add index(nickname, account)

Copy the code

Consider whether filesort is required for the following statements

select nickname,account from users order by nickname

Copy the code
select * from users order by nickname

Copy the code
select * from users order by account

Copy the code

The answer is

  • The first statement does not require filesort because the index already contains the information we need
  • The second statement can directly use the index (index ordered storage), after reading the index corresponding to the primary key value and directly return the corresponding data to the client, do not need to usesort_buffer
  • The third statement requiresfilesortHowever, because account and nickname are combined into indexes, the account corresponding to each nickname is in order, so the account corresponding to different nicknames can be used to merge sort (as mentioned in the merge stage above).

conclusion

So today’s summary is just three pictures

The appendix

Q1: Why use 4 bytes to store integers

To save space, if you use strings, you can save as many bytes as the integer is long

Q2: How do I use the external sorting DEMO provided in this article


The three files in the source code are

  • Print the data in the file
  • Sorts the specified file
  • Generate random number files

Q3: Why use a state machine for merge sort

I have to say that using a state machine to sort out the logic is relatively clear, and I suggest you try it. But in this case performance is much faster if you use a buffer to hold an array of integers.

The resources

High Performance MySQL(Version 3)

The relevant section of the index

MySQL King of the Road to Promotion

Section 3.4