We have 10GB of order data and we want to sort by order amount (assuming they are all positive integers), but we only have a few hundred MB of memory and can’t load 10GB of data into memory at once. What to do at this point?
Bucket sorting (Bucket sort)
Bucket sorting, as the name implies, uses “buckets”. The core idea is that the data to be sorted is sorted into several ordered buckets, and the data in each bucket is sorted separately. After the bucket is sorted, the data in each bucket is taken out in sequence to form an orderly sequence. Let’s say we have 10GB of order data, and we want to sort by order amount (assuming they’re all positive integers), but we only have a few hundred MB of memory, so we can’t load 10GB of data into memory at once. What to do at this point?
We can scan the file first to see the data range of the order amount. Suppose that after scanning, we get that the minimum amount of the order is 1 yuan and the maximum is 100,000 yuan. We divide all the orders into 100 buckets according to the amount. In the first bucket, we store the orders with the amount between 1 yuan and 1000 yuan, in the second bucket, we store the orders with the amount between 1001 yuan and 2000 yuan, and so on. Each bucket corresponds to a file and is numbered and named according to the size of the bucket (00,01,02… 99).
Ideally, if the order size is evenly distributed between $10,000 and $100,000, the order will be evenly divided into 100 files, each containing about 100MB of order data. We can place the 100 small files in memory one by one and use quicksort to sort them. After all the files are sorted, we only need to read the order data in each small file from small to large according to the file number, and write it into a document, then the file stored is the order data sorted according to the amount from small to large.
Orders between 1 yuan and 100,000 yuan are not evenly distributed, so 10GB order data cannot be evenly divided into 100 files. There may be too much data in a certain amount area, and the corresponding file will be too large to be read into memory at one time. What should I do about it?
For large files after these divisions, we can continue to divide, for example, the order amount between 1 yuan and 1000 yuan is more, we will continue to divide this interval into 10 small areas, 1 yuan to 100 yuan, 101 yuan to 200 yuan, 201 yuan to 300 yuan… 901 to 1000 yuan. If, after partitioning, there are still too many orders between $101 and $200 to read into memory at once, continue partitioning until all files can be read into memory.
Counting sort (Counting sort)
Counting sort is actually a special case of bucket sort. When the n numbers that we want to sort are on a small scale, for example, the maximum is k, we can divide the data into k buckets. The data values in each bucket are the same, saving the sorting time in the bucket.
We all went through the college entrance examination, the high test score system. Do you remember? When we check the score, the system will show us our score and the rank of our province. If there are 500,000 examinees in your province, how do you rank them by quick ranking?
The full score of examinees is 900 and the minimum score is 0. This data range is very small, so we can divide it into 901 buckets with scores ranging from 0 to 900. According to the scores of the examinees, we divided the 500,000 examinees into these 901 buckets. The data in the bucket are all examinees with the same score, so there is no need to sort. We just need to scan each bucket in turn, output the candidates in the bucket in turn into an array, and achieve the order of 500,000 candidates. Because only scan traversal is involved, the time complexity is O(n).
Why is this sorting algorithm called ** “counting” sort? Where does the meaning of counting ** come from?
I’ve simplified the size of the data. Suppose there are only eight examinees with a score between 0 and 5. The scores of these 8 examinees are put into an array A[8], they are: 2,5,3,0,2,3,0,3 respectively. Candidates’ scores range from 0 to 5, and buckets are represented by an array C[6] of size 6, where subscripts correspond to scores. However, C[6] does not store examinees, but the corresponding number of examinees. In an example like the one I just gave, we can get a value of C[6] by simply iterating through the test taker’s score.
There are 3 candidates with a score of 3 and 4 candidates with a score of less than 3. Therefore, candidates with a score of 3 will save the positions of subscripts 4, 5, and 6 in the ordered array R[8] after sorting.
If we sum the C[6] array sequentially, the data stored in C[6] looks like this. C[k] stores the number of candidates whose score is less than or equal to k.
We scan array A from back to front. For example, when scanning to 3, we can extract the value 7 with subscript 3 from array C, which means that so far there are 7 candidates with a score of 3 or less including ourselves, which means that 3 is the seventh element in array R (the position with subscript 6 in array R). When 3 is added to R, there are only 6 elements less than or equal to 3 left, so C[3] is reduced by 1 to 6.
Similarly, when we scan for a second candidate with a score of 3, we place it at the position of the sixth element in the array R (subscript 5). When we scan the entire array A, the data in group R is sorted in order of fractions from smallest to largest.
Radix sort (Radix sort)
Suppose we have 100,000 mobile phone numbers and we want to sort them from smallest to largest. What’s a quick way to sort them?
Can bucket sort, count sort come in handy? The 11-digit range of mobile phone numbers is too large to use either sorting algorithm. Is there an O(n) algorithm for this sorting problem? Now I’m going to introduce a new sort algorithm, radix sort.
If you want to compare the size of two mobile phone numbers a and B, if a is already larger than B in the first few digits, you don’t need to look at the next few digits.
Sort the phone number by the last digit, then by the second to last, and so on, and finally by the first digit.
After 11 sequences, the phone numbers were in order. The phone number is a little bit long, so it’s a little bit hard to see, but I’ve drawn a breakdown of radix sort using the string sort example, so you can look at it.
The sorting algorithm that we’re doing here by bits has to be stable, otherwise this implementation is not going to be right. Because if the algorithm is unstable, then the last sort will only consider the highest bit of the order, completely regardless of the size of the other bits, then the lowest bit of the order is completely meaningless.
Sort by each bit, we could do bucket sort or count sort, which is order n in time. If the data to be sorted has K bits, then we need k bucket sorting or counting sorting, the total time is O(k*n). When k is small, as in the case of mobile phone number sorting, k is at most 11, so the time complexity of radix sorting is approximately O(n).
In fact, sometimes the data to sort are not all the same length, for example, we sort the Oxford Dictionary of ten thousand English words, the shortest is only a letter, the longest I specially checked, there is a letter, the Chinese translation is pneumoconiosis. Does radix sort still work for data of unequal lengths?
In fact, we can add all the words to the same length. If we don’t have enough digits, we can add ** “0” **. Since all letters are greater than “0” according to the ASCII value, adding “0” will not affect the original size order. So we can continue to sort by radix.