An interview question
When interviewing software development engineers, we often encounter interview questions about sorting massive data and removing heavy data, especially for big data positions.
Example 1: Given two files A and B, each containing 5 billion urls, each of which occupies 64 bytes and has a memory limit of 4 gb, find the common URL of a and B.
- First of all, the most common method we think of is to read file A, create a hash table, and then read file B, traverse every URL in file B. For each traverse, we perform the operation of searching the hash table. If it is found in the hash table, it means that the two files share the same file and store it in a set.
- However, there is an obvious problem with the above method. Loading a file requires 5 billion *64bytes = 320G, which is much larger than 4G memory. Besides, we also need to allocate the space used by the hash table data structure, so it is impossible to build a whole hash table from all the data in the file at one time.
How to answer
For the above problems, we divide and conquer the idea of the algorithm.
step1
Iterate over file A, get hash(URL)%1000 for each URL, and store each URL into 1000 small files based on the value obtained (denoted a0, A1… ,a999, each small file is about 300M), why 1000? Based on the size of memory and the size of files to divide and conquer, we can roughly divide 320GB into 1000 pieces, each of which is about 300M.
step2
Iterate through file B and store urls into 1000 small files in the same way as A (denoted b0, B1… , b999).
The hash mapping of file A and the hash mapping function of file B should be the same so that the same URL is stored in the corresponding small file. For example, if a URL in A has data1 hashed into a99, then the same URL in B must have been hashed into B99.
So now the problem becomes: find the same URL for every pair of 1000 pairs of small files.
step3
Since each small file is about 300M, we can use the ideas in the solution above.
Their thinking
Common efficient solutions include heap sort, divide-and-conquer and BitMap.
Heap sort
Heap sort is one of the four sorting methods whose average time complexity is N logn, and its advantage is that it has excellent performance when finding the first n largest and smallest numbers in M numbers. So heapsort works well when the first m maxima or minima are to be found from a large amount of data, and no other values are required.
Find the 100 largest integers out of 100 million
- Read the first 100 digits and build the maximum heap.
- The remaining numbers are read in turn and compared to the maximum heap to maintain the maximum heap. You can read one disk page at a time and compare the data on each page in sequence to save I/O time.
- Sort the heap to get 100 ordered maximums.
Heapsort is a very low level of spatial complexity, you sort 100 million numbers, but you only need to read 100 numbers at a time, or you can set some other cardinality, so you don’t have to read all the data at once, reducing the memory requirements.
Divide and conquer strategy
General idea: divide and rule. Divide and conquer big data into small data for processing, and then merge.
First distinguish between internal sort and external sort:
- Internal sorting: Internal sorting refers to the sorting process in which the sequence to be sorted can all be loaded into memory. It is suitable for small element sequences.
- External sorting: External sorting refers to the sorting of large files, that is, the records to be sorted are stored in the external memory. The files to be sorted cannot be loaded into the memory at one time, and the data exchange between the memory and the external memory needs to be carried out for many times to achieve the purpose of sorting the whole file.
Steps:
- Samples are extracted from big data and the data to be sorted are segmented into multiple intervals with roughly equal number of samples
- Divide big data files into multiple small data files. Consider the I/O times and hardware resources. For example, you can set the number of small data files to 1 GB (the memory must be reserved for executing programs).
- The optimal algorithm is used to sort the data in the small data file, and the sorting result is stored in the range divided by Step 1
- The sorting result files in each data interval are processed, and finally a sorting result file is obtained in each interval
- Merge the sorting results of each interval
Secondly, we should pay attention to the characteristics of the data to be sorted. If the data to be sorted has certain characteristics, it can be solved more efficiently. At the same time, this idea is also closer to the thinking mode of big data application.
A BitMap.
On 32-bit machines, an integer, such as int A; The BitMap algorithm uses this idea to sort and query a large amount of data.
Its advantages are high computing efficiency, no comparison and shift, and occupy less memory, such as N=10000000; Only N/8=1250000Byte=1.25 MB memory is required.
To solve the sample
Example 2: In a particular case: sort a billion non-repeating integers. Find the repeated numbers in a billion.
The above problem generally limits memory.
Let’s change a similar problem to the above example to demonstrate the solution process.
Example 3: a host, 2 gb memory, 4 billion unduplicated unsigned int files, and then given an integer, how to quickly determine if the integer is among the 4 billion numbers?
We can solve the problem above in several ways.
Through the calendar
If I have enough memory to fit all four billion numbers into memory, and I iterate through them one by one, the time is order N. However, due to insufficient memory, it is necessary to read some data into memory in batches and then make a judgment, plus the time of I/O operation. The time complexity is far greater than O(N).
At this point, performance problems are concentrated on I/O operations and traversal groups. So is there a way to reduce time complexity? The answer is yes, if we assume that memory is sufficient and only optimize time, we can get the following method.
Direct addressing table method
Char a[0~2^32-1] = 1; char a[0~2^32-1] = 1; char a[0~2^32-1] = 1; For example, if the file has an integer 1000022, a[100002211]=1.
a | 0 | 1 | 2 | . | 2 ^ 32-1 |
---|---|---|---|---|---|
value | 1 | 0 | 1 | 0 | 0 |
The time complexity is O(1), but the space problem has not been solved. Analyze our algorithm, take the required integer as the array subscript, use 0/1 to distinguish whether the integer is in. One byte is used as a marker bit, when in fact one bit is enough. If this part of space can be optimized out, 4G/8 < 2G then we can solve the problem. Look at the following method.
BitMap
In the previous two articles, we covered the concept and application of Bitmaps.
Map integers to bits, such as the integer 10, 10/8=1, 10%8=2, then set b[2] of a[1] to 1. So the time is O(1), and the memory is compressed.
a | 0 | 1 | 2 | . | 2 ^ 32/8-1 |
---|---|---|---|---|---|
bit | 0, 1, 2, 3, 4, 5, 6, 7 | 0, 1, 2, 3, 4, 5, 6, 7 | 0, 1, 2, 3, 4, 5, 6, 7 | 0, 1, 2, 3, 4, 5, 6, 7 | 0, 1, 2, 3, 4, 5, 6, 7 |
flag | 0, 0, 0, 0, 0 | 0 0 1 0 0 0 0 0 0 0 | 0, 0, 0, 0, 0 | 0, 0, 0, 0, 0 | 0, 0, 0, 0, 0 |
Reverse thinking optimization
Unsinged int is only close to 4.3 billion (unsigned int Max is 2^32-1=4294967295, Max is 4.3 billion), we can also somehow store 300 million numbers that never appear (use array {size = the largest number of 300 million /8 bytes}), If it’s in 300 million numbers, it’s not in 4 billion. The 300 million number storage space is generally less than 4 billion. Ps: Storage 4294967296 (less than 4.3 billion) requires 512MB, while storage 294967296 requires 35.16MB.
It is important to note here that the time and space complexity required for BitMap sorting depends on the largest number in the data. The time and space complexity of using BitMap is quite large when the data is similar (11, 000, 100, 000) with only 3 data, and it is advantageous only when the data is dense.
conclusion
When dealing with large amounts of data, we think about the storage structure of that data. Of all the data structures that have been optimized for performance, the most commonly used is the hash table, which, yes, has O(1) constant time on location lookups, how neat and elegant. But with a large amount of data, there is not enough memory. Of course, you can also use a similar external sort to solve the problem, because to go IO so time is not. BitMap is a bit-based mapping that uses a Bit to mark the Value of an element. The Key is the element. Since bitmaps store data in bits, they can save a lot of storage space.
This paper summarizes several commonly used methods of mass data processing, which can be flexibly applied according to practical problems (space and time constraints). See the previous two articles for dissolution lists and bitmaps.
Finally, welcome to buy the author’s new book “Spring Cloud Advanced Microservices Architecture”.
thinking
Finally, leave a thought question for everyone, similar to the above solution process, interested can leave a comment below the article to discuss.
Example 4: The existing 3G data volume, the data type is integer, find the repeated data. Data: Each data is less than 2 billion, and each data can be repeated at most once. Memory: Use up to 300M of memory for operation.
Subscribe to the latest articles, welcome to follow my official account
reference
- Big data sorting algorithm: external sorting, bitmap algorithm; Big data deduplication algorithm: Hash algorithm, bitmap algorithm
- Massive Data Deduplication: Bitmaps and Bloom Filters