Hi, I’m Feng.
Before, some students asked me a big data scenario problem of Didi on wechat. Today, we will talk about how to solve this problem.
In the interview, we not only need to handwriting algorithm, but also encounter some algorithm scenarios, such as the amount of data is too large, the memory is completely insufficient how to deal with, today we will combine two specific problems to discuss.
Problem a
File A has 5 billion urls, and file B also has 5 billion urls. Each URL is 64 MB in size. How can we find the same URL in file A and file B on A machine with only 4 GB memory?
You don’t have to do the math to know you’re running out of memory, or you wouldn’t be asking. In an interview, if you’re not confident about your math skills, don’t embarrass yourself by saying what you’re thinking instead of doing numerical calculations.
For this kind of big data volume problem, usually think of “divide and conquer”. And just for the sake of explanation, I’m going to calculate it.
Size of 5 billion URL files: 50 * 10^8 * 64 Byte ≈ 32 * 10^10 Byte ≈ 320G
A picture is worth a thousand words. Click on the picture to enlarge it.
Use the Hash function to scan files A and B in sequence and store the urls to 1000 small files. Each file is about 300M, which meets the memory requirements. The files are numbered with numbers ending in 0-999.
First, file A is iterated, and hash(URL)%1000 for each URL, which is the number of the URL to be stored in the small file. Finally, all urls are stored in 1000 small files, and file names are in the following format: A0, A1… And a999.
Iterate over file B again and store the URL of file B in b0, b1,… , b999 in the 1000 small files.
We know one feature of Hash: the same key must have the same value after the Hash. Therefore, for the same URL in A and B files, the Hash will be stored in the same file with the same index.
If the Hash function is designed properly and the data is evenly distributed, each file is approximately 300M.
We then read the small files with the same subscript into memory to determine whether there is the same URL.
Is it so easy?
This reminds me of another problem in geek Time: running out of memory.
Question 2
How to sort 10G order data by amount?
At this time to use the idea of bucket sorting, the order amount is generally within a certain range, the span is not too large. First scan the 10G order data to find out the range of the order amount, and then divide it into different buckets according to the range.
For example, if the amount of an order ranges from 1 to 100,000 yuan, it is divided into 100 buckets, and the amount saved in each bucket is 1000. Each bucket is actually a small file, so give a number to the file, for example, 00 file: 1-1000 yuan, 01 file: 1001-2000, and so on.
Assuming that the order amount is fairly uniform, then each file is about 100M, and each file is read for sorting. The documents were originally divided according to the amount, has been orderly. Finally divide the small file according to the number in turn into the result file, then all orders in accordance with the amount of good order.
A picture is worth a thousand words.
In the actual situation, the order amount is generally not evenly distributed. In this case, the larger files are reused by the above method, divided into smaller files and sorted separately, and then summarized uniformly.
Through the above two questions, we found that the important idea of big data processing is “divide and rule”, the next time encounter this kind of topic might as well start from this Angle, even if you can not beat the interviewer, but also can answer a seven, eight, eight.
Finally, the students do not waste the careful guidance of the elder brother peak, successfully took didi offer.