How to find the most frequent integers among 2 billion, 4 billion, and 8 billion with only 2GB of memory?

These days Xiaoqiu went for an interview, but recently Xiaoqiu has studied a lot of articles related to bitwise algorithms, such as

How do you tell if a number is among 4 billion integers?

Algorithm techniques bit operation installation guide

For the algorithm or a little confidence ，，，， so, found the following dialogue.

The 2 billion level

Interviewer: If I gave you 2GB of memory, and I gave you 2 billion ints and asked you to find the number that came up the most, what would you do?

Pelle :(huh? How does that feel a little bit like the previous question of whether a number appears in 4 billion integers? If still using bitmap algorithm, however, seems to be unable to statistics the number of occurrences of a number, can only judge whether a number is), I can use hash table to the statistics, the number as the key, the number of the number of occurrences of the value, I then traverse the hash table which number appears most most times.

Interviewer: Can you calculate how much memory you need for this method?

Pelle: Both key and value are ints. An int takes up 4B of memory, so a hash table record takes up 8B. In the worst case, these 2 billion numbers are all different numbers, which takes up about 16GB of memory.

Interviewer: Your analysis is correct, however, I only gave you 2GB of MEMORY.

Pelle :(feel this problem is somewhat similar, but I don’t know why, no idea, this cool cool), there is no better way at present.

Interviewer: With your method, you can only record about 200 million different records at most, 200 million different records, which is about 1.6GB of memory.

Pelle :(huh? Is the interviewer giving me a hint? I’ve got an idea. I can put these two billion numbers in different files and filter them.

Interview question: Can you tell me more about it?

Xiaoqiu: Just now, you said that my method can only record about 200 million different records at most, so I can map these 2 billion numbers to different files, for example, the values between 0 and 200 million are stored in file 1, and the values between 200 and 400 million are stored in file 2…. Since there are about 4.2 billion different int numbers, I can map them to 21 files, as shown in the figure below

Obviously, the same number must be in the same file, so we can use my method at this time, count the number that appears most frequently in each file, and then choose the number that appears most frequently from these numbers again, and it is ok.

Interviewer: Well, that’s a good idea, but if the 2 billion numbers I gave you are concentrated, say between 1 and 20 million, then you would map them all into the same file. Do you have an optimization idea?

Pelle: Well, I can map each number to the hash function first, and then store them in the corresponding file. If the hash function is well designed, then these numbers will be evenly distributed. (I won’t say anything about the design of hash functions, I’m just giving you an idea.)

The 4 billion level

Interviewer: What if I add two billion numbers to four billion numbers?

Pelle :(it’s not easy, mapping to 42 files) I can increase the number of files.

Interviewer: If all four billion numbers are the same, then you have a hash table where the value of a key is four billion, but the maximum value of an int is about 2.1 billion, and then there is an overflow. What do you do?

Small autumn: I can set the initial value of “value” to negative 2.1 billion, so if “value” is 2.1 billion, That’s 4.2 billion occurrences of a key.

For the record, 21 files will be enough, because 21 files will limit the number of values in each file to 200 million, which means that the hash table will still hold no more than 200 million records.

The 8 billion level

Interviewer: That’s a quick response. What if I increased it from 4 billion to 8 billion?

Pelle :(I depend, this change aggravation)……… I know, AND I can go over it, and if I’m counting and I find that one key has been used more than 4 billion times, then there’s no way that another key has been used more than that, so I just return that key and I’m done.

Interviewer: Ok, this interview is over. Go back and wait for the notice.

conclusion

Today, this article mainly talks about some problems related to big data processing. Later, we may also find some similar problems, but different ways of dealing with them. If you feel good, you might as well:

Have a harvest? Hope the old iron people come to a triple whammy, give more people to see this article

1, give me a thumbs up, can let more people see this article, by the way inspire me, hee hee.

2, old friends, follow my original wechat public account “Shuaidi Play Programming **”, focusing on writing algorithm + basic knowledge of computer (computer network + operating system + database +Linux). Save it so you can read it. Hit me.

Finally, I recommend a long-cherished Github, where hundreds of commonly used e-books are collected and the download address is provided. Some screenshots are shown below

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

How to find the most frequent integers among 2 billion, 4 billion, and 8 billion with only 2GB of memory?

The 2 billion level

The 4 billion level

The 8 billion level

conclusion

Have a harvest? Hope the old iron people come to a triple whammy, give more people to see this article

Here’s the address:Click on Github

How to find the most frequent integers among 2 billion, 4 billion, and 8 billion with only 2GB of memory?

The 2 billion level

The 4 billion level

The 8 billion level

conclusion

Have a harvest? Hope the old iron people come to a triple whammy, give more people to see this article

Here’s the address:Click on Github

Related Posts

Linux command line illegal character processing

Go Web Programming (ii) – Custom table names and JSONB type parameter parsing

260. The number III that occurs only once