Teach you how to kill quickly: 99% of the massive data processing interview questions
After a lot of careful refinement, this article is included in chapter 6 of my new book, The Law of Programming, which is out now
jingdong/
dangdang/
amazon
Source: Method of structure The way of algorithms blog
preface
In general, titles with words like “seconds kill”, “99%”, and “most complete/strongest ever” are often sensational9), but further, if the reader reads this article and doesn’t get anything out of it, then I’m willing to take the blame :-). In the meantime, this article can be viewed as a kind of criticism of this article: Ten mass data processing interview questions and ten methods summary of the general abstract summary.
After all, due to the limitations of the article and theory, this article will abandon most of the details, only talk about the method/model theory, and focus on the most popular and straightforward language to explain the relevant issues. Finally, it must be emphasized that the whole article is based on the analysis of the interview questions. In the specific practice process, it has to be analyzed on a case-by-case basis, and the details that need to be considered in each scenario are far more complex than any of the solutions described in this article.
OK, if you have any questions, please feel free to comment. thank you
What is mass data processing?
The so-called mass data processing is nothing more than storage, processing and operation based on mass data. A massive amount of data is either too large to be resolved quickly in a short period of time or too large to be loaded into memory at once.
And the solution? For time, we can use clever algorithms with appropriate data structures, such as Bloom filter/Hash/bit-map/ heap/database or inverted index /trie tree. For space, there is only one way: Big and small, divide and conquer (hash mapping), you said the scale is too big, that simple, large scale into small scale, not the end of each.
As for the so-called single machine and cluster problem, generally speaking, a single machine is to deal with the limited number of machines loaded with data (as long as the data interaction of CPU, memory and hard disk is considered), while a cluster, with multiple machines, is suitable for distributed processing, parallel computing (more consideration is given to nodes and data interaction between nodes).
Big Data Processing is an important part of Big Data Processing.
- Divide and conquer /hash mapping + Hash statistics + heap/fast/merge sort;
- Double barrel division
- Bloom filter/Bitmap;
- Trie tree/database/inverted index;
- External sort;
- Hadoop/Mapreduce for distributed processing.
Here, the first part of this article, from the set/map about hashtable/hash_map/hash_set, briefly introduce the set/map/multiset/multimap, Hash_set /hash_map/hash_multiset/hash_multimap
Hashtable /hash_map/hash_set
We’ll cover hash_map/hash_set a lot later in Part 2 of this article, but let’s take a look at these containers a little to get started. Generally, there are two types of STL containers,
- Sequential containers (vector/list/deque/stack/queue/heap),
- Associative containers. Associative containers are divided into set(set) and MAP (mapping table), and their derivatives multiset(multi-key set) and multimap(multi-key mapping table), all of which are completed by RB-tree. In addition, there is a third class of associative containers, such as HashTable, And hash_set(hash set)/hash_map(hash map table)/hash_multiset(hash multi-key map table)/hash_multimap(hash multi-key map table). That is to say, the set/map/multiset/multimap containing a RB – tree, while the hash_set/hash_map/hash_multiset hash_multimap containing a hashtable.
An associative container is similar to an associative database. Each piece of data or element has a key and a real value, that is, a key-value pair. When an element is inserted into an associative container, the container’s internal structure (Rb-tree/Hashtable) places the element in place according to a specific rule based on its key value size.
For example, in MongoDB, documents are the most basic form of data organization, and each document is also organized in the form of key-value (key-value pair). A document can have multiple key-value combinations, and each Value can be of a different type, such as String, Integer, List, and so on. { “name” : “July”, “sex” : “male”, “age” : 23 }
set/map/multiset/multimap
Set, like map, all elements are automatically sorted by their key, because all the various operations on set/map only call the rb-tree operation instead. However, it is worth noting that neither of the two elements is allowed to have the same key. The difference is: Unlike a map, a set element can have both a value and a key. The key of a set element is a real value, and the real value is a key. All elements of a map are pairs, and they have both a value and a key. The second element is treated as real. As for multiset/multimap, they have exactly the same properties and usage as set/map, except that they allow duplicate key values, that is, all insertions are based on insert_equal() of rb-tree instead of insert_unique().
hash_set/hash_map/hash_multiset/hash_multimap
Hash_set /hash_map, both operations are based on hashTable. The difference is that a hash_set, like a set, has both real values and key values, and is essentially a key value. A hash_map, like a map, has both a real value and a key, so it is basically used in the same way as a map. However, since hash_set/hash_map are based on hashTable, there is no automatic sorting function. Why is that? Because HashTable doesn’t have automatic sorting. As for hash_multiset/hash_multimap, the properties are exactly the same as the multiset/multimap above, The only difference is that the underlying implementation of hash_multiset/hash_multimap is hashtable, so the elements of both are not automatically sorted. But they also allow duplicate key values.
So, in conclusion, in plain English, what kind of structure determine what kind of nature, because the set/map/multiset/multimap above are based on the RB – tree, so the automatic sorting, Hash_set /hash_map/hash_multiset/hash_multimap are all based on hashtable, so there is no automatic sorting, and adding multi_ just allows duplicate keys. As shown below:
In addition,
- For hash, see this blog post;
- For red-black trees, see the blog series,
- For more details on how to use hash_map, see here. For hash_set, see this article.
OK, next, look at part two of this article, six keys for handling massive data problems.
Part two: Six keys to deal with massive data problems
Key one, divide and conquer /Hash map + Hash_map statistics + heap/fast/merge sort
1, massive log data, extract the IP that visits Baidu the most times in a day.
- Divide and conquer /hash mapping: If the data is too large and the memory is limited, you can only convert a large file into a small file (mod mapping). That is, the 16-character policy: Divide the large file into small files, reduce the size, and solve the problem one by one
- Hash_map statistics: When large files are converted to small files, we can use the regular hash_map(IP, value) for frequency statistics.
- Heap/quicksort: After counting, sort (heap sort is possible) to get the most frequent IP addresses.
Specifically, it is: “The first is this day, and is the IP access to baidu logs out, one by one into a large file. Notice that the IP address is 32 bits, with a maximum of 2^32 IP addresses. You can also use a mapping method, such as %1000, to map the whole large file into 1000 small files, and then find the IP with the highest frequency in each text (you can use hash_map to calculate the frequency of all IP in those 1000 files, and then find the IP with the highest frequency in each file) and the corresponding frequency. And then out of the 1,000 largest IP addresses, find the one with the highest frequency, and that’s it.” Ten massive data processing interview questions and ten methods summary.
With regard to this question, there are still several questions as follows:
1. Hash mod is an equivalent mapping, and the same element will not be scattered to different small files. That is, mod1000 algorithm is used here, so the same IP address can only fall into the same file after Hash mod, and cannot be dispersed. Because if two IP addresses are equal, the Hash value after the Hash(IP) is the same, modulo the Hash value (such as modulo 1000) must still be equal. 2. What exactly is a hash mapping? In simple terms, is to facilitate the big data is processed by the computer in the limited memory and thus through a way of mapping the hash data is uniformly distributed in the corresponding memory location (such as the big data mapping through the way of taking more than the trees stored in memory, or a file mapping into multiple small files), and the mapping hash way is we say normally hash function, Well-designed hash functions can distribute data evenly and reduce collisions. Although the data is mapped to some different location, the data is still the same data, just in a different form to replace and represent the original data.
OK, if you are interested, you can also learn more about the consistent hash algorithm, see the fifth part of this article in the blog: blog.csdn.net/v_july_v/ar… .
2. Find the top 10 queries out of 3 million query strings
The search engine logs all search strings used by the user for each search. Each string is 1-255 bytes long. Let’s say there are 10 million records (these query strings are highly repetitive, and the total is 10 million, but no more than 3 million if duplicates are removed. Top 10 most popular query string, please count the top 10 most popular query string, require the use of memory should not exceed 1G.
Answer: From the first question above, we know that large data is divided into small ones. For example, to calculate the Top 10 of 100 million Ip addresses, we can first divide Ip addresses into 1000 small files and ensure that one Ip address only appears in one file. Then, we can count the Ip addresses in each small file using hashmap and sort them according to the number. The final merge or minimum heap processes the top10 of each small file in turn to get the final result.
But what if the data is small enough to fit into memory at once? For example, there are 10 million queries, but because of the high repetition, there are only 3 million queries, each query255bytes, so we can consider putting all of them in memory. Then the maximum occupied memory 3M*1K/4=0.75G. So all strings can be stored in memory for processing), we just need a proper data structure, and in this case HashTable is definitely our preferred choice.
So we abandon the divide-and-conquer /hash mapping step and go straight to the hash statistics and sort. So, for such typical TOP K problems, the solution is usually: HashMap + heap. As follows:
- Hash_map statistics: This batch of massive data is preprocessed. Maintain a HashTable whose Key is Query and Value is the number of occurrences of the Query, that is, hash_map(Query, Value). Read one Query at a time. If the string is not in the Table, add the string and set the Value to 1. If the string is in the Table, increment the count of the string by one. Finally, we complete the statistics with Hash table in O(N) time complexity;
- Heap sort: the second step, with the help of the heap data structure, find Top K, time complexity is N ‘logK. With heap structure, we can find and adjust/move in order of log time. Therefore, maintain a small root heap of size K(10 in this case), and iterate over 3 million queries, comparing them separately to the root element. So, our final time complexity is: O (N) + N’ * O (logK), (N = 10 million, N’ = 3 million).
Don’t forget the idea of heap sorting described in this article: “Maintain the minimum heap of K elements, that is, the minimum heap of k capacity to store the first k number traversed, and assume that they are the largest k number, build heap O (k), and adjust the heap (time O (logk)), k1>k2>… Kmin (kmin is set to the smallest element in the small top heap). Continue iterating through the columns, one element x at a time, compared to the top of the heap, updating the heap if x>kmin (x into the heap, logk when used), otherwise not updating the heap. So, the total time is O (k logk+ n minus k logk) =O (n logk). This approach benefits from logk time complexity in the heap for everything like lookups.” Chapter 3 Implementation of continuation and Top K algorithm. Of course, you can also use the trie tree, the keyword field stores the number of occurrences of the query string, none of which is 0. Finally, the minimum extrapolation of 10 elements is used to sort the occurrence frequency.
3, there is a 1G size of a file, each line is a word, the size of the word is not more than 16 bytes, the memory limit is 1M. Returns the 100 words with the highest frequency. From the above two examples, divide and conquer + hash statistics + heap/quicksort is starting to feel like it works every time. Now, let’s take a few more and check them out a little bit more. What can I do if the file is large and the memory is limited? What else is there to do? No more than:
- Divide and conquer /hash mapping: Read files sequentially, take hash(x)%5000 for each word x, and store that value into 5000 small files (x0,x1… X4999). That’s about 200K per file. If some of the files are larger than 1 MB, you can continue to divide them in a similar way until no smaller files are larger than 1 MB.
- Hash_map statistics: For each small file, the trie tree /hash_map is used to count the words in each file and the corresponding frequency.
- Heap/merge sort: After taking the top 100 words (you can use a minimum heap of 100 nodes) and storing the 100 words and their frequencies into a file, you get another 5000 files. The final part of the process is to merge (sort like merge) those 5000 files.
4. Massive data is distributed among 100 computers. Find a way to efficiently collect the TOP10 of this batch of data.
- Heap sort: To find the TOP10 on each computer, we can use a heap containing 10 elements (TOP10 small, use the largest heap, TOP10 large, use the smallest heap, for example, to find the TOP10, we first take the first 10 elements to adjust to the smallest heap, if found, then scan the following data, and compare with the top of the heap, if larger than the top of the heap, Replace the top of the heap with that element, and then adjust to the minimum heap. The last heap is the TOP10).
- Figure out the TOP10 on each computer, then combine the TOP10 on those 100 computers to get a total of 1000 data points, and then use a similar method above to figure out the TOP10.
- Iterate over all the data and hash again so that the same element only appears in a single computer. Then use the method mentioned above to count the occurrence times of each element in each computer to find the TOP10, and then combine the TOP10 on 100 computers to find the final TOP10.
- Or, brute force: directly count the number of occurrences of each element on each computer, and then add up the number of occurrences of the same element on different machines to find the TOP10 from all the data.
5. There are 10 files, 1G per file, and each row of each file contains the user’s Query, which may be repeated for each file. Ask you to sort by the frequency of queries.
Option 1: Go directly to:
- Hash mapping: Reads 10 files sequentially and writes query to another 10 files based on hash(Query)%10 (named A0, A1,.. A9). The size of each newly generated file is also about 1G (assuming the hash function is random).
- Use hash_map(query, query_count) to count the number of occurrences of each query on a machine with about 2 gb memory. Note: hash_map(query,query_count) is used to count the number of occurrences of each query, not to store their values.
- Heap/fast/merge sort: Use fast/heap/merge sort to sort by occurrence, output the sorted query and the corresponding query_cout to the file, and get 10 sorted files (denoted as)). Finally, merge sort (a combination of inner sort and outer sort) for the 10 files. According to scenario 1, here is an implementation:Github.com/ooooola/sor….
Scheme 3: Similar to Scheme 1, but after the hash is done and multiple files are divided into, multiple files can be processed by using a distributed architecture (such as MapReduce), and then merged.
6, given two files A and B, each containing 5 billion URLS, each of 64 bytes, the memory limit is 4G, let you find the common URL of a and B.
It can be estimated that the size of each file is 5G×64= 320GB, which is much larger than the memory limit of 4G. So it’s impossible to load it all into memory for processing. Consider a divide-and-rule approach.
- Divide-and-conquer /hash mapping: iterates through file A, taking each URL, and then store the urls into 1000 small files (denoted asI missed a1 here. So each small file is about 300M. Iterate through file B and store urls into 1000 small files in the same way as A (denoted as). After this processing, all possible urls are in the corresponding small file (), small files that do not correspond cannot have the same URL. Then we just want to find the same URL in 1000 pairs of small files.
- Hash_set statistics: To find the same URL in each pair of small files, you can store the URL of one of the small files in the hash_set. We then iterate over each URL of the other small file to see if it’s in the hash_set we just built, and if it is, it’s a common URL to save to the file.
Divide and conquer /hash mapping + Hash statistics + heap/fast/merge sort
7. How to find the most repeated one in the mass data?
Solution: First hash, then map the module to a small file, find the one with the most repetitions in each small file, and record the number of repetitions. Then find the data that is repeated the most often in the previous step (refer to the previous problem).
8, tens of millions or hundreds of millions of data (there are repeated), statistics of the most frequent occurrence of the first N data.
Solution: Tens of millions or hundreds of millions of data, today’s machine memory should be able to store. So consider using hash_map/ search binary tree/red black tree etc for counting times. The heap is then used to fetch the first N occurrences of the most frequent data.
9, a text file, about 10,000 lines, each line of one word, statistics of the most frequent occurrence of the top 10 words, please give your ideas, give the time complexity analysis.
10. 10 million strings. Some of them are duplicate. How to design and implement?
- Option 1: Use a trie tree, or a hash_map.
- Scheme 2: From xjbzju:, the 1000W data size insertion operation is completely unrealistic. The previous attempts to insert 100W elements into a set under STL have been unbearably slow. The hash implementation is not much better than a red-black tree. It is recommended to hash into small files and then process them separately.
- 1, hash_set Insert is better than set? This blog: t.cn/zOibP7t give practice data reliable?
- 2. How does map compare with hash_map? Who did the experiments?
- 3. What about query operations, as described in the following paragraph?
Or map for small data, fast construction, hash_map for large data?
rbtree PK hashtable
According to the performance test of red-black tree and Hash table conducted by my friend № 1, it is found that when the data amount is basically int key, hash table is 3-4 times that of Rbtree, but hash table usually wastes about half of the memory.
For example, rbtree needs to look at value data. Each node needs three more Pointers (or offsets). If you need other functions, such as counting the number of keys in a range, you need to add a count member.
From B trees, B+ trees, B* trees to R trees
11. A text file, find the first 10 words that often appear, but this time the file is relatively long, say it is hundreds of millions of lines or a billion lines, in short, can not read into memory at a time, ask the optimal solution.
12. Find the largest 100 of 100w numbers.
Next, let’s look at the second method, double poke division.
Key two, multi-layer partition
Multilayer division —- is essentially still divide and rule of thought, heavy in “cent” on the skill! Basic principle and main points: because the range of elements is too large to use the direct addressing table, so through many partitions, gradually determine the range, and then finally in an acceptable range. Examples of problems:
13, 250 million integers to find the number of non-repeating integers, memory space is not enough to hold the 250 million integers. A bit like the pigeon nest principle, the number of integers is 2^32, that is, we can divide the 2^32 number into 2^8 regions (such as a single file representing a region), and then divide the data into different regions, and then the different regions can be solved directly using bitmap. That is to say, as long as there is enough disk space, it can be very convenient to solve.
The median of 500 million ints to find them.
- Idea 1: This example is more obvious than the one above. First we divide int into 2^16 regions, and then we read the data and count the number of numbers that fall into each region. Then we can use the statistics to determine which region the median fell into, and also know what number of numbers in this region happens to be the median. And then the second scan we just count the numbers that fall in this region. In fact, if int is int64 instead of int, we can reduce it to an acceptable level by three such partitions. Int64 is divided into 2^24 regions, and then determine the largest number of the region, and then determine the number of the subregion 2^20, and then the number of the subregion only 2^20, you can directly use the direct ADDR table statistics.
- If the data is stored on the hard disk, it needs to be read twice. Method with cardinality sort some like, open a size of 65536 Int array, the first read, statistics Int32 high 16 bits, that is, 0-65535, all count as 0, 65536-131071 all count as 1. That’s the same thing as dividing by 65536. Int32 divided by 65536 can’t get more than 65536 cases, so just count an array of length 65536. Each time a number is read, the corresponding count in the array +1. In the case of negative numbers, the result needs to be added to 32768 and recorded in the corresponding array. After the first count, go through the number groups and add statistics one by one to see which interval the digits are in. For example, if they are in the interval K, then the number of digits in the interval 0-k-1, sum, should be
Key three: Bloom Filter /Bitmap
Bloom filter
For more on what a Bloom Filter is, see this blog post:
Bloom filter maps the elements in the set into the array, and whether all the mapping bits are 1 is used to indicate whether the elements are in the set or not. Counting Bloom Filter (CBF) supports element deletion by expanding each bit in the array to a counter. Spectral Bloom Filter (SBF) associates it with the occurrence times of collection elements. SBF uses the minimum value in counter to approximate the occurrence frequency of elements.
Please refer to question 6 above:
“6, I give you two files A and B, each holding 5 billion urls, each URL occupies 64 bytes, the memory limit is 4G, let you find A,B file common URL. What about three or even n files?
If the error rate is 0.01, we need about 65 billion bits. If the error rate is 0.01, we need about 65 billion bits. Now we have 34 billion available, which is not that far off, so that might increase the error rate a little bit. In addition, if these urlips are one-to-one, they can be converted to IP, which is much easier.
Given two files a and B, each containing 5 billion urls, each occupying 64 bytes, with a memory limit of 4G, you can find the common URL of a and B. If you allow a certain error rate, you can use the Bloom filter, and 4 gigabytes of memory represents about 34 billion bits. Map the url in one of the files to 34 billion bits using Bloom Filter, then read the url in the other file one by one to check if it is the same url as Bloom Filter, if so, it should be the same URL (note the error rate).”
Bitmap
- For more information on what bitmaps are, see the second part of this blog post: blog.csdn.net/v_july_v/ar… .
For the application of Bitmap, see question 13 above, and another new question:
“13. Find the non-repeating integers in the 250 million integers. Note that there is not enough memory to hold the 250 million integers.
Scheme 1:2-bitmap is used (each number is allocated with 2bits, 00 means non-existence, 01 means occurrence once, 10 means multiple times, 11 is meaningless), total memory 2^32 * 2bit =1 GB memory, which is acceptable. Then scan the 250 million integers to see the corresponding bits in the Bitmap. If 00 changes to 01,01 changes to 10,10 stays the same. When you’re done, look at the bitmap and print the corresponding integer with bit 01. Scheme 2: similar to question 1, the method of dividing small files can also be adopted. Then find the non-repeating integers in the small file and sort them. Then merge, taking care to remove duplicate elements.”
15. Given 4 billion unsigned ints that are unsorted, and then given a number, how can we quickly determine if the number is one of the 4 billion unsigned ints? Solution 1: Frome OO, using Bitmap /Bitmap method, request 512 MEgabytes of memory, one bit represents an unsigned int value. Read 4 billion numbers and set the corresponding bit. Read the number to be queried and check whether the corresponding bit is 1. 1 indicates that the number exists, and 0 indicates that the number does not exist.
Trie tree/database/inverted index
Trie tree
Scope of application: Large amount of data, more repeated, but small data types can be put into memory Basic principle and key points: implementation, node child representation expansion: compression implementation. Examples of problems:
- Search for hot queries: the query string has a high degree of repetition, although the total number is 10 million, but if you remove the repetition, no more than 3 million, each no longer than 255 bytes.
- There are 10 files, 1 gb each, and each row of each file contains the user’s query, which can be repeated for each file. Ask you to sort by the frequency of queries.
- 10 million strings, some of which are identical (duplicate), you need to remove all the duplicates and keep the unduplicated strings. How to design and implement?
- Question 8 above: A text file with about 10,000 lines, one word for each line, asks you to count the top 10 words that appear most frequently. The solution is to count the number of occurrences of each word in a trie tree, O(n*le) (le is the word’s collimation length), and then find the top 10 words that occur most frequently.
For more on Trie trees, see this article: From Trie trees (dictionary trees) to suffix trees.
Scope of application of database index: the increase, deletion, change and check of large amount of data Basic principle and key points: the use of data design and implementation methods, the increase, deletion, change and check of massive data processing.
- About database index and its optimization, more can see this article: www.cnblogs.com/pkuoliver/a… ;
- About the data structure and algorithm principle behind the MySQL index, there is a good article: blog.codinglabs.org/articles/th… ;
- There is an excellent article on B trees, B+ trees, B* trees and R trees in this blog: blog.csdn.net/v_JULY_v/ar… .
May I know what Inverted index is? May I know what Inverted index is? An indexing method used to store a mapping of the location of a word in a document or group of documents under a full-text search. For example, here is the text to be indexed: T0 = “it is what it is” T1 = “what is it” T2 = “it is a banana” we get the following reverse file index: “A “: {2} “banana”: {2} “is” : {0, 1, 2} “it” : {0, 1, 2} “what” : {0, 1} retrieval condition “what”, “is” and “it” will correspond to a collection of intersection. Forward indexing is developed to store a list of words for each document. Forward-indexed queries tend to satisfy ordered and frequent full-text queries for each document and validation of each word in a validated document. In forward indexing, documents take center stage, and each document points to a sequence of index items it contains. That is, the document points to the words it contains, and the reverse index points to the document that contains it, so it’s easy to see the reverse relationship. Extension: Problem example: a document retrieval system that queries files that contain a certain word, such as a common keyword search for academic papers.
For more on the use of inverted indexes, see:
- Chapter 23 and 4: Young’s matrix search, inverted index keywords Hash not repeated coding practice,
- Chapter 26: Coding and practice of generating inverted index based on a given document.
Key 5. External sort
Application scope: sorting of big data, basic principle and key points of de-duplication: merging method of external sorting, principle of replacement selection loser tree, optimal merging tree problem examples: 1). There is a file of 1 GB size, each line contains a word, the size of the word is not more than 16 bytes, the memory limit is 1 MB. Returns the 100 words with the highest frequency. This data has obvious characteristics, the word size is 16 bytes, but only 1M memory is not enough to hash, so it can be used for sorting. Memory can be used as an input buffer.
For specific application scenarios of multiple merging algorithm and outer sort, please refer to this article in blog:
Key 6. Mapreduce for distributed processing
MapReduce is a computing model that simply decomposes a large amount of work (data) (MAP) and executes it, and then merges the results into a final result (REDUCE). The advantage of this is that after the task has been decomposed, it can be done in parallel by a large number of machines, reducing the time of the entire operation. But if you want me to make it more general, Mapreduce is basically a merge sort.
Scope of application: Large amount of data, but small type of data can be put into memory Basic principle and key points: data to different machines to process, data partition, result reduction. Examples of problems:
- The canonical example application of MapReduce is a process to count the appearances of each different word in a set of documents:
- With a huge amount of data distributed among 100 computers, find a way to efficiently compile the TOP10 of this batch of data.
- There are N machines, and each machine has N numbers. Each machine can store at most O(N) numbers and operate on them. How do I find a median of N^2 numbers?
For more details, please refer to the blog:
- Massive data processing from Hadhoop framework and MapReduce model,
- And preliminary understanding and learning of MapReduce technology.
Other models/methodologies, combined with operating system knowledge
- Physical address: The address placed on the addressing bus. In the case of reading, the circuit transfers the physical memory of the address to the data bus based on the value of the address bits. In the case of write, the circuit puts the physical memory of the corresponding address into the contents of the data bus based on the value of the address bits. Physical memory is addressed in bytes (8 bits).
- Virtual address: an address in the 4G virtual address space, which is used in the program. With paging, 4 gigabytes of address space is divided into fixed-size pages, each of which is mapped to physical memory, to a swap file on the hard disk, or to nothing at all. For a typical program, only a small portion of the 4 gigabytes of address space maps to physical memory, and a large chunk maps nothing. Physical memory is also paged to map the address space. For 32-bit Win2k, the page size is 4K bytes. The information the CPU uses to convert virtual addresses into physical addresses is stored in structures called page directories and page tables.
The method in the operating system, which is a 4G address table, divides the table into small files of 4M to make an index, a secondary index. The first ten digits represent the number of 4M files, the last 20 digits represent the number of 4M files, and so on. Storage is designed based on key value, and key is used to build indexes.
But if I only have 10,000 numbers, how do I randomly pick 100 of those 10,000 numbers? Ask the reader to think. For more nautical data processing interview questions, see the first part of this article: blog.csdn.net/v_july_v/ar… .
reference
- Ten massive data processing interview questions and ten methods summary;
- Mass data processing interview questions collection and bitmap detailed explanation;
- Parsing Hash table algorithms from start to finish
- Bloom Filter for mass data processing;
- From Trie tree (dictionary tree) to suffix tree;
- The third chapter is the realization of Top K algorithm.
- Chapter 10, how to sort 10^7 disk files;
- From B tree, B+ tree, B* tree to R tree;
- Chapter 23 and chapter 4: Young’s matrix search, inverted index keywords Hash non-repetitive coding practice;
- Chapter 26: Coding and practice of generating inverted index based on a given document;
- Massive data processing from Hadhoop framework and MapReduce model;
- Chapter 16 ~ 20: full arrangement, jump step, odd and even sort, the first one only appears once and so on;
- Blog.csdn.net/v_JULY_v/ar… ;
- STL source code analysis chapter 5, Hou Jie;
- 2012 try: baidu interns pen blog.csdn.net/hackbuteer1… ;
- A great Redis/MongoDB site: blog.nosqlfan.com/;
- One foreign test website: www.careercup.com/.
Afterword.
Job search/algorithm/brush questions
www.julyedu.com/category/in…
Big news: My new book, The Way to Code: Interviews and Algorithms, finally hits stores on October 14, 2015! Address: item.jd.com/11786791.ht… . Jingdong, Dangdang, Amazon and other major online stores have spot sales.