Personally, I think this is a weakness of a job seeker, because students really don’t have access to this kind of question, but at the same time, interviewers love to ask this kind of question
For massive data, common methods are: hash, bitmap, Bloen filter, database optimization, inverted index, external sorting, Trie tree, heap, double bucket sorting, and the use of MapReduce
Hash method
The main advantage of hash table is the search of O(1). Since the storage order of hash table is basically unnecessary, the storage consumption time of hash table will not increase with the increase of data size. However, hash access may be conflict-prone, so it may not be suitable for massive data
Bitmap method
As the name implies, bitmapping is the representation of a single bit of data. In this way, we need to know the range of data in advance. For example, if the data range is 1~10, we apply for 10 bits of space; if the data range is 1~ 10310^3103, we apply for 1,000 bits of space.
The main scope of bitmap method is sorting O(N) and replay O(N). Compared with other O(log2N)O(log_2 N)O(log2N) sorting, bitmap method is fast, but in essence it still belongs to space for time, so it should not be sparse in data distribution, which is undoubtedly very low cost performance
Bulu filtration
A combination of these two methods defines m-bit arrays and n different hash functions. When adding a data, use n different hash functions to find n 1’s (which may not be different from each other) and set all n positions of the array to 1. When searching for data p in the array, if the n positions of the hash are not all 1’s, the data P must not exist.
To sum up, although this method saves some space, it also sacrifices the accuracy. The accuracy here means that only the result that does not exist in the search is certain to be accurate.
The other: Hash function number n = (ln2) ∗ (m/n) n = (ln2) * (m/n) n = (ln2) ∗ (m/n) when the minimum error rate is approximately equal to 0.7 times (m/n) ∗ (m/n) * 0.7 times 0.7 times (m/n) ∗, CBF and SBF is based on the development, CBF expands each bit of the bit array into a counter, supporting deletion operations, while SBF uses the minimum value of counter to approximate the occurrence frequency of elements.
Database optimization method
Common Optimization Ideas
- Data partition
- The index
- Caching mechanisms
- Virtual storage
- batch
- Use temporary tables and intermediate tables
- Optimized query statement
- Using the view
- Using stored procedures
- Sort instead of unsorted access
- Data mining using sampled data
Inverted index
In the actual reference of the search engine, it is sometimes necessary to look up records according to certain values of the keyword, so the index is built by the keyword, which is called the inverted index
Inverted indexes take two forms
- A horizontal backward index of a record contains a list of documents for each reference word
- The horizontal reverse index of a word contains the position of each word in a document
The second offers more compatibility, but requires more time and space
Compared with forward index, which is more suitable for full-text query, reverse index refers to the document containing the word. Therefore, when multiple keywords are used, logical operations such as union and intersection can be completed in the table first to access records and improve the search speed
External sort
Compared with internal sort, external sort stores the records to be sorted in external memory. The files to be sorted cannot be loaded into memory at one time, and data exchange between memory and external memory is required for many times. Generally, merge sort is used to implement external sort.
Merge sort:
- Divide the data into N data segments and sort each data segment within the segment
- Merge data segments two or two times, and finally make the file in order
Trie tree
Dictionary tree, also known as dictionary tree, saves space and time by using the common prefix of the string, mainly used for statistics, saving a large number of strings. I’m going to do a new issue on dictionary tree algorithms
- The root node contains no strings, and each node except the root node contains only one character
- From the root node to a node (if the last node can be), the path through the character concatenated, the node corresponding to the string
- The node maintains an integer, which is 0 or initialized to indicate that it is not a tail node, otherwise several strings set it to be a tail node
The heap
The heap is a tree structure, so it is internally unordered, but the parent must be larger than the child (large root heap), or the child must be larger than the parent (small root heap). From this structure, we can guess that heaps are good at finding maximum values. With a double heap, we can find the median.
Bucket sort
Bucket sorting is actually a classic divide-and-conquer idea. According to the data range and number, data is divided into several buckets, and then sorted within the buckets, and then sorted among the buckets
Otherwise, the number of buckets n=(Max −min)/len+1n =(max-min)/ Len +1n =(Max −min)/ Len +1 is too many, wasting resources
Graphs method
MapReduce is a distributed programming model that simplifies parallel computing. It aims to enable large-scale cluster systems to run parallel operations on large data sets. In the architecture, the MapReduce API provides Map and Reduce processing, while the GFS distributed file system and BigTable distributed database provide data access.
MapReduce is divided into Map and Reduce. Map converts one-to-one data, and Reduce reduces multiple data into one data, such as sum and quadrand.
Here are some common problem types
TopK problem
Take the largest k number, the highest frequency of K number, hot search list, the most popular query words, etc..
Generally, a better scheme is [divide and conquer + Trie tree/Hash + small top heap] : first divide the data set into multiple small data sets according to the hash method, then use trie tree or hash to calculate the frequency of search words in each small data set, then use the small top heap to find the k number with the highest frequency in each data set, and finally merge.
- Local elimination method, always maintain an array of size K
- Divide and conquer method
- The hash method to weight
- Use small top stack
Of course, a case-by-case analysis, we further subdivide the problem
Single single-core large memory
That means there’s not enough data
Single machine multi-core large memory
The hash method is used to divide the data into N pieces, and each piece is processed by a thread. Finally, a thread is used to merge the results. There is a bottleneck in this method that will obviously affect the efficiency, that is, the data skew, the processing speed of each thread may be different, the efficiency of the fast thread is covered by the slow thread, the solution is to further partition, each thread after completing the operation of the current small partition, receive the next small partition.
Single single-core small memory
Cut data into small files
It's the same as the last one. In fact, the last one is for parallelism, and this one belongs to helplessness.Copy the code
Multiple machines with small memory
Distribute data to multiple machines [Hash + socket], and each machine uses a single machine with single core and small memory.
TOPK problem is very suitable to be solved by MapReduce. Users only need to write a Map function and two Reduce functions, and then submit them to Hadoop (using Mapchain and Reducechain) to solve the problem. Map sends data with the same hash value to the same Reduce task. The first Reduce task counts the occurrence frequency of each word, and the second reduce task counts the top K in the output data of all Reduce Tasks.
Repeat the question
The problem with bitmapping is that there are only two possibilities for a bit: 0 or 1, and there are only two possibilities for a bit to be present and not present
So we chose to use two digits to represent a piece of data
Scheduling problems
-
Database sorting method
-
Divide and conquer method
-
Bitmap method