Recently, I have been studying the working mechanism of Sphinx. After the introduction and exploration of the principle of Sphinx, there are still many problems to understand, such as the underlying data structure and algorithm. Therefore, I further understand the working principle of Sphinx from the data structure level. I did a lot of research on the Internet and found that there weren’t many articles about this. Then I found a book called “This Is a Search Engine”. I read the third chapter of the book, which explains the data structure used by the major search engines and how it works.
Note: This article does not make a strict distinction between Sphinx and search engines.
Attached is a picture:
The index basis
First, some basic concepts related to search engines are introduced. Understanding these concepts is very important for the following understanding of how they work.
Word – document matrix
The word – document matrix is a conceptual model to express the inclusion relationship between the two. As shown in the figure below, each column represents a document, each row represents a word, and the location of the tick represents the inclusion relationship.
From the vertical view, you can see which words each column represents in the document; Viewed horizontally, each line represents which documents contain a particular word. Search engine indexes are the concrete data structures that implement the word-document matrix. The above conceptual model can be implemented in different ways, such as inverted indexes, signature files, postfix trees, and so on. However, the experimental data show that inverted indexing is the best way to implement the word-to-document mapping.
Basic concept of inverted index
Document: A storage object in the form of text. Such as: web pages, Word, PDF, XML and other different formats of the document. Document Collection: A Collection of documents. For example: a large number of web pages. Document ID: Within the search engine, a unique number that uniquely identifies a Document. Word ID: A unique number that uniquely identifies a Word within a search engine. Inverted Index: Implement a concrete storage form of word – document matrix. Inverted index mainly consists of word dictionary and inverted file. Lexicon: A string collection of all the words that have appeared in a document set. Each entry in the Lexicon contains some information about the word itself and a pointer to an inverted list. PostingList: A document list of all documents in which a word appears, along with information about where the word appears in the document. Each record in the list is called an inverted entry. Inverted files: A File that holds an Inverted list of all the words. Inverted files are physical files in which Inverted indexes are stored.
The relationship between concepts is shown in the figure below.
Simple instance of inverted index
Let’s take an example to get a better sense of the inverted index.
Assume that the document set consists of five documents, each of which has the following contents:
The inverted index is as follows:
Word ID: record the word number of each word;
2. The corresponding word;
Document frequency: Represents how many documents in the document set contain a certain word
Inverted list: contains the word ID and other necessary information
TF: The number of times a word appears in a given document
POS: The position of the word in the document
Take the word “join” as an example. Its word number is 8 and the document frequency is 3, which means that there are three documents containing this word in the whole document set, and the corresponding inverted list is {(2; 1; < 4 >), (3; 1; < 7 >), (5; 1; <5>)}, meaning that the word appears in document 2,3,5, once in each document, the word “join” in the first document POS is 4, that is, the fourth word in the document is “join”, and the rest is similar.
This inverted index is already a very complete index system, and the index structure of the actual search system is basically the same.
Words in a dictionary
The word dictionary is used to maintain information about all the words that have appeared in the document set, and to record the position of the inverted list corresponding to a word in the inverted file. In the query to the word dictionary query, you can get the corresponding inverted list, and this as the basis of post-order sorting.
Common data structures: hashes plus linked lists and tree dictionary structures.
Hash and linked list
The following figure is a schematic diagram of the structure of the hash plus linked list dictionary. The main body is a hash table. Each hash table entry holds a pointer to the conflicting linked lists. Words with the same hash value form a linked list structure.
Construction process:
Segmentation of documents; For the good segmentation, the hash function is used to obtain the hash value. According to the hash table entries corresponding to the hash value, find the corresponding collision linked list; If the conflict list already exists, the word is not processed or the conflict list is added
A tree structure
Use B tree or B+ tree structure. Unlike hash tables, dictionary entries are required to be sorted by size, that is, in numeric or character order. In the tree structure, hierarchical lookup is used. The middle node stores the dictionary items in a certain order range in which subtree, and the lowest leaf node stores the address information of the word.
Inverted list
Inverted lists are used to keep track of which documents contain a particular word. The inverted list consists of inverted index entries, each of which consists of the document ID, the number of occurrences of the word, TD, and where the word has appeared in the document. A list of inverted index entries containing a word forms an inverted list for a word. The following is an inverted arrangement to indicate the intention:
indexing
The index structure was introduced earlier. So, how is the index built after having the data? There are three main ways to build an index.
Two-pass In-Memory Inversion (2-pass In-Memory Inversion)
This method completes the index creation process in memory. Memory is required to be large enough. The first time, collect some global statistics. Include the number of documents contained in the document set N, the number of different words contained in the document set M, and the information DF of how many documents each word has appeared in. Add up the DF values for all the words, and you can see how much memory is needed to build the final index. After obtaining the information, the memory and other resources were allocated according to the statistical information, and the colleague established the location information of the word corresponding to the inverted list in the memory.
The second time, create an inverted list of information word by word. Get the document ID of each document that contains a word, the number of times that word appears in the document, TF, and then keep filling the memory allocated during the first scan. At the end of the second scan, the allocated memory is just filled. Each word points to the “fragment” of the memory area pointed by the pointer, and the data between its starting position and ending position is the inverted list corresponding to this word.
Sort- Based Inversion
In the process of indexing, a fixed size of space is always allocated in memory to store dictionary information and intermediate results of the index. When the allocated space is consumed, the intermediate results are written to the disk and the space occupied by the intermediate results in memory is cleared to be used as the storage area for the next round of intermediate results of the index. Refer to the following figure:
The figure above is a schematic diagram of the intermediate results indexed by sorting. Establishment process:
After reading the document, the document is numbered, given a unique document ID, and the document content is parsed; Map word to word ID; Create (word ID, document ID, word frequency) triple; Append a triple to the end of the intermediate result store Then the next document is processed sequentially; When the allocated memory quota is full, the intermediate results are sorted (according to the sorting principle of word ID-> document ID). Writes the ordered triple to a disk file.
Note: During indexing by sorting, the dictionary is always stored in memory. Since the allocated memory is a fixed size, the dictionary gradually takes up more and more memory. As a result, less and less space is available for storing triples.
Once you have your indexes set up, you need to merge. When merging, the system sets up a data buffer in memory for each intermediate result file, which is used to store the partial data of the file. Merge the triples of the same word ID contained in different buffers. If all the triples of a certain word ID have been merged, it means that the inverted list of this word has been constructed, and then it will be written into the final index. My colleagues will empty the triples of the corresponding word ID in each buffer. The buffer continues to read subsequent triples from the intermediate result file for the next round of merge. When all the intermediate result files have been read into the buffer in turn and merged, the final index file is formed.
Merge-Based Inversion
Merging is similar to sorting, except that every time the data in memory is written to disk, all intermediate results, including the dictionary, are written to disk, so that all contents of memory can be cleared, and subsequent indexing can use all of the quota memory. The schematic diagram of merge method is shown below:
Differences with sorting method: 1. The sorting method stores dictionary information and triplet data in memory, and there is no direct connection between dictionary and triplet data. The dictionary is only for mapping words into word IDs. The merge rule is to create a complete in-memory index structure in memory, which is part of the final article index. 2. When the intermediate result is written to a temporary file on disk, the merge method writes the inverted index of this memory to the temporary file, and then completely clears the memory. The sorting method only writes the triple data into the disk temporary file after sorting, and the dictionary is always stored in memory as a mapping table. 3. When combining, the sorting method is to combine the triples of the same word in turn; The temporary file of merge method is the partial inversion list of each word, so merge the inversion list of each word to form the final inversion list of this word.
Dynamic index
In a real world, some documents in the set of documents that a search engine needs to process might be deleted or their content modified. If the content is to be immediately displayed in the search results after being deleted or modified, dynamic indexing can achieve this real-time requirement. Dynamic indexes have three key index structures: an inverted index, a temporary index, and a list of deleted documents.
Temporary index: Inverted index built in real time in memory. When new documents enter the system, the documents are parsed and appended to this temporary index structure in real time.
Deleted list: Stores the corresponding document IDs of deleted documents, forming a list of document IDs. When a document is modified, you can think of removing the old document and then adding a new document to the system as an indirect way to support changes in content.
When the system sees new documents coming in, it adds them to the temporary index immediately. When a new document is deleted, it is added to the delete document queue. When a document is changed, the original document is put into the delete queue, the changed document content is parsed, and it is added to the temporary index. In this way, the real-time requirement can be satisfied.
In dealing with the user’s query request, the search engine from the inverted index and temporary index also reads the inversion of user query word list, find the document collection of user queries, and the two results are combined, after using the delete document list filtering, would have been in the search results of the delete document from the results of filtering, form the final results, It is returned to the user.
Index update policy
Dynamic indexes can meet the needs of real-time search, but as more and more documents are added, the memory consumption of temporary indexes will increase. Therefore, consider updating the contents of the temporary index to the disk index to free up memory space for subsequent documents, and consider a reasonable and effective index update strategy.
Complete Re-build Strategy
Re-index all documents. After the new index is created, the old index is discarded and released, and then the new index is entirely responsible for responding to user queries. During the reconstruction process, the old indexes still need to be maintained in memory to respond to the user’s queries. As is shown in
Re-merge strategy (re-merge)
When a new document enters the search system, the search system maintains a temporary inverted index in memory to record its information. When the new document reaches a certain number, or the specified size of memory is consumed, the temporary index and the inverted index of the old document are combined to generate a new index. The process is as follows:
Update steps:
1. When new documents enter the system, the document is parsed, and then the temporary index maintained in memory is updated. Each word in the document is appended to the end of the inverted list, and this temporary index can be called incremental index
2, Once the incremental index consumes the specified memory, the contents of the incremental index and the old inverted index need to be merged.
Why it’s efficient: When traversing an old inverted index, the contents of the file can be read sequentially, reducing disk search time, because the indexed words are sorted in lexicographic order from low to high.
Disadvantages: Since a new inverted index file is being generated, the inverted list in the old index needs to be read out and written to the new index without any changes. Increased I/O consumption.
In-place update strategy
The starting point of the in-situ update strategy is to address the shortcomings of the remerge strategy.
When index combined, does not generate a new index files, but directly in the original old index files for additional operation, inversion of incremental index word list item appended to the old index at the end of the corresponding position, so that it can achieve the above goal, namely the existence of words related in update only incremental index information, other words related information does not change.
In order to support append operations, the in-place update policy reserves disk space at the end of the inverted list of each word in the initial index, so that the incremental index can be appended to the reserved space during index merging. The diagram below:
The experimental data prove that the index update efficiency of in-situ update strategy is lower than that of re-merge strategy. The reasons are as follows: 1. Due to the need for rapid migration, this strategy requires maintenance and management of disk available space, which is very costly. 2. During data migration, some words and their corresponding inverted lists will be moved out of the old index, which destroys word continuity. Therefore, it is necessary to maintain a mapping table of a word to the corresponding location of its inverted file. It reduces disk read speed and consumes a lot of memory (storing mapping information).
Hybrid strategy
Classify words according to their different properties, and adopt different index update strategies for different categories of words. Common practice: Differentiate words according to the length of their inversion lists, since some words appear frequently in different documents and therefore have long inversion lists, while some words are rare and therefore have short inversion lists. According to this property, words are divided into long inverted list words and short inverted list words. Long inverted list words adopt in-situ update strategy, while short inverted list words adopt re-merge strategy.
Because the read/write overhead of long inverted list words is significantly greater than that of short inverted list words, the in-place update strategy saves disk reads/writes. However, the reading/writing cost of a large number of short inverted list words is relatively small, so the sequential reading/writing advantage can be fully utilized by using the re-merge strategy to process the words.
Query processing
After the index is established, how to respond to the user’s query with an inverted index? There are three main query processing mechanisms.
(1) Document at a Time
Taking the documents contained in the inverted list as the unit, the final similarity score of one of the documents and the query is calculated every time, and then the final score of another document is calculated until the score of all the documents is calculated. Then sort by size according to the document score, and output the K documents with the highest score as the search result output, that is, the response to a user query is completed. In a practical implementation, you only need to maintain a priority queue of size K in memory. The following figure shows the schematic diagram of the calculation mechanism of one document at a time:
Dotted arrows indicate the direction in which the query processing computation is headed. In query, for document 1, since the inverted list of two words contains the document, the similarity between the document and the query word can be calculated according to their parameters such as TF and IDF respectively, and then the two scores are added to get the similarity score of document 1 and user query, Score1. And the other things are similar. Finally, size sorting is carried out according to the document score, and the K-interval document with the highest score is output as the search result.
One word at a Time (Term at a Time)
A document with a different, one word at a time “first transverse and longitudinal”, will first inversion of a particular word corresponding list a part of every document ID computing similarity score, that is to say, in a word – document matrix first lateral movement, after calculating a word inverted list contains all the documents, The document IDs contained in the next word inverted list are then calculated, vertically, and if a document ID is found to have a score, the original score is added. After all words are processed, the final similarity score of each document is calculated and sorted according to size. K documents with the highest score are output as the search results. The following diagram shows the mechanism of one word at a time.
The dashed arrow indicates the direction of the calculation. To preserve the data, a hash table is used in memory to store the intermediate results and the final calculation results. During the query, for document 1, the similarity score of this document to the “search engine” is calculated according to parameters such as TD and IDF, and then the similarity score is searched in the hash table according to the document ID, and saved in the hash table. After the other documents are evaluated in turn, the similarity score for the next word (in this case, “technology”) is calculated. Calculations, for the document 1, after calculating the similarity score, hash table lookup, found the document 1 and the existing score, then the hash table corresponding points and just calculate the scores as the final score, and updates the hash table 1 document 1 corresponding points, thus the final document 1 and user query similarity score, similar calculations to other documents, Finally, the results are sorted and the K documents with the highest score are output as the search results.
Skip Pointers
Basic idea: a list of inversion data pieces, segmentation into several fixed size of data blocks, a block of data as a group, for each data block, increase the yuan to record some information about this piece of information, so that even in the face of compressed inverted list, in the inversion list merger also can have two benefits:
1. No need to decompress all inverted list items, only part of the data can be decompressed
2. No need to compare any two document IDs.
The following figure shows the data structure after adding the inverted list of the query word “Google” to the jump pointer.
Let’s say that for an inverted list of the word “Google,” the block size is 3. Then, management information is added before each block of data. For example, the management information of the first block is <<5, POS1 >>, 5 represents the ID number of the first document in the block, and POS1 is the jump pointer, pointing to the starting position of the second block. Suppose you want to find documents with document ID 7 in a compressed inverted list of the word “Google”. Firstly, the first two values of the inversion list were decompressed, and the jump pointer data of the first group was read, and the value was found to be <5,Pos1>, where Pos1 indicated the starting position of the jump pointer of the second group in the inversion list. Thus, two consecutive values at the position of Pos1 could be decompressed, and <13,Pos2> could be obtained. 5 and 13 are the smallest document IDs in the two sets of data (i.e., the first document ID in each set of data). We are looking for 7, so if document No. 7 is included in the inversion list of the word “Google”, it must be in the first group. After the first group of data is uncompressed, restore the original document number according to the minimum document number. Here, the original document ID of <2,1> is: 5+2=7, which is the same as the document ID we are looking for, indicating that document No. 7 is in the inverted list of the word “Google”, so we can end the search.
As can be seen from the above search process, when searching for data, only one of the data blocks needs to be decompressed and the document number search can get the result, without decompressing all the data, which obviously speeds up the search speed and saves memory space.
Disadvantages: Increases the number of pointer comparison operations.
Practice has shown that if the length of the inverted list is L (that is, it contains L document IDs) and the square root of L is used as the block size, the effect is better.
Multi-field index
Indexes multiple fields of the document. The way to realize multi-field index: multi-index, inverted list and extended list.
Multiple indexing mode
For each different field, set up an index, when the user specified a field as the search scope, can extract the results from the corresponding index. When the user does not specify a particular field, the search engine looks up all the fields and combines the relevance scores of multiple fields, which is less efficient. The schematic diagram of multi-index mode is as follows:
Inverted list mode
The field information is stored in a key word corresponding inverted list, each document in inverted list index entry information at the end of the additional fields, so the read inversion of user query keywords list at the same time, can according to the field information, to determine whether a keyword appears in a certain field, in order to filter. The schematic diagram of inverted list is as follows:
Extended list mode
This is one of the more commonly used methods of supporting multi-field indexes. Create a list for each field, which records the location of the field in each document. Below is a schematic diagram of the extension list:
For convenience, build the extension list only for the “Title” field. For example, the first item <1,(1,4)>, which means that for document 1, the position of the title is in the range from the first word to the fourth word, and other items have similar meanings.
For the query, suppose the user searches for “search engine” in the title field and knows that documents 1, 3, and 4 contain the query term through the inverted list. The next step is to determine whether the query term has ever appeared in the title field of these documents. For document 1, the query word “search engine” appears at positions 6 and 10. According to the corresponding title expansion list, the title range of document 1 is from 1 to 4, indicating that the title of document 1 does not contain the query word, that is, document 1 does not meet the requirements. For document 3, “the search engine appears at positions 2, 8, 15, and in the corresponding title extension list, the title appears at range 1 to 3, indicating that the query term appearing at position 2 is within the title range, that is, it meets the requirements and can be output as the search result. Document 4 is similarly handled.
The phrase query
The essence of phrase queries is how to maintain sequential relationships or positional information between words in the index. The more common techniques to support phrase query include: location information index, two – word index and phrase index. You can also combine all three.
Position Index
Phrase queries can be easily supported by recording word location information in the index. But the cost of storage and computing is high. The schematic diagram is as follows:
<5,2,[3,7]> means that the 5 document contains the word “love” and that the word appears twice in the document in the position of 3 and 7, and otherwise has the same meaning.
When querying, the inverted list reveals that documents 5 and 9 contain two query words at the same time, and in order to determine whether the user query exists in the form of a phrase in both documents, the location information is also determined.” The emergence of love “in the word document on 5th position is 3 and 7, and” trading “in the document location is 4, 5 can know the location of the document on 5th position 3 and 4, respectively corresponding to the word” love “and” business “is a phrase that both forms, and according to the analysis of the same document number 9 is not a phrase, so 5 documents will be returned as a search result.
Two-word Index (Nextword Index)
The statistical data show that two-word phrases account for the largest proportion in the phrases, so providing fast query for two-word phrases can solve the problem of phrase query. But if you do that, the number of inverted lists will explode. The data structure of the two-word index is as follows:
The figure shows that memory contains two dictionaries, the first word is “under the word” and “the first word” dictionary, “the word” dictionary has pointed to “the word” dictionary somewhere a pointer, “” dictionary stored in” the first word “second word dictionary of commonly used phrase,” the word “dictionary of inverted list pointer to contain the phrase. For example, the phrase “my” has an inverted list of documents 5 and 7, the phrase “the father of” has an inverted list of documents 5, and the rest of the dictionaries have similar meanings.
For the query, the user enters “my father” to conduct the query, and the search engine divides the words to get two phrases “my” and “my father”. Then, it searches the dictionary information respectively, and finds that the phrase “my” contains document 5 and document 7, while the phrase “father” contains document 5. Looking at its corresponding occurrence location, you can see that document 5 is a qualified search result, thus completing support for phrase queries.
The two-word index will make the index increase sharply. Generally, the two-word index is not established for all words, but only for the phrases with high computational cost.
Phrase Index
Simply add multiple phrases to the dictionary and maintain an inverted list of phrases. The downside is that it is impossible to index all the phrases beforehand. The common practice is to dig up popular phrases. The following figure shows the overall index structure after adding the phrase index:
For the query, when the search engine receives the query from the user, it will now search in the phrase index, if found, it will calculate and return the search results to the user, otherwise, it will still use the regular index for the query processing.
Hybrid method
After receiving the query from the user, the system first searches in the phrase index and returns the result if it finds it. Otherwise, it searches in the double-word index and returns the result if it finds it. Otherwise, it processes the phrase from the regular index and gives full play to their respective advantages. The mixed index structure of the three methods is shown in the figure below:
Phrase queries are used to index popular and high-frequency phrases, and two-word indexes are used to index high-cost phrases including stops.
For a query, the system first looks up in the phrase index, returns the result if found, otherwise looks up in the two-word index, returns the result if found, or processes the phrase from the regular index, thus taking full advantage of each other.
Parallel Indexing
When search engines need to process too many collections of documents, a distributed solution needs to be considered. Each machine maintains a portion of the overall index, and multiple machines collaborate to build the index and respond to queries.
Document Paritioning
The entire document collection is cut into subsets, and each machine is responsible for indexing a subset of documents and responding to query requests. According to the document, the schematic diagram is as follows:
How it works: When the query distributor receives a user query request, it broadcasts the query to all index servers. Each index server is responsible for index maintenance and query response for some subsets of documents. When the index server receives the user query, it computes the relevant documents and sends the K documents with the highest score back to the query distributor. After the query distributor combines the search results of each index server, the search results are merged and the M documents with the highest score are returned to the user as the final search results.
Term Paritioning
Each index server is responsible for creating and maintaining an inverted list of some of the words in the dictionary. Divided by words, the schematic diagram is as follows:
How it works: One word at a time. Assuming the query contains A, B, C three words, after the server receives the query, will forward the query to 1 contains A list of words A inverted index server node, 1 to extract A list of inverted index server node, and the accumulative calculation results in the middle of the points, and then pass the query and the intermediate results to contains A list of words B inverted index server nodes, Index server node 2 is similarly processed and continues to index server node 3. The final result is then returned to the query distributor, which computes the K documents with the highest score as the search result output.
Compare the two schemes
By document is more common, by word is only used in special applications. Disadvantages of word-by-word segmentation: Scalable search engines work with documents that are constantly changing. If you divide the index by document, you only need to add an index server, which is easy to operate. However, if the index is divided by word, it has a direct impact on almost all index servers, because the new document may contain all dictionary words, which requires updating the inverted list of each word, which is relatively complicated to implement.
The inverted list of common words used in load balancing is very large and can be tens of meters in size. Divided by document, the inverted list of words is more evenly distributed across different indexing servers, whereas divided by word, the entire inverted list of a common word is maintained by a single indexing server. If the word is also a buzzword, then the server can become an overloaded performance bottleneck.
Fault tolerance Suppose a server fails. If partitioned by document, only part of the document subset is affected, and the other index servers can still respond. However, if the word is divided, if the index server fails, the inverted list of some words cannot be accessed. When users query these words, they will find no search results, which directly affects the user experience.
Support for query processing can only query one word at a time by indexing by word, not by document.
conclusion
By understanding the data structures and algorithms used by search engines, we have a further understanding of how they work. For Sphinx, an online environment can consider a combination of incremental indexing and a full index at a time for real-time performance.
Due to the poor underlying foundation, it took me more than half a month to understand the content of Chapter 3 after reading it repeatedly for several times. I really realized the importance of data structure and algorithm. Although data structures and algorithms are seldom directly used in daily work, after knowing the commonly used data structures and algorithms, I will have more ideas to solve problems and accumulate and develop.
To the end of this article, if there are any questions or suggestions, you can communicate more, the original article, writing is limited, lack of talent, if there is something wrong in the article, wang told.
If this article is helpful to you, please refer to it. Thanks ^_^