Search engine index



1. Word — Document Matrix

The word-document matrix is a conceptual model that expresses the inclusion relationship between the two. Figure 3-1 shows its meaning. Each column in Figure 3-1 represents a document, each line represents a word, and the checkbox position represents the inclusion relationship.

                          

Figure 3-1 Word-document matrix

From the vertical dimension of the document, each column represents which words are contained in the document, for example, document 1 contains terms 1 and 4, but no other words. In terms of the horizontal (word) dimension, each line represents which documents contain a particular word. For example, for term 1, the word 1 appears in documents 1 and 4, and the other documents do not contain term 1. Other columns and columns in the matrix can also be read in this way.

A search engine index is a concrete data structure that implements a “word-document matrix”. The conceptual model can be implemented in different ways, such as “inverted index”, “signature file”, “suffix tree”, etc. However, various experimental data show that “inverted index” is the best way to realize the word-to-document mapping relationship, so this chapter mainly introduces the technical details of “inverted index”.

 

2. Basic concept of inverted index

Document: The general search engine processing object is the Internet web page, while the concept of Document is broader, representing the storage object in the form of text, compared with the web page, covering more forms, such as Word, PDF, HTML, XML and other files in different formats can be called documents. An email, a text message, a tweet can also be called a document. Throughout the rest of this book, documents are often used to represent textual information.

Document Collection: A Collection composed of several documents is called a Document Collection. For example, the vast number of Web pages on the Internet or the large number of e-mails are concrete examples of document collections.

The reference Document (the Document ID) : in the search engine inside, each Document in the Document collection will be given a unique internal number, this number as a unique identifier for this Document, so convenient internal processing, the interior of the each Document number is called “Document number”, later DocID is sometimes used to easily represent the Document number.

Word ID: Similar to document ID, a search engine internally uses a unique number to represent a Word, which can be used as a unique representation of a Word.

Inverted Index: Inverted Index is a concrete form of storage that implements a “word-document matrix”, by which a list of documents containing a word can be quickly obtained by Inverted Index. The inverted index consists of two main parts: “word dictionary” and “inverted file”.

Lexicon: The usual unit of index for a search engine is a word. A Lexicon is a collection of strings made up of all the words that have ever appeared in a collection of documents. Each index entry in a Lexicon contains information about the word itself and a pointer to an inverted list.

PostingList: An inverted list is a list of all documents in which a word appears and information about where the word appears in the document. Each record is called a Posting. The inverted list tells you which documents contain a word.

Inverted files: An Inverted list of all Inverted words is often stored sequentially in a File on disk, known as an Inverted File. Inverted files are physical files that store Inverted indexes.

The relationship between these concepts can be clearly seen in Figure 3-2.

                                        

 

Figure 3-2 Basic concept of inverted index

 

3. Simple example of inverted index

The inverted index is very simple in terms of logical structure and basic idea. Below we carry on the illustration through the concrete example, causes the reader to be able to have a macroscopic and the direct feeling to the inverted index.

Assume that the document set contains five documents, and the content of each document is shown in Figure 3-3. In the figure, the left-most column is the corresponding document number of each document. Our task is to create an inverted index of this collection of documents.

                         

Figure 3-3 Document collection

 

Unlike languages such as Chinese and English, there is no clear separator between words, so the first step is to use a word segmentation system to automatically cut the document into word sequences. In this way, each document is converted into a data stream composed of word sequences. In order to facilitate the subsequent processing of the system, it is necessary to assign a unique word number to each different word and record which documents contain this word. After such processing, the simplest inverted index can be obtained (see Figure 3-4). In Figure 3-4, the column of “Word ID” records the word number of each word, the second column is the corresponding word, and the third column is the inverted list of each word. For example, the word “Google” has a word number of 1 and an inverted list of {1,2,3,4,5}, indicating that every document in the document collection contains the word.

       

Figure 3-4 Simple inverted index

The inverted index shown in Figure 3-4 is the simplest because it only records which documents contain a particular word, but in fact it can record much more than that. Figure 3-5 is a relatively complicated inverted index, the basic index system than to figure 3-4, in the word list of inversion not only record the document number, also documented information word frequency (TF), namely the occurrences of the word in a document, is to record the information, because the word frequency information in the search results when ordering, The similarity of queries and documents is an important factor to calculate, so record it in the inverted list for subsequent sorting purposes. In figure 3-5 example, the words “founder” number is 7, the corresponding inversion list for: (3:1), three of them on behalf of the document contains the word document number 3, number 1 on behalf of the word frequency information, namely the word in the no. 3 document have arisen only once, in other words the corresponding inverted list represents the same meanings.

                

Figure 3-5 Inverted index with word frequency information

Practical inverted index can also account for more information, as shown in figure 3-6 indexes system in addition to record the document number and word frequency information, additional documents the two types of information, namely corresponding to each word “information” document frequency (FIG. 3-6 in the third column) as well as in inverted list to record the location of words in a document.

                        Figure 3-6 Inverted index with word frequency, document frequency, and occurrence location information

 

“Document frequency information” represents how many documents in the document set contain a certain word. This information is recorded for the same reason as word frequency information, which is a very important factor in the calculation of search results ranking. And the location of the word in a document information is not a index system must record, can be included in a practical index system, also can choose not to include this information, so, because this information is not necessary for search system, location information only when support “phrase query” can come in handy.

Take the word “Ras” for example, its word number is 8 and document frequency is 2, indicating that there are two documents in the whole document set that contain this word, and the corresponding inverted list is: {(3; 1; < 4 >), (5; 1; <4>)}, which means that the word appears in document 3 and document 5 with a frequency of 1, and the occurrence position of the word “ras” is 4 in both documents, i.e. the fourth word in the document is “Ras”.

As shown in Figure 3-6, inverted index is already a very complete index system. The index structure of the actual search system is basically like this. The difference is nothing more than which specific data structure is adopted to realize the above logical structure.

With this indexing system, search engines can easily respond to users’ queries. For example, when users input the query word “Facebook”, the search system finds the inverted index, from which it can read the documents containing the word. These documents are the search results provided to users. Using word frequency information and document frequency information can sort these candidate search results, calculate the similarity of documents and queries, and sort the output according to the similarity score from high to low. This is part of the internal process of the search system, and the specific implementation scheme will be described in detail in chapter 5 of the book.



4. Word dictionary


The word dictionary is a very important part of the inverted index, which is used to maintain the relevant information of all the words in the document set and record the position information of the inverted list corresponding to a word in the inverted file. In support of search, according to the user’s query words, go to the word dictionary to search, you can get the corresponding inverted list, and use this as the basis for the subsequent sorting.


For a large collection of documents that may contain hundreds of thousands or even millions of different words, can quickly locate a particular word, it directly affects the response speed of the search, so you need to efficient data structure to build and word dictionary lookup, the commonly used data structure including the hash chain table structure and tree structure of dictionary.



4.1 Hash plus linked list


Figure 1-7 is a schematic of this dictionary structure. This dictionary structure consists of two main parts:

The main body is a hash table, and each hash table entry holds a pointer to a conflicting list, in which words with the same hash value form a linked list structure. The reason for a conflicting list is that two different words get the same hash value. If this is the case, it is called a conflict in the hashing method, and words with the same hash value can be stored in the list for later lookup.

                      

Figure 1-7 Hash and linked list dictionary structure

During the indexing process, dictionary structures are constructed accordingly. For example, when parsing a new document, for a word T that appears in the document, the hash function is first used to obtain its hash value, and then the pointer stored in it is read according to the hash table entry corresponding to the hash value, and the corresponding conflict linked list is found. If the word already exists in the conflict list, it appears in the document that was parsed before. If the word is not found in the conflict list, it is the first time the word has been encountered and added to the conflict list. In this way, when all the documents in the document collection have been parsed, the corresponding dictionary structure is established.

When responding to a user query request, the process is similar to building a dictionary, except that a word is not added to the dictionary even if it does not appear in the dictionary. Taking Figure 1-7 as an example, it is assumed that the query request input by the user is word 3, and the word is hashed to slot 2 in the hash table. The conflict linked list can be obtained from its reserved pointer. Word 3 is compared with words in the conflict linked list in turn, and word 3 is found to be in the conflict linked list. Then you can read the inverted list corresponding to this word for subsequent work. If this word is not found, it means that no document in the document set contains a word, and the search result is empty.

4.2 Tree Structure

B tree (or B+ tree) is another efficient search structure. Figure 1-8 is a schematic diagram of a B tree structure. Unlike hash lookup, which requires dictionary items to be sorted by size (numeric or character order), b-tree lookup does not require data to be sorted by size.

B tree forms a hierarchical search structure. The middle node is used to point out which sub-tree dictionary items in a certain order range are stored, which plays a role of navigation according to the comparison of the size of dictionary items. The bottom leaf node stores the address information of words, and word strings can be extracted according to this address.

                 

Figure 1-8 B-tree search structure





References:

Chapter 3 in This is The Search Engine: The core technology