What is an inverted index?

We know by its name, we have inverted index, and we have positive index.

May I have an inverted index? May I have an inverted index? May I have an inverted index?

Each file in the search engine corresponds to a file ID, and the contents of the file are represented as a set of keywords (in fact, keywords are also translated into keyword ids in the search engine index library). For example, “Document 1” extracts 20 keywords through word segmentation, and each keyword will record its occurrence times and locations in the document.

The structure of the forward index is as follows:

ID of “document 1” > word 1: occurrence times, occurrence location list; Word 2: Number of occurrences, list of occurrences; ………………… .

ID of Document 2 > list of keywords that appear in this document.

 

Usually you go to value by key.

When a user searches for the keyword “Huawei mobile phone” on the home page, assuming that only a forward index exists, the user needs to scan all documents in the index library to find out all documents containing the keyword “Huawei Mobile phone”, score them according to the scoring model, and then rank them and present them to the user. Because of the astronomical number of documents included in search engines on the Internet, such an index structure is simply not sufficient to return ranking results in real time.

So, the search engine reconstructs the forward index into an inverted index, that is, a mapping of file IDS to keywords to a mapping of keyword to file IDS, each of which corresponds to a series of files in which the keyword appears.

The structure of the inverted index is as follows:

Keywords 1: ID of Document 1, ID of Document 2, ………… .

Keyword 2: list of document ids with this keyword.

 

From the keyword of the word, go to the document.

1. Word — Document Matrix

The word-document matrix is a conceptual model that expresses the inclusion relationship between the two, and Figure 1 shows its meaning. Each column in Figure 3-1 represents a document, each line represents a word, and the checkbox position represents the inclusion relationship.

            

Figure 1 Word-document matrix

From the vertical dimension of the document, each column represents which words are contained in the document, for example, document 1 contains terms 1 and 4, but no other words. In terms of the horizontal (word) dimension, each line represents which documents contain a particular word. For example, for term 1, the word 1 appears in documents 1 and 4, and the other documents do not contain term 1. Other columns and columns in the matrix can also be read in this way.

A search engine index is a concrete data structure that implements a “word-document matrix”. The conceptual model can be implemented in different ways, such as “inverted index”, “signature file”, “suffix tree”, etc. However, various experimental data show that “inverted index” is the best way to realize the word-to-document mapping relationship, so this blog mainly introduces the technical details of “inverted index”.

2. Basic concept of inverted index

Document: The general search engine processing object is the Internet web page, while the concept of Document is broader, representing the storage object in the form of text, compared with the web page, covering more forms, such as Word, PDF, HTML, XML and other files in different formats can be called documents. An email, a text message, a tweet can also be called a document. Throughout the rest of this book, documents are often used to represent textual information.

Document Collection: A Collection composed of several documents is called a Document Collection. For example, the vast number of Web pages on the Internet or the large number of e-mails are concrete examples of document collections.

The reference Document (the Document ID) : in the search engine inside, each Document in the Document collection will be given a unique internal number, this number as a unique identifier for this Document, so convenient internal processing, the interior of the each Document number is called “Document number”, later DocID is sometimes used to easily represent the Document number.

Word ID: Similar to document ID, a search engine internally uses a unique number to represent a Word, which can be used as a unique representation of a Word.

Inverted Index: Inverted Index is a concrete form of storage that implements a “word-document matrix”, by which a list of documents containing a word can be quickly obtained by Inverted Index. The inverted index consists of two main parts: “word dictionary” and “inverted file”.

Lexicon: The usual unit of index for a search engine is a word. A Lexicon is a collection of strings made up of all the words that have ever appeared in a collection of documents. Each index entry in a Lexicon contains information about the word itself and a pointer to an inverted list.

PostingList: An inverted list is a list of all documents in which a word appears and information about where the word appears in the document. Each record is called a Posting. The inverted list tells you which documents contain a word.

Inverted files: An Inverted list of all Inverted words is often stored sequentially in a File on disk, known as an Inverted File. Inverted files are physical files that store Inverted indexes.

The relationship between these concepts can be clearly seen in Figure 2.

              

3. Simple example of inverted index

The inverted index is very simple in terms of logical structure and basic idea. Below we carry on the illustration through the concrete example, causes the reader to be able to have a macroscopic and the direct feeling to the inverted index.

Suppose that the document set contains five documents, the contents of each document are shown in Figure 3, and the left-most column in the figure is the corresponding document number for each document. Our task is to create an inverted index of this collection of documents.

              

Figure 3 document collection

Unlike languages such as Chinese and English, there is no clear separator between words, so the first step is to use a word segmentation system to automatically cut the document into word sequences. In this way, each document is converted into a data stream composed of word sequences. In order to facilitate the subsequent processing of the system, it is necessary to assign a unique word number to each different word and record which documents contain this word. After such processing, the simplest inverted index can be obtained (see Figure 3-4). In Figure 4, the “Word ID” column records the word number of each word, the second column is the corresponding word, and the third column is the inverted list of each word. For example, the word “Google” has a word number of 1 and an inverted list of {1,2,3,4,5}, indicating that every document in the document collection contains the word.

              

Figure 4. Simple inverted index

The inverted index shown in Figure 4 is the simplest because the indexing system only records which documents contain a word, and in fact it can record much more than that. Figure 5 is a relatively complicated inverted index, and the basic index system of figure 4, the word of the inverted list not only record the document number, also documented information word frequency (TF), namely the occurrences of the word in a document, is to record the information, because of the word frequency information in the search results when ordering, The similarity of queries and documents is an important factor to calculate, so record it in the inverted list for subsequent sorting purposes. In figure 5 the example, the word “founder” word number is 7, the corresponding inversion list for: (3:1), three of them on behalf of the document number 3 document contains the words, the number 1 on behalf of the word frequency information, namely the word in the no. 3 document have arisen only once, with other words corresponding inversion lists represent the same meanings.

              

Figure 5. Inverted index with word frequency information

Practical inverted index can also account for more information, as shown in figure 6 indexes system in addition to record the document number and word frequency information, additional documents the two types of information, namely each word corresponding “document frequency information” in the third column (figure 6), and the inversion list to record the location of words in a document.

                  

Figure 6. Inverted index with word frequency, document frequency, and occurrence location information

“Document frequency information” represents how many documents in the document set contain a certain word. This information is recorded for the same reason as word frequency information, which is a very important factor in the calculation of search results ranking. And the location of the word in a document information is not a index system must record, can be included in a practical index system, also can choose not to include this information, so, because this information is not necessary for search system, location information only when support “phrase query” can come in handy.

Take the word “Ras” for example, its word number is 8 and document frequency is 2, indicating that there are two documents in the whole document set that contain this word, and the corresponding inverted list is: {(3; 1; < 4 >), (5; 1; <4>)}, which means that the word appears in document 3 and document 5 with a frequency of 1, and the occurrence position of the word “ras” is 4 in both documents, i.e. the fourth word in the document is “Ras”.

As shown in Figure 6, inverted index is already a very complete index system. The index structure of the actual search system is basically like this. The difference is nothing more than which specific data structure is adopted to realize the above logical structure.

With this indexing system, search engines can easily respond to users’ queries. For example, when users input the query word “Facebook”, the search system finds the inverted index, from which it can read the documents containing the word. These documents are the search results provided to users. Using word frequency information and document frequency information can sort these candidate search results, calculate the similarity of documents and queries, and sort the output according to the similarity score from high to low. This is part of the internal process of the search system, and the specific implementation scheme will be described in detail in chapter 5 of the book.

4. Word dictionary

The word dictionary is a very important part of the inverted index, which is used to maintain the relevant information of all the words in the document set and record the position information of the inverted list corresponding to a word in the inverted file. In support of search, according to the user’s query words, go to the word dictionary to search, you can get the corresponding inverted list, and use this as the basis for the subsequent sorting. For a large collection of documents that may contain hundreds of thousands or even millions of different words, can quickly locate a particular word, it directly affects the response speed of the search, so you need to efficient data structure to build and word dictionary lookup, the commonly used data structure including the hash chain table structure and tree structure of dictionary. 4.1 Hash plus Linked List Figure 7 is a schematic of this dictionary structure. This dictionary structure consists of two main parts:

The main body is a hash table, and each hash table entry holds a pointer to a conflicting list, in which words with the same hash value form a linked list structure. The reason for a conflicting list is that two different words get the same hash value. If this is the case, it is called a conflict in the hashing method, and words with the same hash value can be stored in the list for later lookup.

                      

During the indexing process, dictionary structures are constructed accordingly. For example, when parsing a new document, for a word T that appears in the document, the hash function is first used to obtain its hash value, and then the pointer stored in it is read according to the hash table entry corresponding to the hash value, and the corresponding conflict linked list is found. If the word already exists in the conflict list, it appears in the document that was parsed before. If the word is not found in the conflict list, it is the first time the word has been encountered and added to the conflict list. In this way, when all the documents in the document collection have been parsed, the corresponding dictionary structure is established.

When responding to a user query request, the process is similar to building a dictionary, except that a word is not added to the dictionary even if it does not appear in the dictionary. Take Figure 7 as an example, assuming that the query request input by the user is word 3, hash the word and locate it to slot 2 in the hash table. The conflict list can be obtained from its reserved pointer. Compare word 3 with words in the conflict list in turn and find word 3 is in the conflict list, so find this word. Then you can read the inverted list corresponding to this word for subsequent work. If this word is not found, it means that no document in the document set contains a word, and the search result is empty.

B tree (or B+ tree) is another efficient search structure. Figure 8 is a schematic diagram of a B tree structure. Unlike hash lookup, which requires dictionary items to be sorted by size (numeric or character order), b-tree lookup does not require data to be sorted by size. B tree forms a hierarchical search structure. The middle node is used to point out which sub-tree dictionary items in a certain order range are stored, which plays a role of navigation according to the comparison of the size of dictionary items. The bottom leaf node stores the address information of words, and word strings can be extracted according to this address.

                

Figure 8 B tree lookup structure

conclusion

Word ID: Record the word number of each word; 2. A corresponding word; Document frequency: represents how many documents in the document set contain an inverted list of a word: contains the word ID and other necessary information DocId: documents in which the word appears idTF: The number of times the word appears in a document POS: Take the word “join” as an example, its word number is 6 and document frequency is 3, indicating that there are three documents in the whole document set that contain this word, and the corresponding inverted list is {(2; 1; < 4 >), (3; 1; < 7 >), (5; 1; <5>)}, meaning that the word appears in document 2, 3, and 5, once in each document, the word “join” in the first document POS is 4, i.e. the fourth word in the document is “join”, the other similar. The inverted index is a very complete index system, and the index structure of the actual search system is basically like this.