May I know what happened to Inverted Indexes
In relational database systems, indexes are the most efficient way to retrieve data. But for search engines, it can not meet its special requirements, such as massive data such as Baidu or Google to search tens of billions of web pages, if the use of similar relational database using B+ tree index, it can be imagined that its CPU computing capacity requirements are much higher. Secondly, relational databases generally store structured data with certain data formats and simple operations such as CURD.
An inverted index differs from a forward index in that it is used for full-text search. For example, I have a book with 10W words and 3k words, and I want to search for the chapter in which a certain word appears. What should I do?
Forward index: Walk through the book, recording the occurrence of the chapter. We have to go through almost 10 million words to count them all.
Inverted index: an inverted index is created, with each word as key and the section in which the word appears as value. We just have to find our target word in 3k words.
In this case, inverted indexes are obviously better for full-text search performance.
A normal forward index uses a key to find a value, whereas an inverted index uses a value to find a key. The key-value relationship is reversed.
Reverse index in ES
In es, only inverted index is created by default. In other types, inverted index is also created. Of course, ES supports customization. In this case, the forward index is the Doc Value. In this chapter we mainly introduce the inverted index. Let’s look at an example of how an inverted index can be created.
Let’s say we have two doc’s, both of which have a content field
Doc_1: The quick Brown fox jumped over The lazy dog
Doc_2: Quick Brown Foxes Jump over lazy dogs in summer
First of all, the word segmentation of DOC will be carried out in the bottom es classifier to obtain one term (word) after another, and then a mapping relationship will be established to record the documents of each word. First let’s analyze the documents where each word exists.
Because each DOC is uniquely identified by its ID, it establishes a mapping.
When ES sets up this mapping, when we search for a word, we don’t need to traverse every document. Of course, the inverted index of ES is not so simple.
Term optimization, for example, we used Baidu to search for the word “JUmped”
It’s easy to see that case is case-sensitive and only matches different tenses. So the same is true for ES. The participle of ES does certain things to the word, such as:
1 Case conversion: Quick –> Quick
Mother –> mom
Jumped –> jump
4. Dogs –> dog
. Note: different word segmentation methods and algorithms are not the same. Be aware of that.
When es is term optimized, let’s look at the inverted index:
When the inverted index is shown above, we can easily conduct a full-text search.
3. Tf-idf algorithm
Term Frequency — Inverse Document Frequency (TF-IDF) is a commonly used weighting technique for information retrieval and information exploration. Tf-idf is a statistical method used to assess the importance of a word to one of the documents in a document set or a corpus. The importance of a word increases with the number of times it appears in the document, but decreases inversely with the frequency of its occurrence in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of correlation between files and user queries. In addition to TF-IDF, search engines on the Internet also use link analysis-based rating methods to determine the order in which documents appear in search results.
Term frequency: How many times each Term in the search text appears in the field text, the more times it appears, the more relevant it is
When conducting full-text search in ES, tF-IDF algorithm is also adopted for the matching degree of search results. This match can be reflected in the metadata _score attribute of ES. Let’s verify it experimentally.
Start by creating an index
PUT /my_index? pretty
Insert data
PUT /my_index/my_index_type/1 { "content":"The quick brown fox jumped over the lazy dog" } PUT /my_index/my_index_type/2 { "content":"Quick brown foxes jump over lazy dogs in summer" }
search
GET /my_index/my_index_type/_search { "query":{ "match":{ "content": "quick" } } }
The search results
According to the above results, it is easy to find that ES calculates relevancy _score through TF-IDF algorithm. And don’t omit case.
If we search for the word “summer”, we get something like this, with only DOC1 matching.