“This is the 23rd day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021”
Shard is the smallest unit of work for Elasticsearch. But what exactly is a shard, and how does it work?
Traditional databases store individual values per field, but this is not sufficient for full-text retrieval. Each word in a text field needs to be searched, which means that a database needs the ability to index multiple values for a single field. The best support for a data structure requiring multiple values for a field is an inverted index.
4.6.1 Inverted Index
Elasticsearch uses a structure called an inverted index, which is suitable for fast full-text searches.
See its name, know its meaning, there is an inverted index, there will be positive index. May I have an inverted index? May I have an inverted index? May I have an inverted index?
The so-called forward index means that the search engine will match the file ID to the search keyword to form k-V pairs, and then count the keywords
However, the number of documents included in search engines on the Internet is astronomical, and such an index structure cannot meet the requirements of real-time ranking results. So, the search engine reconstructs the forward index into an inverted index, that is, a mapping of file IDS to keywords to a mapping of keyword to file IDS, each of which corresponds to a series of files in which the keyword appears.
An inverted index consists of a list of all non-repeating words in a document, and for each word there is a list of documents containing it. For example, suppose we have two documents, each containing the following content field:
- The quick brown fox jumped over the lazy dog
- Quick brown foxes leap over lazy dogs in summer
To create an inverted index, we first split the content field of each document into individual words (we call them terms or tokens), create a sorted list of all non-repeating terms, and then list which document each term appears in. The result is as follows:
Now, if we want to search quick Brown, we just need to look for the document that contains each term:
Both documents match, but the first document is a better match than the second. If we use a simple similarity algorithm that only counts the number of matching terms, then we can say that the first document is better for the relevance of our query than the second.
However, there are some problems with our current inverted index:
- Quick and Quick appear as separate entries, but users may think they are the same word.
- Foxes and foxes are very similar, like dogs and foxes; They have the same root.
- Jumped and leap, although they don’t have the same root, have similar meanings. They are synonyms.
Using the previous index search +Quick +fox will not yield any matching documents. (Remember, the + prefix indicates that the word must exist.) The first document contains Quick Fox, and the second document contains Quick foxes.
Our users can reasonably expect both documents to match the query. We can do better.
If we standardize terms into a standard pattern, we can find documents that are not exactly the same as the terms the user is searching for, but are sufficiently relevant. Such as:
- Quick can be lowercase.
- Foxes can extract the stems of foxes into the root format as fox. Similarly, dogs can be extracted as dog.
- Jumped and leap are synonyms and can be indexed to the same word jump.
The index now looks like this:
It’s not nearly enough. Our search for +Quick +fox will still fail because Quick is no longer in our index. However, if we use the same standardized rules for the searched string as for the content field, it becomes query +quick +fox, so both documents match! The process of word segmentation and standardization is called analysis
This is very important. You can only search for terms that appear in the index, so the index text and the query string must be standardized to the same format.