This is the sixth day of my participation in the November Gwen Challenge. Check out the event details: The last Gwen Challenge 2021

One question: why use a search engine when a database can provide select?

The traditional search method is to compare the content one by one according to the keyword, and then take out the comparison. For example, if I have a keyword called “learning” in my hand, I will compare the word one by one and take out two when I find it. That is to find the value according to the key, which is the positive index.

Leading up to today’s main inverted index

For example, if you have 10 “learn” items in an article, you will get the number of the 10 “learn” items. When you search for “learn” items again, you will get the word segmentation back. So you don’t have to go through them

For example, there are now 2 items

  • 0. "it is what it is"
  • 1. "what is it"
  • 2. "it is a banana"

This will be broken down into the following

"a":      {(2, 2)}
"banana": {(2, 3)}
"is":     {(0, 1), (0, 4), (1, 1) , (2, 1)}
"it":     {(0, 0), (0, 3), (1, 2) , (2, 0)} 
"what":   {(0, 2), (1, 0) }
Copy the code

Retrieve the conditions of the “what”, “is” and “it” will correspond to the set: {{0, 1}, n {0} n {0} = {0, 1}}

If we perform a phrase search for “What is it” we get all the words of that phrase in documents 0 and 1. But the continuous condition for this phrase retrieval is only found in document 1.

Now that I’ve explained the speed problem, there’s one more problem with compatibility a little example

  1. The quick brown fox jumped over the lazy dog
  2. Quick brown foxes leap over lazy dogs in summer

Quick and Quick appear as separate entries, but users may think they are the same word. Foxes and foxes are very similar, like dogs and foxes; They have the same root. Jumped and leap, although they don’t have the same root, have similar meanings. They are synonyms.

I get a problem if I search for “Quick fox” because Quick and fox are not in the same document.

Elasticsearch normalizes terms into a standard mode, so you can find documents that are not exactly the same as the terms the user is searching for, but are sufficiently relevant.

Quick can be lowercase. Foxes can extract the stems of foxes into the root format as fox. Similarly, dogs can be extracted as dog. Jumped and leap are synonyms and can be indexed to the same word jump.

After doing this, did we succeed?

No!

Now a search for Quick Fox will still fail in our index, there is no Quick left. However, if we use the same standardized rules for the searched string as for the content field, it becomes query +quick +fox, so both documents match!

The original website: www.elastic.co/guide/cn/el…