This is the 7th day of my participation in the November Gwen Challenge. Check out the details: The last Gwen Challenge 2021

Analysis consists of the following processes:

  • First, divide a block of text into separate pieces suitable for inverted indexingentry ,
  • Later, unify the terms into a standard format to improve their “searchability,” orrecall

The parser does the above work. The parser encapsulates three functions in a single package:

  • Character filter

    First, the string passes through each character filter in order. Their job is to unscramble strings before segmentation. A character filter can be used to strip out HTML, or to convert & to and.

  • Word segmentation is

    Second, the string is divided into individual terms by the classifier. A simple word splitter might break the text into terms when it encounters Spaces and punctuation.

  • Token filter

    Finally, entries pass through each token filter in order. This process may change entries (for example, lowercase Quick), delete entries (for example, no words like a, and, the), or add entries (for example, synonyms like jump and leap).

Elasticsearch provides character filters, word splitters, and token filters out of the box. These can be combined to form custom profilers for different purposes

Built-in analyzer

However, Elasticsearch also comes with a pre-wrapped profiler that you can use directly. Next we list the most important profilers. To demonstrate the difference, let’s see what terms each parser gets from the following string:

"Set the shape to semi-transparent by calling set_trans(5)"
Copy the code
  • Standard analyzer

    Standard profiler is the default profiler for Elasticsearch. It is the most common choice for analyzing text in various languages. It divides text according to word boundaries defined by the Unicode Consortium. Remove most punctuation. Finally, lower case the entry. It will produce

    set, the, shape, to, semi, transparent, by, calling, set_trans, 5
    Copy the code
  • Simple analyzer

    The simple parser separates the text anywhere that is not a letter, lowercase the entry. It will produce

    set, the, shape, to, semi, transparent, by, calling, set, trans
    Copy the code
  • Space analyzer

    Whitespace parsers divide text in Spaces. It will produce

    Set, the, shape, to, semi-transparent, by, calling, set_trans(5)
    Copy the code
  • Language analyzer

    Language-specific parsers are available for many languages. They can take into account the characteristics of the specified language. For example, the English analyzer comes with a set of English non-words (common words, such as and or the, which have little effect on correlation) that are removed. Because of the understanding of the rules of English grammar, this participle can extract the stems of English words.

    English participles produce the following entries:

    set, shape, semi, transpar, call, set_tran, 5
    Copy the code

    Notice that transparent, calling, and set_trans have changed to root format.

When to use profilers

When we index a document, its full-text field is parsed into terms to create an inverted index. However, when we search in the full-text field, we need to pass the query string through the same parsing process to ensure that the format of the term we search for matches the format of the term in the index.

Full-text queries that understand how each field is defined so they can do the right thing:

  • When you query aThe full textThe same parser is applied to the query string to produce the correct list of search terms.
  • When you query aAccurate valueInstead of parsing the query string, search for the exact value you specify.

Now you can understand why the query in the opening section returned that result:

  • dateThe field contains a single exact value: a single entry2014-09-15.
  • _allThe field is a full-text field, so the word segmentation process converts the date into three terms:2014.09, and15.

When we query for 2014 in the _all field, it matches all 12 tweets because they all contain 2014:

GET /_search? q=2014Copy the code

When we query 2014-09-15 in the _all field, it first parses the query string to produce a query that matches any of the terms in 2014, 09, or 15. This will also match all 12 tweets, since they all contain 2014:

GET /_search? q=2014-09-15Copy the code

When we query 2014-09-15 in the date field, it looks for the exact date and finds only one tweet:

GET /_search? q=date:2014-09-15Copy the code

When we query 2014 in the date field, it cannot find any documents because no documents contain this exact log:

GET /_search? q=date:2014Copy the code

Test analyzer

Sometimes it’s hard to understand the segmentation process and the terms actually stored in the index, especially if you’re new to Elasticsearch. To understand what is happening, you can use the Analyze API to see how the text is analyzed. In the body of the message, specify the parser and the text to parse:

Each element in the result represents a single term

{
   "tokens": [
      {
         "token":        "text",
         "start_offset": 0,
         "end_offset":   4,
         "type":         "<ALPHANUM>",
         "position":     1
      },
      {
         "token":        "to",
         "start_offset": 5,
         "end_offset":   7,
         "type":         "<ALPHANUM>",
         "position":     2
      },
      {
         "token":        "analyze",
         "start_offset": 8,
         "end_offset":   15,
         "type":         "<ALPHANUM>",
         "position":     3
      }
   ]
}
Copy the code

Tokens are entries that are actually stored in the index. Position indicates the position of the entry in the original text. Start_offset and end_offset specify the position of the character in the original string.

The Analyze API is a useful tool for understanding what happens inside the Elasticsearch index, and we’ll discuss it more as we go on.

Specified analyzer

When Elasticsearch detects a new string field in your document, it automatically sets it to a full-text string field and parses it using a standard parser.

You don’t want to be like this all the time. Perhaps you want to use a different parser, suitable for the language in which your data is used. Sometimes you want a string field to be just a string field — without parsing, index the exact value you pass in, such as a user ID or an internal status field or tag.

To do this, we must manually specify the mapping of these fields.