Text analysis is the process of converting unstructured text into a structured format optimized for search.

When do YOU need text analysis

Elasticsearch performs text analysis when indexing or searching for fields of type Text.

If you use text fields or text searches do not return the expected results, configuring text analysis often helps

Text Analysis Overview

Text analysis enables full-text indexing on Elasticsearch, which returns not only exact matches but also relevant results.

For example, if A search for Quick fox jumps over the lazy dog returns A Quick brown fox, you might want to return the related statement fast fox or foxes Leap

Tokenization

Break the text into chunks of tokens, often individual words. Through this analysis, full-text index is realized.

The search for the quick Brown fox does not match the alphabet. But if you shred the words first and then index each word individually, the words in the query string can be found.

There is much Normalization.

Segmentation allows search to match individual terms, but each token is still a perfect match. Such as:

  • Quick does not match Quick
  • Fox does not match Foxes
  • Jumps does not match leaps

To solve this problem, text analysis needs to standardize these tokens into a standard format so that terms are not exactly the same.

  • Quick is converted to lowercase Quick
  • Foxes converted to the root word fox
  • Both jump and leap can be indexed by jump

To ensure that search terms match the expected words, the same rules of Tokenization and normalization are used.

Customize Text Analysis

Text analysis is performed by an Analyzer, which contains a set of rules that govern the entire process.

Elasticsearch includes a default analyzer, Standard Analyzer, for most scenarios.

If you want to tweak your search experience, you can choose a different built-in profiler or a custom profiler. Custom profilers give you control over every step of the analysis process, including:

  • Change the input text before the word segmentation
  • How does text become tokens
  • Standardize Tokens before searching or indexing