This is the 8th day of my participation in the August More text Challenge. For details, see: August More Text Challenge

If my article is helpful, welcome to like, follow. This is the greatest encouragement for me to continue my technical creation. More past articles in my personal column

Elasticsearch Analysis analyzer

Analysis – Text Analysis is the process of converting a full text into a series of terms/tokens, also known as participles
Analysis is implemented through Analyzer
- You can use Elasticsearch’s built-in profilers/or customize them as needed
- In addition to converting terms when data is written, the same analyzer is used to analyze Query statements when matching them

Analyzer composition

Word segmentation is a component specialized in word segmentation and consists of three parts

Character Filters(for raw text processing, such as HTML removal)
Tokenizer installs rule participles
Token Filter processes the segmented words into lower case, deletes stopWords, and adds synonyms

Use Analyzer for word segmentation

Analyzer:

Simple Analyzer – Split by non-letter (symbols are filtered), lowercase
Stop Analyzer – Lowercase processing, Stop word filtering (the, a, is)
Whitespace Analyzer – The Analyzer is divided by space and does not change to lower case
Keyword Analyzer – Takes input as output without dividing words
Patter Analyzer – regular expression, default \W+ (non-character separated)
Language – Provides word segmentation for more than 30 common languages

See the effects of different analyzers

Standard analyzer (default)

GET _analyze
{
  "analyzer": "standard"."text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."} =================== result V =================== {"tokens": [{"token" : "2"."start_offset": 0."end_offset" : 1,
      "type" : "<NUM>"."position": 0}, {"token" : "running"."start_offset": 2."end_offset" : 9,
      "type" : "<ALPHANUM>"."position": 1},... {"token" : "evening"."start_offset" : 62,
      "end_offset" : 69,
      "type" : "<ALPHANUM>"."position": 12}]}Copy the code

Stop Analyzer – Lowercase processing to Stop word filtering

GET _analyze
{
  "analyzer": "stop"."text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."} =================== result V =================== {"tokens": [{"token" : "running"."start_offset": 2."end_offset" : 9,
      "type" : "word"."position": 0}, {"token" : "quick"."start_offset" : 10,
      "end_offset": 15."type" : "word"."position": 1},... {"token" : "evening"."start_offset" : 62,
      "end_offset" : 69,
      "type" : "word"."position": 11}]}Copy the code

More examples of participles

#simpe GET _analyze { "analyzer": "simple", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } GET _analyze { "analyzer": "stop", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } #stop GET _analyze { "analyzer": "whitespace", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } #keyword GET _analyze { "analyzer": "keyword", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } GET _analyze { "analyzer": "pattern", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } #english GET _analyze { "analyzer": "english", "text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening." } POST _analyze { "analyzer": } POST _analyze {"analyzer": "standard", "text": } POST _analyze {"analyzer": "icu_analyzer", "text": "this apple is not delicious"}Copy the code

Note that the ICU_Analyzer analyzer; Ik analyzer included; Does not come with Elasticsearch 7.8.0. Run the./bin/ Elasticsearch -plugin install analysis-icu command to install elasticSearch and restart elasticSearch

Chinese word segmentation

ik

Support custom lexicon, support hot update word segmentation gitee.com/mirrors/ela…

THULAC

Tsinghua University Natural Language Processing and Social humanities Computing laboratory of a set of Chinese word segmentation gitee.com/puremilk/TH…

reading

www.elastic.co/guide/en/el…
www.elastic.co/guide/en/el…

Elasticsearch Analysis analyzer

Elasticsearch Analysis analyzer

Analyzer composition

Use Analyzer for word segmentation

See the effects of different analyzers

Standard analyzer (default)

Stop Analyzer – Lowercase processing to Stop word filtering

More examples of participles

Chinese word segmentation

ik

THULAC

reading

Related Posts

Maven relies on transitivity well understood

Python Crawler 05– Common request styles and response status codes

JavaCPP Quick Start (Official Demo Enhanced version)