ElasticSearch has the ability to search for synonyms based on the user’s search time, and to split words based on semantics to return the most appropriate results.

A,ES Search with Analyzer

1.ES Search process

In order to understand this, we need to understand the process of ES search in advance. This ability of ElasticSearch is implemented by Analyzer. An Analyzer in ES handles searches as follows:

As you can see from this process, synonyms are added to the token Filter, which is why ES supports synonym search.

2.Analyzer (Parser)

1)An Analyzer

Analyzer consists of zero or more Char_filters, one Tokenizer, and zero or more Token filters.

Char_filter is used to process the original search sentence before word segmentation, such as removing HTML tags.

Tokenizer is used to break a search sentence into multiple phrases, such as a paragraph into multiple words based on a word divider or space.

The token filter is used to process the phrases output by the tokenizer. The common operations are as follows: Delete or modify the phrases, for example, delete the phrases A and the, change the uppercase to lowercase, and add synonyms.

Description of the Analyzer configuration:

The processor describe
tokenizer Generic or registered Tokenizer.
filter Generic or registered token filters.
char_filter Generic or registered character filters.
position_increment_gap Distance Indicates the maximum query distance. The default value is 100

2)Differences between Analyzer and Search_Analyzer

There are two main situations in ES where analyzer is used: one is when inserting documents, the field of text type is segmented and then the inverted index is inserted; the other is when querying, the input of text type to be queried is segmented first and then the inverted index is searched. The rules used are as follows: when creating an index, just look at whether the field defines an Analyzer. If it does, use the defined one. If it does not, use the ES default one. When querying, see if the search_Analyzer is defined. If not, see if the analyzer is defined and use the ES default.

2.CharFilter

Elasticearch only provides three character filters:

1)HTML Strip Char Filter

You can remove HTML tags, such as <p>I ‘<br>Love</br> cat</p> to: I Love cat.

Sample code:

{ "tokenizer":"keyword", "char_filter":[ "html_strip" ], "text":"&lt; p&gt; I' &lt; br&gt; Love&lt; /br&gt; cat&lt; /p&gt;" }Copy the code

The result is:

{
    "tokenizer":"keyword",

    "char_filter":[

        "html_strip"

    ],

    "text":"I Love cat"
}
Copy the code

2)Mapping Char Filter

You can replace the contents of the query string, for example, by changing “Trump” to: TTT.

{ "settings":{ "analysis":{ "analyzer":{ "my_analyzer":{ "tokenizer":"keyword", "char_filter":[ "my_char_filter" ] } }, "Char_filter" : {" my_char_filter ": {" type" : "mapping", "the mappings" : [" trump = & gt; TTT "]}}}}}Copy the code

Indexing or searching will convert “Trump” to “TTT”.

3)Pattern Replace Char Filter

You can use regular expressions to match and replace characters in strings. See regular expressions for specific capabilities, and omit specific examples.

3.tokenizer

Here are a few common word dividers.

1)Standard Tokenizer

The standard type of Tokenizer, which is very friendly to European languages like English, supports Unicode. Properties:

attribute instructions
max_token_length The maximum token set, the maximum value of the result set after tokenizer. If the token length exceeds the set length, the token will continue to be divided. The default is 255

2)NGram Tokenizer

If the word length is greater than the shortest word length, the word is divided into words of minimum length and then the word of maximum length. For example: min_gram=2, max_gram=3 The result of king of Glory participle is: king, king of glory, king of glory, glory of the one.

Properties:

Set up the instructions Default value
min_gram The minimum length of a word after a participle 1
max_gram The maximum length of a word after a participle 2
token_chars Set the participle form [](Keep all characters)letter,digit,whitespace,punctuation, etc. Symbol e.g. &

3)Edge NGram Tokenizer

Very similar to NGramTokenizer, but with auto-completion.

Second,Chinese word segmentation

1.scenario

Assume that the ik_demo index has the following records, as shown in the table below

_index _type _id _score message
ik_demo _doc 1 1 I love People’s Republic of China
ik_demo _doc 2 1 I come from Zaoyang, Hubei, China

According to this example, “guo” is different in different contexts, and the second sentence “China” is a single word, which should be returned in a separate search for “guo”.

2.elasticsearch-analysis-ik

Elasticsearch supports word segmentation by using the elasticSearch-analysis-ik address: github.com/medcl/elast…

Ik installation:

The. / bin/elasticsearch – plugin installgithub.com/medcl/elast…

Note: 7.3.2 is the version used for ElasticSearch.

3.Set up the mapping

curl -XPUT 'http://localhost:9200/ik_demo/_mapping? pretty' -H 'Content-Type: application/json' -d ' { "properties" : { "message" : { "type" : "text", "analyzer" : "ik_max_word", "search_analyzer" : "ik_smart", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256}}}}}'Copy the code

4.Add data

curl -XPUT 'http://localhost:9200/ik_demo/_doc/1? pretty' -H 'Content-Type: application/json' -d ' { "message": "I love the People's Republic of China "}' curl -XPUT 'http://localhost:9200/ik_demo/_doc/2? Pretty '-h' Content-type: Application /json' -d '{"message": "I am from China "}'Copy the code

5.search

In the word segmentation process, China will be used as a word, so when searching “country” alone, the record “I come from Zaoyang, Hubei, China” will not be found.

curl -XGET 'http://localhost:9200/ik_demo/_search? Pretty = true '-h' the content-type: application/json '3-d' {" query ": {" match" : {" message ":" countries "}}} 'Copy the code

Return result:

{ "took":3, "timed_out":false, "_shards":{ "total":5, "successful":5, "skipped":0, "failed":0 }, "Hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.2876821, "hits" : [{" _index ":" ik_demo ", "_type" : "_doc." "_id" : "1", "_score" : 0.2876821, "_source" : {" message ":" I love the People's Republic of China "}}}}]Copy the code

Here is only a brief introduction to the use of the specific word segmentation content can refer to: github.com/medcl/elast…

Three,Synonyms, synonyms search

1.scenario

Assume the following entries in the index synonym_demo, as shown in the table below

_index _type _id _score message
synonym_demo _doc 1 1 I like cats
synonym_demo _doc 2 1 I like the cat
synonym_demo _doc 3 1 I like dogs

Cat and cat actually mean the same thing. If you want to search for a cat, you should search for [1, 2] records. How to deal with it?

2.elasticsearch-dynamic-synonym

Install synonym plugin for elasticSearch-dynamic-synonym at github.com/ginobefun/e…

Installation:

Git clone github.com/ginobefun/e…

mvn clean install -DskipTests

Will target/releases/elasticsearch – dynamic – synonym – < version >. Zip into ES_HOME/plugins/dynamic – synonym directory.

Modify plugin-descriptor. Properties, comment out site, JVM, isolated, and change elasticSearch. version to elasticSearch version. As follows:

#site=${elasticsearch.plugin.site}

#jvm=true

#isolated=${elasticsearch.plugin.isolated}

Elasticsearch. Version = 7.3.2

3.Create indexes

In ES_HOME/config/analysis/synonyms. TXT add synonyms: the cat, the cat

Create index:

curl -XPUT 'http://localhost:9200/synonym_demo? pretty' -H 'Content-Type: application/json' -d ' { "settings": { "analysis": { "analyzer": { "my_dynamic_synonym": { "type": "custom", "tokenizer": "whitespace", "filter": ["my_synonym"] } }, "filter": { "my_synonym" : { "type" : "synonym", "expand": true, "ignore_case": true, "synonyms_path" : "analysis/synonyms.txt" } } } }, "mappings":{ "properties" : { "message" : { "type" : "text", "analyzer" : "ik_max_word", "search_analyzer" : "my_dynamic_synonym", "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } } }'Copy the code

4.Add data

curl -XPUT 'http://localhost:9200/synonym_demo/_doc/1? pretty' -H 'Content-Type: application/json' -d '{ "message": "I like cats"} 'curl - XPUT' http://localhost:9200/synonym_demo/_doc/2? pretty' -H 'Content-Type: application/json' -d '{ "message": "I like the cat"} 'curl - XPUT' http://localhost:9200/synonym_demo/_doc/3? Pretty '-h' Content-type: application/json' -d '{"message": "I like dogs "}'Copy the code

5.search

Since cat and cat are synonyms, a search for cat should return cat as well.

curl -XGET 'http://localhost:9200/synonym_demo/_search? pretty=true' -H 'Content-Type: application/json' -d ' {"query":{"match":{"message":"cat"}}}'Copy the code

Return result:

{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 2, the "base" : "eq"}, "max_score" : 1.2039728, "hits" : [{" _index ": "Synonym_demo _type", "" :" _doc ", "_id" : "1", "_score" : 1.2039728, "_source" : {" message ":" I like cats "}}, {" _index ": "Synonym_demo _type", "" :" _doc ", "_id" : "2", "_score" : 1.2039728, "_source" : {" message ":" I like the cat "}}}}]Copy the code

Four,Reference documentation

Search_analyzer:

www.elastic.co/guide/en/el…

The Custom Analyzer:

www.elastic.co/guide/en/el…

Elasticsearch built-in character filter:

www.cnblogs.com/Neeo/articl…

Elasticsearch – dynamic – synonym:

Github.com/ginobefun/e…