[Bug Mc-10866] – Elasticsearch’s built-in word splitter is not friendly to Chinese, it only splits words one by one, it can’t form words like:

POST /_analyze {"text": "I love Beijing Tiananmen "," Analyzer ": "standard"}Copy the code

If we were using the Standard participle, the result would be:

{" tokens ": [{" token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "< IDEOGRAPHIC >", "position" : 0}, {" token ":" love ", "start_offset" : 1, "end_offset" : 2, "type" : "< IDEOGRAPHIC >", "position" : 1}, {... "token" : "Door", "start_offset" : 6, "end_offset" : 7, "type" : "< IDEOGRAPHIC >", "position" : 6}]}Copy the code

Obviously this is not friendly to Chinese, it shows every Chinese character. Thankfully, Elastic’s Medcl of Montana has already made an IK Chinese word divider for us. Here’s how to install and use the Chinese word segmentation in detail. Detailed installation procedures can be found at github.com/medcl/elast… To find it.

 

The installation

First, check to see if the latest version of your Elasticsearch is available:

Github.com/medcl/elast…

As of this deadline, we can see the latest release of V7.3.1. You will need to install the ik splitter of the same version as your distribution of Elasticsearch.

So let’s go directly to our Elasticsearch installation directory and type the following command:

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.3.1/elasticsearch-analysis-ik-7.3.1.zip
Copy the code

Instead of 7.3.1 above install your own version of what you want:

Once installed, we can check whether it is installed by using the following command:

Localhost: ElasticSearch-7.3.0 liuxg$./bin/ elasticSearch-plugin list analysis-ikCopy the code

The command above shows that our IK has been successfully installed.

At this point we need to restart our Elasticsearch so that the plugin can be loaded.

 

Use IK word dividers

According to the IK tokenizer documentation, it contains the following sections:

Analyzer:

  • ik_smart
  • ik_max_word

Tokenizer:

  • ik_smart
  • ik_max_word

Let’s use the previous sentence “I love Beijing Tian ‘anmen” to check:

POST _analyze {"text": "I love Beijing Tiananmen "," Analyzer ": "ik_smart"}Copy the code

The ik_smart segmentation above shows the result:

{" tokens ": [{" token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0}, {" token ": "Love", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1}, {" token ":" Beijing ", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 2}, {" token ":" tiananmen square ", "start_offset" : 4, "end_offset" : 7, "type" : "CN_WORD", "position" : 3 } ] }Copy the code

Let’s then use the ik_max_word participle to try the same sentence:

POST _analyze {"text": "I love Beijing Tiananmen ", "analyzer": "ik_max_word"}Copy the code

The ik_max_word segmentation above shows the result:

{" tokens ": [{" token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0}, {" token ": "Love", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1}, {" token ":" Beijing ", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 2}, {" token ":" tiananmen square ", "start_offset" : 4, "end_offset" : 7, "type" : "CN_WORD", "position" : 3}, {" token ":" day ", "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD", "position" : 4}, {" token ":" door ", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 5}]}Copy the code

We can see from the two outputs: the difference between the two is the granularity of terms extracted by ik_smart, which is coarser, while ik_max_word, which gives more tokens, is finer.

Next we create an index:

PUT chinese
Copy the code

Next, let’s create a mapping for this index

PUT /chinese/_mapping
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "ik_max_word",
      "search_analyzer": "ik_smart"
    }
  }
}
Copy the code

After the preceding command is executed, the following information is displayed:

{
  "acknowledged" : true
}
Copy the code

It indicates that our installation was successful.

Next, let’s index some documents:

GET/Chinese /_analyze {"text": "I love Beijing Tiananmen ", "analyzer": "ik_max_word"}Copy the code

The result displayed is:

{" tokens ": [{" token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0}, {" token ": "Love", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1}, {" token ":" Beijing ", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 2}, {" token ":" tiananmen square ", "start_offset" : 4, "end_offset" : 7, "type" : "CN_WORD", "position" : 3}, {" token ":" day ", "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD", "position" : 4}, {" token ":" door ", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 5}]}Copy the code

We can see from the above results that our tokens show “Beijing”, “Tian ‘an” and “Tian ‘anmen”. This is different from what we had before.

Next, we enter two documents:

PUT/Chinese /_doc/1 {" I love Beijing square "} PUT/Chinese /_doc/2 {" I love Beijing square "}Copy the code

So we can search in the following way:

GET/Chinese / _search {" query ": {" match" : {" content ":" Beijing "}}}Copy the code

The results we show are:

{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 2, the "base" : "eq"}, "max_score" : 0.15965709, "hits" : [{" _index ":" Chinese ", "_type" : "_doc", "_id" : "2", "_score" : 0.15965709, "_source" : {" content ":" Beijing, how are you "}}, {" _index ":" Chinese ", "_type" : "_doc", "_id" : "1", "_score" : 0.100605845, "_source" : {" content ":" I love Beijing tiananmen "}}}}]Copy the code

Since both documents contain “Beijing”, we can see that both documents are displayed.

Let’s also do another search:

GET/Chinese /_search {"query": {"match": {"content": "tiananmen"}}Copy the code

The result is:

{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.73898095, "hits" : [{" _index ":" Chinese ", "_type" : "_doc", "_id" : "1", "_score" : 0.73898095, "_source" : {" content ":" I love Beijing tiananmen "}}}}]Copy the code

Because “Tian ‘anmen” only appears in the second document, we can see that there is only one result.

We also did another search:

GET/Chinese /_search {"query": {"match": {"content": "Beijing Tiananmen"}}Copy the code

Here, let’s search for “Beijing Tian ‘anmen”. Notice that we used it in mapping

"search_analyzer": "ik_smart"
Copy the code

That is, search_Analyzer breaks down our “Beijing Tiananmen” into two words “Beijing” and “Tiananmen”. These two words will be used in the search. A match is usually an OR relation, that is, if it matches “Beijing” OR “Tiananmen”, then it is a match:

{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 2, the "base" : "eq"}, "max_score" : 0.7268042, "hits" : [{" _index ":" Chinese ", "_type" : "_doc", "_id" : "1", "_score" : 0.7268042, "_source" : {" content ":" I love Beijing tiananmen "}}, {" _index ":" Chinese ", "_type" : "_doc", "_id" : "2", "_score" : 0.22920427, "_source" : {" content ":" Beijing, how are you "}}}}]Copy the code

The results shown above show that “I love Beijing Tiananmen Square” is the most appropriate search result.

See the article “Elasticsearch: Pinyin Analyzer” if you are interested in pinyin word dividers.

Reference:

【 1 】 github.com/medcl/elast…