Before we create index, query data, are using the default word segmentation, the word segmentation effect is not ideal, the text field will be divided into a Chinese character, and then search the sentence will be searched for word segmentation, so here needs a more intelligent word segmentation IK word segmentation.

Experimental environment

  • Operating system: CentOS7
  • ES version: 7.10.0
  • IK: elasticsearch – analysis – IK – 7.10.0. Zip

Ik word divider download and install, test

download

Download: github.com/medcl/elast… The file.

https://github.com/medcl/elasticsearch-analysis-ik/releases/tag/v7.10.0
Copy the code

Unpack the

Copy the files to the es installation directory /plugin/ik

[es@localhost ik]$plugins/ik [es@localhost ik]$ll total 1432-rw-r --r--. 1 es es 263965 May 6 2018 commons-codec-1.9.jar -rw-r--r--. 1 es es 61829 May 6 2018 Commons-logging-1.2. jar drwxr-xr-x. 2 es es 4096 Dec 25 2019 config-rw-r --r--. 1 ES ES 54625 Nov 12 10:01 elasticSearch-analysis-IK-7.10.0. jar-rw-r --r--. 1 es 736658 May 6 2018 httpClient-4.5.2.jar -rw-r--r--. 1 es es 326724 May 6 2018 httpcore-4.4.4.jar -rw-r--r--. 1 es es 1807 Nov 12 10:01 plugin-descriptor.properties -rw-r--r--. 1 es es 125 Nov 12 10:01 plugin-security.policyCopy the code

ElasticSearch. Yml file for elasticSearch.

restart

Restart the ElasticSearch

test

  • Test word segmentation when not using IK word segmentation
POST book/_analyze {"text": "I am Chinese"}Copy the code

As a result,

{" tokens ": [{" token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "< IDEOGRAPHIC >", "position" : 0}, {" token ": "Yes", "start_offset" : 1, "end_offset" : 2, "type" : "< IDEOGRAPHIC >", "position" : 1}, {" token ":" in ", "start_offset" : 2, "end_offset" : 3, "type" : "< IDEOGRAPHIC >", "position" : 2}, {" token ":" kingdom ", "start_offset" : 3, "end_offset" : 4, "type" : "IDEOGRAPHIC" > ", "position" : 3}, {" token ":" people ", "start_offset" : 4, "end_offset" : 5, "type" : "<IDEOGRAPHIC>", "position": 4 } ] }Copy the code
  • After using the IK word divider, the results are as follows
POST book/_analyze {"analyzer": "ik_max_word", "text": "People's Republic of China"}Copy the code

The results are as follows:

{" tokens ": [{" token" : "the People's Republic of China" and "start_offset" : 0, "end_offset" : 7, "type" : "CN_WORD", "position" : 0}, {" token ": "The people", "start_offset" : 0, "end_offset" : 4, "type" : "CN_WORD", "position" : 1}, {" token ":" Chinese ", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 2}, {" token ":" Chinese ", "start_offset" : 1, "end_offset" : 3, "type" : "CN_WORD", "position" : 3}, {" token ":" the People's Republic of China "and" start_offset ": 2," end_offset ": 7," type ":" CN_WORD ", "position" : 4}, {" token ":" the people ", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 5}, {" token ": "The republic", "start_offset" : 4, "end_offset" : 7, "type" : "CN_WORD", "position" : 6}, {" token ":" the republic ", "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD", "position" : 7}, {" token ":" kingdom ", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position": 8 } ]}Copy the code

Explanation of the above two word segmentation effects:

  1. If the IK analyzer is not installed, then if you write “analyzer” : “ik_max_word”, the program will report an error because you did not install the IK analyzer
  2. If you install the IK word divider and you do not specify the word divider and do not add the word analyzer: “ik_max_word”, then the word segmentation effect is the same as if you did not install the IK word divider and the word is divided into each Chinese character.

The word segmentation type of ik word divider

Ik_max_word: the text will be split into the most fine-grained, for example, “The national anthem of the People’s Republic of China” will be split into “the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the Republic of China, and the guo Guo, the national anthem of the People’s Republic of China”, exhausting all possible combinations;

Ik_smart: Will do the coarsest split, such as “People’s Republic of China national anthem” to “People’s Republic of China national anthem”.