[Bug Mc-10866] – Elasticsearch’s built-in word splitter is not friendly to Chinese, it only splits words one by one, it can’t form words like:
POST /_analyze {"text": "I love Beijing Tiananmen "," Analyzer ": "standard"}Copy the code
If we were using the Standard participle, the result would be:
{" tokens ": [{" token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "< IDEOGRAPHIC >", "position" : 0}, {" token ":" love ", "start_offset" : 1, "end_offset" : 2, "type" : "< IDEOGRAPHIC >", "position" : 1}, {... "token" : "Door", "start_offset" : 6, "end_offset" : 7, "type" : "< IDEOGRAPHIC >", "position" : 6}]}Copy the code
Obviously this is not friendly to Chinese, it shows every Chinese character. Thankfully, Elastic’s Medcl of Montana has already made an IK Chinese word divider for us. Here’s how to install and use the Chinese word segmentation in detail. Detailed installation procedures can be found at github.com/medcl/elast… To find it.
The installation
First, check to see if the latest version of your Elasticsearch is available:
Github.com/medcl/elast…
As of this deadline, we can see the latest release of V7.3.1. You will need to install the ik splitter of the same version as your distribution of Elasticsearch.
So let’s go directly to our Elasticsearch installation directory and type the following command:
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.3.1/elasticsearch-analysis-ik-7.3.1.zip
Copy the code
Instead of 7.3.1 above install your own version of what you want:
Once installed, we can check whether it is installed by using the following command:
Localhost: ElasticSearch-7.3.0 liuxg$./bin/ elasticSearch-plugin list analysis-ikCopy the code
The command above shows that our IK has been successfully installed.
At this point we need to restart our Elasticsearch so that the plugin can be loaded.
Use IK word dividers
According to the IK tokenizer documentation, it contains the following sections:
Analyzer:
- ik_smart
- ik_max_word
Tokenizer:
- ik_smart
- ik_max_word
Let’s use the previous sentence “I love Beijing Tian ‘anmen” to check:
POST _analyze {"text": "I love Beijing Tiananmen "," Analyzer ": "ik_smart"}Copy the code
The ik_smart segmentation above shows the result:
{" tokens ": [{" token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0}, {" token ": "Love", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1}, {" token ":" Beijing ", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 2}, {" token ":" tiananmen square ", "start_offset" : 4, "end_offset" : 7, "type" : "CN_WORD", "position" : 3 } ] }Copy the code
Let’s then use the ik_max_word participle to try the same sentence:
POST _analyze {"text": "I love Beijing Tiananmen ", "analyzer": "ik_max_word"}Copy the code
The ik_max_word segmentation above shows the result:
{" tokens ": [{" token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0}, {" token ": "Love", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1}, {" token ":" Beijing ", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 2}, {" token ":" tiananmen square ", "start_offset" : 4, "end_offset" : 7, "type" : "CN_WORD", "position" : 3}, {" token ":" day ", "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD", "position" : 4}, {" token ":" door ", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 5}]}Copy the code
We can see from the two outputs: the difference between the two is the granularity of terms extracted by ik_smart, which is coarser, while ik_max_word, which gives more tokens, is finer.
Next we create an index:
PUT chinese
Copy the code
Next, let’s create a mapping for this index
PUT /chinese/_mapping
{
"properties": {
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
}
}
}
Copy the code
After the preceding command is executed, the following information is displayed:
{
"acknowledged" : true
}
Copy the code
It indicates that our installation was successful.
Next, let’s index some documents:
GET/Chinese /_analyze {"text": "I love Beijing Tiananmen ", "analyzer": "ik_max_word"}Copy the code
The result displayed is:
{" tokens ": [{" token" : "I", "start_offset" : 0, "end_offset" : 1, "type" : "CN_CHAR", "position" : 0}, {" token ": "Love", "start_offset" : 1, "end_offset" : 2, "type" : "CN_CHAR", "position" : 1}, {" token ":" Beijing ", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 2}, {" token ":" tiananmen square ", "start_offset" : 4, "end_offset" : 7, "type" : "CN_WORD", "position" : 3}, {" token ":" day ", "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD", "position" : 4}, {" token ":" door ", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 5}]}Copy the code
We can see from the above results that our tokens show “Beijing”, “Tian ‘an” and “Tian ‘anmen”. This is different from what we had before.
Next, we enter two documents:
PUT/Chinese /_doc/1 {" I love Beijing square "} PUT/Chinese /_doc/2 {" I love Beijing square "}Copy the code
So we can search in the following way:
GET/Chinese / _search {" query ": {" match" : {" content ":" Beijing "}}}Copy the code
The results we show are:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 2, the "base" : "eq"}, "max_score" : 0.15965709, "hits" : [{" _index ":" Chinese ", "_type" : "_doc", "_id" : "2", "_score" : 0.15965709, "_source" : {" content ":" Beijing, how are you "}}, {" _index ":" Chinese ", "_type" : "_doc", "_id" : "1", "_score" : 0.100605845, "_source" : {" content ":" I love Beijing tiananmen "}}}}]Copy the code
Since both documents contain “Beijing”, we can see that both documents are displayed.
Let’s also do another search:
GET/Chinese /_search {"query": {"match": {"content": "tiananmen"}}Copy the code
The result is:
{ "took" : 0, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 1, the "base" : "eq"}, "max_score" : 0.73898095, "hits" : [{" _index ":" Chinese ", "_type" : "_doc", "_id" : "1", "_score" : 0.73898095, "_source" : {" content ":" I love Beijing tiananmen "}}}}]Copy the code
Because “Tian ‘anmen” only appears in the second document, we can see that there is only one result.
We also did another search:
GET/Chinese /_search {"query": {"match": {"content": "Beijing Tiananmen"}}Copy the code
Here, let’s search for “Beijing Tian ‘anmen”. Notice that we used it in mapping
"search_analyzer": "ik_smart"
Copy the code
That is, search_Analyzer breaks down our “Beijing Tiananmen” into two words “Beijing” and “Tiananmen”. These two words will be used in the search. A match is usually an OR relation, that is, if it matches “Beijing” OR “Tiananmen”, then it is a match:
{ "took" : 3, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : {" total ": {" value" : 2, the "base" : "eq"}, "max_score" : 0.7268042, "hits" : [{" _index ":" Chinese ", "_type" : "_doc", "_id" : "1", "_score" : 0.7268042, "_source" : {" content ":" I love Beijing tiananmen "}}, {" _index ":" Chinese ", "_type" : "_doc", "_id" : "2", "_score" : 0.22920427, "_source" : {" content ":" Beijing, how are you "}}}}]Copy the code
The results shown above show that “I love Beijing Tiananmen Square” is the most appropriate search result.
See the article “Elasticsearch: Pinyin Analyzer” if you are interested in pinyin word dividers.
Reference:
【 1 】 github.com/medcl/elast…