The Smart Chinese Analysis plugin integrates Lucene’s Smart Chinese Analysis module into Elasticsearch to analyze Chinese or Mixed Chinese and English text. The supported profiler uses probabilistic knowledge based on the Hidden Markov model to find the best word segmentation for simplified Chinese text on a large training corpus. It uses a strategy of first breaking the input text into sentences and then shred the sentences to get words. The plug-in provides a parser called smartCN parser and a marker called SmartCN_tokenizer. Note that neither can be configured with any parameters.

To install the SmartCN Analysis plug-in in the Elasticsearch Docker container, use the command shown in the screen capture below. We then restart the container for the plug-in to take effect:

./bin/elasticsearch-plugin install analysis-smartcn
Copy the code

Run the above command in the Elasticsearch installation directory. The following information is displayed:

$ ./bin/elasticsearch-plugin install analysis-smartcn -> Downloading analysis-smartcn from elastic [=================================================] 100% WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.bouncycastle.jcajce.provider.drbg.DRBG (file: / Users/liuxg/elastic/elasticsearch - 7.3.0 / lib/tools/plugin - cli/bcprov - jdk15on - 1.61. Jar) to the constructor sun.security.provider.Sun() WARNING: Please consider reporting this to the maintainers of org.bouncycastle.jcajce.provider.drbg.DRBG WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release -> Installed analysis-smartcn (base) Localhost: ElasticSearch -7.3.0 liuxg$./bin/ elasticSearch -plugin list analysis- ICU analysis- IK analysis- SmartCN pinyinCopy the code

It shows that we have successfully installed Analysis-SmartCN. For the installation of Docker, we can enter the docker through the following command, and then install:

$ docker exec -it es01 /bin/bash
[root@ec4d19f59a7d elasticsearch]# ls
LICENSE.txt  README.textile  config  jdk  logs     plugins
NOTICE.txt   bin             data    lib  modules
[root@ec4d19f59a7d elasticsearch]# 
Copy the code

Here ES01 is the Elasticsearch instance in Docker. See my article “Elastic: Deploying an Elastic Stack with Docker” for details on the installation.

Note: After we have installed the SmartCN analyzer, we have to restart Elasticsearch to make it work.

The instance

Below, we use an example in Kibana to demonstrate this usage:

POST _analyze {"text": "analyze ", "analyze ": "analyze "} POST _analyze {"text": "analyze ", "analyze ": "analyze "}Copy the code

Display result:

{" tokens ": [{" token" : "stock market", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0}, {" token ": "Investment", "start_offset" : 3, "end_offset" : 5, "type" : "word", "position" : 2}, {" token ":" stable ", "start_offset" : 6, "end_offset" : 7, "type" : "word", "position" : 4}, {" token ":" make ", "start_offset" : 8, "end_offset" : 9, "type" : "Word", "position" : 6}, {" token ":" no ", "start_offset" : 10, "end_offset" : 11, "type" : "word", "position" : 8}, {" token ", "lost", "start_offset" : 12, "end_offset" : 13, "type" : "word", "position" : 10}, {" token ": "Required "," starT_offset ": 14, "end_offset" : 17, "type" : "word", "position" : 12}, {"token" :" how ", "start_offset" : 18, "end_offset" : 20, "type" : "word", "position" : 14}, {"token" : "do ", "start_offset" : 21, "end_offset" : 21 22, "type" : "the word", "position" : 16}, {" token ":" good, "" start_offset" : 23, "end_offset" : 24, "type" : "Word ", "position" : 18}, {"token" : "position", "start_offset" : 25, "end_offset" : 26, "type" : "word", "position" : 20}, {" token ":" a ", "start_offset" : 27, "end_offset" : 28, "type" : "word", "position" : 22}, {" token ": "Management", "start_offset" : 29, "end_offset" : 31, "type" : "word", "position" : 24}, {" token ":" and ", "start_offset" : 32, "end_offset" : 33, "type" : "word", "position" : 26}, {"token" : "emotion ", "start_offset" : 34, "end_offset" : 34 36, "type" : "word", "position" : 28}, {"token" : "management ", "start_offset" : 37, "end_offset" : 39, "type" : "word", "position" : 30 } ] }Copy the code