HanLP Chinese word splitter is an open-source word splitter designed for Elasticsearch. It is based on HanLP and provides most of the word segmentation in HanLP. Its source code is located at:

Github.com/KennFalcon/…

Elasticsearch has been updated with different releases of Elasticsearch since 5.2.2.

The installation

1) Method 1:

A. corresponding release installation package, download the latest release packages can be downloaded from baidu plate (link: pan.baidu.com/s/1mFPNJXgi… Password: i0o7)

B. Run the following command to install the plug-in: PATH Indicates the absolute PATH of the plug-in package.

./bin/elasticsearch-plugin install file://${PATH}
Copy the code

2) Method 2:

A. Use the elasticSearch script to install command.

./bin/elasticsearch-plugin install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.4.2/elasticsearch-analysis-hanlp-7.4.2.z ipCopy the code

After the installation, we can verify the success of our installation by using the following methods:

$ ./bin/elasticsearch-plugin list
analysis-hanlp
Copy the code

If our installation is successful, we can see the output above.

Installation package

The release package stores the default word segmentation data in HanLP source code. To download the complete package, please check HanLP Release.

Package directory: ES_HOME/plugins/analysis-hanlp

Note: The hanlp.properties file has been changed to English because some files in the original data packet user-defined dictionary are named in Chinese. Please modify the file name accordingly

Restart the Elasticsearch

Note: ES_HOME in the above description is your ES installation path, which requires an absolute path.

This step is very important. If we don’t reboot, the newly installed segmenter won’t work.

4. Hot update

In this version, the dictionary hot update is added, and the modification steps are as follows:

A. in ES_HOME/plugins/analysis – hanlp/data/dictionary/custom directory, add a custom dictionary

B. Modify hanlp. Properties, modify CustomDictionaryPath, and add user-defined dictionary configurations

C. Wait 1 minute. The dictionary automatically loads

Note: Each node needs to make the above changes

A description of the participle provided

Hanlp: default participle of hanLP
Hanlp_standard: standard participle
Hanlp_index: index participle
Hanlp_nlp: NLP participle
Hanlp_n_short: N- shortest participle
Hanlp_dijkstra: the shortest participle
Hanlp_crf: CRF participle (latest form available)
Hanlp_speed: dictionary participle of speed

Let’s do a simple example:

GET _analyze {"text": "tokenizer": "hanlp"}Copy the code

Then the result displayed is:

{" tokens ": [{" token" : "the United States", "start_offset" : 0, "end_offset" : 2, "type" : "NSF", "position" : 0}, {" token ": "Alaska", "start_offset" : 2, "end_offset" : 7, "type" : "the NSF", "position" : 1}, {" token ":" ", "start_offset" : 9, 7, "end_offset" : "type" : "v", "position" : 2}, {" token ":" 8.0 ", "start_offset" : 9, "end_offset" : 12, "type" : "M", "position" : 3}, {" token ":" grade ", "start_offset" : 12, "end_offset" : 13, "type" : "q", "position" : 4}, {" token ":" earthquake ", "start_offset" : 13, "end_offset" : 15, "type" : "n", "position" : 5}]}Copy the code

For more detailed reading, please see the link github.com/KennFalcon/…

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Elasticsearch: HanLP Chinese word segmentation

The installation

Installation package

Restart the Elasticsearch

4. Hot update

A description of the participle provided

Elasticsearch: HanLP Chinese word segmentation

The installation

Installation package

Restart the Elasticsearch

4. Hot update

A description of the participle provided

Related Posts

Leetcode 2160. Minimum Sum of Four Digit Number After Splitting Digits (Python)

How to quickly implement an HTTP service using Java packages?

DateUtil(time utility class)- the current time and the current timestamp