HanLP Chinese word splitter is an open-source word splitter designed for Elasticsearch. It is based on HanLP and provides most of the word segmentation in HanLP. Its source code is located at:
Github.com/KennFalcon/…
Elasticsearch has been updated with different releases of Elasticsearch since 5.2.2.
The installation
1) Method 1:
A. corresponding release installation package, download the latest release packages can be downloaded from baidu plate (link: pan.baidu.com/s/1mFPNJXgi… Password: i0o7)
B. Run the following command to install the plug-in: PATH Indicates the absolute PATH of the plug-in package.
./bin/elasticsearch-plugin install file://${PATH}
Copy the code
2) Method 2:
A. Use the elasticSearch script to install command.
./bin/elasticsearch-plugin install https://github.com/KennFalcon/elasticsearch-analysis-hanlp/releases/download/v7.4.2/elasticsearch-analysis-hanlp-7.4.2.z ipCopy the code
After the installation, we can verify the success of our installation by using the following methods:
$ ./bin/elasticsearch-plugin list
analysis-hanlp
Copy the code
If our installation is successful, we can see the output above.
Installation package
The release package stores the default word segmentation data in HanLP source code. To download the complete package, please check HanLP Release.
Package directory: ES_HOME/plugins/analysis-hanlp
Note: The hanlp.properties file has been changed to English because some files in the original data packet user-defined dictionary are named in Chinese. Please modify the file name accordingly
Restart the Elasticsearch
Note: ES_HOME in the above description is your ES installation path, which requires an absolute path.
This step is very important. If we don’t reboot, the newly installed segmenter won’t work.
4. Hot update
In this version, the dictionary hot update is added, and the modification steps are as follows:
A. in ES_HOME/plugins/analysis – hanlp/data/dictionary/custom directory, add a custom dictionary
B. Modify hanlp. Properties, modify CustomDictionaryPath, and add user-defined dictionary configurations
C. Wait 1 minute. The dictionary automatically loads
Note: Each node needs to make the above changes
A description of the participle provided
- Hanlp: default participle of hanLP
- Hanlp_standard: standard participle
- Hanlp_index: index participle
- Hanlp_nlp: NLP participle
- Hanlp_n_short: N- shortest participle
- Hanlp_dijkstra: the shortest participle
- Hanlp_crf: CRF participle (latest form available)
- Hanlp_speed: dictionary participle of speed
Let’s do a simple example:
GET _analyze {"text": "tokenizer": "hanlp"}Copy the code
Then the result displayed is:
{" tokens ": [{" token" : "the United States", "start_offset" : 0, "end_offset" : 2, "type" : "NSF", "position" : 0}, {" token ": "Alaska", "start_offset" : 2, "end_offset" : 7, "type" : "the NSF", "position" : 1}, {" token ":" ", "start_offset" : 9, 7, "end_offset" : "type" : "v", "position" : 2}, {" token ":" 8.0 ", "start_offset" : 9, "end_offset" : 12, "type" : "M", "position" : 3}, {" token ":" grade ", "start_offset" : 12, "end_offset" : 13, "type" : "q", "position" : 4}, {" token ":" earthquake ", "start_offset" : 13, "end_offset" : 15, "type" : "n", "position" : 5}]}Copy the code
For more detailed reading, please see the link github.com/KennFalcon/…