One, system word segmentation
- You can use GET to send the _analyze command, specifying the parser and the text content to be analyzed
- Standard analyzers, at the minimum granularity
GET _analyze
{
"analyzer": "standard"."text": ["Chinese ABC"]}Copy the code
- The results of the analysis
{
"tokens": [{"token" : "In"."start_offset" : 0."end_offset" : 1."type" : "<IDEOGRAPHIC>"."position" : 0
},
{
"token" : "The"."start_offset" : 1."end_offset" : 2."type" : "<IDEOGRAPHIC>"."position" : 1
},
{
"token" : "People"."start_offset" : 2."end_offset" : 3."type" : "<IDEOGRAPHIC>"."position" : 2
},
{
"token" : "abc"."start_offset" : 3."end_offset" : 6."type" : "<ALPHANUM>"."position" : 3}}]Copy the code
- As keywords, keywords don’t split
GET _analyze
{
"analyzer": "keyword"."text": ["Chinese ABC"]}Copy the code
- The results of the analysis
{
"tokens": [{"token" : "Chinese ABC"."start_offset" : 0."end_offset" : 6."type" : "word"."position" : 0}}]Copy the code
Two, IK participle
2.1 IK participle description
- IK word segmentation provides two word segmentation algorithms: IK_SMART and IK_max_word
- Ik_smart: minimal split
- Ik_max_word: the most fine-grained shard
2.2 IK participle installation
- Download at github.com/medcl/elast…
- Note: The version must be consistent with the ES version
- Unpack the IK segmented word into es/plugins with the folder name ik
- Restart Elasticsearch, the installation is complete, and the plugin loading information will be displayed when the page starts
2.3 Use of IK participle
- You can test the use of the word analyzer with _ANALYZE
GET _analyze
{
"analyzer": "Types of word participles"."text": "I am Chinese code coordinates."
}
Copy the code
2.3.1 ik_smart
- The minimum resolution
GET _analyze
{
"analyzer": "ik_max_word"."text": "I am Chinese code coordinates."
}
Copy the code
- Break up the results
{
"tokens": [{"token" : "我"."start_offset" : 0."end_offset" : 1."type" : "CN_CHAR"."position" : 0
},
{
"token" : "Yes"."start_offset" : 1."end_offset" : 2."type" : "CN_CHAR"."position" : 1
},
{
"token" : "Chinese"."start_offset" : 2."end_offset" : 5."type" : "CN_WORD"."position" : 2
},
{
"token" : "Code"."start_offset" : 5."end_offset" : 6."type" : "CN_CHAR"."position" : 3
},
{
"token" : "Coordinates"."start_offset" : 6."end_offset" : 8."type" : "CN_WORD"."position" : 4}}]Copy the code
2.3.2 ik_max_word
- The most fine-grained split
GET _analyze
{
"analyzer": "ik_max_word"."text": "I am Chinese code coordinates."
}
Copy the code
- Break up the results
{
"tokens": [{"token" : "我"."start_offset" : 0."end_offset" : 1."type" : "CN_CHAR"."position" : 0
},
{
"token" : "Yes"."start_offset" : 1."end_offset" : 2."type" : "CN_CHAR"."position" : 1
},
{
"token" : "Chinese"."start_offset" : 2."end_offset" : 5."type" : "CN_WORD"."position" : 2
},
{
"token" : "China"."start_offset" : 2."end_offset" : 4."type" : "CN_WORD"."position" : 3
},
{
"token" : "Chinese people"."start_offset" : 3."end_offset" : 5."type" : "CN_WORD"."position" : 4
},
{
"token" : "Code"."start_offset" : 5."end_offset" : 6."type" : "CN_CHAR"."position" : 5
},
{
"token" : "Coordinates"."start_offset" : 6."end_offset" : 8."type" : "CN_WORD"."position" : 6}}]Copy the code
2.4 Custom data Dictionary
- In the elasticsearch/plugins/new ik/config. Dic file, for example here is codecoord. Dic
- Edit the codecoord.dic file and add the dictionary to it. The added information will be used as a word in the word splicer and will not be split
- Edit ik/config/IKAnalyzer. CFG. XML file, add newly created in ext_dict codecoord. Dic dictionary, multiple use a comma
- At this point, the use of the word segmentation will be displayed as a word
{
"tokens": [{"token" : "我"."start_offset" : 0."end_offset" : 1."type" : "CN_CHAR"."position" : 0
},
{
"token" : "Yes"."start_offset" : 1."end_offset" : 2."type" : "CN_CHAR"."position" : 1
},
{
"token" : "Chinese"."start_offset" : 2."end_offset" : 5."type" : "CN_WORD"."position" : 2
},
{
"token" : "Code coordinates"."start_offset" : 5."end_offset" : 8."type" : "CN_CHAR"."position" : 3
},
{
"token" : "Coordinates"."start_offset" : 6."end_offset" : 8."type" : "CN_WORD"."position" : 4}}]Copy the code
2.5 IK participle query
- Create an index and specify the parser
PUT index
{
"mappings": {
"properties": {
"content": {
"type": "text"."analyzer": "ik_max_word"."search_analyzer": "ik_smart"}}}}Copy the code
- Create a document
POST index/_doc/1
{
"content": "Is Iraq a mess?"
}
POST index/_doc/2
{
"content": "Ministry of Public Security: School buses will enjoy the highest right of way"
}
POST index/_doc/3
{
"content": "Investigation into China-ROK fishery police clash: Rok police detain 1 Chinese fishing boat on average every day"
}
POST index/_doc/4
{
"content": "Suspect in Shooting of Chinese Consulate in Los Angeles turns himself in"
}
Copy the code
- Specify highlighting information when searching
GET index/_search
{
"query": {
"match": {
"content": "China"}},"highlight" : {
"pre_tags" : ["<tag1>"."<tag2>"]."post_tags" : ["</tag1>"."</tag2>"]."fields" : {
"content": {}}}}Copy the code
- The highlighted information is returned in highlight
{
"took" : 50."timed_out" : false."_shards" : {
"total" : 1."successful" : 1."skipped" : 0."failed" : 0
},
"hits" : {
"total" : {
"value" : 2."relation" : "eq"
},
"max_score" : 0.642793."hits": [{"_index" : "index"."_type" : "_doc"."_id" : "3"."_score" : 0.642793."_source" : {
"content" : "Investigation into China-ROK fishery police clash: Rok police detain 1 Chinese fishing boat on average every day"
},
"highlight" : {
"content" : [
"Investigation of sino-ROK fishery police conflict: Rok police detain 1 Chinese fishing boat on average every day"]}}, {"_index" : "index"."_type" : "_doc"."_id" : "4"."_score" : 0.642793."_source" : {
"content" : "Suspect in Shooting of Chinese Consulate in Los Angeles turns himself in"
},
"highlight" : {
"content" : [
"
Chinese
Suspect in Los Angeles Consulate shooting surrendered"]}}]}}Copy the code