One, system word segmentation

  1. You can use GET to send the _analyze command, specifying the parser and the text content to be analyzed
  2. Standard analyzers, at the minimum granularity
GET _analyze
{
  "analyzer": "standard"."text": ["Chinese ABC"]}Copy the code
  • The results of the analysis
{
  "tokens": [{"token" : "In"."start_offset" : 0."end_offset" : 1."type" : "<IDEOGRAPHIC>"."position" : 0
    },
    {
      "token" : "The"."start_offset" : 1."end_offset" : 2."type" : "<IDEOGRAPHIC>"."position" : 1
    },
    {
      "token" : "People"."start_offset" : 2."end_offset" : 3."type" : "<IDEOGRAPHIC>"."position" : 2
    },
    {
      "token" : "abc"."start_offset" : 3."end_offset" : 6."type" : "<ALPHANUM>"."position" : 3}}]Copy the code
  1. As keywords, keywords don’t split
GET _analyze
{
  "analyzer": "keyword"."text": ["Chinese ABC"]}Copy the code
  • The results of the analysis
{
  "tokens": [{"token" : "Chinese ABC"."start_offset" : 0."end_offset" : 6."type" : "word"."position" : 0}}]Copy the code

Two, IK participle

2.1 IK participle description

  1. IK word segmentation provides two word segmentation algorithms: IK_SMART and IK_max_word
    • Ik_smart: minimal split
    • Ik_max_word: the most fine-grained shard

2.2 IK participle installation

  1. Download at github.com/medcl/elast…
  2. Note: The version must be consistent with the ES version
  3. Unpack the IK segmented word into es/plugins with the folder name ik

  1. Restart Elasticsearch, the installation is complete, and the plugin loading information will be displayed when the page starts

2.3 Use of IK participle

  1. You can test the use of the word analyzer with _ANALYZE
GET _analyze 
{
  "analyzer": "Types of word participles"."text": "I am Chinese code coordinates."
}
Copy the code

2.3.1 ik_smart

  • The minimum resolution
GET _analyze 
{
  "analyzer": "ik_max_word"."text": "I am Chinese code coordinates."
}
Copy the code
  • Break up the results
{
  "tokens": [{"token" : "我"."start_offset" : 0."end_offset" : 1."type" : "CN_CHAR"."position" : 0
    },
    {
      "token" : "Yes"."start_offset" : 1."end_offset" : 2."type" : "CN_CHAR"."position" : 1
    },
    {
      "token" : "Chinese"."start_offset" : 2."end_offset" : 5."type" : "CN_WORD"."position" : 2
    },
    {
      "token" : "Code"."start_offset" : 5."end_offset" : 6."type" : "CN_CHAR"."position" : 3
    },
    {
      "token" : "Coordinates"."start_offset" : 6."end_offset" : 8."type" : "CN_WORD"."position" : 4}}]Copy the code

2.3.2 ik_max_word

  • The most fine-grained split
GET _analyze 
{
  "analyzer": "ik_max_word"."text": "I am Chinese code coordinates."
}
Copy the code
  • Break up the results
{
  "tokens": [{"token" : "我"."start_offset" : 0."end_offset" : 1."type" : "CN_CHAR"."position" : 0
    },
    {
      "token" : "Yes"."start_offset" : 1."end_offset" : 2."type" : "CN_CHAR"."position" : 1
    },
    {
      "token" : "Chinese"."start_offset" : 2."end_offset" : 5."type" : "CN_WORD"."position" : 2
    },
    {
      "token" : "China"."start_offset" : 2."end_offset" : 4."type" : "CN_WORD"."position" : 3
    },
    {
      "token" : "Chinese people"."start_offset" : 3."end_offset" : 5."type" : "CN_WORD"."position" : 4
    },
    {
      "token" : "Code"."start_offset" : 5."end_offset" : 6."type" : "CN_CHAR"."position" : 5
    },
    {
      "token" : "Coordinates"."start_offset" : 6."end_offset" : 8."type" : "CN_WORD"."position" : 6}}]Copy the code

2.4 Custom data Dictionary

  1. In the elasticsearch/plugins/new ik/config. Dic file, for example here is codecoord. Dic
  2. Edit the codecoord.dic file and add the dictionary to it. The added information will be used as a word in the word splicer and will not be split

  1. Edit ik/config/IKAnalyzer. CFG. XML file, add newly created in ext_dict codecoord. Dic dictionary, multiple use a comma

  1. At this point, the use of the word segmentation will be displayed as a word
{
  "tokens": [{"token" : "我"."start_offset" : 0."end_offset" : 1."type" : "CN_CHAR"."position" : 0
    },
    {
      "token" : "Yes"."start_offset" : 1."end_offset" : 2."type" : "CN_CHAR"."position" : 1
    },
    {
      "token" : "Chinese"."start_offset" : 2."end_offset" : 5."type" : "CN_WORD"."position" : 2
    },
    {
      "token" : "Code coordinates"."start_offset" : 5."end_offset" : 8."type" : "CN_CHAR"."position" : 3
    },
    {
      "token" : "Coordinates"."start_offset" : 6."end_offset" : 8."type" : "CN_WORD"."position" : 4}}]Copy the code

2.5 IK participle query

  1. Create an index and specify the parser
PUT index
{
	"mappings": {
		"properties": {
			"content": {
				"type": "text"."analyzer": "ik_max_word"."search_analyzer": "ik_smart"}}}}Copy the code
  1. Create a document
POST index/_doc/1
{
	"content": "Is Iraq a mess?"
}

POST index/_doc/2
{
	"content": "Ministry of Public Security: School buses will enjoy the highest right of way"
}

POST index/_doc/3
{
	"content": "Investigation into China-ROK fishery police clash: Rok police detain 1 Chinese fishing boat on average every day"
}

POST index/_doc/4
{
	"content": "Suspect in Shooting of Chinese Consulate in Los Angeles turns himself in"
}
Copy the code
  1. Specify highlighting information when searching
GET index/_search
{
  "query": {
    "match": {
      "content": "China"}},"highlight" : {
        "pre_tags" : ["<tag1>"."<tag2>"]."post_tags" : ["</tag1>"."</tag2>"]."fields" : {
            "content": {}}}}Copy the code
  1. The highlighted information is returned in highlight
{
  "took" : 50."timed_out" : false."_shards" : {
    "total" : 1."successful" : 1."skipped" : 0."failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2."relation" : "eq"
    },
    "max_score" : 0.642793."hits": [{"_index" : "index"."_type" : "_doc"."_id" : "3"."_score" : 0.642793."_source" : {
          "content" : "Investigation into China-ROK fishery police clash: Rok police detain 1 Chinese fishing boat on average every day"
        },
        "highlight" : {
          "content" : [
            "Investigation of sino-ROK fishery police conflict: Rok police detain 1 Chinese  fishing boat on average every day"]}}, {"_index" : "index"."_type" : "_doc"."_id" : "4"."_score" : 0.642793."_source" : {
          "content" : "Suspect in Shooting of Chinese Consulate in Los Angeles turns himself in"
        },
        "highlight" : {
          "content" : [
            "
      
        Chinese 
       Suspect in Los Angeles Consulate shooting surrendered"]}}]}}Copy the code