One, system word segmentation

You can use GET to send the _analyze command, specifying the parser and the text content to be analyzed
Standard analyzers, at the minimum granularity

GET _analyze
{
  "analyzer": "standard"."text": ["Chinese ABC"]}Copy the code

The results of the analysis

{
  "tokens": [{"token" : "In"."start_offset" : 0."end_offset" : 1."type" : "<IDEOGRAPHIC>"."position" : 0
    },
    {
      "token" : "The"."start_offset" : 1."end_offset" : 2."type" : "<IDEOGRAPHIC>"."position" : 1
    },
    {
      "token" : "People"."start_offset" : 2."end_offset" : 3."type" : "<IDEOGRAPHIC>"."position" : 2
    },
    {
      "token" : "abc"."start_offset" : 3."end_offset" : 6."type" : "<ALPHANUM>"."position" : 3}}]Copy the code

As keywords, keywords don’t split

GET _analyze
{
  "analyzer": "keyword"."text": ["Chinese ABC"]}Copy the code

The results of the analysis

{
  "tokens": [{"token" : "Chinese ABC"."start_offset" : 0."end_offset" : 6."type" : "word"."position" : 0}}]Copy the code

Two, IK participle

2.1 IK participle description

IK word segmentation provides two word segmentation algorithms: IK_SMART and IK_max_word
- Ik_smart: minimal split
- Ik_max_word: the most fine-grained shard

2.2 IK participle installation

Download at github.com/medcl/elast…
Note: The version must be consistent with the ES version
Unpack the IK segmented word into es/plugins with the folder name ik

Restart Elasticsearch, the installation is complete, and the plugin loading information will be displayed when the page starts

2.3 Use of IK participle

You can test the use of the word analyzer with _ANALYZE

GET _analyze 
{
  "analyzer": "Types of word participles"."text": "I am Chinese code coordinates."
}
Copy the code

2.3.1 ik_smart

The minimum resolution

GET _analyze 
{
  "analyzer": "ik_max_word"."text": "I am Chinese code coordinates."
}
Copy the code

Break up the results

{
  "tokens": [{"token" : "我"."start_offset" : 0."end_offset" : 1."type" : "CN_CHAR"."position" : 0
    },
    {
      "token" : "Yes"."start_offset" : 1."end_offset" : 2."type" : "CN_CHAR"."position" : 1
    },
    {
      "token" : "Chinese"."start_offset" : 2."end_offset" : 5."type" : "CN_WORD"."position" : 2
    },
    {
      "token" : "Code"."start_offset" : 5."end_offset" : 6."type" : "CN_CHAR"."position" : 3
    },
    {
      "token" : "Coordinates"."start_offset" : 6."end_offset" : 8."type" : "CN_WORD"."position" : 4}}]Copy the code

2.3.2 ik_max_word

The most fine-grained split

GET _analyze 
{
  "analyzer": "ik_max_word"."text": "I am Chinese code coordinates."
}
Copy the code

Break up the results

{
  "tokens": [{"token" : "我"."start_offset" : 0."end_offset" : 1."type" : "CN_CHAR"."position" : 0
    },
    {
      "token" : "Yes"."start_offset" : 1."end_offset" : 2."type" : "CN_CHAR"."position" : 1
    },
    {
      "token" : "Chinese"."start_offset" : 2."end_offset" : 5."type" : "CN_WORD"."position" : 2
    },
    {
      "token" : "China"."start_offset" : 2."end_offset" : 4."type" : "CN_WORD"."position" : 3
    },
    {
      "token" : "Chinese people"."start_offset" : 3."end_offset" : 5."type" : "CN_WORD"."position" : 4
    },
    {
      "token" : "Code"."start_offset" : 5."end_offset" : 6."type" : "CN_CHAR"."position" : 5
    },
    {
      "token" : "Coordinates"."start_offset" : 6."end_offset" : 8."type" : "CN_WORD"."position" : 6}}]Copy the code

2.4 Custom data Dictionary

In the elasticsearch/plugins/new ik/config. Dic file, for example here is codecoord. Dic
Edit the codecoord.dic file and add the dictionary to it. The added information will be used as a word in the word splicer and will not be split

Edit ik/config/IKAnalyzer. CFG. XML file, add newly created in ext_dict codecoord. Dic dictionary, multiple use a comma

At this point, the use of the word segmentation will be displayed as a word

{
  "tokens": [{"token" : "我"."start_offset" : 0."end_offset" : 1."type" : "CN_CHAR"."position" : 0
    },
    {
      "token" : "Yes"."start_offset" : 1."end_offset" : 2."type" : "CN_CHAR"."position" : 1
    },
    {
      "token" : "Chinese"."start_offset" : 2."end_offset" : 5."type" : "CN_WORD"."position" : 2
    },
    {
      "token" : "Code coordinates"."start_offset" : 5."end_offset" : 8."type" : "CN_CHAR"."position" : 3
    },
    {
      "token" : "Coordinates"."start_offset" : 6."end_offset" : 8."type" : "CN_WORD"."position" : 4}}]Copy the code

2.5 IK participle query

Create an index and specify the parser

PUT index
{
	"mappings": {
		"properties": {
			"content": {
				"type": "text"."analyzer": "ik_max_word"."search_analyzer": "ik_smart"}}}}Copy the code

Create a document

POST index/_doc/1
{
	"content": "Is Iraq a mess?"
}

POST index/_doc/2
{
	"content": "Ministry of Public Security: School buses will enjoy the highest right of way"
}

POST index/_doc/3
{
	"content": "Investigation into China-ROK fishery police clash: Rok police detain 1 Chinese fishing boat on average every day"
}

POST index/_doc/4
{
	"content": "Suspect in Shooting of Chinese Consulate in Los Angeles turns himself in"
}
Copy the code

Specify highlighting information when searching

GET index/_search
{
  "query": {
    "match": {
      "content": "China"}},"highlight" : {
        "pre_tags" : ["<tag1>"."<tag2>"]."post_tags" : ["</tag1>"."</tag2>"]."fields" : {
            "content": {}}}}Copy the code

The highlighted information is returned in highlight

{
  "took" : 50."timed_out" : false."_shards" : {
    "total" : 1."successful" : 1."skipped" : 0."failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2."relation" : "eq"
    },
    "max_score" : 0.642793."hits": [{"_index" : "index"."_type" : "_doc"."_id" : "3"."_score" : 0.642793."_source" : {
          "content" : "Investigation into China-ROK fishery police clash: Rok police detain 1 Chinese fishing boat on average every day"
        },
        "highlight" : {
          "content" : [
            "Investigation of sino-ROK fishery police conflict: Rok police detain 1 Chinese  fishing boat on average every day"]}}, {"_index" : "index"."_type" : "_doc"."_id" : "4"."_score" : 0.642793."_source" : {
          "content" : "Suspect in Shooting of Chinese Consulate in Los Angeles turns himself in"
        },
        "highlight" : {
          "content" : [
            "
      
        Chinese 
       Suspect in Los Angeles Consulate shooting surrendered"]}}]}}Copy the code

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Elasticsearch- 英语 Elasticsearch- 英语 Elasticsearch

One, system word segmentation

Two, IK participle

2.1 IK participle description

2.2 IK participle installation

2.3 Use of IK participle

2.3.1 ik_smart

2.3.2 ik_max_word

2.4 Custom data Dictionary

2.5 IK participle query

Elasticsearch- 英 语 Elasticsearch- 英 语 Elasticsearch

One, system word segmentation

Two, IK participle

2.1 IK participle description

2.2 IK participle installation

2.3 Use of IK participle

2.3.1 ik_smart

2.3.2 ik_max_word

2.4 Custom data Dictionary

2.5 IK participle query

Related Posts

Java foundation (five) String nature in-depth parsing

The top! 12 experts recommended this Spring5 study notes, and Github online more than a million

Ali iERP spring school recruitment, senior students answer questions hand in hand

Elasticsearch- 英语 Elasticsearch- 英语 Elasticsearch