Elastic’s Medcl provides a way to search for Pinyin searches. Pinyin search is used in many application scenarios. For example, in Baidu search, we use Pinyin to appear Chinese characters:

For us Chinese, pinyin search is also very direct. How do you use Pinyin to search for Elasticsearch? The answer is that we used the ElasticSearch-Analysis-Pinyin parser created by Medcl. Let’s take a quick look at how to install and test.

 

Download the Pinyin analyzer source code to compile and install

As elasticSearch-Analysis-Pinyin currently does not have a downloadable installer, we will have to download the source code and compile it ourselves. First, we can use the following name to download:

$ git clone https://github.com/medcl/elasticsearch-analysis-pinyin
Copy the code

After downloading the source code, go to the root directory of your project. The source code for the whole project is shown as follows:

Tree - L $2. ├ ─ ─ LICENSE. TXT ├ ─ ─ the README. Md ├ ─ ─ lib │ └ ─ ─ NLP - lang - 1.7 jar ├ ─ ─ pom. The XML └ ─ ─ the SRC ├ ─ ─ the main └ ─ ─ the testCopy the code

** ElasticSearch -analysis-pinyin** ** ** Before compiling, we need to change our version number to match our version of Elasticsearch. Otherwise our plugin will not load correctly. We know our Elasticsearch version number is 7.3.0, so let’s modify our pom.xml file:

Maven must be installed on our computer. Then go to the root directory of the project and type the following command on the command line:

$ mvn install
Copy the code

This completes the compilation of the entire project. We type the following command on the command line:

$find. / -name "*.zip". / / target/releases/elasticsearch - analysis - pinyin - 7.3.0. ZipCopy the code

It shows that a zip file called ElasticSearch-analysis-Pinyin-7.3.0.zip has been produced in the tagert directory. This version number happens to be the same as our Elasticsearch version.

Create a subdirectory named pinyin under plugin directory of Elasticsearch installation directory:

/ Users/liuxg/elastic/elasticsearch 7.3.0 / plugins localhost: plugins liuxg $ls analysis - ik pinyinCopy the code

Then unzip the elasticSearch-analysis-pinyin-7.0.0. zip file we just produced in the previous step and put it in the pinyin directory we just created. The entire Pinyin folder will look like this:

$ls analysis-ik pinyin localhost: Plugins liuxg$tree Pinyin / -l 3 Pinyin / ├─ ElasticSearch - Analysis-Pinyin 7.3.jar ├ ─ ─ NLP - lang - 1.7. Jar └ ─ ─ the plugin - descriptor. The propertiesCopy the code

Now that our installation is complete, I need to restart our Elasticsearch.

 

Test the Pinyin analyzer

 

Let’s test our installed Pinyin splitter to see if it works. We can copy github.com/medcl/elast… Here are some simple tests:

Create a custom Pinyin tokenizer

PUT /medcl/ 
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true
                }
            }
        }
    }
}
Copy the code

Test some Chinese characters

GET /medcl/_analyze {"text": [" tiananmen "], "analyzer": "pinyin_analyzer"}Copy the code

The command output is as follows:

{ "tokens" : [ { "token" : "tian", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "Tiananmen square", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0}, {" token ":" tam ", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 0 }, { "token" : "an", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 1 }, { "token" : "men", "start_offset" : 0, "end_offset" : 0, "type" : "word", "position" : 2 } ] }Copy the code

The token above shows that if we enter the search tam, we can find our results completely.

To create the mapping

POST /medcl/_mapping
{
  "properties": {
    "name": {
      "type": "keyword",
      "fields": {
        "pinyin": {
          "type": "text",
          "store": false,
          "term_vector": "with_offsets",
          "analyzer": "pinyin_analyzer",
          "boost": 10
        }
      }
    }
  }
}
Copy the code

The Index document

POST /medcl/_create/ Andy {"name":" Andy "}Copy the code

Search for documents

curl http://localhost:9200/medcl/_search? q=name:%E5%88%98%E5%BE%B7%E5%8D%8E curl http://localhost:9200/medcl/_search? q=name.pinyin:%e5%88%98%e5%be%b7 curl http://localhost:9200/medcl/_search? q=name.pinyin:liu curl http://localhost:9200/medcl/_search? q=name.pinyin:ldh curl http://localhost:9200/medcl/_search? q=name.pinyin:de+huaCopy the code

Or:

GET medcl/_search? q=name:%E5%88%98%E5%BE%B7%E5%8D%8E GET medcl/_search? q=name.pinyin:%e5%88%98%e5%be%b7 GET medcl/_search? q=name.pinyin:liu GET medcl/_search? q=name.pinyin:ldh GET medcl/_search? q=name.pinyin:de+huaCopy the code

The first Unicode above is “Andy Lau” and the second is “Liu De”.

Use pinyin – tokenFilter

PUT /medcl1/ 
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "user_name_analyzer" : {
                    "tokenizer" : "whitespace",
                    "filter" : "pinyin_first_letter_and_full_pinyin_filter"
                }
            },
            "filter" : {
                "pinyin_first_letter_and_full_pinyin_filter" : {
                    "type" : "pinyin",
                    "keep_first_letter" : true,
                    "keep_full_pinyin" : false,
                    "keep_none_chinese" : true,
                    "keep_original" : false,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "trim_whitespace" : true,
                    "keep_none_chinese_in_first_letter" : true
                }
            }
        }
    }
}
Copy the code

Token Test: Andy Lau, Jacky Cheung, Aaron Kwok, Dawn four Heavenly Kings

GET /medcl1/_analyze {"text": [" Andy Lau, Jacky Cheung and Aaron Kwok "], "Analyzer ": "user_name_analyzer"}Copy the code
{
  "tokens" : [
    {
      "token" : "ldh",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "zxy",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "gfc",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "lm",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "sdtw",
      "start_offset" : 15,
      "end_offset" : 19,
      "type" : "word",
      "position" : 4
    }
  ]
}
Copy the code

For others please see the link github.com/medcl/elast… .

See the article “Elasticsearch: IK Chinese Word Divider” for more information about Elasticsearch.