Elasticsearch 6.x inverted index and participle

Inverted index

Forward index: Document ID to word association
Inverted index: Association of words to document IDS

Example: Construct an inverted index after removing the stop word for the following three documents

Inverted index – query process

Query documents that contain “Search engine”

Obtain the list of document ids corresponding to “search engine” by inverting the index, with 1,3
Query the complete contents of 1 and 3 through the forward index
Return final result

Inverted index – composition

Term Dictionary
Posting List

Term Dictionary

The implementation of the word dictionary generally uses B+ Tree, B+ Tree Visualization process url: B+ Tree Visualization

B trees and B+ trees

Wikipedia -B tree

Wikipedia -B+ tree

B tree and B+ tree insert, delete graphic details

Posting List

The inverted list records the collection of documents corresponding to the word. It consists of Posting entries
The inverted index contains the following information:
1. The document ID is used to retrieve the original information
2. Word Frequency (TF, Term Frequency) records the Frequency of occurrence of the word in the document for subsequent correlation score
3. Posting is used to record the location of words in documents, which is used to search for words.
4. Offset, which records the beginning and end positions of words in the document for highlighting

The internal nodes of the B+ tree store the index, and the leaf nodes store the data. The dictionary here is the B+ tree index, and the inverted list is the data, as shown below

Note: B+ tree index Unicode comparison or pinyin?

ES stores a JSON-formatted document containing multiple fields, each with its own inverted index

participles

Word segmentation is the process of converting text into a series of words or tokens. It can also be called text Analysis, which is called Analysis in ES

Word segmentation is

A word Analyzer is a component in ES that specializes in word segmentation. It consists of the following components:

Character Filters: Process raw text, such as removing HTML tags
Tokenizer: Splits the original text into words according to certain rules
Token Filters: Reworks words processed by Tokenizer, such as lowercase, delete, or add new ones

Word segmentation call order

Analyze API

ES provides an API interface for testing word segmentation to verify the word segmentation effect. The endpoint is _analyze

You can directly specify Analyzer for testing

You can directly specify fields in the index for testing

POST test_index/doc
{
  "username": "whirly"."age":22
}

POST test_index/_analyze
{
  "field": "username"."text": ["hello world"]}Copy the code

You can customize the word divider for testing

POST _analyze
{
  "tokenizer": "standard"."filter": ["lowercase"]."text": ["Hello World"]}Copy the code

Predefined word dividers

ES has the following built-in participles:

Standard Analyzer
- Default word divider
- According to the word segmentation, support multiple languages
- Lower case processing
Simple Analyzer
- Divide by non-letter
- Lower case processing
Whitespace Analyzer
- Whitespace character as separator
Stop Analyzer
- More than Simple Analyzer remove please word processing
- Stop words refer to modifiers such as the, an, this, etc
Keyword Analyzer
- Output the input directly as a word, regardless of the word
Pattern Analyzer
- Customize delimiters through regular expressions
- The default is \W+, that is, non-word symbols as delimiters
Language Analyzer
- Provides word segmentation for more than 30 common languages

Example: Stop the word participle

POST _analyze
{
  "analyzer": "stop"."text": ["The 2 QUICK Brown Foxes jumped over the lazy dog's bone."]}Copy the code

The results of

{
  "tokens": [{"token": "quick"."start_offset": 6,
      "end_offset": 11."type": "word"."position": 1}, {"token": "brown"."start_offset": 12."end_offset": 17."type": "word"."position": 2}, {"token": "foxes"."start_offset": 18."end_offset": 23."type": "word"."position": 3}, {"token": "jumped"."start_offset": 24,
      "end_offset": 30."type": "word"."position": 4}, {"token": "over"."start_offset": 31."end_offset": 35."type": "word"."position": 5}, {"token": "lazy"."start_offset": 40."end_offset": 44,
      "type": "word"."position": 7}, {"token": "dog"."start_offset": 45,
      "end_offset": 48."type": "word"."position": 8}, {"token": "s"."start_offset": 49."end_offset": 50."type": "word"."position": 9}, {"token": "bone"."start_offset": 51."end_offset": 55."type": "word"."position": 10}]}Copy the code

Chinese word segmentation

The difficulties in
- Chinese word segmentation refers to the segmentation of a Chinese character sequence into separate words. In English, words are naturally delimited by Spaces, while in Chinese, there is no formal delimiter
- Word segmentation results in different contexts, such as cross ambiguity
Common word segmentation systems
- IK: Achieve Chinese and English word segmentation, can customize the thesaurus, support hot update thesaurus
- Jieba: Support word segmentation and part-of-speech tagging, support traditional word segmentation, custom dictionary, parallel word segmentation, etc
- Hanlp: A Java toolkit consisting of a series of models and algorithms that aims to popularize natural language processing in production environments
- THUAC: Chinese word segmentation and pos tagging

Install the IK Chinese word segmentation plug-in

Run the command in the Elasticsearch installation directory and restart es
bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip

If the installation fails due to slow network, you can download the zip package, change the following command to the actual path, execute, and restart esBin/elasticsearch - plugin install file:///path/to/elasticsearch-analysis-ik-6.3.0.zipCopy the code

Ik test – IK_SMART

POST _analyze
{
  "analyzer": "ik_smart"."text": ["Ministry of Public Security: Local school buses to enjoy highest right of way"]}# the results
{
  "tokens": [{"token": "Ministry of Public Security"."start_offset": 0."end_offset": 3."type": "CN_WORD"."position": 0}, {"token": "Around"."start_offset": 4."end_offset": 6,
      "type": "CN_WORD"."position": 1}, {"token": "School bus"."start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD"."position": 2}, {"token": "Will"."start_offset": 8,
      "end_offset": 9,
      "type": "CN_CHAR"."position": 3}, {"token": "Enjoy"."start_offset": 9,
      "end_offset": 10,
      "type": "CN_CHAR"."position": 4}, {"token": "The highest"."start_offset": 10,
      "end_offset": 12."type": "CN_WORD"."position": 5}, {"token": "The road"."start_offset": 12."end_offset": 13."type": "CN_CHAR"."position": 6}, {"token": "Right"."start_offset": 13."end_offset": 14."type": "CN_CHAR"."position": 7}]}Copy the code

Ik test – IK_MAX_word

POST _analyze
{
  "analyzer": "ik_max_word"."text": ["Ministry of Public Security: Local school buses to enjoy highest right of way"]}# the results
{
  "tokens": [{"token": "Ministry of Public Security"."start_offset": 0."end_offset": 3."type": "CN_WORD"."position": 0}, {"token": "Public security"."start_offset": 0."end_offset": 2."type": "CN_WORD"."position": 1}, {"token": "Department"."start_offset": 2."end_offset": 3."type": "CN_CHAR"."position": 2}, {"token": "Around"."start_offset": 4."end_offset": 6,
      "type": "CN_WORD"."position": 3}, {"token": "School bus"."start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD"."position": 4}, {"token": "Will"."start_offset": 8,
      "end_offset": 9,
      "type": "CN_CHAR"."position": 5}, {"token": "Enjoy"."start_offset": 9,
      "end_offset": 10,
      "type": "CN_CHAR"."position": 6}, {"token": "The highest"."start_offset": 10,
      "end_offset": 12."type": "CN_WORD"."position": 7}, {"token": "The road"."start_offset": 12."end_offset": 13."type": "CN_CHAR"."position": 8}, {"token": "Right"."start_offset": 13."end_offset": 14."type": "CN_CHAR"."position": 9}]}Copy the code

What is the difference between ik_max_word and IK_smart?
- Ik_max_word: the text will be split into the most fine-grained, for example, “The national anthem of the People’s Republic of China” will be split into “the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the Republic of China, and the guo Guo, the national anthem of the People’s Republic of China”, exhausting all possible combinations;
- Ik_smart: Will do the coarsest split, such as “People’s Republic of China national anthem” to “People’s Republic of China national anthem”.

Custom participles

When the built-in word segmentation cannot meet the requirements, you can customize the word segmentation by defining Character Filters, Tokenizer, and Token Filters

Character Filters

The original text is processed before Tokenizer, such as adding, deleting, or replacing characters
The built-in ones are as follows:
- HTML Strip Character Filter: Strip HTML tags and transform HTML entities
- Mapping Character Filter: Performs Character replacement
- Pattern Replace Character Filter: Performs regular match replacement
Affects subsequent position and offset information parsed by tokenizer

Character Filters test

POST _analyze
{
  "tokenizer": "keyword"."char_filter": ["html_strip"]."text": ["I' m so happy! 
"]}# the results
{
  "tokens": [{"token": """ I'm so happy! """."start_offset": 0."end_offset": 32."type": "word"."position": 0}]}Copy the code

Tokenizers

Split the original text into words (term or token) according to certain rules
The built-in ones are as follows:
- Standard splits by word
- Letters are separated by non-character classes
- Whitespace is separated by space
- UAX URL Email is segmented according to standard, but the mailbox and URL are not segmented
- Ngram and Edge Ngram conjunction segmentation
- Path Hierarchy splits files based on file paths

Tokenizers test

POST _analyze
{
  "tokenizer": "path_hierarchy"."text": ["/path/to/file"]}# the results
{
  "tokens": [{"token": "/path"."start_offset": 0."end_offset": 5,
      "type": "word"."position": 0}, {"token": "/path/to"."start_offset": 0."end_offset": 8,
      "type": "word"."position": 0}, {"token": "/path/to/file"."start_offset": 0."end_offset": 13."type": "word"."position": 0}]}Copy the code

Token Filters

Add, delete, modify, and more words to the output of the Tokenizer
The built-in ones are as follows:
- Lowercase converts all terms to lowercase
- Stop Deletes the stop word
- Ngram and Edge Ngram conjunction segmentation
- Synonym Adds a Synonym to a term

Token Filters test

POST _analyze
{
  "text": [
    "a Hello World!"]."tokenizer": "standard"."filter": [
    "stop"."lowercase",
    {
      "type": "ngram"."min_gram": 4."max_gram": 4}]}# the results
{
  "tokens": [{"token": "hell"."start_offset": 2."end_offset": 7,
      "type": "<ALPHANUM>"."position": 1}, {"token": "ello"."start_offset": 2."end_offset": 7,
      "type": "<ALPHANUM>"."position": 1}, {"token": "worl"."start_offset": 8,
      "end_offset": 13."type": "<ALPHANUM>"."position": 2}, {"token": "orld"."start_offset": 8,
      "end_offset": 13."type": "<ALPHANUM>"."position": 2}]}Copy the code

Custom participles

Custom segmentation requires setting char_filter, Tokenizer, Filter, and Analyzer in index configuration

Example of custom word segmentation:

Word divider name: my_custom\
The filter converts the token to uppercase

PUT test_index_1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom"."tokenizer": "standard"."char_filter": [
            "html_strip"]."filter": [
            "uppercase"."asciifolding"]}}}}}Copy the code

Custom segmentation tests

POST test_index_1/_analyze
{
  "analyzer": "my_custom_analyzer"."text": ["I' m so happy! 
"]}# the results
{
  "tokens": [{"token": "I'M"."start_offset": 3."end_offset": 11."type": "<ALPHANUM>"."position": 0}, {"token": "SO"."start_offset": 12."end_offset": 14."type": "<ALPHANUM>"."position": 1}, {"token": "HAPPY"."start_offset": 18."end_offset": 27,
      "type": "<ALPHANUM>"."position": 2}]}Copy the code

Participle instructions

Participles are used in the following two situations:

When an Index Time document is created or updated, word segmentation is performed on the corresponding document
In Search Time, the query statement is segmented
- The analyzer is used to specify the word divider during query
- Set the search_Analyzer implementation through index Mapping
- Generally, it is not necessary to specify the query time classifier, but to use the index classifier directly, otherwise there will be no match

Suggestions on participle use

You can save space and improve write performance by specifying whether fields require word segmentation and setting type to keyword for fields that do not require word segmentation
Use the _analyze API to view word segmentation results of documents

For more information please visit my personal website: laijianfeng.org

Elasticsearch official document

Moocs Elastic Stack from getting started to getting started

Welcome to follow my wechat official account

Elasticsearch 6.x inverted index and participle

Inverted index

Inverted index – query process

Inverted index – composition

Term Dictionary

Posting List

participles

Word segmentation is

Analyze API

Predefined word dividers

Chinese word segmentation

Install the IK Chinese word segmentation plug-in

Custom participles

Character Filters

Character Filters test

Tokenizers

Tokenizers test

Token Filters

Token Filters test

Custom participles

Custom segmentation tests

Participle instructions

Suggestions on participle use

Related Posts

Common Application deployments for Docker

This is probably the simplest KMP on the web

【Code prawns 】 National Day problem: travel terminal, hash table quick solution, PS: inside volume people how can go out to travel 😭!!