Inverted index

  • Forward index: Document ID to word association
  • Inverted index: Association of words to document IDS

Example: Construct an inverted index after removing the stop word for the following three documents

Inverted index – query process

Query documents that contain “Search engine”

  1. Obtain the list of document ids corresponding to “search engine” by inverting the index, with 1,3
  2. Query the complete contents of 1 and 3 through the forward index
  3. Return final result

Inverted index – composition

  • Term Dictionary
  • Posting List

Term Dictionary

The implementation of the word dictionary generally uses B+ Tree, B+ Tree Visualization process url: B+ Tree Visualization

B trees and B+ trees

  1. Wikipedia -B tree
  2. Wikipedia -B+ tree
  3. B tree and B+ tree insert, delete graphic details

Posting List

  • The inverted list records the collection of documents corresponding to the word. It consists of Posting entries
  • The inverted index contains the following information:
    1. The document ID is used to retrieve the original information
    2. Word Frequency (TF, Term Frequency) records the Frequency of occurrence of the word in the document for subsequent correlation score
    3. Posting is used to record the location of words in documents, which is used to search for words.
    4. Offset, which records the beginning and end positions of words in the document for highlighting

The internal nodes of the B+ tree store the index, and the leaf nodes store the data. The dictionary here is the B+ tree index, and the inverted list is the data, as shown below

Note: B+ tree index Unicode comparison or pinyin?

ES stores a JSON-formatted document containing multiple fields, each with its own inverted index

participles

Word segmentation is the process of converting text into a series of words or tokens. It can also be called text Analysis, which is called Analysis in ES

Word segmentation is

A word Analyzer is a component in ES that specializes in word segmentation. It consists of the following components:

  • Character Filters: Process raw text, such as removing HTML tags
  • Tokenizer: Splits the original text into words according to certain rules
  • Token Filters: Reworks words processed by Tokenizer, such as lowercase, delete, or add new ones

Word segmentation call order

Analyze API

ES provides an API interface for testing word segmentation to verify the word segmentation effect. The endpoint is _analyze

  • You can directly specify Analyzer for testing

  • You can directly specify fields in the index for testing
POST test_index/doc
{
  "username": "whirly"."age":22
}

POST test_index/_analyze
{
  "field": "username"."text": ["hello world"]}Copy the code
  • You can customize the word divider for testing
POST _analyze
{
  "tokenizer": "standard"."filter": ["lowercase"]."text": ["Hello World"]}Copy the code

Predefined word dividers

ES has the following built-in participles:

  • Standard Analyzer
    • Default word divider
    • According to the word segmentation, support multiple languages
    • Lower case processing
  • Simple Analyzer
    • Divide by non-letter
    • Lower case processing
  • Whitespace Analyzer
    • Whitespace character as separator
  • Stop Analyzer
    • More than Simple Analyzer remove please word processing
    • Stop words refer to modifiers such as the, an, this, etc
  • Keyword Analyzer
    • Output the input directly as a word, regardless of the word
  • Pattern Analyzer
    • Customize delimiters through regular expressions
    • The default is \W+, that is, non-word symbols as delimiters
  • Language Analyzer
    • Provides word segmentation for more than 30 common languages

Example: Stop the word participle

POST _analyze
{
  "analyzer": "stop"."text": ["The 2 QUICK Brown Foxes jumped over the lazy dog's bone."]}Copy the code

The results of

{
  "tokens": [{"token": "quick"."start_offset": 6,
      "end_offset": 11."type": "word"."position": 1}, {"token": "brown"."start_offset": 12."end_offset": 17."type": "word"."position": 2}, {"token": "foxes"."start_offset": 18."end_offset": 23."type": "word"."position": 3}, {"token": "jumped"."start_offset": 24,
      "end_offset": 30."type": "word"."position": 4}, {"token": "over"."start_offset": 31."end_offset": 35."type": "word"."position": 5}, {"token": "lazy"."start_offset": 40."end_offset": 44,
      "type": "word"."position": 7}, {"token": "dog"."start_offset": 45,
      "end_offset": 48."type": "word"."position": 8}, {"token": "s"."start_offset": 49."end_offset": 50."type": "word"."position": 9}, {"token": "bone"."start_offset": 51."end_offset": 55."type": "word"."position": 10}]}Copy the code

Chinese word segmentation

  • The difficulties in
    • Chinese word segmentation refers to the segmentation of a Chinese character sequence into separate words. In English, words are naturally delimited by Spaces, while in Chinese, there is no formal delimiter
    • Word segmentation results in different contexts, such as cross ambiguity
  • Common word segmentation systems
    • IK: Achieve Chinese and English word segmentation, can customize the thesaurus, support hot update thesaurus
    • Jieba: Support word segmentation and part-of-speech tagging, support traditional word segmentation, custom dictionary, parallel word segmentation, etc
    • Hanlp: A Java toolkit consisting of a series of models and algorithms that aims to popularize natural language processing in production environments
    • THUAC: Chinese word segmentation and pos tagging

Install the IK Chinese word segmentation plug-in

Run the command in the Elasticsearch installation directory and restart es
bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.3.0/elasticsearch-analysis-ik-6.3.0.zip

If the installation fails due to slow network, you can download the zip package, change the following command to the actual path, execute, and restart esBin/elasticsearch - plugin install file:///path/to/elasticsearch-analysis-ik-6.3.0.zipCopy the code
  • Ik test – IK_SMART
POST _analyze
{
  "analyzer": "ik_smart"."text": ["Ministry of Public Security: Local school buses to enjoy highest right of way"]}# the results
{
  "tokens": [{"token": "Ministry of Public Security"."start_offset": 0."end_offset": 3."type": "CN_WORD"."position": 0}, {"token": "Around"."start_offset": 4."end_offset": 6,
      "type": "CN_WORD"."position": 1}, {"token": "School bus"."start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD"."position": 2}, {"token": "Will"."start_offset": 8,
      "end_offset": 9,
      "type": "CN_CHAR"."position": 3}, {"token": "Enjoy"."start_offset": 9,
      "end_offset": 10,
      "type": "CN_CHAR"."position": 4}, {"token": "The highest"."start_offset": 10,
      "end_offset": 12."type": "CN_WORD"."position": 5}, {"token": "The road"."start_offset": 12."end_offset": 13."type": "CN_CHAR"."position": 6}, {"token": "Right"."start_offset": 13."end_offset": 14."type": "CN_CHAR"."position": 7}]}Copy the code
  • Ik test – IK_MAX_word
POST _analyze
{
  "analyzer": "ik_max_word"."text": ["Ministry of Public Security: Local school buses to enjoy highest right of way"]}# the results
{
  "tokens": [{"token": "Ministry of Public Security"."start_offset": 0."end_offset": 3."type": "CN_WORD"."position": 0}, {"token": "Public security"."start_offset": 0."end_offset": 2."type": "CN_WORD"."position": 1}, {"token": "Department"."start_offset": 2."end_offset": 3."type": "CN_CHAR"."position": 2}, {"token": "Around"."start_offset": 4."end_offset": 6,
      "type": "CN_WORD"."position": 3}, {"token": "School bus"."start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD"."position": 4}, {"token": "Will"."start_offset": 8,
      "end_offset": 9,
      "type": "CN_CHAR"."position": 5}, {"token": "Enjoy"."start_offset": 9,
      "end_offset": 10,
      "type": "CN_CHAR"."position": 6}, {"token": "The highest"."start_offset": 10,
      "end_offset": 12."type": "CN_WORD"."position": 7}, {"token": "The road"."start_offset": 12."end_offset": 13."type": "CN_CHAR"."position": 8}, {"token": "Right"."start_offset": 13."end_offset": 14."type": "CN_CHAR"."position": 9}]}Copy the code
  • What is the difference between ik_max_word and IK_smart?
    • Ik_max_word: the text will be split into the most fine-grained, for example, “The national anthem of the People’s Republic of China” will be split into “the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the People’s Republic of China, the Republic of China, and the guo Guo, the national anthem of the People’s Republic of China”, exhausting all possible combinations;

    • Ik_smart: Will do the coarsest split, such as “People’s Republic of China national anthem” to “People’s Republic of China national anthem”.

Custom participles

When the built-in word segmentation cannot meet the requirements, you can customize the word segmentation by defining Character Filters, Tokenizer, and Token Filters

Character Filters

  • The original text is processed before Tokenizer, such as adding, deleting, or replacing characters
  • The built-in ones are as follows:
    • HTML Strip Character Filter: Strip HTML tags and transform HTML entities
    • Mapping Character Filter: Performs Character replacement
    • Pattern Replace Character Filter: Performs regular match replacement
  • Affects subsequent position and offset information parsed by tokenizer

Character Filters test

POST _analyze
{
  "tokenizer": "keyword"."char_filter": ["html_strip"]."text": ["

I' m so happy!

"
]}# the results { "tokens": [{"token": """ I'm so happy! """."start_offset": 0."end_offset": 32."type": "word"."position": 0}]}Copy the code

Tokenizers

  • Split the original text into words (term or token) according to certain rules
  • The built-in ones are as follows:
    • Standard splits by word
    • Letters are separated by non-character classes
    • Whitespace is separated by space
    • UAX URL Email is segmented according to standard, but the mailbox and URL are not segmented
    • Ngram and Edge Ngram conjunction segmentation
    • Path Hierarchy splits files based on file paths

Tokenizers test

POST _analyze
{
  "tokenizer": "path_hierarchy"."text": ["/path/to/file"]}# the results
{
  "tokens": [{"token": "/path"."start_offset": 0."end_offset": 5,
      "type": "word"."position": 0}, {"token": "/path/to"."start_offset": 0."end_offset": 8,
      "type": "word"."position": 0}, {"token": "/path/to/file"."start_offset": 0."end_offset": 13."type": "word"."position": 0}]}Copy the code

Token Filters

  • Add, delete, modify, and more words to the output of the Tokenizer
  • The built-in ones are as follows:
    • Lowercase converts all terms to lowercase
    • Stop Deletes the stop word
    • Ngram and Edge Ngram conjunction segmentation
    • Synonym Adds a Synonym to a term

Token Filters test

POST _analyze
{
  "text": [
    "a Hello World!"]."tokenizer": "standard"."filter": [
    "stop"."lowercase",
    {
      "type": "ngram"."min_gram": 4."max_gram": 4}]}# the results
{
  "tokens": [{"token": "hell"."start_offset": 2."end_offset": 7,
      "type": "<ALPHANUM>"."position": 1}, {"token": "ello"."start_offset": 2."end_offset": 7,
      "type": "<ALPHANUM>"."position": 1}, {"token": "worl"."start_offset": 8,
      "end_offset": 13."type": "<ALPHANUM>"."position": 2}, {"token": "orld"."start_offset": 8,
      "end_offset": 13."type": "<ALPHANUM>"."position": 2}]}Copy the code

Custom participles

Custom segmentation requires setting char_filter, Tokenizer, Filter, and Analyzer in index configuration

Example of custom word segmentation:

  • Word divider name: my_custom\
  • The filter converts the token to uppercase
PUT test_index_1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom"."tokenizer": "standard"."char_filter": [
            "html_strip"]."filter": [
            "uppercase"."asciifolding"]}}}}}Copy the code

Custom segmentation tests

POST test_index_1/_analyze
{
  "analyzer": "my_custom_analyzer"."text": ["

I' m so happy!

"
]}# the results { "tokens": [{"token": "I'M"."start_offset": 3."end_offset": 11."type": "<ALPHANUM>"."position": 0}, {"token": "SO"."start_offset": 12."end_offset": 14."type": "<ALPHANUM>"."position": 1}, {"token": "HAPPY"."start_offset": 18."end_offset": 27, "type": "<ALPHANUM>"."position": 2}]}Copy the code

Participle instructions

Participles are used in the following two situations:

  • When an Index Time document is created or updated, word segmentation is performed on the corresponding document
  • In Search Time, the query statement is segmented
    • The analyzer is used to specify the word divider during query
    • Set the search_Analyzer implementation through index Mapping
    • Generally, it is not necessary to specify the query time classifier, but to use the index classifier directly, otherwise there will be no match

Suggestions on participle use

  • You can save space and improve write performance by specifying whether fields require word segmentation and setting type to keyword for fields that do not require word segmentation
  • Use the _analyze API to view word segmentation results of documents

For more information please visit my personal website: laijianfeng.org

  1. Elasticsearch official document
  2. Moocs Elastic Stack from getting started to getting started

Welcome to follow my wechat official account