ElasticSearch: The basic knowledge of ElasticSearch

1. Getting started with Elasticsearch

1. Glossary introduction

Document Document: Data Document that the user stores in ES
Index Index: Consists of a list of documents with the same field
Node Node: a running instance of Elasticsearch that is a component of a cluster
Cluster Cluster: a Cluster consists of one or more nodes to provide external services

2. The Document is introduced

Json objects consist of fields. Common data types are as follows:

Character string: text, keyword
Numeric types: long, INTEGER, short, byte, double, float, half_float, SCALed_float
Boolean: the Boolean
Date: the date
Binary: binary
Range types: integer_range, FLOAT_range, long_range, doubLE_range, date_range

Each document has a unique ID identifier

To specify
Es automatic generation

Document MetaData

Metadata, used to annotate relevant information about a document

_index: indicates the index name of the document
_type: the name of the document type
_ID: indicates the unique ID of a document
_uid: combination ID, consisting of _type and _id (6.X _type is no longer valid, as is _id)
_source: The raw Json data of the document, from which you can obtain the contents of each field
_all: Consolidates all fields into this field. This parameter is disabled by default

3. The Index is introduced

The index stores documents with the same structure
- Each index has its own mapping definition that defines field names and types
A cluster can have multiple indexes, such as:
- Nginx logs can be stored in an index per day by date
  - nginx-log-2017-01-01
  - nginx-log-2017-01-02
  - nginx-log-2017-01-03

4. Restapi is introduced

Elasticsreach provides RESTful apis
- REST – REpresentational State Transfer
- URI specifies resources, such as Index, Document, and so on
- Http Method Indicates the resource operation type, such as GET, POST, PUT, and DELETE
There are two common modes of interaction
- The Curl command line
- Kibana DevTools

5. index_api

Es has a dedicated Index API for creating, updating, and deleting Index configurations

Thpffcj: ElasticSearch -6.5.4 Thpffcj $bin/ elasticSearch Thpffcj:kibana-6.5.4- Darwin x86_64 Thpffcj $bin/kibanaCopy the code

Access the localhost:5601 port to access the Kibana graphical interface and we can use the REST API using Kibana DevTools

    PUT /test_index

    #! Deprecation: the default number of shards will change from [5] to [1] in 7.0.0; if you wish to continue using the default of [5] shards, you must manage this on the create index request or with an index template
    {
      "acknowledged" : true."shards_acknowledged" : true."index" : "test_index"} GET _index (test_index) {GET _index (test_index);"acknowledged" : true
    }
Copy the code

6. document_api

Es has a dedicated Document API

Create a document
Query the document
Update the document
Delete the document
Create a test_index, type doc, id 1 document with no concept of type after higher version

    PUT /test_index/doc/1
    {
      "username":"thpffcj"."age":22
    }

    {
      "_index" : "test_index"."_type" : "doc"."_id" : "1"."_version" : 1,
      "result" : "created"."_shards" : {
        "total": 2."successful" : 1,
        "failed": 0}."_seq_no": 0."_primary_term": 1}Copy the code

Creates a document without specifying an ID

    POST /test_index/doc
    {
      "username":"tom"."age":20
    }

    {
      "_index" : "test_index"."_type" : "doc"."_id" : "yfg31mwBWHG_wS6wM641"."_version" : 1,
      "result" : "created"."_shards" : {
        "total": 2."successful" : 1,
        "failed": 0}."_seq_no": 0."_primary_term": 1}Copy the code

Specify the id of the document to be queried

    GET /test_index/doc/1

    {
      "_index" : "test_index"."_type" : "doc"."_id" : "1"."_version" : 1,
      "found" : true."_source" : {
        "username" : "thpffcj"."age": 22}}Copy the code

To search all documents, use _search

    GET /test_index/doc/_search

    {
      "took": 12, query time, unit: ms"timed_out" : false."_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped": 0."failed": 0}."hits" : {
        "total": 2, the total number of documents that meet the conditions"max_score": 1.0."hits": [Returns document details data, default first 10 documents {"_index" : "test_index", the index name"_type" : "doc"."_id" : "yfg31mwBWHG_wS6wM641", the id of the document"_score": 1.0, document score"_source": {document details"username" : "tom"."age": 20}}, {"_index" : "test_index"."_type" : "doc"."_id" : "1"."_score": 1.0."_source" : {
              "username" : "thpffcj"."age": 22}}]}}Copy the code

Specify search criteria

    GET /test_index/doc/_search
    {
      "query": {
        "term": {
          "_id":"1"}}}Copy the code

Es allows multiple documents to be created at once, reducing network transport overhead and increasing write rates
- The endpoint for _bulk
- The endpoint for _mget

    POST _bulk
    {"index": {"_index":"test_index"."_type":"doc"."_id":"3"}}
    {"username":"lilei"."age": 10} {"delete": {"_index":"test_index"."_type":"doc"."_id":"1"}}
    {"update": {"_index":"test_index"."_type":"doc"."_id":"2"}}
    {"doc": {"age":100}}

    GET _mget
    {
      "docs": [{"_index": "test_index"."_type": "doc"."_id": "1"
        },
        {
          "_index": "test_index"."_type": "doc"."_id": "2"}}]Copy the code

2. Inverted index and participle of Elasticsearch

1. Table of contents and index of books

How to find the page of “ACID” keyword?

Books and search engines

The table of contents page corresponds to a forward index
Index pages correspond to inverted indexes

2. Introduction to forward and inverted indexes

Is row index
- Document Id to document content, word association
Inverted index
- Association of words to document IDS

Query documents that contain “Search engine”
- The document ids for “search engine” are 1 and 3 by inverting the index
- Query the complete contents of 1 and 3 through the forward index
- Returns the user’s final result

3. Inverted index details

Inverted index is the core of search engine, mainly consists of two parts:
- Term Dictionary
  - Is an important component of the inverted index
  - The words that record all the documents tend to be large
  - Records the association information of the word to the inverted list
  - The implementation of the dictionary is usually B+ Tree
- Posting List
  - A collection of documents corresponding to words was recorded, which consisted of inverted index items
  - Contains the following information:
  - Document Id, which is used to retrieve the original information
  - Word Frequency (TF, Term Frequency) records the Frequency of occurrence of the word in the document for subsequent correlation calculation
  - Position: record the word segmentation Position (multiple) in the document, which is used for Phrase Query.
  - Offset: records the beginning and end positions of words in the document for highlighting

Es stores a JSON-formatted document containing multiple fields, each with its own inverted index

4. Participle introduction

Word segmentation refers to the process of converting text into a series of words (term or token), which can also be called text Analysis, or Analysis in ES
Word segmentation is a component in ES that specializes in word segmentation. Its composition is as follows:
- Character Filters: Process raw text, such as removing HTML special markup
- Tokenizer: Splits the original text into words according to certain rules
- Token Filters: Reworks words processed by Tokenizer, such as lowercase, deleted, or added

5. analyze_api

Es provides an API interface for testing word segmentation to verify the word segmentation effect. The endpoint is _analyze

You can directly specify Analyzer for testing

    POST _analyze
    {
      "analyzer": "standard"The word segmentation"text": "hello world!"Test text} {"tokens": [{"token" : "hello"Acne results"start_offset": 0, start offset"end_offset": 5, end offset"type" : "<ALPHANUM>"."position": 0 participle position}, {"token" : "world"."start_offset" : 6,
          "end_offset": 11."type" : "<ALPHANUM>"."position": 1}]}Copy the code

You can directly specify fields in the index for testing

    POST test_index/_analyze
    {
      "field": "username"."text": "hello world!"
    }
Copy the code

You can customize the word divider for testing

    POST _analyze
    {
      "tokenizer": "standard"."filter": ["lowercase"]."text": "hello world!"
    }
Copy the code

6. Bring your own word divider

Es comes with the following segmentation

Standard
- Default word divider
- According to the word segmentation, support multiple languages
Simple
- Divide by non-letter
Whitespace
- Divide by space
Stop
- More Stop World processing than Simple Analyzer
Keyword
- Output the input directly as a word, regardless of the word
Pattern
- Customize delimiters through regular expressions
- The default is \W+, that is, non-word symbols as delimiters
Language
- Provides word segmentation for 30+ common languages

7. Chinese word segmentation

The difficulties in
- Chinese word segmentation refers to the cutting of a Chinese character sequence into a single word. In English, Spaces are used as natural delimiters between words, while in Chinese, there is no formal delimiter
- Word segmentation results in different contexts
Common word segmentation system
- IK
  - To achieve the segmentation of Chinese and English words
  - Customizable thesaurus, support hot update thesaurus
- jieba
  - The most popular word segmentation system in Python, supporting word segmentation and pos tagging
  - Support traditional word segmentation, custom dictionary, parallel word segmentation, etc
- Hanlp
  - A Java toolkit of models and algorithms aimed at popularizing natural language processing in production environments
- THULAC
  - THU Lexical Analyzer for Chinese is a set of Chinese Lexical analysis toolkit developed by natural Language Processing and Social Humanities Computing Laboratory of Tsinghua University, which has the functions of Chinese word segmentation and part of speech tagging

8. Customize the Character Filter of word segmentation

If the built-in participle cannot meet requirements, you can customize a participle
- Through custom Character Filter, Tokenizer and Token Filter implementation

Character Filter

The original text is processed before Tokenizer, such as adding, deleting, or replacing characters
The built-in ones are as follows:
- HTML Strip Removes HTML tags and transforms HTML entities
- Mapping Performs character replacement
- Pattern Replace Performs regular match replacement
Affects subsequent position and offset information parsed by tokenizer

9. Tokenizer for custom participles

Split the original text into words (term or token) according to certain rules
The built-in ones are as follows:
- Standard: Split by word
- Letter: Split by non-character class
- Whitespace: Separated by space
- UAX URL Email: The Email address is separated according to the standard, but the Email address and URL are not separated
- NGram and Edge NGram: conjunction segmentation
- Path Hierarchy: Cuts files based on file paths

10. User-defined Token Filter for word segmentation

Add, delete, and modify words in the output of the Tokenizer
The built-in ones are as follows:
- Lowercase: Converts all terms to lowercase
- Stop: Deletes stop words
- NGram and Edge NGram: conjunction segmentation
- Synonym: Term to which a Synonym is added

11. Customize participles

Custom participles need to be set in the index configuration

    PUT test_index_1
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_custom_analyzer": {
              "type": "custom"."tokenizer": "standard"."char_filter": [
                "html_strip"]."filter": [
                "lowercase"."asciifolding"]}}}}}Copy the code

The test results

    POST test_index_1/_analyze
    {
      "analyzer": "my_custom_analyzer"."text": "Is this <b>a box</b>?"
    }

    {
      "tokens": [{"token" : "is"."start_offset": 0."end_offset": 2."type" : "<ALPHANUM>"."position": 0}, {"token" : "this"."start_offset": 3."end_offset" : 7,
          "type" : "<ALPHANUM>"."position": 1}, {"token" : "a"."start_offset": 11."end_offset": 12."type" : "<ALPHANUM>"."position": 2}, {"token" : "box"."start_offset": 13."end_offset": 20."type" : "<ALPHANUM>"."position": 3}]}Copy the code

12. Participle instructions

Participles are used in the following two situations:
- When an Index Time document is created or updated, word segmentation is performed on the corresponding document
- In Search Time, the query statement is segmented
Generally, it is not necessary to specify the query timer, but to use the index timer directly, otherwise there will be no match

Suggestions on the use of participles

You can save space and improve write performance by specifying whether fields require word segmentation and setting type to keyword for fields that do not require word segmentation
Use the _analyze API to view the specific word segmentation results of a document
Begin testing

3. Mapping Settings for Elasticsearch

1. The introduction of the mapping

Similar to the definition of a table structure in a database, the main functions are as follows:
- Define the Field Name under Index
- Define the type of field, such as numeric, string, Boolean, etc
- Define configurations related to inverted indexes, such as whether to index, record position, and so on

    GET /test_index/_mapping

    {
      "test_index" : {
        "mappings" : {
          "doc" : {
            "properties" : {
              "age" : {
                "type" : "long"
              },
              "username" : {
                "type" : "text"."fields" : {
                  "keyword" : {
                    "type" : "keyword"."ignore_above": 256}}}}}}}}Copy the code

2. Customize the Mapping

Once the field type in the Mapping is set, do not directly change the field type. The reasons are as follows:
- Lucene’s implementation of inverted indexes is not allowed to change after generation
Create a new index, and then reindex
Allow new fields
- The automatic addition of fields is controlled by the dynamic parameter
  - True (default) : fields can be added automatically
  - False: Indicates that fields cannot be automatically added but can be written to the document, but fields cannot be queried
  - Strict: The document cannot be written. An error is reported

3. The mapping demo

    PUT my_index
    {
      "mappings": {
        "doc": {
          "dynamic": false."properties": {
            "title": {
              "type": "text"
            },
            "name": {
              "type": "keyword"
            },
            "age": {
              "type": "integer"
            }
          }
        }
      }
    }
Copy the code

Write data

    PUT my_index/doc/1
    {
      "title": "hello world"."desc": "nothing here"
    }

    {
      "_index" : "my_index"."_type" : "doc"."_id" : "1"."_version" : 1,
      "result" : "created"."_shards" : {
        "total": 2."successful" : 1,
        "failed": 0}."_seq_no": 0."_primary_term": 1}Copy the code

Query data

    GET my_index/doc/_search
    {
      "query": {
        "match": {
          "title": "hello"}}} {"took" : 10,
      "timed_out" : false."_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped": 0."failed": 0}."hits" : {
        "total" : 1,
        "max_score": 0.2876821."hits": [{"_index" : "my_index"."_type" : "doc"."_id" : "1"."_score": 0.2876821."_source" : {
              "title" : "hello world"."desc" : "nothing here"}}]}}Copy the code

GET my_index/doc/_search { "query": { "match": { "title": "here" } } } { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : []}}Copy the code

4. Copy_to Parameter Description

Copies the value of this field to the target field, acting like _all
Does not appear in _source, only used for searching

5. Index Parameter description

Controls whether the current field is indexed. The default value is true, that is, the index is recorded, false, that is, not searched

    PUT my_index
    {
      "mappings": {
        "doc": {
          "properties": {
            "cookie": {
              "type": "text"."index": false
            }
          }
        }
      }
    }
Copy the code

6. Index_options parameter description

Index_options Controls the contents of inverted index records. The options are as follows
- Docs records only doc ids
- Freqs records the DOC ID and term Frequencies
- Positions record the DOC ID, term Frequencies, and term position
- Offsets records the DOC ID, term Frequencies, term position, and character offsets
The default value of the text type is positions, and the default value of the other types is docs
The more you record, the more space you take up

    PUT my_index
    {
      "mappings": {
        "doc": {
          "properties": {
            "cookie": {
              "type": "text"."index_options": "offsets"
            }
          }
        }
      }
    }
Copy the code

null_value
When a field encounters a null value, the default value is NULL. At this time, ES will ignore the value. You can set the default value of the field by setting the value

7. data types

Core data types
- String text, keyword
- Numeric types long, INTEGER, short, byte, double, float, half_float, SCALed_float
- Date type date
- Boolean type Boolean
- Binary type binary
- Range types Integer_range, FLOAT_range, long_range, doubLE_range, date_range
Complex data types
- Array type Array
- Object type Object
- Nested type Nested object
Geolocation data type
- geo_point
- geo_shape
Special type
- Record IP address IP address
- Achieve automatic completion completion
- Record the participle token_count
- Record the string hash value murmur3
- percolator
- join
Multi-field feature Muti-fields
- It is possible to use different configurations for the same field, such as word segmentation. Common examples are to implement pinyin search for the name of a person by adding a subfield pinyin to the name of a person

8. The dynamic mapping profile

Es can automatically identify document field types, thereby reducing user costs
- Es relies on the field type of the JSON document to automatically recognize the field type

9. Dynamic Date and number recognition

Automatic date recognition can self-configure date formats to meet various requirements
When a string is a number, it is not automatically recognized as an integer by default, because it is perfectly reasonable to have numbers in the string

10. The dynamic – the introduction of the template

Allows dynamic setting of field type according to data type and field name automatically recognized by ES, which can achieve the following effects:
- All string types are set to keyword, which means that there is no word by default
- All fields that begin with message are set to text, a participle
- All fields beginning with long_ are set to type long
- All automatically matched doubles are set to float to save space
A matching rule generally has the following parameters:
- Match_mapping_type Matches field types automatically identified by ES, such as Boolean, long, and string
- Match, unmatch Match field name
- Path_match, path_unmatch Match path

    PUT test_index
    {
      "mappings": {
        "doc": {
          "dynamic_templates": [{"message_as_text": {
                "match_mapping_type": "string"."match": "message"."mapping": {
                  "type": "text"}}}, {"string_as_keywords": {
                "match_mapping_type": "string"."mapping": {
                  "type": "keyword"
                }
              } 
            }
            ]
        }
      }
    }

    PUT test_index/doc/1
    {
      "name": "Thpffcj"."message": "hello world"
    }
Copy the code

Viewing index types

GET test_index/_mapping

{
  "test_index" : {
    "mappings" : {
      "doc" : {
        "dynamic_templates": [{"message_as_text" : {
              "match" : "message"."match_mapping_type" : "string"."mapping" : {
                "type" : "text"}}}, {"string_as_keywords" : {
              "match_mapping_type" : "string"."mapping" : {
                "type" : "keyword"}}}]."properties" : {
          "message" : {
            "type" : "text"
          },
          "name" : {
            "type" : "keyword"
          }
        }
      }
    }
  }
}
Copy the code

11. Suggestions for customizing mapping

To customize the Mapping, proceed as follows:
- Write a document to the temporary index of ES to get the mapping automatically generated by ES
- Modify the mapping obtained in Step 1 and customize related configurations
- Use the mapping in Step 2 to create the required index

12. Index template

The last

You can follow my wechat public number to learn and progress together.