1. Getting started with Elasticsearch

1. Glossary introduction

  • Document Document: Data Document that the user stores in ES
  • Index Index: Consists of a list of documents with the same field
  • Node Node: a running instance of Elasticsearch that is a component of a cluster
  • Cluster Cluster: a Cluster consists of one or more nodes to provide external services

2. The Document is introduced

Json objects consist of fields. Common data types are as follows:

  • Character string: text, keyword
  • Numeric types: long, INTEGER, short, byte, double, float, half_float, SCALed_float
  • Boolean: the Boolean
  • Date: the date
  • Binary: binary
  • Range types: integer_range, FLOAT_range, long_range, doubLE_range, date_range

Each document has a unique ID identifier

  • To specify
  • Es automatic generation

Document MetaData

Metadata, used to annotate relevant information about a document

  • _index: indicates the index name of the document
  • _type: the name of the document type
  • _ID: indicates the unique ID of a document
  • _uid: combination ID, consisting of _type and _id (6.X _type is no longer valid, as is _id)
  • _source: The raw Json data of the document, from which you can obtain the contents of each field
  • _all: Consolidates all fields into this field. This parameter is disabled by default

3. The Index is introduced

  • The index stores documents with the same structure
    • Each index has its own mapping definition that defines field names and types
  • A cluster can have multiple indexes, such as:
    • Nginx logs can be stored in an index per day by date
      • nginx-log-2017-01-01
      • nginx-log-2017-01-02
      • nginx-log-2017-01-03

4. Restapi is introduced

  • Elasticsreach provides RESTful apis
    • REST – REpresentational State Transfer
    • URI specifies resources, such as Index, Document, and so on
    • Http Method Indicates the resource operation type, such as GET, POST, PUT, and DELETE
  • There are two common modes of interaction
    • The Curl command line
    • Kibana DevTools

5. index_api

Es has a dedicated Index API for creating, updating, and deleting Index configurations

Thpffcj: ElasticSearch -6.5.4 Thpffcj $bin/ elasticSearch Thpffcj:kibana-6.5.4- Darwin x86_64 Thpffcj $bin/kibanaCopy the code
  • Access the localhost:5601 port to access the Kibana graphical interface and we can use the REST API using Kibana DevTools
    PUT /test_index

    #! Deprecation: the default number of shards will change from [5] to [1] in 7.0.0; if you wish to continue using the default of [5] shards, you must manage this on the create index request or with an index template
    {
      "acknowledged" : true."shards_acknowledged" : true."index" : "test_index"} GET _index (test_index) {GET _index (test_index);"acknowledged" : true
    }
Copy the code

6. document_api

Es has a dedicated Document API

  • Create a document

  • Query the document

  • Update the document

  • Delete the document

  • Create a test_index, type doc, id 1 document with no concept of type after higher version

    PUT /test_index/doc/1
    {
      "username":"thpffcj"."age":22
    }

    {
      "_index" : "test_index"."_type" : "doc"."_id" : "1"."_version" : 1,
      "result" : "created"."_shards" : {
        "total": 2."successful" : 1,
        "failed": 0}."_seq_no": 0."_primary_term": 1}Copy the code
  • Creates a document without specifying an ID
    POST /test_index/doc
    {
      "username":"tom"."age":20
    }

    {
      "_index" : "test_index"."_type" : "doc"."_id" : "yfg31mwBWHG_wS6wM641"."_version" : 1,
      "result" : "created"."_shards" : {
        "total": 2."successful" : 1,
        "failed": 0}."_seq_no": 0."_primary_term": 1}Copy the code
  • Specify the id of the document to be queried
    GET /test_index/doc/1

    {
      "_index" : "test_index"."_type" : "doc"."_id" : "1"."_version" : 1,
      "found" : true."_source" : {
        "username" : "thpffcj"."age": 22}}Copy the code
  • To search all documents, use _search
    GET /test_index/doc/_search

    {
      "took": 12, query time, unit: ms"timed_out" : false."_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped": 0."failed": 0}."hits" : {
        "total": 2, the total number of documents that meet the conditions"max_score": 1.0."hits": [Returns document details data, default first 10 documents {"_index" : "test_index", the index name"_type" : "doc"."_id" : "yfg31mwBWHG_wS6wM641", the id of the document"_score": 1.0, document score"_source": {document details"username" : "tom"."age": 20}}, {"_index" : "test_index"."_type" : "doc"."_id" : "1"."_score": 1.0."_source" : {
              "username" : "thpffcj"."age": 22}}]}}Copy the code
  • Specify search criteria
    GET /test_index/doc/_search
    {
      "query": {
        "term": {
          "_id":"1"}}}Copy the code
  • Es allows multiple documents to be created at once, reducing network transport overhead and increasing write rates
    • The endpoint for _bulk
    • The endpoint for _mget
    POST _bulk
    {"index": {"_index":"test_index"."_type":"doc"."_id":"3"}}
    {"username":"lilei"."age": 10} {"delete": {"_index":"test_index"."_type":"doc"."_id":"1"}}
    {"update": {"_index":"test_index"."_type":"doc"."_id":"2"}}
    {"doc": {"age":100}}

    GET _mget
    {
      "docs": [{"_index": "test_index"."_type": "doc"."_id": "1"
        },
        {
          "_index": "test_index"."_type": "doc"."_id": "2"}}]Copy the code

2. Inverted index and participle of Elasticsearch

1. Table of contents and index of books

How to find the page of “ACID” keyword?

Books and search engines

  • The table of contents page corresponds to a forward index
  • Index pages correspond to inverted indexes

2. Introduction to forward and inverted indexes

  • Is row index
    • Document Id to document content, word association
  • Inverted index
    • Association of words to document IDS

  • Query documents that contain “Search engine”
    • The document ids for “search engine” are 1 and 3 by inverting the index
    • Query the complete contents of 1 and 3 through the forward index
    • Returns the user’s final result

3. Inverted index details

  • Inverted index is the core of search engine, mainly consists of two parts:
    • Term Dictionary
      • Is an important component of the inverted index
      • The words that record all the documents tend to be large
      • Records the association information of the word to the inverted list
      • The implementation of the dictionary is usually B+ Tree
    • Posting List
      • A collection of documents corresponding to words was recorded, which consisted of inverted index items
      • Contains the following information:
      • Document Id, which is used to retrieve the original information
      • Word Frequency (TF, Term Frequency) records the Frequency of occurrence of the word in the document for subsequent correlation calculation
      • Position: record the word segmentation Position (multiple) in the document, which is used for Phrase Query.
      • Offset: records the beginning and end positions of words in the document for highlighting

  • Es stores a JSON-formatted document containing multiple fields, each with its own inverted index

4. Participle introduction

  • Word segmentation refers to the process of converting text into a series of words (term or token), which can also be called text Analysis, or Analysis in ES
  • Word segmentation is a component in ES that specializes in word segmentation. Its composition is as follows:
    • Character Filters: Process raw text, such as removing HTML special markup
    • Tokenizer: Splits the original text into words according to certain rules
    • Token Filters: Reworks words processed by Tokenizer, such as lowercase, deleted, or added

5. analyze_api

Es provides an API interface for testing word segmentation to verify the word segmentation effect. The endpoint is _analyze

  • You can directly specify Analyzer for testing
    POST _analyze
    {
      "analyzer": "standard"The word segmentation"text": "hello world!"Test text} {"tokens": [{"token" : "hello"Acne results"start_offset": 0, start offset"end_offset": 5, end offset"type" : "<ALPHANUM>"."position": 0 participle position}, {"token" : "world"."start_offset" : 6,
          "end_offset": 11."type" : "<ALPHANUM>"."position": 1}]}Copy the code
  • You can directly specify fields in the index for testing
    POST test_index/_analyze
    {
      "field": "username"."text": "hello world!"
    }
Copy the code
  • You can customize the word divider for testing
    POST _analyze
    {
      "tokenizer": "standard"."filter": ["lowercase"]."text": "hello world!"
    }
Copy the code

6. Bring your own word divider

Es comes with the following segmentation

  • Standard
    • Default word divider
    • According to the word segmentation, support multiple languages
  • Simple
    • Divide by non-letter
  • Whitespace
    • Divide by space
  • Stop
    • More Stop World processing than Simple Analyzer
  • Keyword
    • Output the input directly as a word, regardless of the word
  • Pattern
    • Customize delimiters through regular expressions
    • The default is \W+, that is, non-word symbols as delimiters
  • Language
    • Provides word segmentation for 30+ common languages

7. Chinese word segmentation

  • The difficulties in
    • Chinese word segmentation refers to the cutting of a Chinese character sequence into a single word. In English, Spaces are used as natural delimiters between words, while in Chinese, there is no formal delimiter
    • Word segmentation results in different contexts
  • Common word segmentation system
    • IK
      • To achieve the segmentation of Chinese and English words
      • Customizable thesaurus, support hot update thesaurus
    • jieba
      • The most popular word segmentation system in Python, supporting word segmentation and pos tagging
      • Support traditional word segmentation, custom dictionary, parallel word segmentation, etc
    • Hanlp
      • A Java toolkit of models and algorithms aimed at popularizing natural language processing in production environments
    • THULAC
      • THU Lexical Analyzer for Chinese is a set of Chinese Lexical analysis toolkit developed by natural Language Processing and Social Humanities Computing Laboratory of Tsinghua University, which has the functions of Chinese word segmentation and part of speech tagging

8. Customize the Character Filter of word segmentation

  • If the built-in participle cannot meet requirements, you can customize a participle
    • Through custom Character Filter, Tokenizer and Token Filter implementation

Character Filter

  • The original text is processed before Tokenizer, such as adding, deleting, or replacing characters
  • The built-in ones are as follows:
    • HTML Strip Removes HTML tags and transforms HTML entities
    • Mapping Performs character replacement
    • Pattern Replace Performs regular match replacement
  • Affects subsequent position and offset information parsed by tokenizer

9. Tokenizer for custom participles

  • Split the original text into words (term or token) according to certain rules
  • The built-in ones are as follows:
    • Standard: Split by word
    • Letter: Split by non-character class
    • Whitespace: Separated by space
    • UAX URL Email: The Email address is separated according to the standard, but the Email address and URL are not separated
    • NGram and Edge NGram: conjunction segmentation
    • Path Hierarchy: Cuts files based on file paths

10. User-defined Token Filter for word segmentation

  • Add, delete, and modify words in the output of the Tokenizer
  • The built-in ones are as follows:
    • Lowercase: Converts all terms to lowercase
    • Stop: Deletes stop words
    • NGram and Edge NGram: conjunction segmentation
    • Synonym: Term to which a Synonym is added

11. Customize participles

  • Custom participles need to be set in the index configuration
    PUT test_index_1
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_custom_analyzer": {
              "type": "custom"."tokenizer": "standard"."char_filter": [
                "html_strip"]."filter": [
                "lowercase"."asciifolding"]}}}}}Copy the code
  • The test results
    POST test_index_1/_analyze
    {
      "analyzer": "my_custom_analyzer"."text": "Is this <b>a box</b>?"
    }

    {
      "tokens": [{"token" : "is"."start_offset": 0."end_offset": 2."type" : "<ALPHANUM>"."position": 0}, {"token" : "this"."start_offset": 3."end_offset" : 7,
          "type" : "<ALPHANUM>"."position": 1}, {"token" : "a"."start_offset": 11."end_offset": 12."type" : "<ALPHANUM>"."position": 2}, {"token" : "box"."start_offset": 13."end_offset": 20."type" : "<ALPHANUM>"."position": 3}]}Copy the code

12. Participle instructions

  • Participles are used in the following two situations:
    • When an Index Time document is created or updated, word segmentation is performed on the corresponding document
    • In Search Time, the query statement is segmented
  • Generally, it is not necessary to specify the query timer, but to use the index timer directly, otherwise there will be no match

Suggestions on the use of participles

  • You can save space and improve write performance by specifying whether fields require word segmentation and setting type to keyword for fields that do not require word segmentation
  • Use the _analyze API to view the specific word segmentation results of a document
  • Begin testing

3. Mapping Settings for Elasticsearch

1. The introduction of the mapping

  • Similar to the definition of a table structure in a database, the main functions are as follows:
    • Define the Field Name under Index
    • Define the type of field, such as numeric, string, Boolean, etc
    • Define configurations related to inverted indexes, such as whether to index, record position, and so on
    GET /test_index/_mapping

    {
      "test_index" : {
        "mappings" : {
          "doc" : {
            "properties" : {
              "age" : {
                "type" : "long"
              },
              "username" : {
                "type" : "text"."fields" : {
                  "keyword" : {
                    "type" : "keyword"."ignore_above": 256}}}}}}}}Copy the code

2. Customize the Mapping

  • Once the field type in the Mapping is set, do not directly change the field type. The reasons are as follows:
    • Lucene’s implementation of inverted indexes is not allowed to change after generation
  • Create a new index, and then reindex
  • Allow new fields
    • The automatic addition of fields is controlled by the dynamic parameter
      • True (default) : fields can be added automatically
      • False: Indicates that fields cannot be automatically added but can be written to the document, but fields cannot be queried
      • Strict: The document cannot be written. An error is reported

3. The mapping demo

    PUT my_index
    {
      "mappings": {
        "doc": {
          "dynamic": false."properties": {
            "title": {
              "type": "text"
            },
            "name": {
              "type": "keyword"
            },
            "age": {
              "type": "integer"
            }
          }
        }
      }
    }
Copy the code
  • Write data
    PUT my_index/doc/1
    {
      "title": "hello world"."desc": "nothing here"
    }

    {
      "_index" : "my_index"."_type" : "doc"."_id" : "1"."_version" : 1,
      "result" : "created"."_shards" : {
        "total": 2."successful" : 1,
        "failed": 0}."_seq_no": 0."_primary_term": 1}Copy the code
  • Query data
    GET my_index/doc/_search
    {
      "query": {
        "match": {
          "title": "hello"}}} {"took" : 10,
      "timed_out" : false."_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped": 0."failed": 0}."hits" : {
        "total" : 1,
        "max_score": 0.2876821."hits": [{"_index" : "my_index"."_type" : "doc"."_id" : "1"."_score": 0.2876821."_source" : {
              "title" : "hello world"."desc" : "nothing here"}}]}}Copy the code
GET my_index/doc/_search { "query": { "match": { "title": "here" } } } { "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : []}}Copy the code

4. Copy_to Parameter Description

  • Copies the value of this field to the target field, acting like _all
  • Does not appear in _source, only used for searching

5. Index Parameter description

  • Controls whether the current field is indexed. The default value is true, that is, the index is recorded, false, that is, not searched
    PUT my_index
    {
      "mappings": {
        "doc": {
          "properties": {
            "cookie": {
              "type": "text"."index": false
            }
          }
        }
      }
    }
Copy the code

6. Index_options parameter description

  • Index_options Controls the contents of inverted index records. The options are as follows
    • Docs records only doc ids
    • Freqs records the DOC ID and term Frequencies
    • Positions record the DOC ID, term Frequencies, and term position
    • Offsets records the DOC ID, term Frequencies, term position, and character offsets
  • The default value of the text type is positions, and the default value of the other types is docs
  • The more you record, the more space you take up
    PUT my_index
    {
      "mappings": {
        "doc": {
          "properties": {
            "cookie": {
              "type": "text"."index_options": "offsets"
            }
          }
        }
      }
    }
Copy the code
  • null_value
  • When a field encounters a null value, the default value is NULL. At this time, ES will ignore the value. You can set the default value of the field by setting the value

7. data types

  • Core data types
    • String text, keyword
    • Numeric types long, INTEGER, short, byte, double, float, half_float, SCALed_float
    • Date type date
    • Boolean type Boolean
    • Binary type binary
    • Range types Integer_range, FLOAT_range, long_range, doubLE_range, date_range
  • Complex data types
    • Array type Array
    • Object type Object
    • Nested type Nested object
  • Geolocation data type
    • geo_point
    • geo_shape
  • Special type
    • Record IP address IP address
    • Achieve automatic completion completion
    • Record the participle token_count
    • Record the string hash value murmur3
    • percolator
    • join
  • Multi-field feature Muti-fields
    • It is possible to use different configurations for the same field, such as word segmentation. Common examples are to implement pinyin search for the name of a person by adding a subfield pinyin to the name of a person

8. The dynamic mapping profile

  • Es can automatically identify document field types, thereby reducing user costs
    • Es relies on the field type of the JSON document to automatically recognize the field type

9. Dynamic Date and number recognition

  • Automatic date recognition can self-configure date formats to meet various requirements
  • When a string is a number, it is not automatically recognized as an integer by default, because it is perfectly reasonable to have numbers in the string

10. The dynamic – the introduction of the template

  • Allows dynamic setting of field type according to data type and field name automatically recognized by ES, which can achieve the following effects:
    • All string types are set to keyword, which means that there is no word by default
    • All fields that begin with message are set to text, a participle
    • All fields beginning with long_ are set to type long
    • All automatically matched doubles are set to float to save space
  • A matching rule generally has the following parameters:
    • Match_mapping_type Matches field types automatically identified by ES, such as Boolean, long, and string
    • Match, unmatch Match field name
    • Path_match, path_unmatch Match path
    PUT test_index
    {
      "mappings": {
        "doc": {
          "dynamic_templates": [{"message_as_text": {
                "match_mapping_type": "string"."match": "message"."mapping": {
                  "type": "text"}}}, {"string_as_keywords": {
                "match_mapping_type": "string"."mapping": {
                  "type": "keyword"
                }
              } 
            }
            ]
        }
      }
    }

    PUT test_index/doc/1
    {
      "name": "Thpffcj"."message": "hello world"
    }
Copy the code
  • Viewing index types
GET test_index/_mapping

{
  "test_index" : {
    "mappings" : {
      "doc" : {
        "dynamic_templates": [{"message_as_text" : {
              "match" : "message"."match_mapping_type" : "string"."mapping" : {
                "type" : "text"}}}, {"string_as_keywords" : {
              "match_mapping_type" : "string"."mapping" : {
                "type" : "keyword"}}}]."properties" : {
          "message" : {
            "type" : "text"
          },
          "name" : {
            "type" : "keyword"
          }
        }
      }
    }
  }
}
Copy the code

11. Suggestions for customizing mapping

  • To customize the Mapping, proceed as follows:
    • Write a document to the temporary index of ES to get the mapping automatically generated by ES
    • Modify the mapping obtained in Step 1 and customize related configurations
    • Use the mapping in Step 2 to create the required index

12. Index template

The last

You can follow my wechat public number to learn and progress together.