1. Search API for Elasticsearch

1. SearchAPI overview

  • The data stored in ES can be queried and analyzed. The endpoint is _search
  • There are two main forms of query
    • URI Search
      • Easy to operate, easy to pass the command line test
      • Contains only partial query syntax
    • Request Body Search
      • Query DSL(Domain Specific Language), a complete Query syntax provided by ES

2. URI Search

  • Search is implemented using url query parameters, which are commonly used as follows:
    • Q Specifies the statement to be queried. The Syntax is Query String Syntax
    • If no field is specified in df Q, es will query all fields
    • The sort order
    • Timeout Specifies the timeout period. The default value is not timeout
    • Form, size is used for paging

Query String Syntax

  • The term with the phrase
    • Alfred way is equivalent to Alfred OR way
    • Alfred way, in order
  • A generic query
    • Alfred is equivalent to matching the term in all fields
  • Specified field
    • name:alfred
  • Group Group setting, using parentheses to specify matching rules
    • (quick OR brown) AND fox
    • status:(active OR pending) title:(full text search)
PUT test_search_index
{
  "settings": {
    "index": {"number_of_shards": "1"
    }
  }
}

POST test_search_index/doc/_bulk
{"index": {"_id":"1"}}
{"username":"alfred way"."job":"java engineer"."age": 18."birth":"1990-01-02"."isMarried":false}
{"index": {"_id":"2"}}
{"username":"alfred"."job":"java senior engineer and java specialist"."age": 28."birth":"1980-05-07"."isMarried":true}
{"index": {"_id":"3"}}
{"username":"lee"."job":"java and ruby engineer"."age": 22."birth":"1985-08-07"."isMarried":false}
{"index": {"_id":"4"}}
{"username":"alfred junior way"."job":"ruby engineer"."age": 23."birth":"1989-08-07"."isMarried":false}
Copy the code
  • Now let’s do the actual query, starting with a generic query that means the document with Alfred in all its fields
GET test_search_index/_search? q=alfred {"took": 29."timed_out" : false."_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped": 0."failed": 0}."hits" : {
    "total": 3."max_score": 1.2039728."hits": [{"_index" : "test_search_index"."_type" : "doc"."_id" : "2"."_score": 1.2039728."_source" : {
          "username" : "alfred"."job" : "java senior engineer and java specialist"."age": 28."birth" : "1980-05-07"."isMarried" : true}}, {"_index" : "test_search_index"."_type" : "doc"."_id" : "1"."_score": 0.33698124."_source" : {
          "username" : "alfred way"."job" : "java engineer"."age": 18."birth" : "1990-01-02"."isMarried" : false}}, {"_index" : "test_search_index"."_type" : "doc"."_id" : "4"."_score": 0.27601978."_source" : {
          "username" : "alfred junior way"."job" : "ruby engineer"."age": 23."birth" : "1989-08-07"."isMarried" : false}}]}}Copy the code
  • How does es actually perform the query
GET test_search_index/_search? q=alfred {"profile":true
}
Copy the code
  • Query by field
GET test_search_index/_search? q=username:alfredCopy the code
  • Any one of these conditions will do
GET test_search_index/_search? q=username:alfred way {"profile":true
}
Copy the code
  • Now there are two ways to change
GET test_search_index/_search? q=username:"alfred way"
{
  "profile":true} GET test_search_index/_search? q=username:(alfred way) {"profile":true
}
Copy the code
  • Boolean operator
    • AND (&), OR (| |), NOT (!)
    • name:(tom NOT lee)
    • Note uppercase, not lowercase
  • + – corresponds to must and must_not, respectively
    • name:(tom +lee -alfred)
    • + will be parsed as Spaces in the URL using encode results, %2B
GET test_search_index/_search? q=username:alfred AND way {"profile":true} GET test_search_index/_search? q=username:(alfred AND way) {"profile":true} GET test_search_index/_search? q=username:(alfred NOT way) GET test_search_index/_search? q=username:(alfred +way) {"profile":true} GET test_search_index/_search? q=username:(alfred %2Bway) {"profile":true
}
Copy the code
  • Range query, supporting values and dates
    • For interval writing, closed interval is [] and open interval is {}.
    • Arithmetic notation
GET test_search_index/_search? q=username:alfred age:>26 GET test_search_index/_search? q=username:alfred AND age:>20 GET test_search_index/_search? q=birth:(>1980 AND <1990)Copy the code
  • Wildcard query
    • ? Represents one character, and * represents zero or more characters
    • Wildcard matching is not recommended because it is inefficient and occupies a large amount of memory
    • Do not send anything without special needs. /* put it first
GET test_search_index/_search? q=username:alf*Copy the code
  • Regular expression matching
GET test_search_index/_search? q=username:/[a]? l.*/Copy the code
  • Fuzzy matching Fuzzy Query
    • name:roam~1
    • Roam match 1 character word, such as foam roams
  • Query proximity search
    • “fox quick”~5
    • The differences were compared in terms of term
GET test_search_index/_search? q=username:alfed GET test_search_index/_search? q=username:alfed~1 GET test_search_index/_search? q=username:alfd~2 GET test_search_index/_search? q=job:"java engineer"GET test_search_index/_search? q=job:"java engineer"~1 GET test_search_index/_search? q=job:"java engineer"~ 2Copy the code

3. Introduction to Query DSL

  • The query statement is sent to ES through the HTTP Request Body and contains the following parameters
    • Query The statement of a query that conforms to the Query DSL syntax
    • The form, the size
    • timeout
    • sort
    • .
  • Json-defined query languages include the following two types:
    • Field query
      • For example, term, match, and range are queried only for a certain field
    • Compound query
      • Such as bool query, containing one or more field class query or compound query statement

4. Field query introduction and match-query

  • Field queries mainly include the following two types:
    • The full text matching
      • For full-text retrieval of fields of text type, word segmentation is performed on query statements first, such as match, match_PHRASE and other query types
    • Match the words
      • No word segmentation is performed on the query statement, and the inverted index of the field is directly matched, such as term, terms, range and other query types
GET test_search_index/_search
{
  "query": {
    "match": {
      "username": "alfred way"}}}Copy the code
  • The operator parameter controls the matching relationship between words. The options are OR or and
GET test_search_index/_search
{
  "query": {
    "match": {
      "username": {
        "query": "alfred way"."operator": "and"}}}}Copy the code
  • The minimum_should_match parameter controls the number of words to match
GET test_search_index/_search
{
  "query": {
    "match": {
      "job": {
        "query": "java ruby engineer"."minimum_should_match": "3"}}}}Copy the code

5. Relevance counts

  • Relevance score is the degree of relevance between documents and query statements
    • The list of documents that match the query statement can be obtained by inverting the index, so how to put the documents that best meet the user’s query requirements to the front?
    • It’s essentially a sorting problem, and the sorting is based on correlation
  • Some important concepts of correlation scoring are as follows:
    • Term Frequency(TF) : Word Frequency is the number of times a word appears in the document. The higher the word Frequency, the higher the correlation
    • Document Frequency(DF) : Document Frequency, i.e. the number of documents in which words appear
    • Inverse Document Frequency(IDF) : Inverse Document Frequency, which is the opposite of the Document Frequency, simply understood as 1/DF, that is, the fewer documents in which words appear, the higher the correlation
    • Field-length Norm: The shorter the document, the higher the relevance
  • At present, ES mainly has two correlation scoring models, as follows:
    • TF/IDF model
    • BM25 model: default model after 5.x

  • You can use the Explain parameter to see the exact calculation method, but note that:
    • Es score is calculated according to shard, that is, shard score calculation is independent of each other, so pay attention to the number of fragments when using explain
    • You can avoid this problem by setting the number of shards in the index to 1
GET test_search_index/_search
{
  "explain":true."query": {
    "match": {
      "username": "alfred way"}}}Copy the code
  • In the BM25 model, BM refers to Best Match, and 25 refers to the calculation method after iterating 25 words, which is an optimization for TF/IDF

6. match-phrase-query

  • To do field retrieval, there are order requirements
GET test_search_index/_search
{
  "query": {
    "match_phrase": {
      "job": "java engineer"
    }
  }
}


GET test_search_index/_search
{
  "query": {
    "match_phrase": {
      "job": "engineer java"}}}Copy the code
  • The slop parameter controls the spacing between words
GET test_search_index/_search
{
  "query": {
    "match_phrase": {
      "job": {
        "query": "java engineer"."slop": "2"}}}}Copy the code

7. query-string-query

  • Similar to the q parameter query in URI Search
GET test_search_index/_search
{
  "profile":true."query": {"query_string": {
      "default_field": "username"."query": "alfred AND way"
    }
  }
}

GET test_search_index/_search
{
  "profile":true."query": {
    "query_string": {
      "fields": [
        "username"."job"]."query": "alfred OR (java AND ruby)"}}}Copy the code

8. simple-query-string-query

  • Similar to Query String, but ignores incorrect Query syntax and supports only partial Query syntax
GET test_search_index/_search
{
  "profile":true."query": {"simple_query_string": {
      "query": "alfred +way \"java"."fields": ["username"]
    }
  }
}

GET test_search_index/_search
{
  "query": {"query_string": {
      "default_field": "username"."query": "alfred +way \"java"}}}Copy the code

9. term-terms-query

  • The query statement is queried as the whole word, that is, the query statement is not segmented
GET test_search_index/_search
{
  "query": {"term": {"username":"alfred"
    }
  }
}

GET test_search_index/_search
{
  "query": {"term": {"username":"alfred way"}}}Copy the code
  • Terms: Multiple queries are passed in to a single query
G`ET test_search_index/_search
{
  "query": {
    "terms": {
      "username": [
        "alfred"."way"]}}}Copy the code

10. range-query

  • Range queries focus on numeric and date types
GET test_search_index/_search
{
  "query": {"range": {
      "age": {
        "gte": 10,
        "lte": 30
      }
    }
  }
}

GET test_search_index/_search
{
  "query": {"range": {
      "birth": {
        "gte": "1980-01-01"}}}}Copy the code
  • Provides a friendlier way of calculating dates
GET test_search_index/_search
{
  "query": {"range": {
      "birth": {
        "gte": "now-30y"
      }
    }
  }
}

GET test_search_index/_search
{
  "query": {"range": {
      "birth": {
        "gte": "2010||-20y"}}}}Copy the code

11. Introduction to compound query and ConstantScore

  • A compound query is a type that contains field classes or compound queries, including the following types:
    • constant_score_query
    • bool query
    • dis_max query
    • function_score_query
    • boosting query

Constant Score Query

  • This query sets its internal query result documents to a score of 1 or boost
    • Mostly used in combination with bool query to achieve a custom score
GET test_search_index/_search
{
  "query": {"constant_score": {
      "filter": {
        "match": {"username":"alfred"
        }
      }
    }
  }
}

GET test_search_index/_search
{
  "query": {
    "bool": {
      "should": [{"constant_score": {
            "filter": {
              "match": {
                "job": "java"}}}}, {"constant_score": {
            "filter": {
              "match": {
                "job": "ruby"}}}}]}}}Copy the code

12. bool-query

  • A Boolean query consists of one or more Boolean clauses, including the following four:
    • Filter: Only documents that meet the criteria are filtered and correlation scores are not calculated
    • Must: Documents must meet all conditions in MUST, which affects the correlation score
    • Must_not: Documents must not meet all conditions in must_NOT
    • Should: Documents can meet the criteria in should, which affects the relevance score

Filter

  • The Filter query only filters the documents that meet the criteria and does not perform correlation calculation
    • Es has an intelligent cache for filters, so its execution is very efficient
    • It is recommended to use filter instead of Query to perform simple matching queries without considering calculation time
GET test_search_index/_search
{
  "query": {
    "bool": {
      "filter": [{"term": {
            "username": "alfred"}}]}}}Copy the code

Must

GET test_search_index/_search
{
  "query": {
    "bool": {
      "must": [{"match": {
            "username": "alfred"}}, {"match": {
            "job": "specialist"}}]}}}Copy the code

Must_Not

GET test_search_index/_search
{
  "query": {
    "bool": {
      "must": [{"match": {
            "job": "java"}}]."must_not": [{"match": {
            "job": "ruby"}}]}}}Copy the code

should

  • To contain only should, the document must satisfy at least one condition
    • Minimum_should_match controls the number or percentage of conditions that are met
GET test_search_index/_search
{
  "query": {
    "bool": {
      "should": [{"match": {
            "username": "junior"}}, {"match": {
            "job": "ruby"
          }
        }
      ]
    }
  }
}

GET test_search_index/_search
{
  "query": {
    "bool": {
      "should": [{"term": {"job": "java"}},
        {"term": {"job": "ruby"}},
        {"term": {"job": "specialist"}}]."minimum_should_match": 2}}}Copy the code
  • When you include both should and must, the document doesn’t have to satisfy the condition in should, but if it does, it increases the relevance score
GET test_search_index/_search
{
  "query": {
    "bool": {
      "must": [{"term": {
            "username": "alfred"}}]."should": [{"term": {
            "job": "ruby"}}]}}}Copy the code
  • Es executes differently when a Query is in a Query or Filter context
    • Query: Searches for the documents that most match the query statement, calculates and sorts all documents by correlation
    • Filter: Finds documents that match the query statement

13. count and source filtering

  • Get the number of documents that match the conditions. Endpoint is _count
GET test_search_index/_count
{
  "query": {"match": {"username": "alfred"}}}Copy the code

source filtering

  • Filter the fields in _source in the returned result
GET test_search_index/_search GET test_search_index/_search? _source=username GET test_search_index/_search {"_source": false
}

GET test_search_index/_search
{
  "_source": ["username"."age"]
}

GET test_search_index/_search
{
  "_source": {
    "includes": "*i*"."excludes": "birth"}}Copy the code

Find out how Elasticsearch works

1. Query Then Fetch

  • Search is executed in two steps
    • The Query phase
    • The Fetch phase

2. Correlation score

  • The correlation score is independent between shards, which means that the IDF equivalent of the same Term is different on different shards. The correlation score of a document is related to the shard in which it is located
  • When the number of documents is small, the correlation calculation can be severely inaccurate
POST test_search_relevance/doc
{
  "name":"hello"
}

POST test_search_relevance/doc
{
  "name":"hello,world"
}

POST test_search_relevance/doc
{
  "name":"hello,world! a beautiful world"
}

GET test_search_relevance/_search
{
  "explain": true."query": {
    "match": {"name":"hello"}}}Copy the code
  • There are two ways to solve the problem:
    • One is to set the number of fragments to 1, fundamentally eliminate the problem, when the number of documents is not much, you can consider this scheme, such as the number of millions to tens of millions of documents
    • Second, DFS query-then-fetch is used
  • DFS query-then-fetch is used to completely compute the correlation score after getting all the documents, which consumes more CPU and memory and has low performance. It is generally not recommended to use DFS query-then-fetch as follows:
GET test_search_relevance/_search? search_type=dfs_query_then_fetch {"query": {
    "match": {"name":"hello"}}}Copy the code

3. sorting doc values fielddata

  • Es uses correlation sorting by default. Users can customize sorting rules by setting the sorting parameter
GET test_search_index/_search
{
  "query": {"match": {
      "username": "alfred"}},"sort": {"birth":"desc"
  }
}

GET test_search_index/_search
{
  "query": {"match": {
      "username": "alfred"}},"sort": [{"birth": "desc"
    },
    {
      "_score": "desc"
    },
    {
      "_doc": "desc"}}]Copy the code
  • Sorting by string is special because es has both text and keyword
    • Sorting for the text type generates an error
    • For the keyword type sort, you can return expected results
GET test_search_index/_search
{
  "sort": {"username.keyword":"desc"}}Copy the code

The sorting

  • The process of sorting is essentially the process of sorting the original content of the field. In this process, the inverted index cannot play a role, and the forward index is needed, that is, the original content of the field can be quickly obtained through the document Id and field
  • Es provides two ways to do this:
    • Fielddata is disabled by default
    • Doc values are enabled by default, except for the text type

Fielddata

  • Fielddata is off by default and can be turned on using the following API:
    • In this case, the string is sorted according to the term after the word segmentation, and the result is often difficult to meet the expectation
    • It is usually turned on during the aggregation analysis of participles
PUT test_search_index/_mapping/doc
{
  "properties": {
    "job": {"type":"text"."fielddata": true}}}Copy the code

Doc Values

  • Doc Values are enabled by default and can be turned off at index creation time:
    • If you want to enable Doc Values again, you need to perform reindex
PUT test_doc_values/_mapping/doc
{
  "properties": {
    "username": {
      "type": "keyword"."doc_values": false
    },
    "hobby": {
      "type": "keyword"}}}Copy the code

docvalue_fields

  • You can use this field to get the content stored in fieldData or DOC Values
GET test_search_index/_search
{
  "docvalue_fields": [
    "username"."username.keyword"."age"]}Copy the code

4. Paging and traversal -fromsize

  • Es provides three ways to solve paging and traversal problems:
    • from/size
      • From specifies the start position
      • Size specifies the total number of retrieves
    • scoll
    • search_after
  • Deep paging is a classic problem: how do you get the first 1000 documents in the case of data fragmentation?
    • When retrieving documents from 990 to 1000, 1000 documents are first retrieved in each shard, and then the results of all shards are aggregated by the Coordinating Node and the first 1000 documents are retrieved by sorting
    • The deeper the pages are, the more documents will be processed, the more memory will be occupied, and the longer the time will be. Try to avoid deep paging. Es limits the maximum number of data to 10000 through index.max_result_window
GET test_search_index/_search
{
  "from": 0."size":2
}

GET test_search_index/_search
{
  "from": 10000,"size"2} :Copy the code

5. Paging and traversal -scroll

  • Iterate through the API of the document set to take a snapshot to avoid deep paging problems
    • It can’t be used for real-time search, because the data isn’t real-time
    • Try not to use complex sort conditions, _doc is most efficient
    • It’s a little complicated to use
  • The first step is to initiate a scroll search
    • Es creates a snapshot of the collection of document ids based on the query criteria after receiving the request
GET test_search_index/_search? scroll=5m {"size":1 Specifies the number of documents returned per scroll}Copy the code
  • The second step is to use the Scroll Search API to get the collection of documents
    • Iterate over the call until the hits.hits array is returned empty
POST _search/scroll
{
  "scroll" : "5m"."scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAABswWX3FLSTZFOF9URFdqWHlvX3gtYmhtdw=="
}
Copy the code
  • The new document cannot be retrieved because it is a snapshot
PUT test_search_index/doc/10
{
  "username":"doc10"
}
Copy the code
  • Too many Scroll calls occupy a large amount of memory. You can use the Clear API to delete too many Scroll snapshots

    DELETE _search/scroll/_all

6. Paging and traversal -search_after

  • Avoid the performance problems of deep paging and provide real-time next-page document retrieval
    • The disadvantage is that you cannot use the FROM argument, that is, you cannot specify the number of pages
    • Next page, not last
    • Using a simple
  • The first step is to do a normal search, but specify a sort value that is unique
GET test_search_index/_search
{
  "size": 1,"sort": {"age":"desc"."_id":"desc"}}Copy the code
  • The second step is to query using the sort value of the last document in the previous step
GET test_search_index/_search
{
  "size": 1,"search_after": [23,"4"]."sort": {"age":"desc"."_id":"desc"}}Copy the code

Application scenarios

  • From/Size: You need to obtain the top part of the document in real time and turn the page freely
  • Scroll: requires all documents, for example, to export all data
  • Search_After: All documents are required without free paging

Elasticsearch: An introduction to aggregation analysis

1. Introduction to aggregation analysis

  • Search engines are used to answer questions like:
    • Could you please tell me all the orders addressed to Shanghai?
    • Please tell me all orders created in the last 1 day that have not been paid?
  • Aggregation analysis can answer the following questions:
    • Please tell me the daily order volume in the last week.
    • Please tell me the average daily order amount of the recent one month.
    • Could you please tell me what are the top five best-selling items in the last six months?

Aggregate analysis

  • Aggregation analysis, or Aggregation, is a statistical analysis of ES data provided by ES in addition to the search function
    • With rich functions, Bucket, Metric, Pipeline and other analysis methods can meet most analysis requirements
    • With high real-time performance, all calculation results are returned in time, while big data systems such as Hadoop are generally T+1

classification

  • For easy understanding, ES divides the aggregation analysis into the following four categories
    • Bucket: Bucket type, similar to GROUP BY syntax in SQL
    • Metric analysis type, such as calculating maximum, minimum, average, etc
    • Pipeline, the type of Pipeline analysis that is reanalyzed based on the aggregation analysis results of the previous level
    • Matrix, Matrix analysis type

2. Metric Aggregation analysis

POST test_search_index/doc/_bulk
{"index": {"_id":"1"}}
{"username":"alfred way"."job":"java engineer"."age": 18."birth":"1990-01-02"."isMarried":false."salary": 10000} {"index": {"_id":"2"}}
{"username":"tom"."job":"java senior engineer"."age": 28."birth":"1980-05-07"."isMarried":true."salary": 30000} {"index": {"_id":"3"}}
{"username":"lee"."job":"ruby engineer"."age": 22."birth":"1985-08-07"."isMarried":false."salary": 15000} {"index": {"_id":"4"}}
{"username":"Nick"."job":"web engineer"."age": 23."birth":"1989-08-07"."isMarried":false."salary": 8000} {"index": {"_id":"5"}}
{"username":"Niko"."job":"web engineer"."age": 18."birth":"1994-08-07"."isMarried":false."salary": 5000} {"index": {"_id":"6"}}
{"username":"Michell"."job":"ruby engineer"."age": 26."birth":"1987-08-07"."isMarried":false."salary": 12000}Copy the code
  • It is mainly divided into the following two categories:
    • Single value analysis, output only one analysis result
      • Min, Max, AVg, sum
      • cardinality
    • Multi – value analysis, output multiple analysis results
      • Stats, entended stats
      • The percentile, the percentile rank
      • top hits

Min

  • Returns the minimum value of a numeric class field
GET test_search_index/_search
{
  "size": 0."aggs": {"min_age": {"min": {
        "field": "age"}}}}Copy the code

Max

  • Returns the maximum value of a numeric class field
GET test_search_index/_search
{
  "size": 0."aggs": {"max_age": {"max": {
        "field": "age"}}}}Copy the code

Avg

  • Returns the average value of a numeric class field
GET test_search_index/_search
{
  "size": 0."aggs": {"avg_age": {"avg": {
        "field": "age"}}}}Copy the code

Sum

  • Returns the sum of numeric class fields
GET test_search_index/_search
{
  "size": 0."aggs": {"sum_age": {"sum": {
        "field": "age"}}}}Copy the code
  • Returns multiple aggregate results at once
GET test_search_index/_search
{
  "size": 0."aggs": {
    "min_age": {
      "min": {
        "field": "age"}},"max_age": {
      "max": {
        "field": "age"}},"avg_age": {
      "avg": {
        "field": "age"}},"sum_age": {
      "sum": {
        "field": "age"}}}}Copy the code

Cardinality

  • Cardinality refers to the potential or Cardinality of a set. Cardinality refers to the number of different values, similar to the concept of distinct count in SQL
GET test_search_index/_search
{
  "size": 0."aggs": {"count_of_job": {"cardinality": {
        "field": "job.keyword"}}}}Copy the code

Stats

  • Returns statistics of a series of numeric types, including min, Max, AVg, sum, and count
GET test_search_index/_search
{
  "size": 0."aggs": {"stats_age": {"stats": {
        "field": "age"}}}}Copy the code

Extended Stats

  • An extension to STATS to include more statistics such as variance, standard deviation, etc
GET test_search_index/_search
{
  "size": 0."aggs": {"exstats_salary": {"extended_stats": {
        "field": "salary"}}}}Copy the code

Percentile

  • Percentile statistics
GET test_search_index/_search
{
  "size": 0."aggs": {"per_salary": {"percentiles": {
        "field": "salary"
      }
    }
  }
}

GET test_search_index/_search
{
  "size": 0."aggs": {
    "per_age": {
      "percentiles": {
        "field": "salary"."percents": [95, 99, 99.9]}}}}Copy the code

Percentile Rank

  • Percentile statistics
GET test_search_index/_search
{
  "size": 0."aggs": {
    "per_salary": {
      "percentile_ranks": {
        "field": "salary"."values": [11000, 30000]}}}}Copy the code

Top Hits

  • It is generally used to obtain the list of the most matched top documents in the bucket, that is, the detailed data
GET test_search_index/_search
{
  "size": 0."aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"."size": 10},"aggs": {
        "top_employee": {
          "top_hits": {
            "size": 10,
            "sort": [{"age": {
                  "order": "desc"}}]}}}}}}Copy the code

3. Bucket aggregation analysis

  • Bucket refers to Bucket, which allocates documents to different buckets according to certain rules for classification and analysis
  • Based on Bucket splitting policy, common Bucket aggregation analysis is as follows:
    • Terms
    • Range
    • Date Range
    • Histogram
    • Date Histogram

Terms

  • The simplest bucket division strategy is to divide buckets according to term. If the bucket type is text, buckets are divided according to the result after the word division
GET test_search_index/_search
{
  "size": 0."aggs": {
    "jobs": {
      "terms": {
        "field": "job"."size": 5}}}}Copy the code

Range

  • Sets bucket splitting rules based on the range of specified values
GET test_search_index/_search
{
  "size": 0."aggs": {
    "salary_range": {
      "range": {
        "field": "salary"."ranges": [{"key":"" 10000"."to": 10000}, {"from": 10000,
            "to": 20000}, {"key":"20000" >."from": 20000}]}}}}Copy the code

Date Range

  • Set bucket splitting rules by specifying a range of dates
GET test_search_index/_search
{
  "size": 0."aggs": {
    "date_range": {
      "range": {
        "field": "birth"."format": "yyyy"."ranges": [{"from":"1980"."to": "1990"
          },
          {
            "from": "1990"."to": "2000"
          },
          {
            "from": "2000"}]}}}}Copy the code

Histogram

  • Histogram, with a fixed interval strategy to divide data
GET test_search_index/_search
{
  "size": 0."aggs": {"salary_hist": {"histogram": {
        "field": "salary"."interval": 5000,
        "extended_bounds": {
          "min": 0."max": 40000}}}}}Copy the code

Date Histogram

  • Histogram or bar graph for date is a common type of aggregation analysis in time series data analysis
GET test_search_index/_search
{
  "size": 0."aggs": {"by_year": {"date_histogram": {
        "field": "birth"."interval": "year"."format":"yyyy"}}}}Copy the code

4. Bucket and metric aggregation analysis

  • Bucket aggregation analysis allows further analysis by adding subanalyses, either Bucket or Metric, which makes ES extremely powerful
  • Divide the barrels and then divide the barrels
GET test_search_index/_search
{
  "size": 0."aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"."size": 10},"aggs": {
        "age_range": {
          "range": {
            "field": "age"."ranges": [{"to": 20}, {"from": 20."to": 30}, {"from": 30}]}}}}}}Copy the code
  • Data analysis was carried out after buckets were divided
GET test_search_index/_search
{
  "size": 0."aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"."size": 10},"aggs": {
        "salary": {
          "stats": {
            "field": "salary"
          }
        }
      }
    }
  }
}
Copy the code

5. Pipeline aggregation analysis

  • Re-aggregate analysis of the results of the aggregate analysis, with support for chain calls, can answer the following questions:
    • What is the average monthly sales of orders?
  • The analysis result of Pipeline will be output to the original result, which can be divided into the following two categories according to the different output positions:
    • Parent results are embedded in existing aggregation analysis results
      • Derivative
      • Moving Average
      • Cumulative Sum
    • Sibling results are equivalent to aggregation analysis results
      • Max/Min/Avg/Sum Bucket
      • Stats/Extended Stats Bucket
      • Percentiles Bucket

Min Bucket

GET test_search_index/_search
{
  "size": 0."aggs": {"jobs": {"terms": {
        "field": "job.keyword"."size": 10},"aggs": {"avg_salary": {"avg": {
            "field": "salary"}}}},"min_salary_by_job": {"min_bucket": {
        "buckets_path": "jobs>avg_salary"}}}}Copy the code

Derivative

  • Compute the derivative of the Bucket value
GET test_search_index/_search
{
  "size": 0."aggs": {
    "birth": {
      "date_histogram": {
        "field": "birth"."interval": "year"."min_doc_count": 0}."aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"}},"derivative_avg_salary": {
          "derivative": {
            "buckets_path": "avg_salary"
          }
        }
      }
    }
  }
}
Copy the code

Moving Average

  • Calculates the moving average of Bucket values

Cumulative Sum

  • Calculating buckets is worth adding up

6. Scope of action

  • The default scope of ES aggregation analysis is the result set of Query, and the scope can be changed as follows:
    • filter
    • post_filter
    • global

filter

  • Filter criteria are set for an aggregation analysis, changing scope without changing the overall Query statement
GET test_search_index/_search
{
  "size": 0."aggs": {
    "jobs_salary_small": {
      "filter": {
        "range": {
          "salary": {
            "to": 10000}}},"aggs": {
        "jobs": {
          "terms": {
            "field": "job.keyword"}}}},"jobs": {
      "terms": {
        "field": "job.keyword"}}}}Copy the code

post_filter

  • Works on text filtering, but takes effect after aggregation analysis
GET test_search_index/_search
{
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"}}},"post_filter": {
    "match": {"job.keyword":"java engineer"}}}Copy the code

global

  • Analysis is performed based on all documents regardless of query filtering conditions
GET test_search_index/_search
{
  "query": {
    "match": {
      "job.keyword": "java engineer"}},"aggs": {
    "java_avg_salary": {
      "avg": {
        "field": "salary"}},"all": {
      "global": {},
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    }
  }
}
Copy the code

7. The sorting

  • You can use built-in key data to sort, for example:
    • _count document number
    • _key Sort by key value
GET test_search_index/_search
{
  "size": 0."aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"."size": 10,
        "order": [{"avg_salary": "desc"}},"aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    }
  }
}
Copy the code

4. Data Modeling for Elasticsearch

1. Introduction to data modeling

  • Data Modeling, the process of creating a Data model
  • Data Model
    • A tool or method for abstracting the real world
    • Mapping to the real world is achieved by describing business rules in the form of abstract entities and relationships between them

Process of data modeling

  • The conceptual model
    • Define the core requirements and scope boundaries of the system and design entities and relationships among them
  • The logical model
    • Further comb through the business requirements, identifying attributes, relationships, constraints, and so on for each entity
  • The physical model
    • Based on specific database products, the final definition is determined on the premise of meeting the requirements of read/write performance
    • MySQL, MongoDB, ElasticSearch, etc
    • The third paradigm

2. Introduction to ES data modeling configuration

  • ES is a storage system based on Lucene’s inverted index, which does not follow the normal conventions in relational databases

Setting the Mapping field

  • enbaled
    • true | false
    • Store only, do not search or aggregate analysis
  • index
    • true | false
    • Whether to build an inverted index
  • index_options
    • docs | freqs | positions | offsets
    • What information is stored in the inverted index
  • norms
    • true | false
    • Whether to store normalized parameters. If the field is only used for filtering and aggregation analysis, disable the parameter
  • doc_values
    • true | false
    • Whether to enable doc_values for sorting and aggregation analysis
  • field_data
    • false | true
    • Whether to start FieldData for the text type for sorting and aggregation analysis
  • store
    • false | true
    • Whether to store the field value
  • coerce
    • true | false
    • Whether to enable automatic data type conversion, such as string to number, floating point to integer, etc
  • Multifields many fields
    • Flexible use of multi-field features to address diverse business requirements
  • dynamic
    • true | false | strict
    • Controls the automatic update of mapping
  • data_detection
    • true | false
    • Whether to automatically recognize the date type

Setting process

  • What type is it?
    • String type
    • Enumerated type
    • Numeric types
    • Other types of
  • Do I need to retrieve it?
    • There is no need to retrieve, sort, and aggregate fields for analysis
      • Set enabled to false
    • Fields that do not need to be retrieved
      • Set index to false
    • You can perform the following operations to set the storage strength for the fields to be retrieved
      • Index_options combination needs to be set
      • The norms should be closed when no normalized data is needed
  • Do you need sort and aggregate analysis?
    • Doc_values is set to false
    • Fielddata is set to false
  • Do I need to store it separately?
    • Do you need to store data specifically for the current field?

3. Example of ES data modeling

  • Blog post blog_index
    • Title the title
    • Publish date publish_date
    • The author the author
    • In this paper, the abstract
    • Content of the content
    • Network Address URL
PUT blog_index
{
  "mappings": {
    "doc": {
      "_source": {
        "enabled": false
      },
      "properties": {
        "title": {
          "type": "text"."fields": {
            "keyword": {
              "type": "keyword"."ignore_above": 100}}."store": true
        },
        "publish_date": {
          "type": "date"."store": true
        },
        "author": {
          "type": "keyword"."ignore_above": 100, 
          "store": true
        },
        "abstract": {
          "type": "text"."store": true
        },
        "content": {
          "type": "text"."store": true
        },
        "url": {
          "type": "keyword"."doc_values":false."norms":false."ignore_above": 100, 
          "store": true
        }
      }
    }
  }
}
Copy the code
  • The query
GET blog_index/_search
{
  "stored_fields": ["title"."publish_date"."author"."abstract"."url"]."query": {
    "match": {
      "content": "blog"}},"highlight": {
    "fields": {"content": {}}}}Copy the code

4. Nested_Object

  • ES is not good at dealing with association relations in relational databases. For example, blog_id is used to associate article table blog with comment table, which can be solved by the following two ways in ES
    • Nested Object
    • Parent/Child

Relational processing

  • The article Id blog_id
  • To comment the username
  • Date of comment
  • Content of comments
DELETE blog_index_nested
PUT blog_index_nested
{
"mappings": {
  "doc": {"properties": {
      "title": {"type": "text"."fields": {
          "keyword": {"type":"keyword"."ignore_above": 100}}},"publish_date": {"type":"date"
      },
      "author": {"type":"keyword"."ignore_above": 100}."abstract": {"type": "text"
        },
        "url": {"enabled":false
        },
        "comments": {"type":"nested"."properties": {
            "username": {"type":"keyword"."ignore_above": 100}."date": {"type":"date"
            },
            "content": {"type":"text"
            }
          }
        }
      }
    }
  }
}

PUT blog_index_nested/doc/2
{
  "title": "Blog Number One"."author": "alfred"."comments": [{"username": "lee"."date": "2017-01-02"."content": "awesome article!"
    },
    {
      "username": "fax"."date": "2017-04-02"."content": "thanks!"
    }
  ]
}

GET blog_index_nested/_search
{
  "query": {
    "nested": {
      "path": "comments"."query": {
        "bool": {
          "must": [{"match": {
                "comments.username": "lee"}}, {"match": {
                "comments.content": "thanks"}}]}}}}}Copy the code

5. Parent_Child

  • ES also provides an implementation similar to join in a relational database, using the Join data type
PUT blog_index_parent_child
{
  "mappings": {
    "doc": {
      "properties": {
        "join": {
          "type": "join"."relations": {
            "blog": "comment"
          }
        }
      }
    }
  }
}

PUT blog_index_parent_child/doc/1
{
  "title":"blog"."join":"blog"} PUT blog_index_parent_child/doc/2 {"title":"blog2"."join":"blog"
}
Copy the code
PUT blog_index_parent_child/doc/comment-1? Routing =1 Specifies the routing value to ensure that parent documents are on the same shard."comment":"comment world"."join": {"name":"comment"To indicate the subtype"parent"}} PUT blog_index_parent_child/doc/comment-2? routing=2 {"comment":"comment hello"."join": {"name":"comment"."parent": 2}}Copy the code
  • The common query syntax includes the following:
    • Parent_id: Returns a child document of a parent document
    • Has_child: Returns the parent document that contains a child document
    • Has_parent: Returns a child document containing a parent document
GET blog_index_parent_child/_search
{
  "query": {"parent_id": {"type":"comment"."id":"2"}}}Copy the code
GET blog_index_parent_child/_search
{
  "query": {"has_child": {
      "type": "comment"."query": {
        "match": {
          "comment": "world"
        }
      }
    }
  }
}
Copy the code
GET blog_index_parent_child/_search
{
  "query": {"has_parent": {
      "parent_type": "blog"."query": {
        "match": {
          "title": "blog"
        }
      }
    }
  }
}
Copy the code

6. nested_vs_parent_child

nested object

  • Advantages: Documents are stored together, so read performance is high
  • Disadvantages: Updating parent or child documents requires updating the entire document
  • Scenario: Subdocuments are updated occasionally and queries are frequent

parent child

  • Advantages: The parent and child files can be updated independently without affecting each other
  • Disadvantages: In order to maintain the join relationship, it needs to occupy part of the memory, and the read performance is poor
  • Scenario: Subdocuments are updated frequently

It is recommended to select nested object to solve problems

7. reindex

  • The process of reconstructing all data, usually when:
    • Mapping Settings changes, such as field type changes, dictionary segmentation updates, etc
    • Index Settings changed, such as the number of shards
    • Migrating data
  • ES provides a ready-made API to do this
    • _update_BY_query is rebuilt on an existing index
    • _reindex is rebuilt on other indexes
POST blog_index/_update_by_query? conflicts=proceed POST _reindex {"source": {
    "index": "blog_index"
  },
  "dest": {
    "index": "blog_new_index"}}Copy the code
  • The time of data reconstruction is affected by the size of the original index document. The larger the size is, the more time is required. In this case, the url parameter Wait_for_completion needs to be set to false to perform the task asynchronously
POST blog_index/_update_by_query? conflicts=proceed&wait_for_completion=false

GET _tasks/_qKI6E8_TDWjXyo_x-bhmw:11996
Copy the code

8. Other suggestions

Data model versioning

  • Manage the Mapping version
    • Include it in code or manage it in a special file, add comments, and add it to a version management repository such as Git for easy review
    • Add a metadata field for each document to maintain metadata for easy data management

Preventing too many fields

  • Too many fields have the following disadvantages:
    • Difficult to maintain, when there are hundreds or thousands of fields, it is almost impossible for anyone to know exactly what each field means
    • The mapping information is stored in the cluster state. If there are too many fields, the mapping becomes too large and the update slows down
    • The common reason for too many fields is the lack of high-quality data modeling, such as setting dynamic to true
    • Consider splitting multiple indexes to solve the problem

The last

You can follow my wechat public number to learn and progress together.