ElasticSearch query and aggregation

1. Search API for Elasticsearch

1. SearchAPI overview

The data stored in ES can be queried and analyzed. The endpoint is _search
There are two main forms of query
- URI Search
  - Easy to operate, easy to pass the command line test
  - Contains only partial query syntax
- Request Body Search
  - Query DSL(Domain Specific Language), a complete Query syntax provided by ES

2. URI Search

Search is implemented using url query parameters, which are commonly used as follows:
- Q Specifies the statement to be queried. The Syntax is Query String Syntax
- If no field is specified in df Q, es will query all fields
- The sort order
- Timeout Specifies the timeout period. The default value is not timeout
- Form, size is used for paging

Query String Syntax

The term with the phrase
- Alfred way is equivalent to Alfred OR way
- Alfred way, in order
A generic query
- Alfred is equivalent to matching the term in all fields
Specified field
- name:alfred
Group Group setting, using parentheses to specify matching rules
- (quick OR brown) AND fox
- status:(active OR pending) title:(full text search)

PUT test_search_index
{
  "settings": {
    "index": {"number_of_shards": "1"
    }
  }
}

POST test_search_index/doc/_bulk
{"index": {"_id":"1"}}
{"username":"alfred way"."job":"java engineer"."age": 18."birth":"1990-01-02"."isMarried":false}
{"index": {"_id":"2"}}
{"username":"alfred"."job":"java senior engineer and java specialist"."age": 28."birth":"1980-05-07"."isMarried":true}
{"index": {"_id":"3"}}
{"username":"lee"."job":"java and ruby engineer"."age": 22."birth":"1985-08-07"."isMarried":false}
{"index": {"_id":"4"}}
{"username":"alfred junior way"."job":"ruby engineer"."age": 23."birth":"1989-08-07"."isMarried":false}
Copy the code

Now let’s do the actual query, starting with a generic query that means the document with Alfred in all its fields

GET test_search_index/_search? q=alfred {"took": 29."timed_out" : false."_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped": 0."failed": 0}."hits" : {
    "total": 3."max_score": 1.2039728."hits": [{"_index" : "test_search_index"."_type" : "doc"."_id" : "2"."_score": 1.2039728."_source" : {
          "username" : "alfred"."job" : "java senior engineer and java specialist"."age": 28."birth" : "1980-05-07"."isMarried" : true}}, {"_index" : "test_search_index"."_type" : "doc"."_id" : "1"."_score": 0.33698124."_source" : {
          "username" : "alfred way"."job" : "java engineer"."age": 18."birth" : "1990-01-02"."isMarried" : false}}, {"_index" : "test_search_index"."_type" : "doc"."_id" : "4"."_score": 0.27601978."_source" : {
          "username" : "alfred junior way"."job" : "ruby engineer"."age": 23."birth" : "1989-08-07"."isMarried" : false}}]}}Copy the code

How does es actually perform the query

GET test_search_index/_search? q=alfred {"profile":true
}
Copy the code

Query by field

GET test_search_index/_search? q=username:alfredCopy the code

Any one of these conditions will do

GET test_search_index/_search? q=username:alfred way {"profile":true
}
Copy the code

Now there are two ways to change

GET test_search_index/_search? q=username:"alfred way"
{
  "profile":true} GET test_search_index/_search? q=username:(alfred way) {"profile":true
}
Copy the code

Boolean operator
- AND (&), OR (| |), NOT (!)
- name:(tom NOT lee)
- Note uppercase, not lowercase
+ – corresponds to must and must_not, respectively
- name:(tom +lee -alfred)
- + will be parsed as Spaces in the URL using encode results, %2B

GET test_search_index/_search? q=username:alfred AND way {"profile":true} GET test_search_index/_search? q=username:(alfred AND way) {"profile":true} GET test_search_index/_search? q=username:(alfred NOT way) GET test_search_index/_search? q=username:(alfred +way) {"profile":true} GET test_search_index/_search? q=username:(alfred %2Bway) {"profile":true
}
Copy the code

Range query, supporting values and dates
- For interval writing, closed interval is [] and open interval is {}.
- Arithmetic notation

GET test_search_index/_search? q=username:alfred age:>26 GET test_search_index/_search? q=username:alfred AND age:>20 GET test_search_index/_search? q=birth:(>1980 AND <1990)Copy the code

Wildcard query
- ? Represents one character, and * represents zero or more characters
- Wildcard matching is not recommended because it is inefficient and occupies a large amount of memory
- Do not send anything without special needs. /* put it first

GET test_search_index/_search? q=username:alf*Copy the code

Regular expression matching

GET test_search_index/_search? q=username:/[a]? l.*/Copy the code

Fuzzy matching Fuzzy Query
- name:roam~1
- Roam match 1 character word, such as foam roams
Query proximity search
- “fox quick”~5
- The differences were compared in terms of term

GET test_search_index/_search? q=username:alfed GET test_search_index/_search? q=username:alfed~1 GET test_search_index/_search? q=username:alfd~2 GET test_search_index/_search? q=job:"java engineer"GET test_search_index/_search? q=job:"java engineer"~1 GET test_search_index/_search? q=job:"java engineer"~ 2Copy the code

3. Introduction to Query DSL

The query statement is sent to ES through the HTTP Request Body and contains the following parameters
- Query The statement of a query that conforms to the Query DSL syntax
- The form, the size
- timeout
- sort
- .
Json-defined query languages include the following two types:
- Field query
  - For example, term, match, and range are queried only for a certain field
- Compound query
  - Such as bool query, containing one or more field class query or compound query statement

4. Field query introduction and match-query

Field queries mainly include the following two types:
- The full text matching
  - For full-text retrieval of fields of text type, word segmentation is performed on query statements first, such as match, match_PHRASE and other query types
- Match the words
  - No word segmentation is performed on the query statement, and the inverted index of the field is directly matched, such as term, terms, range and other query types

GET test_search_index/_search
{
  "query": {
    "match": {
      "username": "alfred way"}}}Copy the code

The operator parameter controls the matching relationship between words. The options are OR or and

GET test_search_index/_search
{
  "query": {
    "match": {
      "username": {
        "query": "alfred way"."operator": "and"}}}}Copy the code

The minimum_should_match parameter controls the number of words to match

GET test_search_index/_search
{
  "query": {
    "match": {
      "job": {
        "query": "java ruby engineer"."minimum_should_match": "3"}}}}Copy the code

5. Relevance counts

Relevance score is the degree of relevance between documents and query statements
- The list of documents that match the query statement can be obtained by inverting the index, so how to put the documents that best meet the user’s query requirements to the front?
- It’s essentially a sorting problem, and the sorting is based on correlation
Some important concepts of correlation scoring are as follows:
- Term Frequency(TF) : Word Frequency is the number of times a word appears in the document. The higher the word Frequency, the higher the correlation
- Document Frequency(DF) : Document Frequency, i.e. the number of documents in which words appear
- Inverse Document Frequency(IDF) : Inverse Document Frequency, which is the opposite of the Document Frequency, simply understood as 1/DF, that is, the fewer documents in which words appear, the higher the correlation
- Field-length Norm: The shorter the document, the higher the relevance
At present, ES mainly has two correlation scoring models, as follows:
- TF/IDF model
- BM25 model: default model after 5.x

You can use the Explain parameter to see the exact calculation method, but note that:
- Es score is calculated according to shard, that is, shard score calculation is independent of each other, so pay attention to the number of fragments when using explain
- You can avoid this problem by setting the number of shards in the index to 1

GET test_search_index/_search
{
  "explain":true."query": {
    "match": {
      "username": "alfred way"}}}Copy the code

In the BM25 model, BM refers to Best Match, and 25 refers to the calculation method after iterating 25 words, which is an optimization for TF/IDF

6. match-phrase-query

To do field retrieval, there are order requirements

GET test_search_index/_search
{
  "query": {
    "match_phrase": {
      "job": "java engineer"
    }
  }
}


GET test_search_index/_search
{
  "query": {
    "match_phrase": {
      "job": "engineer java"}}}Copy the code

The slop parameter controls the spacing between words

GET test_search_index/_search
{
  "query": {
    "match_phrase": {
      "job": {
        "query": "java engineer"."slop": "2"}}}}Copy the code

7. query-string-query

Similar to the q parameter query in URI Search

GET test_search_index/_search
{
  "profile":true."query": {"query_string": {
      "default_field": "username"."query": "alfred AND way"
    }
  }
}

GET test_search_index/_search
{
  "profile":true."query": {
    "query_string": {
      "fields": [
        "username"."job"]."query": "alfred OR (java AND ruby)"}}}Copy the code

8. simple-query-string-query

Similar to Query String, but ignores incorrect Query syntax and supports only partial Query syntax

GET test_search_index/_search
{
  "profile":true."query": {"simple_query_string": {
      "query": "alfred +way \"java"."fields": ["username"]
    }
  }
}

GET test_search_index/_search
{
  "query": {"query_string": {
      "default_field": "username"."query": "alfred +way \"java"}}}Copy the code

9. term-terms-query

The query statement is queried as the whole word, that is, the query statement is not segmented

GET test_search_index/_search
{
  "query": {"term": {"username":"alfred"
    }
  }
}

GET test_search_index/_search
{
  "query": {"term": {"username":"alfred way"}}}Copy the code

Terms: Multiple queries are passed in to a single query

G`ET test_search_index/_search
{
  "query": {
    "terms": {
      "username": [
        "alfred"."way"]}}}Copy the code

10. range-query

Range queries focus on numeric and date types

GET test_search_index/_search
{
  "query": {"range": {
      "age": {
        "gte": 10,
        "lte": 30
      }
    }
  }
}

GET test_search_index/_search
{
  "query": {"range": {
      "birth": {
        "gte": "1980-01-01"}}}}Copy the code

Provides a friendlier way of calculating dates

GET test_search_index/_search
{
  "query": {"range": {
      "birth": {
        "gte": "now-30y"
      }
    }
  }
}

GET test_search_index/_search
{
  "query": {"range": {
      "birth": {
        "gte": "2010||-20y"}}}}Copy the code

11. Introduction to compound query and ConstantScore

A compound query is a type that contains field classes or compound queries, including the following types:
- constant_score_query
- bool query
- dis_max query
- function_score_query
- boosting query

Constant Score Query

This query sets its internal query result documents to a score of 1 or boost
- Mostly used in combination with bool query to achieve a custom score

GET test_search_index/_search
{
  "query": {"constant_score": {
      "filter": {
        "match": {"username":"alfred"
        }
      }
    }
  }
}

GET test_search_index/_search
{
  "query": {
    "bool": {
      "should": [{"constant_score": {
            "filter": {
              "match": {
                "job": "java"}}}}, {"constant_score": {
            "filter": {
              "match": {
                "job": "ruby"}}}}]}}}Copy the code

12. bool-query

A Boolean query consists of one or more Boolean clauses, including the following four:
- Filter: Only documents that meet the criteria are filtered and correlation scores are not calculated
- Must: Documents must meet all conditions in MUST, which affects the correlation score
- Must_not: Documents must not meet all conditions in must_NOT
- Should: Documents can meet the criteria in should, which affects the relevance score

Filter

The Filter query only filters the documents that meet the criteria and does not perform correlation calculation
- Es has an intelligent cache for filters, so its execution is very efficient
- It is recommended to use filter instead of Query to perform simple matching queries without considering calculation time

GET test_search_index/_search
{
  "query": {
    "bool": {
      "filter": [{"term": {
            "username": "alfred"}}]}}}Copy the code

Must

GET test_search_index/_search
{
  "query": {
    "bool": {
      "must": [{"match": {
            "username": "alfred"}}, {"match": {
            "job": "specialist"}}]}}}Copy the code

Must_Not

GET test_search_index/_search
{
  "query": {
    "bool": {
      "must": [{"match": {
            "job": "java"}}]."must_not": [{"match": {
            "job": "ruby"}}]}}}Copy the code

should

To contain only should, the document must satisfy at least one condition
- Minimum_should_match controls the number or percentage of conditions that are met

GET test_search_index/_search
{
  "query": {
    "bool": {
      "should": [{"match": {
            "username": "junior"}}, {"match": {
            "job": "ruby"
          }
        }
      ]
    }
  }
}

GET test_search_index/_search
{
  "query": {
    "bool": {
      "should": [{"term": {"job": "java"}},
        {"term": {"job": "ruby"}},
        {"term": {"job": "specialist"}}]."minimum_should_match": 2}}}Copy the code

When you include both should and must, the document doesn’t have to satisfy the condition in should, but if it does, it increases the relevance score

GET test_search_index/_search
{
  "query": {
    "bool": {
      "must": [{"term": {
            "username": "alfred"}}]."should": [{"term": {
            "job": "ruby"}}]}}}Copy the code

Es executes differently when a Query is in a Query or Filter context
- Query: Searches for the documents that most match the query statement, calculates and sorts all documents by correlation
- Filter: Finds documents that match the query statement

13. count and source filtering

Get the number of documents that match the conditions. Endpoint is _count

GET test_search_index/_count
{
  "query": {"match": {"username": "alfred"}}}Copy the code

source filtering

Filter the fields in _source in the returned result

GET test_search_index/_search GET test_search_index/_search? _source=username GET test_search_index/_search {"_source": false
}

GET test_search_index/_search
{
  "_source": ["username"."age"]
}

GET test_search_index/_search
{
  "_source": {
    "includes": "*i*"."excludes": "birth"}}Copy the code

Find out how Elasticsearch works

1. Query Then Fetch

Search is executed in two steps
- The Query phase
- The Fetch phase

2. Correlation score

The correlation score is independent between shards, which means that the IDF equivalent of the same Term is different on different shards. The correlation score of a document is related to the shard in which it is located
When the number of documents is small, the correlation calculation can be severely inaccurate

POST test_search_relevance/doc
{
  "name":"hello"
}

POST test_search_relevance/doc
{
  "name":"hello,world"
}

POST test_search_relevance/doc
{
  "name":"hello,world! a beautiful world"
}

GET test_search_relevance/_search
{
  "explain": true."query": {
    "match": {"name":"hello"}}}Copy the code

There are two ways to solve the problem:
- One is to set the number of fragments to 1, fundamentally eliminate the problem, when the number of documents is not much, you can consider this scheme, such as the number of millions to tens of millions of documents
- Second, DFS query-then-fetch is used
DFS query-then-fetch is used to completely compute the correlation score after getting all the documents, which consumes more CPU and memory and has low performance. It is generally not recommended to use DFS query-then-fetch as follows:

GET test_search_relevance/_search? search_type=dfs_query_then_fetch {"query": {
    "match": {"name":"hello"}}}Copy the code

3. sorting doc values fielddata

Es uses correlation sorting by default. Users can customize sorting rules by setting the sorting parameter

GET test_search_index/_search
{
  "query": {"match": {
      "username": "alfred"}},"sort": {"birth":"desc"
  }
}

GET test_search_index/_search
{
  "query": {"match": {
      "username": "alfred"}},"sort": [{"birth": "desc"
    },
    {
      "_score": "desc"
    },
    {
      "_doc": "desc"}}]Copy the code

Sorting by string is special because es has both text and keyword
- Sorting for the text type generates an error
- For the keyword type sort, you can return expected results

GET test_search_index/_search
{
  "sort": {"username.keyword":"desc"}}Copy the code

The sorting

The process of sorting is essentially the process of sorting the original content of the field. In this process, the inverted index cannot play a role, and the forward index is needed, that is, the original content of the field can be quickly obtained through the document Id and field
Es provides two ways to do this:
- Fielddata is disabled by default
- Doc values are enabled by default, except for the text type

Fielddata

Fielddata is off by default and can be turned on using the following API:
- In this case, the string is sorted according to the term after the word segmentation, and the result is often difficult to meet the expectation
- It is usually turned on during the aggregation analysis of participles

PUT test_search_index/_mapping/doc
{
  "properties": {
    "job": {"type":"text"."fielddata": true}}}Copy the code

Doc Values

Doc Values are enabled by default and can be turned off at index creation time:
- If you want to enable Doc Values again, you need to perform reindex

PUT test_doc_values/_mapping/doc
{
  "properties": {
    "username": {
      "type": "keyword"."doc_values": false
    },
    "hobby": {
      "type": "keyword"}}}Copy the code

docvalue_fields

You can use this field to get the content stored in fieldData or DOC Values

GET test_search_index/_search
{
  "docvalue_fields": [
    "username"."username.keyword"."age"]}Copy the code

4. Paging and traversal -fromsize

Es provides three ways to solve paging and traversal problems:
- from/size
  - From specifies the start position
  - Size specifies the total number of retrieves
- scoll
- search_after
Deep paging is a classic problem: how do you get the first 1000 documents in the case of data fragmentation?
- When retrieving documents from 990 to 1000, 1000 documents are first retrieved in each shard, and then the results of all shards are aggregated by the Coordinating Node and the first 1000 documents are retrieved by sorting
- The deeper the pages are, the more documents will be processed, the more memory will be occupied, and the longer the time will be. Try to avoid deep paging. Es limits the maximum number of data to 10000 through index.max_result_window

GET test_search_index/_search
{
  "from": 0."size":2
}

GET test_search_index/_search
{
  "from": 10000,"size"2} :Copy the code

5. Paging and traversal -scroll

Iterate through the API of the document set to take a snapshot to avoid deep paging problems
- It can’t be used for real-time search, because the data isn’t real-time
- Try not to use complex sort conditions, _doc is most efficient
- It’s a little complicated to use
The first step is to initiate a scroll search
- Es creates a snapshot of the collection of document ids based on the query criteria after receiving the request

GET test_search_index/_search? scroll=5m {"size":1 Specifies the number of documents returned per scroll}Copy the code

The second step is to use the Scroll Search API to get the collection of documents
- Iterate over the call until the hits.hits array is returned empty

POST _search/scroll
{
  "scroll" : "5m"."scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAABswWX3FLSTZFOF9URFdqWHlvX3gtYmhtdw=="
}
Copy the code

The new document cannot be retrieved because it is a snapshot

PUT test_search_index/doc/10
{
  "username":"doc10"
}
Copy the code

Too many Scroll calls occupy a large amount of memory. You can use the Clear API to delete too many Scroll snapshots

DELETE _search/scroll/_all

6. Paging and traversal -search_after

Avoid the performance problems of deep paging and provide real-time next-page document retrieval
- The disadvantage is that you cannot use the FROM argument, that is, you cannot specify the number of pages
- Next page, not last
- Using a simple
The first step is to do a normal search, but specify a sort value that is unique

GET test_search_index/_search
{
  "size": 1,"sort": {"age":"desc"."_id":"desc"}}Copy the code

The second step is to query using the sort value of the last document in the previous step

GET test_search_index/_search
{
  "size": 1,"search_after": [23,"4"]."sort": {"age":"desc"."_id":"desc"}}Copy the code

Application scenarios

From/Size: You need to obtain the top part of the document in real time and turn the page freely
Scroll: requires all documents, for example, to export all data
Search_After: All documents are required without free paging

Elasticsearch: An introduction to aggregation analysis

1. Introduction to aggregation analysis

Search engines are used to answer questions like:
- Could you please tell me all the orders addressed to Shanghai?
- Please tell me all orders created in the last 1 day that have not been paid?
Aggregation analysis can answer the following questions:
- Please tell me the daily order volume in the last week.
- Please tell me the average daily order amount of the recent one month.
- Could you please tell me what are the top five best-selling items in the last six months?

Aggregate analysis

Aggregation analysis, or Aggregation, is a statistical analysis of ES data provided by ES in addition to the search function
- With rich functions, Bucket, Metric, Pipeline and other analysis methods can meet most analysis requirements
- With high real-time performance, all calculation results are returned in time, while big data systems such as Hadoop are generally T+1

classification

For easy understanding, ES divides the aggregation analysis into the following four categories
- Bucket: Bucket type, similar to GROUP BY syntax in SQL
- Metric analysis type, such as calculating maximum, minimum, average, etc
- Pipeline, the type of Pipeline analysis that is reanalyzed based on the aggregation analysis results of the previous level
- Matrix, Matrix analysis type

2. Metric Aggregation analysis

POST test_search_index/doc/_bulk
{"index": {"_id":"1"}}
{"username":"alfred way"."job":"java engineer"."age": 18."birth":"1990-01-02"."isMarried":false."salary": 10000} {"index": {"_id":"2"}}
{"username":"tom"."job":"java senior engineer"."age": 28."birth":"1980-05-07"."isMarried":true."salary": 30000} {"index": {"_id":"3"}}
{"username":"lee"."job":"ruby engineer"."age": 22."birth":"1985-08-07"."isMarried":false."salary": 15000} {"index": {"_id":"4"}}
{"username":"Nick"."job":"web engineer"."age": 23."birth":"1989-08-07"."isMarried":false."salary": 8000} {"index": {"_id":"5"}}
{"username":"Niko"."job":"web engineer"."age": 18."birth":"1994-08-07"."isMarried":false."salary": 5000} {"index": {"_id":"6"}}
{"username":"Michell"."job":"ruby engineer"."age": 26."birth":"1987-08-07"."isMarried":false."salary": 12000}Copy the code

It is mainly divided into the following two categories:
- Single value analysis, output only one analysis result
  - Min, Max, AVg, sum
  - cardinality
- Multi – value analysis, output multiple analysis results
  - Stats, entended stats
  - The percentile, the percentile rank
  - top hits

Min

Returns the minimum value of a numeric class field

GET test_search_index/_search
{
  "size": 0."aggs": {"min_age": {"min": {
        "field": "age"}}}}Copy the code

Max

Returns the maximum value of a numeric class field

GET test_search_index/_search
{
  "size": 0."aggs": {"max_age": {"max": {
        "field": "age"}}}}Copy the code

Avg

Returns the average value of a numeric class field

GET test_search_index/_search
{
  "size": 0."aggs": {"avg_age": {"avg": {
        "field": "age"}}}}Copy the code

Sum

Returns the sum of numeric class fields

GET test_search_index/_search
{
  "size": 0."aggs": {"sum_age": {"sum": {
        "field": "age"}}}}Copy the code

Returns multiple aggregate results at once

GET test_search_index/_search
{
  "size": 0."aggs": {
    "min_age": {
      "min": {
        "field": "age"}},"max_age": {
      "max": {
        "field": "age"}},"avg_age": {
      "avg": {
        "field": "age"}},"sum_age": {
      "sum": {
        "field": "age"}}}}Copy the code

Cardinality

Cardinality refers to the potential or Cardinality of a set. Cardinality refers to the number of different values, similar to the concept of distinct count in SQL

GET test_search_index/_search
{
  "size": 0."aggs": {"count_of_job": {"cardinality": {
        "field": "job.keyword"}}}}Copy the code

Stats

Returns statistics of a series of numeric types, including min, Max, AVg, sum, and count

GET test_search_index/_search
{
  "size": 0."aggs": {"stats_age": {"stats": {
        "field": "age"}}}}Copy the code

Extended Stats

An extension to STATS to include more statistics such as variance, standard deviation, etc

GET test_search_index/_search
{
  "size": 0."aggs": {"exstats_salary": {"extended_stats": {
        "field": "salary"}}}}Copy the code

Percentile

Percentile statistics

GET test_search_index/_search
{
  "size": 0."aggs": {"per_salary": {"percentiles": {
        "field": "salary"
      }
    }
  }
}

GET test_search_index/_search
{
  "size": 0."aggs": {
    "per_age": {
      "percentiles": {
        "field": "salary"."percents": [95, 99, 99.9]}}}}Copy the code

Percentile Rank

Percentile statistics

GET test_search_index/_search
{
  "size": 0."aggs": {
    "per_salary": {
      "percentile_ranks": {
        "field": "salary"."values": [11000, 30000]}}}}Copy the code

Top Hits

It is generally used to obtain the list of the most matched top documents in the bucket, that is, the detailed data

GET test_search_index/_search
{
  "size": 0."aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"."size": 10},"aggs": {
        "top_employee": {
          "top_hits": {
            "size": 10,
            "sort": [{"age": {
                  "order": "desc"}}]}}}}}}Copy the code

3. Bucket aggregation analysis

Bucket refers to Bucket, which allocates documents to different buckets according to certain rules for classification and analysis
Based on Bucket splitting policy, common Bucket aggregation analysis is as follows:
- Terms
- Range
- Date Range
- Histogram
- Date Histogram

Terms

The simplest bucket division strategy is to divide buckets according to term. If the bucket type is text, buckets are divided according to the result after the word division

GET test_search_index/_search
{
  "size": 0."aggs": {
    "jobs": {
      "terms": {
        "field": "job"."size": 5}}}}Copy the code

Range

Sets bucket splitting rules based on the range of specified values

GET test_search_index/_search
{
  "size": 0."aggs": {
    "salary_range": {
      "range": {
        "field": "salary"."ranges": [{"key":"" 10000"."to": 10000}, {"from": 10000,
            "to": 20000}, {"key":"20000" >."from": 20000}]}}}}Copy the code

Date Range

Set bucket splitting rules by specifying a range of dates

GET test_search_index/_search
{
  "size": 0."aggs": {
    "date_range": {
      "range": {
        "field": "birth"."format": "yyyy"."ranges": [{"from":"1980"."to": "1990"
          },
          {
            "from": "1990"."to": "2000"
          },
          {
            "from": "2000"}]}}}}Copy the code

Histogram

Histogram, with a fixed interval strategy to divide data

GET test_search_index/_search
{
  "size": 0."aggs": {"salary_hist": {"histogram": {
        "field": "salary"."interval": 5000,
        "extended_bounds": {
          "min": 0."max": 40000}}}}}Copy the code

Date Histogram

Histogram or bar graph for date is a common type of aggregation analysis in time series data analysis

GET test_search_index/_search
{
  "size": 0."aggs": {"by_year": {"date_histogram": {
        "field": "birth"."interval": "year"."format":"yyyy"}}}}Copy the code

4. Bucket and metric aggregation analysis

Bucket aggregation analysis allows further analysis by adding subanalyses, either Bucket or Metric, which makes ES extremely powerful
Divide the barrels and then divide the barrels

GET test_search_index/_search
{
  "size": 0."aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"."size": 10},"aggs": {
        "age_range": {
          "range": {
            "field": "age"."ranges": [{"to": 20}, {"from": 20."to": 30}, {"from": 30}]}}}}}}Copy the code

Data analysis was carried out after buckets were divided

GET test_search_index/_search
{
  "size": 0."aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"."size": 10},"aggs": {
        "salary": {
          "stats": {
            "field": "salary"
          }
        }
      }
    }
  }
}
Copy the code

5. Pipeline aggregation analysis

Re-aggregate analysis of the results of the aggregate analysis, with support for chain calls, can answer the following questions:
- What is the average monthly sales of orders?
The analysis result of Pipeline will be output to the original result, which can be divided into the following two categories according to the different output positions:
- Parent results are embedded in existing aggregation analysis results
  - Derivative
  - Moving Average
  - Cumulative Sum
- Sibling results are equivalent to aggregation analysis results
  - Max/Min/Avg/Sum Bucket
  - Stats/Extended Stats Bucket
  - Percentiles Bucket

Min Bucket

GET test_search_index/_search
{
  "size": 0."aggs": {"jobs": {"terms": {
        "field": "job.keyword"."size": 10},"aggs": {"avg_salary": {"avg": {
            "field": "salary"}}}},"min_salary_by_job": {"min_bucket": {
        "buckets_path": "jobs>avg_salary"}}}}Copy the code

Derivative

Compute the derivative of the Bucket value

GET test_search_index/_search
{
  "size": 0."aggs": {
    "birth": {
      "date_histogram": {
        "field": "birth"."interval": "year"."min_doc_count": 0}."aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"}},"derivative_avg_salary": {
          "derivative": {
            "buckets_path": "avg_salary"
          }
        }
      }
    }
  }
}
Copy the code

Moving Average

Calculates the moving average of Bucket values

Cumulative Sum

Calculating buckets is worth adding up

6. Scope of action

The default scope of ES aggregation analysis is the result set of Query, and the scope can be changed as follows:
- filter
- post_filter
- global

filter

Filter criteria are set for an aggregation analysis, changing scope without changing the overall Query statement

GET test_search_index/_search
{
  "size": 0."aggs": {
    "jobs_salary_small": {
      "filter": {
        "range": {
          "salary": {
            "to": 10000}}},"aggs": {
        "jobs": {
          "terms": {
            "field": "job.keyword"}}}},"jobs": {
      "terms": {
        "field": "job.keyword"}}}}Copy the code

post_filter

Works on text filtering, but takes effect after aggregation analysis

GET test_search_index/_search
{
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"}}},"post_filter": {
    "match": {"job.keyword":"java engineer"}}}Copy the code

global

Analysis is performed based on all documents regardless of query filtering conditions

GET test_search_index/_search
{
  "query": {
    "match": {
      "job.keyword": "java engineer"}},"aggs": {
    "java_avg_salary": {
      "avg": {
        "field": "salary"}},"all": {
      "global": {},
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    }
  }
}
Copy the code

7. The sorting

You can use built-in key data to sort, for example:
- _count document number
- _key Sort by key value

GET test_search_index/_search
{
  "size": 0."aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"."size": 10,
        "order": [{"avg_salary": "desc"}},"aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    }
  }
}
Copy the code

4. Data Modeling for Elasticsearch

1. Introduction to data modeling

Data Modeling, the process of creating a Data model
Data Model
- A tool or method for abstracting the real world
- Mapping to the real world is achieved by describing business rules in the form of abstract entities and relationships between them

Process of data modeling

The conceptual model
- Define the core requirements and scope boundaries of the system and design entities and relationships among them
The logical model
- Further comb through the business requirements, identifying attributes, relationships, constraints, and so on for each entity
The physical model
- Based on specific database products, the final definition is determined on the premise of meeting the requirements of read/write performance
- MySQL, MongoDB, ElasticSearch, etc
- The third paradigm

2. Introduction to ES data modeling configuration

ES is a storage system based on Lucene’s inverted index, which does not follow the normal conventions in relational databases

Setting the Mapping field

enbaled
- true | false
- Store only, do not search or aggregate analysis
index
- true | false
- Whether to build an inverted index
index_options
- docs | freqs | positions | offsets
- What information is stored in the inverted index
norms
- true | false
- Whether to store normalized parameters. If the field is only used for filtering and aggregation analysis, disable the parameter
doc_values
- true | false
- Whether to enable doc_values for sorting and aggregation analysis
field_data
- false | true
- Whether to start FieldData for the text type for sorting and aggregation analysis
store
- false | true
- Whether to store the field value
coerce
- true | false
- Whether to enable automatic data type conversion, such as string to number, floating point to integer, etc
Multifields many fields
- Flexible use of multi-field features to address diverse business requirements
dynamic
- true | false | strict
- Controls the automatic update of mapping
data_detection
- true | false
- Whether to automatically recognize the date type

Setting process

What type is it?
- String type
- Enumerated type
- Numeric types
- Other types of
Do I need to retrieve it?
- There is no need to retrieve, sort, and aggregate fields for analysis
  - Set enabled to false
- Fields that do not need to be retrieved
  - Set index to false
- You can perform the following operations to set the storage strength for the fields to be retrieved
  - Index_options combination needs to be set
  - The norms should be closed when no normalized data is needed
Do you need sort and aggregate analysis?
- Doc_values is set to false
- Fielddata is set to false
Do I need to store it separately?
- Do you need to store data specifically for the current field?

3. Example of ES data modeling

Blog post blog_index
- Title the title
- Publish date publish_date
- The author the author
- In this paper, the abstract
- Content of the content
- Network Address URL

PUT blog_index
{
  "mappings": {
    "doc": {
      "_source": {
        "enabled": false
      },
      "properties": {
        "title": {
          "type": "text"."fields": {
            "keyword": {
              "type": "keyword"."ignore_above": 100}}."store": true
        },
        "publish_date": {
          "type": "date"."store": true
        },
        "author": {
          "type": "keyword"."ignore_above": 100, 
          "store": true
        },
        "abstract": {
          "type": "text"."store": true
        },
        "content": {
          "type": "text"."store": true
        },
        "url": {
          "type": "keyword"."doc_values":false."norms":false."ignore_above": 100, 
          "store": true
        }
      }
    }
  }
}
Copy the code

The query

GET blog_index/_search
{
  "stored_fields": ["title"."publish_date"."author"."abstract"."url"]."query": {
    "match": {
      "content": "blog"}},"highlight": {
    "fields": {"content": {}}}}Copy the code

4. Nested_Object

ES is not good at dealing with association relations in relational databases. For example, blog_id is used to associate article table blog with comment table, which can be solved by the following two ways in ES
- Nested Object
- Parent/Child

Relational processing

The article Id blog_id
To comment the username
Date of comment
Content of comments

DELETE blog_index_nested
PUT blog_index_nested
{
"mappings": {
  "doc": {"properties": {
      "title": {"type": "text"."fields": {
          "keyword": {"type":"keyword"."ignore_above": 100}}},"publish_date": {"type":"date"
      },
      "author": {"type":"keyword"."ignore_above": 100}."abstract": {"type": "text"
        },
        "url": {"enabled":false
        },
        "comments": {"type":"nested"."properties": {
            "username": {"type":"keyword"."ignore_above": 100}."date": {"type":"date"
            },
            "content": {"type":"text"
            }
          }
        }
      }
    }
  }
}

PUT blog_index_nested/doc/2
{
  "title": "Blog Number One"."author": "alfred"."comments": [{"username": "lee"."date": "2017-01-02"."content": "awesome article!"
    },
    {
      "username": "fax"."date": "2017-04-02"."content": "thanks!"
    }
  ]
}

GET blog_index_nested/_search
{
  "query": {
    "nested": {
      "path": "comments"."query": {
        "bool": {
          "must": [{"match": {
                "comments.username": "lee"}}, {"match": {
                "comments.content": "thanks"}}]}}}}}Copy the code

5. Parent_Child

ES also provides an implementation similar to join in a relational database, using the Join data type

PUT blog_index_parent_child
{
  "mappings": {
    "doc": {
      "properties": {
        "join": {
          "type": "join"."relations": {
            "blog": "comment"
          }
        }
      }
    }
  }
}

PUT blog_index_parent_child/doc/1
{
  "title":"blog"."join":"blog"} PUT blog_index_parent_child/doc/2 {"title":"blog2"."join":"blog"
}
Copy the code

PUT blog_index_parent_child/doc/comment-1? Routing =1 Specifies the routing value to ensure that parent documents are on the same shard."comment":"comment world"."join": {"name":"comment"To indicate the subtype"parent"}} PUT blog_index_parent_child/doc/comment-2? routing=2 {"comment":"comment hello"."join": {"name":"comment"."parent": 2}}Copy the code

The common query syntax includes the following:
- Parent_id: Returns a child document of a parent document
- Has_child: Returns the parent document that contains a child document
- Has_parent: Returns a child document containing a parent document

GET blog_index_parent_child/_search
{
  "query": {"parent_id": {"type":"comment"."id":"2"}}}Copy the code

GET blog_index_parent_child/_search
{
  "query": {"has_child": {
      "type": "comment"."query": {
        "match": {
          "comment": "world"
        }
      }
    }
  }
}
Copy the code

GET blog_index_parent_child/_search
{
  "query": {"has_parent": {
      "parent_type": "blog"."query": {
        "match": {
          "title": "blog"
        }
      }
    }
  }
}
Copy the code

6. nested_vs_parent_child

nested object

Advantages: Documents are stored together, so read performance is high
Disadvantages: Updating parent or child documents requires updating the entire document
Scenario: Subdocuments are updated occasionally and queries are frequent

parent child

Advantages: The parent and child files can be updated independently without affecting each other
Disadvantages: In order to maintain the join relationship, it needs to occupy part of the memory, and the read performance is poor
Scenario: Subdocuments are updated frequently

It is recommended to select nested object to solve problems

7. reindex

The process of reconstructing all data, usually when:
- Mapping Settings changes, such as field type changes, dictionary segmentation updates, etc
- Index Settings changed, such as the number of shards
- Migrating data
ES provides a ready-made API to do this
- _update_BY_query is rebuilt on an existing index
- _reindex is rebuilt on other indexes

POST blog_index/_update_by_query? conflicts=proceed POST _reindex {"source": {
    "index": "blog_index"
  },
  "dest": {
    "index": "blog_new_index"}}Copy the code

The time of data reconstruction is affected by the size of the original index document. The larger the size is, the more time is required. In this case, the url parameter Wait_for_completion needs to be set to false to perform the task asynchronously

POST blog_index/_update_by_query? conflicts=proceed&wait_for_completion=false

GET _tasks/_qKI6E8_TDWjXyo_x-bhmw:11996
Copy the code

8. Other suggestions

Data model versioning

Manage the Mapping version
- Include it in code or manage it in a special file, add comments, and add it to a version management repository such as Git for easy review
- Add a metadata field for each document to maintain metadata for easy data management

Preventing too many fields

Too many fields have the following disadvantages:
- Difficult to maintain, when there are hundreds or thousands of fields, it is almost impossible for anyone to know exactly what each field means
- The mapping information is stored in the cluster state. If there are too many fields, the mapping becomes too large and the update slows down
- The common reason for too many fields is the lack of high-quality data modeling, such as setting dynamic to true
- Consider splitting multiple indexes to solve the problem

The last

You can follow my wechat public number to learn and progress together.