A, takeaway
Hi! Thank you for waiting! After 10 days, daydream Elasticsearch Note progression is over! This update is still full of dry goods!
Here are 32 query methods, 15 aggregation methods, and 7 optimized query techniques. Welcome to forward support!
If you are not clear about the various concepts in ES, read the article Daydreaming ES Notes – Basics, and if you do not understand some of the concepts, it will not prevent you from understanding the various query methods introduced in this article.
In the next article (the third in the ES series of notes on Daydreaming) we will get back to basics and do a systematic conceptual literacy!
The last article (ES series notes the fourth article) is based on the actual programming language, if no accident will be in the way of video and you meet.
Welcome to daydreaming! The first time to catch up with the update!
2. Welfare: Account borrowing
Good news!! If you don’t like the hassle of installing the ES and want to use the ready-made ES to learn, you can use the free ES instance built on the Internet by the white whos (which has more than 340 days left and is expected to expire in early 2022). Pay attention to this public number background reply: white prostitute can get the account password.
Notice!!!!!! I can’t guarantee that it will be safe to use because IP can be hacked if it is directly exposed to the public network. If you find the service unavailable, you can let me know. I made a mirror in advance, so I can quickly get the system back to normal. (FOR security, I will also periodically update IP, account and password) so you can use it for learning, do not put important data on it ha!
Attention daydream (a baidu backend research and development focused on technology) background reply: white prostitute, you can get the account password.
Attention daydream (a baidu backend research and development focused on technology) background reply: white prostitute, you can get the account password.
Attention daydream (a baidu backend research and development focused on technology) background reply: white prostitute, you can get the account password.
Click the link to read the original: you can find my public account qr code.
In addition, I also recommend you to read the original text, json format will look much better!
_search API Search API
The Search API is also one of the most important apis to know and master. Because most of the time you use ES for retrieval, so let’s take a look at what SEARCH apis ES has, and of course the ultimate goal is to have the ability to choose a search method that suits your business.
Here I go again!
If you don’t learn the query methods and techniques that daydreaming has shown you. I bet you don’t understand the code that people write in Java or Golang.
On the contrary, if you understand the following dozens of cases, I dare say that you can independently use familiar programming language to write the corresponding query code in minutes!
What is query String Search?
The so-called Query String search is actually a retrieval method provided by ES. The following line of requests is a typical query string search.
In fact, this type of retrieval is rarely used. The intuitive feature of query String Search is that its request parameters are all written in the URI.
GET /your_index/your_type/_search? q=*&sort=account_number:asc&prettyCopy the code
Read the query String search above: Q =*, which matches all docs under index=bank; sort=account_number:asc, which tells ES that the results are sorted in ascending order by the account_number field; pretty, which tells ES, which returns a nice JSON format.
The top q can also be written as follows:
GET /your_index/your_type/_search? Q = custom field: expected value GET /your_index/your_type/_search? Q =+ custom field: expected value GET /your_index/your_type/_search? Q =- Custom field: expected valueCopy the code
The response returned by ES is as follows: (Query DSLS and query optimization tips)
{
"took" : 63.// Time spent
By default, there is no time_out. For example, if your search takes 1 minute, it waits 1 minute, but does not time out
// You can specify a timeout when sending a search request
For example, if you specify a 10ms timeout, it will return the data obtained during the 10ms
"timed_out" : false."_shards" : { // Your search request hits several shards.
// The Primary Shard can receive read and write traffic. The Replica Shard receives read traffic.
// Since I am the default, there are five primary shards.
// So its search request will be typed into 5 fragments, and all of them will be successful
"total" : 5."successful" : 5."skipped" : 0.// 0 is skipped
"failed" : 0 // 0 failed
},
"hits" : {// The hit situation
"total" : 1000.// The hit rate is 1000
// _score is used for full-text retrieval. The higher the correlation score, the more relevant and matching the doc is to the retrieved content
// Max_score is the maximum _score
"max_score" : null.// By default, the first 10 entries are queried and the complete data for each doc is returned
"hits": [{"_index" : "bank"./ / index
"_type" : "_doc".// type
"_id" : "0".// id
"sort": [0]."_score" : null.// Correlation score
// _source contains the specific data of the doc
"_source" : {"account_number":0."balance":16623."firstname":"Bradshaw"."lastname":"Mckenzie"."age":29."gender":"F"."address":"244 Columbus Place"."employer":"Euron"."email":"[email protected]"."city":"Hobucken"."state":"CO"}}, {"_index" : "bank"."_type" : "_doc"."_id" : "1"."sort": [1]."_score" : null."_source" : {"account_number":1."balance":39225."firstname":"Amber"."lastname":"Duke"."age":32."gender":"M"."address":"880 Holmes Lane"."employer":"Pyrami"."email":"[email protected]"."city":"Brogan"."state":"IL"}},... ] }}Copy the code
Specify timeout: GET /_search? Timeout =10ms When optimizing, consider using timeout, for example: normally we can get 2000 pieces of data within 10s, but if timeout is specified, we can get 100 pieces of data within 10ms.
What is query DSL?
Domain Specified Language
Both query String Search and Query Specified Language in this section are essentially sending a Resutful type of network request. Rather than write all request parameters in the URI of a Query String search, a Query DSL is typically long like this:
GET /yourIndex/yourType/_search
{
// Many request parameters
}
Copy the code
To put it bluntly, Query String Search is more like an HTTP GET request in that it has no request body. The Query DSL in this section is more like a POST request in HTTP.
3.3. Dry goods! 32 query cases!
Let’s take a look at how some of the Query DSLS are used. (The return value of the query is the same as the one above, so the focus is on how to look, not how to look at the return value.)
Query all docs under the specified index
GET /your_index/ your__type /_search {"query": { "match_all": {}}}Copy the code
2. Full-text retrieval for name field (match query)
ES takes the user’s input strings and disintegrates them through a word splinter, then scans them for matches in the inverted index. (A daydream note in the next article will return to the core concepts involved in ES, including the inverted index.) Even one match in the inverted index will return the result.
GET /yourIndex/yourType/_search
{
"query": {# match = full text search, so daydream is split into daydream, dream, daydream #"match": {
"name":"Daydream"}}} # In fact, the underlying match Query is converted to the following format to retrieve # # {#"bool":{
# "should":[
# {"term": {"title":"Day"#}}, {"term": {"title":"Daydream"#}}, {"term": {"title":"Dream"}} #] #} #} #Copy the code
3. Full-text retrieval: manually control the accuracy of full-text retrieval
GET /your_index/your_type/_search
{
"query": {
"match": {
"name": {"query":"bairi meng"If "and" is not added, then the relationship between bairi and meng is "or", as long as one of them appears"operator":"and",}}}} # adding operator will be converted by ES to the following format, converting should above to must # # {#"bool":{
# "must":[
# {"term": {"title":"bairi"#}}, {"term": {"title":"meng"}} #] #} #}Copy the code
4. Remove the long tail of full-text retrieval
GET /your_index/your_type/_search {"query": {
"match": {
"name": {"query":"Welcome to daydreaming!"."operator":"and"The query above can be split into five words: welcome, attention, daydream, welcome attention, attention daydream. By default, if you hit one of these words, the doc will be returned, so there is a long tail. # Go to the long tail: Control if the doc hits at least 3/4 of the word."minimum_should_match":"75%"}}}} # adding minimum_should_match will be converted by ES to the following format # # {#"bool":{
# "should":[
# {"term": {"title":"Day"#}}, {"term": {"title":"Dream"}} #, #"minimum_should_match":3#} #} #Copy the code
5, full text retrieval: Control weight through Boost.
In this Case, if the name of the doc field contains “concern”, then increase the weight of the doc field to 3, and if the name of the doc field contains “official account”, then increase the weight of the doc field to 2. After this processing, the name field contains: “follow daydream public account” of the doc weight is the highest, it in the search results ranking higher.
GET /your_index/your_type/_search
{
"query": {
"bool": {"must": {"match": {
"name":{# By default, all fields have the same weight, which is 1"query":"Attention",}}},"should":[
{
"match": {
"name": {"query":"Daydream"# increase the weight of the name field to 3"boost":3}}}, {"match": {
"name": {"query":"Public Account"By default, all fields have the same weight, which is 1"boost":2}}}}Copy the code
6, slightly more complex multi-condition query :bool query
GET /your_index/your_type/_search
{
"query": {# For example, if your query is complex and involves many subqueries, consider wrapping the subqueries with a bool query # Each subquery will calculate the doc's correlation score against its query. Finally, the bool query will combine these scores into one final score"bool": {# must match to XXX and will result in a correlation score # address must include mill"must": [{"match": { "address": "mill"}},], # If there is no must, one of the conditions in should must be met"should": [{ "match": { "address": "lane"}}]."must_not": [# must not contain who {"match": { "address": "mill"}},]}}}Copy the code
7. Bool query + remove long tail.
GET /your_index/ your__type /_search {"query": {
"bool": {"should": ["match": {"name":"Daydream 1"},
"match": {"name":"Daydream 2"},
"match": {"name":"Daydream 3"},]."minimum_should_match":3}}}Copy the code
8. Best Fields strategy: Take the highest score from multiple queries as the doc’s final score.
There are multiple matches in a query (we call it a multi-field query), and each match contributes its own correlation score, which means that doc’s final correlation score is calculated by some mechanism based on the correlation score contributed by the multiple matches. And the higher the relevance score, the higher the document appears in the search results.
At this point, if you don’t want doc’s final score to be calculated by combining all the matches, you can use the dis_max query. It will take the match with the highest score of all the matches as doc’s final score.
GET /your_index/your_type/_search
{
"query"# Select the query with the highest score as the final score"dis_max": {
"queries":[
{"match": {"name":"Daydream"}},
{"match": {"content":"Focus on daydreams!"}}}Copy the code
9. Optimize DIS_max with tie_breaker
The dis_max query mentioned in the above Case is also the key to implementing the Best field, that is, it takes the highest score of all the matches as doc’s final score.
In this case, the tie_breaker will refactor the dis_max into the impact of other field scores, such as 0.4 below. This means that the final DOC score will take into account the impact of other matches, but it will be reduced to the original 0.4.
GET /your_index/your_type/_search {# Optimize dis_max based on tie_breaker # Tie_breaker allows dis_max to consider the score effects of other fields"query": {# Take the query with the highest score as the final score # This is also the best field strategy"dis_max": {
"queries":[
{"match": {"name":"Attention"}},
{"match": {"content":"Daydream"}}]."tie_breaker":0.4}}}Copy the code
10. Retrieve multiple fields you specify simultaneously: multi_match
GET /your_index/your_type/_search {# query multiple and retrieve doc containing "this is a test" in the two fields specified below"query": {
"multi_match" : {
"query": "this is a test"."fields": [ "subject"."message"]}}}Copy the code
11. Simplify dis_max with multi_match query
# dis_max query: GET /your_index/your_type/_search {# Optimize dis_max based on tie_breaker # Tie_breaker allows dis_max to consider the score effects of other fields"query": {# Take the query with the highest score as the final score # This is also the best field strategy"dis_max": {
"queries":[
{"match": {"name":"Attention"}},
{"match": {"content":"Daydream"}}]."tie_breaker":0.4GET /your_index/ your__type /_search {"query": {
"multi_match": {"query":"Focus on daydreams", # specifies the policy for retrieving best_fields (because dis_max is the best field policy)"type":"best_fields", # content^2 indicates increased weight, equivalent to: Boost2"fields": ["name"."content^2"]."tie_breaker":0.4."minimum_should_match":3}}}Copy the code
12. The most field policy is different from the best field policy, because the best field policy is to return the docs that match more keywords in a given field first.
Return the doc that has more fields matching your given keyword first. Instead of returning a doc for a field that exactly matches your given keyword
In addition most_fields does not support using minimum_should_match to remove long tails.
GET /your_index/your_type/_search {# most_fields policy, return the docs that hit more keywords first #"query": {
"multi_match": {"query":"Give me a daydream.", # specifies the policy most_fields to retrieve"type":"most_fields"."fields": ["title"."name"."content"]}}}Copy the code
13. The cross_fields policy
GET /your_index/your_type/_search
{
"query": {
"multi_match": {"query":"golang java", # cross_fields requires golang: must appear in title or content"type":"cross_fields"."fields": ["title"."content"]}}}Copy the code
14. The query is empty
GET /your_index/your_type/_search
{
"query": {
"match_none": {}}}Copy the code
15. Exact match
GET /your_index/your_type/_search with trem"query": {
"constant_score": {"filter": {"term": {
"name":"Daydream"}}}}} # Use terms to specify exact matches in multiple fields # The following example is equivalent to SQL: Where name in (' Tom ',' Jerry ') GET /your_index/ your__type /_search {# exact match"query": {
"constant_score": {"filter": {"terms": {
"Name of the field you want to search for.": ["tom"."jerry"]}}}}}Copy the code
Phrase retrieval: Requires the value of this field in doc to be exactly the same as the value you gave, in the same order, so it has high accuracy, but low recall rate.
GET /your_index/your_type/_search {# phrase retrieval sequence is guaranteed by term position # accuracy is high but recall rate is low"query": {# if the name field contains the full daydream, the name field can not be a single day, nor can it be a single dream, nor can it be a day dream"match_phrase": {
"name": "Daydream"}}}Copy the code
17. Improve the recall rate of phrase retrieval
If you use match_phase for phrase retrieval, you essentially require the field values in doc to be exactly the same as the given values, even if not in the same order. But in order to increase the recall rate if you want to tolerate a little bit of error in phrase matching, for example if you want to search for “I love world”, you want to be able to search for “world love I”
This can be done with slop, which allows you to return a doc to the user if the word in a given phrase matches a doc after a maximum of slop moves.
GET /your_index/ your__type /_search {# phrase retrieve"query": {# Specifying slop no longer requires search terms to be next to each other, but allows maximum slop distance between search terms. # If the slop parameter is specified, the closer to the keyword, the less the number of moves, the higher the relevance score. # match_phrase + slop is similar to proximity match. Balance accuracy and recall."match_phrase": {
"address": "mill lane", # specifies that several terms in the search text can be matched to a doc after several moves"slop":2}}}Copy the code
Use match and match_phrase to balance accuracy and recall
GET /your_index/your_type/_search {# Mix match and match_phrase to balance accuracy and recall"query": {
"bool": {
"must": {# Full-text retrieval can match a large number of documents, but it can't control the distance between terms. # It may be that I love world is very close in DOC1, but it is ranked at the back of the result set by ES"match": {
"title": "i love world"}},"should": {# Because slop has a feature that the closer the words are, the fewer times the words are moved, the higher the final score # Match_phrase +slop can be used to sense the term position # to contribute scores to the nearest docs, so that they are ranked first"match_phrase": {"title": {"query":"i love world"."slop":15
}
}
}
}
}
Copy the code
19. Use rescore_query to re-score. Improve accuracy and recall.
GET /your_index/ your__type /_search {# rescore mechanism"query": {
"match": {"title": {"query":"i love world"."minimum_should_match":"50%"}}, # Re-score the results of full text search"rescore":{# Re-score the top 50 full-text searches"window_size":50."query": {# keyword"rescore_query":{# match_phrase + slop Senses term Persition and contributes score"match_phrase": {"title": {"query":"i love world"."slop":50
}
}
}
}
}
}
Copy the code
20, prefix match: search for docs that start with “daydream” in the user field
GET /your_index/ your_index/ _search {# # The prefix search does not calculate the correlation score. All doc scores are 1. # The shorter the prefix, the more doc matches, the worse the performance"query": {
"prefix" : { "user" : "Daydream"}}}Copy the code
21. Prefix search + add weight
GET /your_index/ your__type /_search"query": {
"prefix" : {
"name" : {
"value" : "Daydream"."boost" : 2.0}}}}Copy the code
22. Wildcard search
GET /your_index/your_type/_search {# wildcard search"query": {
"wildcard" : {
"title" : "Daydream * notes"}}} GET /your_index/your_type/_search {# wildcard search"query": {
"wildcard" : {
"title" : {
"value" : "Daydream * notes"."boost" : 2.0}}}}Copy the code
23. Regular search
GET /your_index/your_type/_search {# regular search"query": {
"regexp": {"name.first": {"value":"s.*y"."boost":1.2}}}}Copy the code
24. Search recommendation: match_phrase_prefix. The final effect is similar to baidu search.
Match_phrase_prefix is similar to match_phrase, except that it uses the last term as the prefix to initiate a search. That’s why it’s also called Search Time, because it initiates a new request to get the recommended content while you’re searching, and it’s less efficient overall.
GET /your_index/your_type/_search
{
"query": {# prefix match (keyword)"match_phrase_prefix" : {
"message": {# For example, if you search for daydream to follow, the last word will be "daydream" after being processed by the word participle. Then he will hold the daydream to launch a search again, so you may find the following content: # "wechat public account of daydream to follow" # "Circle of Daydream to follow""query" : "Focus on daydreams", # specifies the maximum number of terms that the prefix can match. If this number is exceeded, it will not be retrieved in the inverted index"max_expansions" : 10, # improve recall rate, adjust term Persition with SLOP, and contribute score"slop":10}}}}Copy the code
25, Function Score Query
Function Score Query is actually a way for users to customize a doc Score enhancement. For example, the user can define a function_secore function and specify that the value of this field is multiplied by the score calculated by ES as the final score for doc.
# Case1
GET /your_index/your_type/_search
{
"query": {
"function_score": {# Write a query as normal"query": {
"match": {
"query":"es"}}, # custom enhancement policy "field_value_factor" :{# Multiply the value of the star field from the final score of the retrieved doc"field":"star",}"boost_mode":"multiply", # limit the maximum score to no more than the value specified by maxBoost."maxboost":3
}
}
}
# Case2
GET /your_index/your_type/_search
{
"query": {
"function_score": {
"query": {
"match": {
"query":"es"}}, "field_value_factor" :{# multiply the final score of the doc from the star field and multiply the value of the star field."field":"star"# newScore = oldScore + log(1+star) # newScore = oldScore + log(1+star)"modifier":"log1p",}"boost_mode":"multiply"."maxboost":3
}
}
}
# Case3
GET /your_index/your_type/_search
{
"query": {
"function_score": {
"query": {
"match": {
"query":"es"}}, "field_value_factor" : {"field":"star"."modifier":"log1p"# newScore = oldScore + log(1 + star*factor)"factor":0.1
}
"boost_mode":"multiply"."maxboost":3Multiply (multiply, sum, min, Max, replace); multiply (multiply, sum, min)Copy the code
26, Fuzzy Query provides fault-tolerant processing
GET /your_index/your_type/_search {# Fuzzy Query Provides fault-tolerant handling"query": {
"fuzzy" : {
"user" : {
"value": "Daydream"."boost": 1.0, # The maximum number of error correction times, generally set to AUTO"fuzziness": 2The initial number of characters that # will not be "blurred". This helps reduce the number of terms that must be checked. The default value is 0."prefix_length": 0, # Maximum number of items the fuzzy query will expand to. The default value is 50"max_expansions": 100# Support fuzzy transformation (ab→ BA). The default isfalse
transpositions:true}}}}Copy the code
Interpret a practical case
GET /your_index/your_type/_search
{
"query": {# For example, if your query is complex and involves many subqueries, consider wrapping the subqueries with a bool query # Each subquery will calculate the doc's correlation score against its query. Finally, the bool query will combine these scores into one final score"bool": {# must match to XXX and will result in a correlation score # address must include mill"must": [{"match": {
"address": "mill"}},], # If there is no must, one of the conditions in should must be met"should": [{"match": { "address": "lane"}}]."must_not": [# must not contain who {"match": { "address": "mill"The expressions in}},], # filter only filter the data, but do not affect the relevancy score of the search results. So if you don't want the filter to affect the final order of the doc, you can put it in the filter. # Query will calculate the doc correlation score, the higher the score, the higher the doc correlation score."filter": {
"range": {# Filter by range"balance": {# Specifies the field to filter"gte": 20000S # above20000
"lte": 30000# below30000
}
}
}
}
}
Copy the code
The default collation is based on _score descending, but as mentioned above, it will not count the score if all the filters are filter, which means all the scores are 1. In this case, you need to customize the collation
28, query docs whose name contains “daydream” and sort by star
Highlighting, sorting, paging, and _source specifying the required fields can be further applied to the results of the Query.
# ES default collation is descending by _score field # but ES allows you to customize collation as follows: GET /your_index/your_type/_search {"query": {
"match": {"name":"Daydreaming"}}, # specify sort condition"sort":[# specify sort field as star {"star":"desc"}}]Copy the code
29. Paging query
For example: Start the search from the first doc and check 10 items. (If you don’t use from, to search, default to search the top 10)
GET /your_index/your_type/_search
{
"query": { "match_all": {}},"from": 0, # 0: is the first doc"size": 10GET /your_index/your_type/_search? size=10GET /your_index/your_type/_search? size=10&from=20For example, a system with only3A primary shard,1Replica Shard, in total6W pieces of data. The user wants to query the first1000Page, each page10The data. That is1000*10 = 10001 ~ 10010If the user sends the paging request to the Replica Shard in the ES cluster, what happens next? A: The shard that receives the request is called a coordinate node. It forwards the request to the three primary shards, and each primary shard takes its end1~10010I'm going to return the id to the Coordinate Node, which means that the coordinate node is going to receive it in total30030And then the Coordinate Node takes those ids and makes an MGET request to get the result of the data pair30030Sort processing, and finally take the highest correlation score10Bar is returned to the user. So when paging is too deep, it is very memory, network bandwidth, and CPU consuming.Copy the code
30, specify some fields of doc to query. As follows:
# Suppose the json length of the daydream looks like this: {"name":"Daydream", "address" :"beijing"."gender":"man"GET /your_index/your_type/_search {"query": { "match_all": {}}, # ES returns full-text JSON, and _source specifies the fields to return"_source": ["name"],}Copy the code
Select * from doc where name contains daydream and star > 100.
GET /your_index/your_type/_search
{
"query": {# Bool can be used to encapsulate multiple query conditions": {"must": {"match": {"name":"The day dream"}} # specify filter based on the range of star"filter":{# range can be placed in either query or filter. The range filter action does not affect the final score if placed in the filter. However, when placed in query, the range action affects the final score. "range": {" star" : {"gt":100}}}}}} # Extension: # about range can also filter time like this"range":{# specify a date range for the latest month of docs"birthday": {"gt":"2021-01-20||-30d}} # or use the now syntax # to specify a birthday range of the latest month's doc"birthday": {"gt":"now-30d"}}Copy the code
32, specify highlighting the specified word in the specified field in the doc returned.
GET /your_index/your_type/_search
{
"query": {
"match": {"name":"Daydreaming"}},"highlight":{# highlighted"fields":{# specifies the highlighted field as firstName"firstname":{}}} #"hits" : {
"total" : 1000And # 1000"max_score" : null."hits": [{"_index" : "bank"."_type" : "_doc"."_id" : "0"."sort": [0]."_score" : 0.777777."_source" : {"account_number":0."balance":16623."firstname":"I am white"."lastname":"Day dream"."state":"CO"}}]."highlight": {"firstname": [I am white "]}...Copy the code
Reference: www.elastic.co/guide/en/el…
Iv. Polymerization analysis
4.1. What is aggregation analysis?
Aggregation analysis is similar to group by, where age > 20 and age < 30 in SQL statements. The common aggregation analysis is to analyze the group according to a certain field. The requirement is that the field cannot be segmented. If the aggregated field is segmented and indexed according to the inverted index, the entire inverted index has to be scanned (only to find all the aggregated fields, the efficiency is very low).
The aggregation analysis is based on a result set of data called DOC Value, which is essentially a straight index.
There are three important concepts in aggregate analysis:
-
bucket
In particular, if you use the ES related API in Java and Golang, you’ll see the bucket keyword, which is the result set of the aggregation operation.
-
metric
Metric is the analysis of the bucket, such as the maximum, minimum, and average.
-
Drill down
Running down is to continue grouping existing buckets, for example by gender, running down and then by age.
4.2. Dry goods! 15 cases of aggregation analysis
I want to know how many people in my company are named Tom and how many are called Jerry. In other words, I want to know how many people have the same name. So we need to aggregate by name like this.
There is a natural metric in the aggregated result, which is the count of the current bucket, and that is what we want:
GET /your_index/your_type/_search {0When using aggregation, there is a natural metric, which is the count of the current bucket"aggs": {
"group_by_name": {# Custom name"term": {
"field": "name"Group by name}}}} GET /your_index/your_type/_search {" size ":0When using aggregation, there is a natural metric, which is the count of the current bucket"aggs": {
"group_by_xxx": {# Custom names # In addition to term you can use terms # trems to allow you to specify multiple fields"terms": {# group by v1, v2, v3"field": {"value1"."value2"."value3"}
}
}
}
}
Copy the code
2. Search first and then aggregate the search results. For example, I want to know about all the boys with the same name
GET /your_index/ your_index/ _search {"term": {"gender":"man"}}, # reaggregate"aggs": {
"group_by_name": {
"term": {
"field": "name"Group by name}}}}Copy the code
3. I want to divide people with the same name into groups, and I want to know the average age of each group. You can do it like this
GET /your_index/your_type/_search
{
"size":0Group by avg age, then group by field1."aggs": {
"group_by_name": {
"terms": {
"field": "name"}, # aggregated by age on top of the results grouped by name above"aggs": {
"average_age": {# specifies the aggregate function as avg"avg": {
"field": "age"
}
}
}
}
}
}
Copy the code
4. I want to know different age groups in our company: how many people are between 20 and 25 years old, how many people are between 25 and 30 years old, how many people are between 30 and 35 years old, how many people are between 35 and 40 years old, and how many women and how many men are between each age group.
GET /your_index/your_type/_search
{
"size":0, # group by age first, then by gender"aggs": {
"group_by_age": {
"range": {
"field": "age"."ranges": [{"from": 20."to": 25}, {"from": 25."to": 30}, {"from": 30."to": 35}, {"from": 35."to": 40}},"aggs": {
"group_by_gender": {
"terms": {# keyword;} {# keyword;} {# keyword;"field": "gender.keyword"
}
}
}
}
}
Copy the code
I’d like to know the average account balance of our company for each age group and each gender.
GET /your_index/your_type/_search
{
"size":0, # first group by age, then group by gender, and then aggregate by average salary # The final result is the average account balance for each age and gender"aggs": {
"group_by_age": {
"range": {
"field": "age"."ranges": [{"from": 20."to": 30}},"aggs": {
"group_by_gender": {
"term": {
"field": "gender.keyword"}, # aggregate avG balance on top of the previous layer based on gender"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
}
Copy the code
6. Nest aggregations and use result sets from internal aggregations
GET /your_index/your_type/_search
{
"size":0, # nest the aggregation and use the result set of the internal aggregation"aggs": {
"group_by_state": {
"term": {
"field": "state.keyword"."order": {# average_balance is the result set of the following internal aggregation, based on which do DESC"average_balance": "desc"}}, # agg will produce multiple buckets: # bucket1 => {state=2, acg= XXX, min= XXX, Max = XXX, sum= XXX}"aggs": {
"average_balance": {
"avg"{# avg"field": "balance"}},"min_price": {
"min": {# metric = minimum"field": "price"}},"max_price": {
"max": {# metric = maximum value"field": "price"}},"sum_price": {
"sum": {# metric calculates the total"field": "price"}},}}}}Copy the code
8. In addition to grouping by value, such as men and women, as described above, you can also use histogram to aggregate by interval.
GET /your_index/your_type/_search
{
"size":0, # histogram, similar to terms, also performs bucket grouping operations. # Using Histogram requires performing a field, such as age in the following example, that represents grouping and aggregating according to the scope of age"aggs": {# nested aggregates within aggregates"group_by_price": {
"histogram": {
"field": "age", # interval = 10, it will be divided like this 0-10 10-20 20-30... Records with an age of 21 will be classified into the 20-30 range"interval":10
},
"aggs": {# nested aggregates within aggregates"average_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
Copy the code
9. Aggregate by date
GET /your_index/your_type/_search
{
"size":0."aggs": {
"agg_by_time": {# keyword"date_histogram" : {
"field" : "age", # interval, a month for a span"interval" : "1M"."format" : "yyyy-MM-dd"The interval is returned even if there is no data in it"min_doc_count":0{extended_bounds}}"min":"2021-01-01"."max":"2021-01-01",}}}}} # add"interval": "quarter" is divided by quarterCopy the code
10. Filter aggregate
# Case1 # for example: I want to filter out age > first20GET /your_index/your_type/_search {"size":0."query": {"consitant_score":{# This filter will filter the global data in ES"filter": {"range": {"age": {"gte":20}}}}},"aggs": {"avg_salary": {"avg": {"field":"salary"
}
}
}
}
# Case2
# bucket filter
POST /sales/_search
{
"aggs": {# T-shirt bucket agg"agg_t_shirts" : {
"filter" : {
"term": {
"type": "t-shirt"}},"aggs" : {
"avg_price" : { "avg" : { "field" : "price"}}}}, agg for # sweater Bucket"agg_sweater" : {
"filter" : {
"term": {
"type": "sweater"}},"aggs" : {
"avg_price" : { "avg" : { "field" : "price" } }
}
}
}
}
Copy the code
Nested aggregation – breadth first
Here’s an example that applies to the scene: We retrieve reviews of the movie, but we aggregate them first by actor group, then by number of reviews. And let’s say each actor has appeared in 10 movies.
Analysis: If we choose depth first, when ES builds the information related to actors and movies, it will calculate the information of the number of comments below the movies by the way. If there are 100,000 actors’ Filter aggregate, 100,000 *10=1 million movies, and there are many reviews under each movie, and then process the movie reviews. So there might be millions of pieces of data in memory, but we end up with 50, which is expensive.
In breadth first, we deal with the number of films first, regardless of the number of reviews aggregated, by first eliminating 99,990 entries from 100,000 actors and then aggregating the remaining 10 actors.
"aggs": {"target_actors": {"terms": {"field":"actors"."size":10."collect_mode":"breadth_first"}}}Copy the code
12, global aggregation
Global aggregation, the following uses query for full-text retrieval, followed by aggregation, which is actually an aggregation of two different results.
-
The first aggregation adds the global keyword, which means that all docs present in ES are aggregated to calculate the average price of T-shirt
-
The second aggregation aggregates the results of a full-text search
POST /sales/_search? size=0
{
"query": {# full-text search for merchandise with type = T-shirt"match" : { "type" : "t-shirt"}},"aggs" : {
"all_products" : {
"global": {}, # indicates all_products to aggregate all data in ES"aggs": {# Without the global keyword, the results of full-text search are aggregated"avg_price" : { "avg" : { "field" : "price"}}}},"t_shirts": { "avg" : { "field" : "price"}}}}Copy the code
13. Cardinality Aggregate
Cardinality metric is usually used in the aggregation of ES, which can realize the de-weighting of the specified field in each bucket, and finally get the count value after de-weighting.
Although it will have an error rate of about 5%, but the performance is particularly good
POST /sales/_search? size=0
{
"aggs": {# aggregates buckets for different months"agg_by_month" : {
"date_histogram": {"field" : "my_month"."internal":"month"}}, # In the previous step on the basis of the month for maintenance partition bucket, and then according to the brand to find the base weight. # So finally we have the monthly sales for each brand."aggs" : {
"dis_by_brand" : {
"cardinality" : {
"field" : "brand"}}}}Copy the code
To optimize Cardinality Aggregate, add precision_threshold optimization accuracy and memory overhead.
When the total number of brands is less than 100, the precision of the deweighting is 100%, and the memory usage is 100*8=800 bytes.
Let’s adjust this value to 1000, which means that when the number of products is less than 1000, the accuracy of weight removal is 100%, and the memory usage is 1000*8=80KB.
The official indicator is that when the PRECision_threshold is set to 5, the error rate is controlled within 5%.
POST /sales/_search? size=0
{
"aggs" : {
"type_count" : {
"cardinality": {# keyword"field" : "brand"
"precision_threshold":100}}}}Copy the code
Further optimization, the algorithm underlying Cardinality is HyperLogLog++.
Because the underlying algorithm will hash all unique values and use this hash value to approximate the distcint count, we can set the hash value when creating the mapping and calculate the hash value when adding doc. So HyperLogLog++ doesn’t have to compute the hash value anymore, it just uses. So as to achieve the effect of optimizing the speed.
PUT /index/
{
"mappings": {"my_type": {"properties": {"my_field": {"type":"text"."fields": {"hash": {"type":"murmu3"
}
}
}
}
}
}
}
Copy the code
14. Control the ascending and descending order of polymerization
For example, I want to know the average price of each color item, and I want to show it to me in ascending order of price from small to large.
As follows, items of the same color can be aggregated into a group by color first, and then aggregated by price on the result of aggregation. The final result is expected to be sorted in ascending order by the group of price aggregations through the Order control, which is a sort technique used in drill-down analysis.
GET /index/type/_search
{
"size":0."aggs": {"group_by_color": {"term": {"field":"color"."order":{ #
"avg_price":"asc"}}},"aggs":{# aggregate the price on top of the color aggregation at the previous level"avg_price": {"avg": {"field":"price"
}
}
}
}
}
Copy the code
15, Percentiles Aggregation
To calculate the percentage, it is commonly used to calculate: the ratio of successful access to the website within 200ms, within 500ms, within 1000ms, or the proportion of the sale price of 1000 yuan of goods in the total sales, the proportion of the sale price of 2000 yuan of goods in the total sales, etc.
Example: For the load_time field in DOC, calculate the load_time_outliner case under different percentages.
GET latency/_search
{
"size": 0."aggs" : {
"load_time_outlier": {# keyword"percentiles" : {
"field" : "load_time"}}}}Copy the code
Response interpretation: For 50 percent of load requests, the average load_time is 445.0. In 99% of requests, the average load time was 980.1.
{..."aggregations": {
"load_time_outlier": {
"values" : {
"1.0": 9.9."5.0": 29.500000000000004."25.0": 167.5."50.0": 445.0."75.0": 722.5."95.0": 940.5."99.0": 980.1000000000001}}}}Copy the code
You can also specify the percentage span interval yourself.
GET latency/_search
{
"size": 0."aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time"."percents" : [95.99.99.9]}}}}Copy the code
Optimization: TDigest algorithm is used at the bottom of percentile. Using many nodes to perform percentage calculations, approximate estimates, have errors, the more nodes, the more accurate.
The default value for compression is 100. ES limits the maximum number of nodes to compression*20 =2000, because the more nodes, the worse the performance.
A node occupies 32 bytes. 1002032 = 64KB.
GET latency/_search
{
"size": 0."aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time"."percents" : [95.99.99.9]."compression":100# the default value100}}}}Copy the code
Reference: www.elastic.co/guide/en/el…
Five, seven query optimization tips
- The first: multi – field retrieval, clever control weight
- The first: change the writing method, change the proportion of the weight occupied.
- Third: If you don’t want to use correlation scores, use the following syntax.
- Fourth: flexible query
- Fifth: Let’s say I’m retrieving the title field, I want the result to contain “Java”, and I allow the result to contain: “golang”, but! If the search result contains “golang”, I hope that the doc with “golang” in the title will be ranked lower.
- Sixth: re-scoring
- 7. Tips for improving recall and Accuracy: Mix match and match_phrase+slop to improve recall. Notice the nested query levels below: bool, must, should
The above seven ways to optimize the correlation score of the specific implementation code, in the public number of the original text can be viewed, recommended to read the original text, JSON format will look a lot better, ES topic is still serialized ~, welcome to pay attention to.
Click to read the original text, view the specific implementation code of 7 optimization methods click to read the original text, view the specific implementation code of 7 optimization methods click to read the original text, view the specific implementation code of 7 optimization methods
Reference:
The official document: www.elastic.co/guide/en/el…
Query DSL: www.elastic.co/guide/en/el…
Aggregate analysis: www.elastic.co/guide/en/el…
Welcome to attention
How to pay attention to me, you know ~