Data group buckets.
To prepare data
- Indexed TVS
curl PUT ip:port/tvs
{
"mappings": {
"properties": {
"price": {
"type": "long"
},
"color": {
"type": "keyword"
},
"brand": {
"type": "keyword"
},
"sold_date": {
"type": "date"
}
}
}
}
Copy the code
- The test data
Curl POST IP: POST/TVS / _bulk {" index ": {}} {" price" : 1000, "color" : "red", "brand", "changhong", "sold_date" : {" index ":" 2016-10-28 "} {}} {" price ": 2000," color ":" red ", "brand", "changhong", "sold_date" : "2016-11-05"} {" index ": {}} {" price ": 3000," color ":" green ", "brand" : "millet", "sold_date" : "2016-05-18"} {" index ": {}} {" price" : 1500, "color", "blue", "brand" : "TCL", "sold_date" : "2016-07-02"} {" index ": {}} {" price" : 1200, "color" : "Green" and "brand", "TCL", "sold_date" : "2016-08-19"} {" index ": {}} {" price" : 2000, "color" : "red", "brand" : "Changhong", "sold_date" : "2016-11-05"} {" index ": {}} {" price" : 8000, "color" : "red", "brand" : "samsung", "sold_date" : {" index ":" 2017-01-01 "} {}} {" price ": 2500," color ":" blue ", "brand" : "millet", "sold_date" : "2017-02-12"}Copy the code
1. Basic functions
Metric is some kind of aggregate analysis operation performed on a bucket. Count avg Max min sum
Grouping by quantity
Statistics show that a certain color TV sells the most
curl GET ip:port/tvs/_search
{
"size" : 0,
"aggs" : {
"popular_colors" : {
"terms" : {
"field" : "color"
}
}
}
}
Copy the code
Request parameters
- Size: Only the aggregation results are retrieved without returning the raw data on which the aggregation was performed;
- Aggs: fixed syntax, indicating that a group aggregation operation is to be performed on a piece of data.
- Popular_colors: Name of each AGGS, custom;
- Terms: Groups groups based on field values.
- Field: indicates the field for grouping.
Returns the result
{ "took" : 7, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 8, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "popular_colors" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Red", "doc_count" : 4}, {" key ":" green ", "doc_count" : 2}, {" key ":" blue ", "doc_count" : 2}]}}}Copy the code
The refs explain
hits.hits
We specified size=0 in the request, so hits.hits is empty, otherwise the original aggregated data will be returned.aggregations
Aggregate results.popular_color
User-defined aggregation name.buckets
Buckets according to the field we specify.key
The field value.doc_count
The number of Doc’s in this bucket group.
Grouping by quantity is not a metric, it is a default for Elasticsearch aggregate analysis, implemented using term.
Statistical mean
Statistics on the average price of each color TV:
curl GET ip:port/tvs/_search
{
"size" : 0,
"aggs": {
"colors": {
"terms": {
"field": "color"
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
Copy the code
Nested AGGs and terms level, perform a metric operation on each bucket.
Returns the result
{... "aggregations" : { "colors" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "Red", "doc_count" : 4, "avg_price" : {" value ": 3250.0}}, {" key" : "green", "doc_count" : 2, "avg_price" : {" value ": 2100.0}}, {" key ":" blue ", "doc_count" : 2, "avg_price" : {" value ": 2000.0}}]}}}Copy the code
The avg_price value is the result of the metric calculation, the average value of the price field for all doc’s in each bucket.
Drill-down analysis
The buckets are regrouped, and the aggregation analysis operation is performed on each of the smallest groups. For example: group TVS by color, and then average the price of each brand OF TV under each color.
curl GET ip:port/tvs/_search
{
"size": 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
},
"aggs": {
"color_avg_price": {
"avg": {
"field": "price"
}
},
"group_by_brand": {
"terms": {
"field": "brand"
},
"aggs": {
"brand_avg_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
}
}
Copy the code
Nested group_by_brand Groups according to the band field to find the average price of brands.
{... "aggregations" : { "group_by_color" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{" key ":" red ", "doc_count" : 4, "color_avg_price" : {" value ", 3250.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{"key" : "long ", "doc_count" : 0, "bucket" : 3, "brand_avg_price" : {"value" : 1666.66666666667}}, {"key" : "Samsung ", "doc_count" : 1, "brand_avg_price" : {" value ": 8000.0}}}}, {" key" : "green", "doc_count" : 2, "color_avg_price" : {" value ": 2100.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{"key" : "TCL", "doc_count" : 1, "brand_avg_price" : {" value ": 1200.0}}, {" key" : "millet", "doc_count" : 1, "brand_avg_price" : {" value ": 3000.0}}}}, {" key" : "blue", "doc_count" : 2, "color_avg_price" : {" value ": 2000.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{"key" : "TCL", "doc_count" : 1, "brand_avg_price" : {" value ": 1500.0}}, {" key" : "millet", "doc_count" : 1, "brand_avg_price" : {" value ": 2500.0}}}}}}}]]Copy the code
Statistical extremum
Count the highest and lowest prices for each color of TV:
curl GET ip:port/tvs/_search
{
"size" : 0,
"aggs": {
"colors": {
"terms": {
"field": "color"
},
"aggs": {
"min_price" : { "min": { "field": "price"} },
"max_price" : { "max": { "field": "price"} }
}
}
}
}
Copy the code
2. Interval grouping
The histogram keyword is used to complete the interval grouping of the specified field values; if we want to group the field type as date, we need to use the date_HISTOGRAM keyword.
Receive a field and group buckets according to each range of field values:
curl GET ip:port/tvs/_search
{
"size" : 0,
"aggs":{
"price":{
"histogram":{
"field": "price",
"interval": 2000
}
}
}
}
Copy the code
In the above request, we grouped the “price” field into interval groups with interval intervals of 2000 and returned the result:
{... "Aggregations" : {" price ": {" buckets" : [{" key ": 0.0," doc_count ": 3}, {" key" : 2000.0, "doc_count" : 4}, {" key ": 4000.0," doc_count ": 0}, {" key" : 6000.0, "doc_count" : 0}, {" key ": 8000.0," doc_count ": 1}]}}}Copy the code
After grouping by interval, we can perform metric operations on each bucket, such as calculating the sum:
curl GET ip:port/tvs/_search
{
"size" : 0,
"aggs":{
"price":{
"histogram":{
"field": "price",
"interval": 2000
},
"aggs":{
"revenue": {
"sum": {
"field" : "price"
}
}
}
}
}
}
Copy the code
2.1. date_histogram
Fields grouped by interval are of the date type, and the date_histogram keyword is required, for example:
curl GET ip:port/tvs/_search { "size" : 0, "aggs": { "sales": { "date_histogram": { "field": "sold_date", "interval": "month", "format": "yyyy-MM-dd", "min_doc_count" : 0, "extended_bounds" : { "min" : "2016-01-01", "max" : "2017-12-31"}}}}}Copy the code
Ginseng explanation to
min_doc_count
The number of doc’s in a date interval must be at least equal to this parameter before the interval is returned.extended_bounds
When dividing buckets, you limit them to this start and end date.
TV sales per brand per quarter:
curl GET ip:port/tvs/_search
{
"size": 0,
"aggs": {
"group_by_sold_date": {
"date_histogram": {
"field": "sold_date",
"interval": "quarter",
"format": "yyyy-MM-dd",
"min_doc_count": 0,
"extended_bounds": {
"min": "2016-01-01",
"max": "2017-12-31"
}
},
"aggs": {
"total_sum_price": {
"sum": {
"field": "price"
}
},
"group_by_brand": {
"terms": {
"field": "brand"
},
"aggs": {
"sum_price": {
"sum": {
"field": "price"
}
}
}
}
}
}
}
}
Copy the code
Group by date, drill down into groups and group by brand, and then perform summation Metric for each subgroup. The results are as follows:
{... "aggregations" : { "group_by_sold_date" : { "buckets" : [ { "key_as_string" : "2016-01-01", "key" : 1451606400000, doc_count: 0, total_sum_price: {"value" : 0.0}, "group_by_brand" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ ] } }, { "key_as_string" : "Doc_count" : 1, "total_sum_price" : {"value" : 3000.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{"key" : "count", "doc_count" : 0 1, "sum_price" : {"value" : 3000.0}}]}}, {" key_AS_string ": "2016-07-01", "key" : 1467331200000, "doc_count" : 2, "total_sum_price" : {"value" : 2700.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ { "key" : "TCL", "doc_count" : 2, "sum_price" : { "value" : }}, {" key_AS_string ": "2016-10-01", "key" : 1475280000000, "doc_count" : 3, "total_sum_price" : {"value" : 5000.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{" key ", "changhong", "doc_count" : 3, "sum_price" : {" value ": 5000.0}}]}}, {" key_as_string" : "2017-01-01", "key" : 1483228800000, "DOC_count" : 2, "total_SUM_price" : {"value" : 10500.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [{"key" : "three ", "doc_count" : 0 1, "sum_price" : {" value ": 8000.0}}, {" key" : "millet", "doc_count" : 1, "sum_price" : {" value ": }}, {" key_AS_string ": "2017-04-01", "key" : 1491004800000, "doc_count" : 0, "total_sum_price" : {"value" : 0.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ ] } }, { "key_as_string" : "2017-07-01", "key" : 1498867200000, "doc_count" : 0, "total_sum_price" : { "value" : 0.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : [ ] } }, { "key_as_string" : "2017-10-01", "key" : 1506816000000, "doc_count" : 0, "total_sum_price" : { "value" : 0.0}, "group_by_brand" : {"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0, "buckets" : []}}}Copy the code
3. Aggregation qualification
Aggregation Scope specifies the DOC Scope for Aggregation analysis. It can be used in combination with Query and filter.
Aggregate analysis is used in conjunction with full-text retrieval
All aggregations in Elasticsearch are performed within a scope, which is the retrieved result when combined with a normal search request.
Statistics of sales of each color under the specified brand:
Curl the GET IP: port/TVS / _search {" size ": 0," query ": {" term" : {" brand ": {" value" : "millet"}}}, "aggs" : { "group_by_color": { "terms": { "field": "color" } } } }Copy the code
Returns the result
{
"took" : 34,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_color" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "绿色",
"doc_count" : 1
},
{
"key" : "蓝色",
"doc_count" : 1
}
]
}
}
}
Copy the code
Aggregation analysis is used in conjunction with filter
Statistics of the average price of all TV sets with a price greater than 1200:
curl GET ip:port/tvs/_search
{
"size": 0,
"query": {
"constant_score": {
"filter": {
"range": {
"price": {
"gte": 1200
}
}
}
}
},
"aggs": {
"avg_price": {
"avg": {
"field": "price"
}
}
}
}
Copy the code
For a filter refined for a bucket, you can use aggs.filter. For example, the average value of changhong TV in recent 1 month, 3 months and 6 months is as follows:
Curl the GET IP: port/TVS / _search {" size ": 0," query ": {" term" : {" brand ": {" value" : "changhong"}}}, "aggs" : {" recent_1m ": { "filter": { "range": { "sold_date": { "gte": "now-1m" } } }, "aggs": { "recent_1m_avg_price": { "avg": { "field": "price" } } } }, "recent_3m": { "filter": { "range": { "sold_date": { "gte": "now-3m" } } }, "aggs": { "recent_3m_avg_price": { "avg": { "field": "price" } } } }, "recent_6m": { "filter": { "range": { "sold_date": { "gte": "now-6m" } } }, "aggs": { "recent_6m_avg_price": { "avg": { "field": "price" } } } } } }Copy the code
4. Global grouping
For an aggregate analysis request, two results are given, using the global bucket for this requirement:
- Specifies aggregation results within scope;
- Aggregation results without limitation of scope.
Comparing the average sales volume of Changhong TV with that of all TV brands:
Curl the GET IP: port/TVS / _search {" size ": 0," query ": {" term" : {" brand ": {" value" : "changhong"}}}, "aggs" : { "single_brand_avg_price": { "avg": { "field": "price" } }, "all": { "global": {}, "aggs": { "all_brand_avg_price": { "avg": { "field": "price" } } } } } }Copy the code
Query in the above request is used to limit the scope to perform aggregate analysis on doc within that scope, and the internal global keyword specifies aggregate analysis to all doc.
Request the results
{ "took" : 35, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 3, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : {"all" : {"doc_count" : 8, "all_brand_avg_price" : {"value" : 2650.0}}, "single_brand_avg_price" : {"value" : 1666.6666666666667}}}Copy the code
Generally speaking, some metric operations of aggregation analysis are easy to be carried out in parallel in multiple shards, such as Max, min, AVG, etc. After receiving the return results of each SHard, the coordinate Node only needs simple calculation to get the final result:
- Coordinate Node broadcasts requests to all shards;
- Each fragment calculates the local maximum field value and returns it to coordinate Node.
- Coordinate Node chooses the maximum value that all the shards return, and that’s the final maximum.
These algorithms can scale horizontally as the number of machines increases linearly, without any coordination (no intermediate results need to be discussed between machines), and with very little memory consumption (an integer represents the maximum).
However, there are other algorithms that are difficult to execute in parallel, such as count(DISTINCT). It is not necessary to filter distinct values directly on each SHard, because coordinate Node needs to get the results returned by each SHard to perform the filtering operation in memory. This process can be time-consuming if the data volume is very large.
As a result, Elasticsearch uses approximation algorithms to improve performance, which give accurate but not 100% accurate results at the expense of a few minor estimation errors, in return for fast execution and minimal memory consumption.