There is a data analysis for products on the platform, and statistics are made on a certain merchant within a certain period of time (today, yesterday, 7 days, 30 days…). For such analysis, it is obviously unreliable to use MySql count, sum and GroupBy to query, especially in the case of a large amount of data, the efficiency is self-evident. I originally intended to use HBase MR to do this. Or we directly processed the data of various latitudes by MR of HADOOP and stored them in HBase. Later, after chatting with our friends, we were severely despised by our friends, who basically said that our data volume was too small to be used in HBase, let alone MR. We decided to use ElasticSearch (ES for short), and we’ll start with the traditional build – pick – fill – pick – fill pit.

1. Download and configure ElasticSearch

ElasticSearch’s website www.elastic.co/products/el… Find the version I need, I chose 5.6.9 here, don’t ask me why, because the latest version has more holes in my uncharted territory! Download 5.6.9 directly is the most comfortable version I am currently using. ES relies on the lowest JDK1.8 version, so environment variables must be configured

Wget tar ZXVF - https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.6.9.tar.gz Elasticsearch-5.6.9.tar. gz -c /usr/local/cd /usr/local/elasticSearch-5.6.9Copy the code

The ES configuration files are all in the config directory, where elasticSearch. yml is the main configuration file.

Vim config/ elasticsearch.ymlCopy the code
cluster.name=tsk-es

node.name=tsk1

path.data: /opt/data/elastic/data

path.logs: /opt/data/elastic/log

Network. The host: 0.0.0.0

http.port: 9200

Elasticsearch. Yml/bin directory will give you a bunch of errors. The first one is telling you that you can’t start ES as root, so you need to create a user, I’m going to create a user called elastic and then remember to give that user permission to folder, and then su into that user to start, but don’t get into elastic just yet, there are a few things you need to change:

1, modify the/etc/security/limits file, Otherwise Max file Descriptors [4096] for ElasticSearch process likely too low, increase to at least [65536] error

vim /etc/security/limits.conf
Copy the code
* soft nofile 65536

* hard nofile 65536

* soft nproc 2048

* hard nproc 4096

Max Virtual Memory AREAS Vm. max_map_count [65530] likely too low, Increase to at least [262144] error

vim /etc/sysctl.conf
Copy the code
vm.max_map_count=262144

Once you’ve made all the changes you can go to Elastic and start ES

su elastic
bin/elasticsrarch
Copy the code

If nothing else, your ES should be running properly. If you visit 192.168.0.1:9200 in your browser, you should see a string of JSON to prove that your ES has been started successfully. If you want to run ES in the background, you can do it directly

nohup bin/elasticsearch > /opt/data/elastic/elastic.log 2>&1 &
Copy the code

2. Initial understanding of ES operation

ES operations are carried out through HTTP requests. Different request methods and parameters are specific to different operations. For example, to create an index, PUT is used; to DELETE an index, DELETE is used; to query, if there is no parameter, GET is used; to submit data, POST is used. The first step is to create an index:

PUT: http://192.168.0.1:9200/shopsinfo {" the mappings ": {" shopsOrder" : {" properties ": {" shopid" : {" type ":" string ", "index" : "not_analyzed" }, "createdate":{ "type":"string", "index": "not_analyzed" }, "timestamp":{ "type":"long" }, "paymentType":{ "type":"string" , "index": "not_analyzed" }, "amount":{ "type":"long" } } } } }Copy the code

Is to create a named shopsinfo index, there is a call shopsOrder mapping, shopid, createdata, timestamp, paymentType, amount of several fields, as well as the corresponding type

It’s easy to insert data, just POST and the argument is a JSON

POST: http://192.168.0.1:9200/shopsinfo/shopsOrder {" shopid ":" 96119 ", "createdate" : "20180410", "timestamp" : 1523289600000, "paymentType": "alipay", "amount": 6917 }Copy the code

Delete the data

POST: http://192.168.0.1:9200/shopsinfo/shopsOrder/_delete_by_queryCopy the code

Check out the GET/POST: http://192.168.0.1:9200/shopsinfo/shopsOrder/_search

There are Chinese versions on the official website and a lot of them on Baidu. The most important thing is the statistical query of ES, which is the key of ES

3. Key inquiries

ES belongs to the inverted index, and the query efficiency is particularly high, but the query statement of ES is very troublesome. Both the query and statistics of ES are carried out in the form of JSON with the BODY of POST. SQL > select time > shopId 100000002 and shopId 100000006;

select * from shopsOrder where timestamp>1523671189000 and shopid in ("100000002","100000006")
Copy the code

In ES, look like this:

POST: http://192.168.0.1:9200/shopsinfo/shopsOrder/_search {" size ": 20, "query":{ "bool":{ "must":[ { "range":{ "timestamp":{ "gte":1523671189000 } } }, { "terms":{ "shopid":["100000002","100000006"] } } ] } } }Copy the code

In this I pass the size parameter, if not, ES will return you 10 pieces of data by default, and the query result ES will also return you JSON, hits field will have total is the total number of results you queried and hits will return you the result content.

The above is a simple query. In terms of statistics, ES takes AGGS as the parameter and its full name should be Aggregation. For example, I want to calculate the total amount of the result in the query just now, which is similar to that in SQL

select sum(amount)query_amount from shopsOrder where timestamp>1523671189000 and shopid in ("100000002","100000006")
Copy the code

That’s what you do in ES

{
    "aggs":{
        "query_amount":{
            "sum":{
                "field":"amount"
            }
        }
    },
    "query":{
        "bool":{
            "must":[
                {
                    "range":{
                        "timestamp":{
                            "gte":1523671189000
                        }

                    }
                },
                {
                    "terms":{
                        "shopid":["100000002","100000006"]
                    }
                }
            ]
        }
    }
}
Copy the code

The result of the count under query_amount in the returned aggregations parameter looks something like this:

. "aggregations": { "query_amount": { "value": 684854 } } ......Copy the code

Now, to make things a little more complicated, group the query by day.

select createdate,sum(amount)query_amount from shopsOrder where timestamp>1523671189000 and shopid in ("100000002","100000006")
group by createdate order by createdate
Copy the code

In ES it looks like this:

{
    "size":0,
    "aggs":{
        "orderDate":{
            "terms":{
                "field":"createdate",
                "order":{
                    "_term":"asc"
                }
            },
            "aggs":{
                "query_amount":{
                    "sum":{
                        "field":"amount"
                    }
                }
            }
        }
    },
    "query":{
        "bool":{
            "must":[
                {
                    "range":{
                        "timestamp":{
                            "gte":1523671189000
                        }

                    }
                },
                {
                    "terms":{
                        "shopid":["100000002","100000006"]
                    }
                }
            ]
        }
    }
}
Copy the code

The query result is

. "aggregations": { "orderDate": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 99, "buckets": [ ...... { "key": "20180415", "doc_count": 8, "query_amount": { "value": 31632 } }, { "key": "20180417", "doc_count": 3, "query_amount": { "value": 21401 } }, { "key": "20180418", "doc_count": 2, "query_amount": { "value": 2333}}......] }}Copy the code

Key is the value I createdate, doc_count is like count,query_amount is the value I createdate. As for my parameter, there is a size:0 because I don’t need a specific record, namely hits, so I pass 0 here

Finally, let’s do a more complicated one: 1. 2. Group the amount by paymentType payment method first, and then group the amount by day for each payment method

    {
    "size":0,
    "aggs":{
        "amount":{
            "sum":{
                "field":"amount"
            }
        },
        "paymenttype":{
            "terms":{
                "field":"paymentType"
            },
            "aggs":{
                "query_amount":{
                    "sum":{
                        "field":"amount"
                    }
                },
                "payment_date":{
                    "terms":{
                        "field":"createdate"
                    },
                    "aggs":{
                        "query_amount":{
                            "sum":{
                                "field":"amount"
                            }
                        }
                    }
                }
            }
        }
    },
    "query":{
        "bool":{
            "must":[
                {
                    "range":{
                        "timestamp":{
                            "gte":1523671189000
                        }

                    }
                },
                {
                    "terms":{
                        "shopid":["100000002","100000006"]
                    }
                }
            ]
        }
    }
}
Copy the code

Query results are as follows:

. "amount": { "value": 684854 }, "paymenttype":{ ...... "buckets": [ { "key": "wechatpay", "doc_count": 73, "amount": { "value": 351142 }, "payment_date": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 25, "buckets": [ ...... { "key": "20180415", "doc_count": 6, "amount": { "value": 29032 } }, { "key": "20180425", "doc_count": 6, "amount": { "value": 21592 } } ...... ] } }, { "key": "alipay", "doc_count": 67, "amount": { "value": 333712 }, "payment_date": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 23, "buckets": [ ...... { "key": "20180506", "doc_count": 8, "amount": { "value": 38280 } }, { "key": "20180426", "doc_count": 6, "amount": { "value": 41052 } } ...... ] } } ] }Copy the code

4, JAVA operation ES

Download the JAR package of the corresponding version according to the ES version I downloaded. I installed 5.6.9, so my JAR package version should also be 5.6.9

< the dependency > < groupId > org. Elasticsearch < / groupId > < artifactId > elasticsearch < / artifactId > < version > 5.6.9 < / version > </dependency> <dependency> <groupId>org.elasticsearch.client</groupId> <artifactId>transport</artifactId> The < version > 5.6.9 < / version > < / dependency >Copy the code

Since my ES project is based on SpringBoot, my helper decides to let Spring manage it. In fact, it is also possible to write a singleton. First, create a client connection

import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.client.transport.TransportClient;
import org.elasticsearch.transport.client.PreBuiltTransportClient;
import org.elasticsearch.common.transport.InetSocketTransportAddress;
import java.net.InetAddress;

Settings settings = Settings.builder().put("cluster.name", "tsk-es").put("client.transport.sniff", true)
                    .build();
TransportClient client = new PreBuiltTransportClient(settings)
                    .addTransportAddresses(new InetSocketTransportAddress(InetAddress.getByName(HOST), PORT));
Copy the code

Inserting data is relatively easy. You can insert JSON strings directly, or you can pass In JAVA beans

import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.common.xcontent.XContentType;

IndexResponse response = client.prepareIndex(index, mapping).setSource(jsonStr, XContentType.JSON)
                .get();
Copy the code

$amount = ‘paymentType’ $amount = ‘paymentType’ $amount = ‘paymentType’ $amount = ‘paymentType’

public void getAmountData(Long startTimestamp, String... // Initialize a SearchRequestBuilder, Sbuilder = client.prepareSearch(" shopsInfo ").settypes ("shopsOrder"); TermsQuery builder MPQ = QueryBuilders. TermsQuery ("shopid", shopIds); RangeQueryBuilder mpq2 = QueryBuilders. RangeQuery ("timestamp").gte(startTimestamp); // Initialize QueryBuilder QueryBuilder = QueryBuilder.boolQuery (). Must (MPQ). Must (mpq2); // Add QueryBuilder to SearchRequestBuilder sBuilder.setQuery (QueryBuilder); sbuilder.setSize(0); SumAggregationBuilder salaryAgg = AggregationBuilders. Sum ("query_amount"). Field ("amount"); TermsAggregationBuilder paymentAgg = AggregationBuilders.terms("paymentType").field("paymentType"); paymentAgg.size(100); paymentAgg.subAggregation(salaryAgg); TermsAggregationBuilder groupDateAff = AggregationBuilders.terms("payment_date").field("createdate") .order(Order.term(true)); groupDateAff.size(100); groupDateAff.subAggregation(salaryAgg); paymentAgg.subAggregation(groupDateAff); AddAggregation (salaryAgg).addaggregation (paymentAgg); // Put the statistics query into the SearchRequestBuilder sBuilder.addAggregation (salaryAgg). SearchResponse response = sbuilder.execute().actionGet(); Map<String, Aggregation> aggMap = response.getAggregations().asMap(); InternalSum shopGroupAllAmount = (InternalSum) aggmap. get("amount"); Double amount = shopGroupAllAmount.getValue(); . }Copy the code

All the information can be obtained in SearchResponse, and how to parse the data in the future depends on the specific business needs and the habits of each person. ES’s various operations are simple and complex, according to the words of a friend is to use a familiar nature is simple, it is indeed so, no matter what to have a thorough understanding, or he is a pit! For example, when I do the statistics query, the returned result is always much less than it should be, and the small amount always appears in the field of sum_other_doc_count. After studying for a long time, I found that the original statistical results also need to pass the size parameter, otherwise the default is 10!

Finally, thanks to my friend, who scoffed at my attempts to build various big data platforms (because we don’t have much data), but still output the best solution he could think of.

WeChat exceptional

Alipay rewards