Elasticsearch tutorial live replay
1, the problem
This is from the actual ball player question.
The general requirements are as follows:
About 36 million data, the key fields are as follows:
_id | creator |
---|---|
doc_1 | [Zhang SAN, Li Si, Wang Wu, Zhao Liu] |
doc_2 | [Chen Sheng, Wu Guang, Zhang SAN] |
There are about 13 million Creators, according to cardinality aggregation.
Problem: But in the case of high cardinality, the performance is not ideal.
2. Concept interpretation: What is high cardinal number?
For a more precise interpretation, here’s the exact translation of the official blog Elastic.
The performance of terms aggregations can be greatly impacted by the cardinality of the field that is being aggregated.
Cardinality refers to the uniqueness of values stored in a particular field.
High cardinality means that a field contains a large percentage of unique values.
Low cardinality means that a field contains a lot of repeated values.
For example, a field storing country names will be relatively low cardinality since there are less than two hundred countries in the world. Alternatively, a field storing IBAN numbers or email addresses is high cardinality since there may be millions of unique values stored.
Copy the code
The performance of terms aggregation can be greatly affected by the cardinality of the fields being aggregated.
Cardinality refers to the uniqueness of values stored ina particular field.
- High cardinality: Means that a field contains a high percentage of unique values.
For example: E-mail addresses may have tens of millions of + unique values, which is a high cardinality. (For example)
- Low cardinality: Means that a field contains many duplicate values.
Example: Because there are less than 200 countries in the world, the country name is a low base.
3. The nature of the problem
After repeated discussion, the essential problem: in high base business scenarios, the aggregation is slow and fails to meet expectations.
I remember when I first started my career, I asked my mentor’s mentor a question in person. I talked for a long time, but he couldn’t listen any more, so he said, “What’s your problem?” I remember it to this day.
Later, when I asked others for questions, I made drafts and Outlines in advance, quickly and directly expressed the key points, and greatly improved the communication efficiency.
It is no exaggeration to say that if you can describe a problem clearly in a few words, you can solve more than half of the problem.
4. How to improve it?
After repeated discussion and combined with the previous practice of ball friends, the ideas are as follows:
- First: For field values, store Hash values (processed at write time).
- Second: Do aggregation and statistical analysis based on Hash.
Does Elasticsearch have a Hash value type?
Earlier versions (before 7.x) do not, but after 7.x do.
The mapper-Murmur3 plugin implements the following:
Plug-in address:
www.elastic.co/guide/en/el…
Murmur3 needs to be introduced:
MurmurHash is an unencrypted hash function suitable for general hash retrieval operations.
Invented by Austin Appleby in 2008, there have been several variations, all of which have been released into the public domain.
Compared with other popular hash functions, murmurhashes perform better with random distribution characteristics for more regular keys.
Redis uses two different hash algorithms to implement dictionaries. MurmurHash is one of them (DJB is the other). It is widely used in Redis, including databases, clusters, hash keys, blocking operations, etc.
The author of the algorithm was invited to work at Google. The latest version of the algorithm, MurmurHash3, improves on MurmurHash2 with a few minor flaws to make it faster, enabling a 32-bit (low latency), 128-bit HashKey, especially for large chunks of data. It has high balance and low collision rate.
6. Mapper-murmur3 plugins
Step 1: Plug-in installation
bin/elasticsearch-plugin install mapper-murmur3
Copy the code
Step 2: Import the Demo tests
PUT my_index
{
"mappings": {
"properties": {
"my_field": {
"type": "keyword"."fields": {
"hash": {
"type": "murmur3"
}
}
}
}
}
}
PUT my_index/_doc/1
{
"my_field": "This is a document"
}
PUT my_index/_doc/2
{
"my_field": "This is a document"
}
GET my_index/_search
{
"aggs": {
"my_field_cardinality": {
"terms": {
"field": "my_field.hash"
},
"aggs": {
"top_sales_hits": {
"top_hits": {
"size": 2
}
}
}
}
}
}
Copy the code
The aggregation results are as follows:
At this point, the role of Murmur3 can be clearly seen:
- The specific field type that belongs to the Mapping.
- Can be combined with the keyword type to be used as a compound type.
- _source does not store the result value.
- You only see the results after aggregation.
7. How does mapper-Murmur3 Hash work?
The feedback is as follows:
What is the performance of polymerization at a relatively low base?
Practice it.
8.1 Simulation to generate 1000W+ data.
The text file entered 39415 name information (randomly generated). Write to ES cluster via python-DSL random generation.
The write result is as follows:
The index 1:
PUT my-index- 000002.
{
"mappings": {
"properties": {
"creator": {"type":"keyword"}}}}Copy the code
The index 2:
PUT my-index- 000003.
{
"mappings": {
"properties": {
"creator": {
"type": "keyword"."fields": {
"hash": {
"type": "murmur3"
}
}
}
}
}
}
Copy the code
8.2 How low is the low base?
POST my-index- 000002./_search
{
"size": 0."aggs": {
"count_aggs": {
"cardinality": {
"field": "creator"}}}}Copy the code
- Results after weight removal: 39415.
- Amount of original data: 11509010.
- Ratio: 0.342%
Although there is no clear definition of high or low base, this is clearly a low base.
8.2 Comparison of aggregation results
POST my-index- 000002./_search
{
"aggs": {
"terms_agg": {
"terms": {
"field":"creator"."size":3."shard_size":1000}}}}Copy the code
As shown above, an unhashed index aggregate is twice as fast as a hashed index aggregate!
This is also a preliminary indication that Hash aggregation does not work at low cardinals.
9, summary
The above verification and testing are for reference only. The actual selection needs to be fully verified based on the actual business scenarios.
Similar to high base aggregation business scenarios, what are the optimization points in your practice? Welcome to leave a message.
Reference:
www.elastic.co/cn/blog/imp…
Cloud.tencent.com/developer/a…
Add elastic6 (only a few slots left) and work with BAT to improve Elastic!
Recommended reading:
Blockbuster | into Elasticsearch methodology cognitive listing (National Day update edition in 2020)
The official documentation for your Elasticsearch puzzle is already available……
You can pass the Elastic certification exam with a driver’s license!
Concentrate on a technique, do the ultimate! — Elastic Certified Engineer
Upgrade these ten points and you are the boss!
Cognitive Upgrading – Don’t be a starter!
Learn more in less time, faster!
About 50% + ****Elastic certified engineers in China are from here!
Elasticsearch with 875+ Elastic enthusiasts around the world!