“This is the fifth day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

Basic principle of ElasticSearch document score calculation

1) Boolean model

Filter doc documents containing the specified term based on the user’s query condition

  • Query “hello world” ‐‐> hello/world/hello & world
  • Bool ‐> must/must not/should ‐> filter ‐> Contains/does not contain/may contain
  • ‐‐> To improve performance by reducing the number of doc’s to be calculated later

2) Relevance Score algorithm

Basically, you calculate the correlation between the text in an index and the search text

Elasticsearch uses term Frequency/Inverse Document Frequency algorithm, referred to as TF/IDF algorithm

Term frequency: How many times each Term in the search text appears in the field text, the more times it appears, the more relevant the search request is: Hello world

  • Doc1: Hello you, and world is very good
  • Doc2: Hello, how are you

Inverse Document frequency: How many times each term in the search text appears across all documents in the entire index, the more times it occurs, the less relevant it is

Search request: Hello World

  • Doc1: Hello, tuling is very good
  • Doc2: Hi world, how are you

For example, if you have 10,000 documents in index, the word hello appears 1,000 times in all of the documents; The word world appears 100 times in all of the documents

Field-length norm: Specifies the Field length. The longer the Field, the weaker the correlation

Search request: Hello World

  • Doc1: {” title “:” hello, article “and” content “:”… N words “}
  • Doc2: {“title”: “my article”, “content”: “…… N words, hi world”}

Hello world appears the same number of times in the entire index. Doc1 is more relevant, title field is shorter

GET /es_db/_doc/1/_explain 
{ 
"query": { 
"match": { 
"remark": "java developer" 
} 
} 
} 
Copy the code

Two, word segmentation workflow

There has been much discussion of the university

I give you a sentence, and THEN I break that sentence down into individual words one by one, while doing some normalization on each word, the word participle, and the number of results that can be searched increases when you search

Character filter: ‐‐> hello), & ‐> and (I&you ‐> I and you) tokenizer: Participle, hello you and me ‐> Hello, you, and, me token filter: Lowercase, stop word, synonym, like, Tom ‐‐> Tom, A /th e/an ‐> dry, small ‐> little#A word splitter, very important, takes a piece of text and does all sorts of processing before it's invertedRow indexCopy the code

2. Introduction of built-in word segmentation

Set the shape to semi‐transparent by calling set_trans(5)

Standard Analyzer: Set, the, Shape, to, semi, transparent, by, calling, set_trans, 5 (default is Standard)

Simple Analyzer: Set, the, Shape, to, semi, transparent, by, calling, set, TRAN

Whitespace Analyzer: Set, the, Shape, to, semi‐transparent, by, calling, set_trans(5)

Stop Analyzer: Removes stop words such as A the it and so on

Testing:

POST _analyze 
{ 
"analyzer":"standard", 
"text":"Set the shape to semi‐transparent by calling set_trans(5)" 
} 
Copy the code

3. Customize word segmentation

1) Default participle

Standard Token filter: Do nothing lowercase Token filter: Convert all letters to lowercase Stop Token filer (disabled by default) : Remove stop words such as a the it, etc

2) Modify the tokenizer Settings to enable the Token filter for English stop words

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
} 

GET /my_index/_analyze
{
  "analyzer": "standard",
  "text": "a dog is in the house"
} 

GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text": "a dog is in the house"
} 
Copy the code

3. Customize your own segmentation

PUT /my_index { "settings": { "analysis": { "char_filter": { "&_to_and": { "type": "mapping", "mappings": [ "&=> and" ] } }, "filter": { "my_stopwords": { "type": "stop", "stopwords": [ "the", "a" ] } }, "analyzer": { "my_analyzer": { "type": "custom", "char_filter": [ "html_strip", "&_to_and" ], "tokenizer": "standard", "filter": [ "lowercase", "my_stopwords" ] } } } } } GET /my_index/_analyze { "text": "tom&jerry are a friend in the house, <a>, HAHA!!" , "analyzer": "my_analyzer" } PUT /my_index/_mapping/my_type { "properties": { "content": { "type": "text", "analyzer": "my_analyzer" } } }Copy the code

3) ik participle explanation

Ik configuration file address: es/plugins/ik/config ikAnalyzer.cfg. XML: used to configure the custom thesaurus main.dic: The ik native built-in Chinese word library contains more than 270,000 words, all of which are grouped together quantifier.dic: put some words related to units quantifier.dic: put some surnames. The two most important ik native configuration files are main.dic: this file contains native Chinese words, which will be divided into words according to this word: Stopword, stopword a the and at but, will be eliminated in the participle and will not be built in the inverted index

4) IK tokenizer custom thesaurus

Ikanalyzer.cfc.xml (1) Build your own lexicon: every year there will be some special popular words, net red, blue thin mushroom, mai, ghosts, generally not in the IK native dictionary to add their own latest words, to the IK lexicon face ikAnalyzer.cfc.xml: Dic = ext_dict, custom/mydict.dic = ext_dict, custom/mydict.dic = ext_dict Es1 IK /ext_stopword.dic, es1 IK /ext_stopword.dic, es1 IK /ext_stopword.dic, es1 IK /ext_stopword.dic

5) IK hot update

Every time is in the extension dictionaries of es, manually add new words, a pit (1) add each finished, restart the es is to take effect, very trouble, (2) es is distributed, there may be hundreds of nodes, you can’t always a a node to modify the above es non-stop, somewhere in the external direct adding new words, Es is immediately hot loaded into these new words ikAnalyzer.cfg.xml

<properties> <comment>IK Analyzer extension configuration </comment> <! ‐‐ users can configure their own extended dictionaries here ‐> <entry key="ext_dict">location</entry> <! ‐‐ users can configure their own extended stopword dictionary ‐> <entry key="ext_stopwords">location</entry> <! ‐‐ users can configure remote extension dictionary ‐> <entry key="remote_ext_dict">words_location</entry> <! ‐‐> <entry key="remote_ext_stopwords"> Words_location </entry> </properties>Copy the code

Highlight

In search, search keywords often need to be highlighted, highlighting also has its common parameters, in this case to do some common parameters. Now search the document that contains “masses” in the remark field in the CARS index. And the “XX keyword” highlight, highlight effect using HTML tags, and set the font to red. If the remark data is too long, only the first 20 characters are displayed.

PUT /news_website { "mappings": { "properties": { "title": { "type": "text", "analyzer": "ik_max_word" }, "content": { "type": "text", "analyzer": "ik_max_word" } } } } PUT /news_website { "settings" : { "index" : {" analysis. Analyzer. Default. Type ":" ik_max_word "}}} PUT/news_website _doc / 1 {" title ":" this is the first article I wrote ", "content" : "Hi everyone, this is the first article I wrote, I love this article portal!!" }Copy the code

Query title: “article”

GET/news_website _doc / _search {" query ": {" match" : {" title ":" article "}}, "highlight" : {" fields ": {" title" : {}}}}Copy the code

The query results

{
  "took" : 878,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "news_website",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "title" : "这是我写的第一篇文章",
          "content" : "大家好,这是我写的第一篇文章,特别喜欢这个文章门户网站!!!"
        },
        "highlight" : {
          "title" : [
            "这是我写的第一篇<em>文章</em>"
          ]
        }
      }
    ]
  }
}
Copy the code

So the field that you specify, if it contains that search term, will be highlighted in red in the text of that field

GET/news_website _doc / _search {" query ": {" bool" : {" should ": [{" match" : {" title ":" article "}}, {" match ": {" content ":" article "}}}}], "highlight" : {" fields ": {" title" : {}, "content" : {}}}}Copy the code

The field in highlight must be aligned with the field in query

Plain highlight, lucene highlight, default Posting highlight, index_options=offsets

(1) Better performance than plain Highlight because there is no need to re-word the highlighted text (2) less disk consumption

DELETE news_website PUT /news_website { "mappings": { "properties": { "title": { "type": "text", "analyzer": "ik_max_word" }, "content": { "type": "text", "analyzer": "ik_max_word", "index_options": "Offsets"}}}} PUT /news_website/_doc/1 {"title": "my first article "," Content ": "Hello everyone, this is the first article I wrote, especially like this article portal!!" } GET/news_website _doc / _search {" query ": {" match" : {" content ":" article "}}, "highlight" : {" fields ": {" content" : {}}}}Copy the code

Fast vector highlight index‐time term vector when set in mapping, fast Verctor highlight (1) is used for higher performance for large fields (greater than 1MB)


DELETE /news_website 

PUT /news_website 
{ 
"mappings": { 
"properties": { 
"title": { 
"type": "text", "analyzer": "ik_max_word" 
}, 
"content": { 
"type": "text", 
"analyzer": "ik_max_word", 
"term_vector" : "with_positions_offsets" 
} 
} 
} 
} 
Copy the code

Some highlighter is forced, such as plain Hlight for fields with term vector turned on

GET /news_website/_doc/_search {"query": {"match": {"content": "article"}}, "highlight": {"fields": {"content": { "type": "plain" } } } }Copy the code

In general, plain Highlight is good enough and you don’t need to do any extra Settings. If you need high performance highlighting, you can try to enable Posting Highlight. If the field value is very large, more than 1M, So we can use fast vector highlight

3. Set the HTML tag to highlight. The default is the tag

GET/news_website _doc / _search {" query ": {" match" : {" content ":" article "}}, "highlight" : {" pre_tags ": ["<span color='red'>"], "post_tags": ["</span>"], "fields": { "content": { "type": "plain" } } } }Copy the code

4. Highlight fragment fragment Settings

GET /news_website/_doc/_search {"query" : {"match": {"content": "article"}}, "highlight" : {"fields" : {"content": {"fragment_size" : 150, "number_of_fragments" : 3 } } } }Copy the code

Fragment_size: You have a Field value, say 10,000 long, but you can’t display that long on a page… Set the length of fragment text to be displayed. Default is 100 number_of_fragments: You may have multiple fragments in your highlighted fragment text. You can specify how many fragments to display

Fourth, aggregation search technology in-depth

Overview of bucket and metric concepts

A bucket is a group of data for an aggregated search. For example, there are Zhang SAN and Li Si in the sales department and Wang Wu and Zhao Liu in the development department. The result is two buckets grouped by department. There are Zhang SAN and Li Si in the sales department bucket, and Wang Wu and Zhao Liu in the development department bucket. A metric is a statistical analysis performed on a bucket. In the example above, the development department had 2 employees and the sales department had 2 employees, which would be metric. Metric has various statistics, such as sum, maximum, minimum, average, etc. 1 Use an easy-to-understand SQL syntax, for example, select count() from table group by colum n. Then each group of data after the group by column is a bucket. The count() performed for each group is the metric. 2. Prepare case data

DELETE /cars
PUT /cars 
{ 
"mappings": { 
"properties": { 
"price": { 
"type": "long" 
}, 
"color": { 
"type": "keyword" 
}, 
"brand": { 
"type": "keyword" 
}, 
"model": { 
"type": "keyword" 
}, 
"sold_date": { 
"type": "date",
"format": ["yyyy-MM-dd"]
}, 
"remark" : { 
"type" : "text", 
"analyzer" : "ik_max_word" 
} 
} 
} 
} 

Copy the code

Write data in batches, paying attention to the date format

GET /cars/_doc/_search { "query": { "match_all": {} } } POST /cars/_bulk {"index":{}} { "price" : 258000, "color" : "Golden", "brand", "the masses", "model" : "vw magotan," "sold_date" : "2015-01-11", "remark" : "popular mid-range car"} {" index ": {}} {" price" : 123000, "color", "golden", "brand" : "the masses", "model" : "public soar team", "sold_date" : "2015-02-11", "remark" : "Vw god"} {" index ": {}} {" price" : 239800, "color" : "white", "brand" : "sign", "model" : "sign 508", "sold_date" : "2015-03-11" and "remark", "global symbolic products listed models"} {" index ": {}} {" price" : 148800, "color" : "white", "brand" : "sign", "model" : "Mark 408", "sold_date" : "2015-04-11" and "remark" : "the larger compact"} {" index ": {}} {" price" : 1998000, "color" : "Black", "barand" : "the masses", "model" : "Volkswagen phaeton", "sold_date" : "2015-05-11", "remark" : "the pain the most liver mass car"} {" index ": {}} {" price" : 218000, "color" : "red", "brand" : "audi", "model" : "the audi A4," "sold_date" : "2015-06-11", "remark" : "Small endowment models"} {" index ": {}} {" price" : 489000, "color" : "black", "brand" : "audi", "model" : "audi A6," "sold_date" : "2015-07-11"," Remark ":" For government use only?" } {" index ": {}} {" price" : 1899000, "color" : "black", "brand" : "audi", "model" : "audi A 8", "so ld_date" : "2021-10-11"," Remark ":" Expensive big A6..." }Copy the code

1, according to color group statistics of sales

Perform only aggregation groups and do not perform complex aggregation statistics. The most basic aggregation in ES is terms, which is equivalent to count in SQL. In ES, grouping data is sorted by default and doc_count data is sorted in descending order. You can use the _key metadata to perform different sorting schemes based on the grouped field data, or use the _count metadata to perform different sorting schemes based on the grouped statistics.

Aggs: {"terms": {"field": "color", "order": {"field": "color", "color" : {"field": "color", "order": { "_count": "desc" } } } } }Copy the code

2, statistics of the average price of vehicles in different colors

In this example, we first perform aggregation groups based on color. Based on this group, we perform aggregate statistics for intra-group data. The aggregate statistics for intra-group data are called metrics. Sorting can also be performed because the group has aggregate statistics and the statistics are named AVg_by_price, so the sorting logic can be performed based on the aggregate statistics field name. Scenario: Drill-down analysis

GET /cars/_search 
{ 
"aggs": { 
"group_by_color": { 
"terms": { 
"field": "color", 
"order": { 
"avg_by_price": "asc" 
} 
}, 
"aggs": { 
"avg_by_price": { 
"avg": { 
"field": "price" 
} 
} 
} 
} 
} 
} 
Copy the code

Size can be set to 0, indicating that no documents in ES are returned, only the aggregated data of ES is returned, which can improve the query speed. Of course, if you need these documents, you can also set according to the actual situation

GET /cars/_search
{
"size" : 0,
"aggs": {
"group_by_color": {
"terms": {
"field": "color"
},
"aggs": {
"group_by_brand" : {
"terms": {
"field": "brand",
"order": {
"avg_by_price": "desc"
}
},
"aggs": {18 "avg_by_price": {
"avg": {
"field": "price"
}
}
}
}
}
}
}
} 
Copy the code

3. Make statistics of the average price of vehicles in different colors and brands

Group by color aggregation first, and group by brand aggregation again within the group. This operation can be called drill-down analysis. Aggs syntax format, has a relatively fixed structure, simple definition: Aggs can be nested definitions, can be defined horizontally. Nested definitions are called trip analysis. Horizontal definition is tiling multiple grouping methods.

#grammarGET /index_name/type_name/_search {"aggs" : {" define group name (outermost) ": {" group policies such as terms, AVG, sum" : {"field" : {"field" : "According to which a field group", "other parameters" : ""}," aggs ": {" name of group 1" : {}, "the name of the group 2" : {}}}}}
#Practical cases
GET /cars/_search 
{ 
"aggs": { 
"group_by_color": { 
"terms": { 
"field": "color",
"order": { 
"avg_by_price_color": "asc" 
} 
}, 
"aggs": { 
"avg_by_price_color" : { 
"avg": { 
"field": "price" 
} 
}, 
"group_by_brand" : { 
"terms": { 
"field": "brand", 
"order": { 
"avg_by_price_brand": "desc" 
} 
}, 
"aggs": { 
"avg_by_price_brand": { 
"avg": { 
"field": "price" 
} 
} 
} 
} 
} 
} 
} 
} 
Copy the code

4, statistics of the maximum and minimum price in different colors, total price

GET /cars/_search 
{ 
"aggs": { 
"group_by_color": { 
"terms": { 
"field": "color" 
}, 
"aggs": { 
"max_price": { 
"max": {
"field": "price" 
} 
}, 
"min_price" : { 
"min": { 
"field": "price" 
} 
}, 
"sum_price" : { 
"sum": { 
"field": "price" 
} 
} 
} 
} 
} 
} 
Copy the code

In common business common, aggregation analysis, the most common types are statistical quantity, maximum, minimum, average, total, etc. It usually accounts for more than 60% of aggregate businesses and even more than 85% of small projects.

5. Make statistics of the models with the highest prices among different brands of cars

After grouping, you may want to sort the data within the group and select the highest-ranked data among them. The size attribute in top_top_hithits represents how many pieces of data are taken from the group (default is 10). Sort stands for what sort of field is used within the group (default is asC sort of _doc); _source indicates that the result contains the fields in document (all fields are included by default).

GET cars/_search 
{ 
"size" : 0, 
"aggs": { 
"group_by_brand": { 
"terms": { 
"field": "brand" 
}, 
"aggs": { 
"top_car": { 
"top_hits": { 
"size": 1, 
"sort": [ 
{
"price": { 
"order": "desc" 
} 
} 
], 
"_source": { 
"includes": ["model", "price"] 
} 
} 
} 
} 
} 
} 
} 
Copy the code

6. Histogram interval statistics

Histogram, similar to terms, is also used for bucket grouping, and is based on a field to achieve data interval grouping. For example, taking 1 million as a range, statistics the sales volume and average price of vehicles in different ranges. So when using the aggregate of histogram, the field specifies the price field price. The value ranges from 1 million to 1000000. – Interval: 1000000. At this time, ES will divide the price range into: [0, 1000000), [1000000, 2000000), [2000000, 3000000) and so on. At the same time of interval partitioning, histogram will perform data quantity statistics (count) similar to Terms, and can perform aggregation analysis again on the aggregated grouped data by nested AGGs.

GET /cars/_search
{
  "aggs": {
    "histogram_by_price": {
      "histogram": {
        "field": "price",
        "interval": 1000000
      },
      "aggs": {
        "avg_by_price": {
          "avg": {
            "field": "price"
          }
        }
      }
    }
  }
} 
Copy the code

7. Date_histogram interval grouping

Date_histogram can perform interval aggregation grouping on date-type fields, such as monthly sales, annual sales, and so on. For example, take the month as the unit, count the sales quantity and total sales amount of cars in different months. At this point, you can implement aggregate grouping using date_histogram, where field specifies the fields used for aggregate grouping, and interval refers to a specific interval range (optional values include: Year, quarter, month, week, day, hour, minute, second), format specifies the date format, min_doc_count specifies the minimum document for each interval (if not specified, default is 0, Bucket groups are also displayed when there is no document within the range), extended_bounds specifies start and end times (if not, the start and end times are used by default for the minimum and maximum ranges of dates in the field).

#Syntax before ES7.xGET /cars/_search { "aggs": { "histogram_by_date" : { "date_histogram": { "field": "sold_date", "interval": "Month" and "format", "yyyy MM ‐ ‐ dd", "min_doc_count" : 1, "extended_bounds" : {" min ":" 2021 ‐ ‐ 01 01 ", "Max" : "2022 12 ‐ ‐ 31}}", "aggs" : {" sum_by_price ": {" sum" : {" field ":" the price "}}}}}}Copy the code

# after execution #! Deprecation: [interval] on [date_histogram] is deprecated, use [fixed_inter val] or [calendar_interval] in the future.

After the # 7. X

GET /cars/_search { "aggs": { "histogram_by_date": { "date_histogram": { "field": "sold_date", "calendar_interval": "Month" and "format", "yyyy MM ‐ ‐ dd", "min_doc_count" : 1, "extended_bounds" : {" min ":" 2021 ‐ ‐ 01 01 ", "Max" : "2022 12 ‐ ‐ 31}}", "aggs" : {" sum_by_price ": {" sum" : {" field ":" the price "}}}}}}Copy the code

8, _global bucket

When aggregating statistics, it is sometimes necessary to compare partial data with aggregate data. Such as: statistics of the average price of a brand of vehicles and the average price of all vehicles. Global is used to define a global bucket that ignores query conditions and retrieves all documents for corresponding aggregate statistics.

GET/cars / _search {" size ": 0," query ": {" match" : {" brand ":" the masses "}}, "aggs" : {" volkswagen_of_avg_price ": {" avg" : { "field": "price" } }, "all_avg_price": { "global": {}, "aggs": { "all_of_price": { "avg": { "field": "price" } } } } } }Copy the code

9, aggs + the order

Sort aggregate statistics. For example, the sales volume and total sales volume of each brand are counted in descending order of total sales volume.

GET /cars/_search 
{ 
"aggs": { 
"group_of_brand": { 
"terms": { 
"field": "brand", 
"order": { 
"sum_of_price": "desc" 
} 
}, 
"aggs": { 
"sum_of_price": { 
"sum": { 
"field": "price" 
} 
} 
} 
} 
} 
}
Copy the code

If you have multiple agGs, you can also perform a sort based on the innermost aggregation data while drilling down. For example, the total sales amount of each color vehicle in each brand is counted and ranked in descending order according to the total sales amount. This is just like grouping sort in SQL, which can only sort data within a group, not across groups.

GET /cars/_search 
{ 
"aggs": { 
"group_by_brand": { 
"terms": { 
"field": "brand" 
}, 
"aggs": { 
"group_by_color": { 
"terms": { 
"field": "color", 
"order": { 
"sum_of_price": "desc" 
} 
}, 
"aggs": { 
"sum_of_price": { 
"sum": { 
"field": "price" 
} 
} 
} 
} 
} 
} 
} 
} 
Copy the code

10, search + aggs

Aggregation is similar to the GROUP by clause in SQL and search is similar to the WHERE clause in SQL. In ES it is perfectly possible to integrate search and aggregations to perform relatively complex search statistics. Such as: statistics of a brand of vehicles each quarter and sales.

GET/cars / _search {" query ": {4" match ": {" brand" : "the masses"}}, "aggs" : {" histogram_by_date ": {" date_histogram" : { "field": "sold_date", "calendar_interval": "quarter", "min_doc_count": 1 }, "aggs": { "sum_by_price": { "sum": { "field": "price" } } } } } }Copy the code

11, the filter + aggs

In ES, filter can also be combined with AGGS to achieve relatively complex filter aggregation analysis. For example, calculate the average price of vehicles between 100,000 and 500,000.

GET /cars/_search 
{ 
"query": { 
"constant_score": { 
"filter": { 
"range": { 
"price": { 
"gte": 100000, 
"lte": 500000 
} 
} 
} 
} 
}, 
"aggs": { 
"avg_by_price": { 
"avg": {18 "field": "price" 
} 
} 
} 
} 
Copy the code

12. Use filter in aggregation

Filter can also be used in agGS syntax, where the scope of a filter determines the scope of its filtering. Such as: statistics of a brand of car sales in the last year. Placing the filter inside the AGGS means that the filter performs filter filtering only on the results of the Query search. If the filter is placed outside the AGGS, the filter will filter all data. 12M/M indicates 12 months. 1y/y represents one year. D said day

GET/cars / _search {" query ": {" match" : {" brand ":" the masses "}}, "aggs" : {" count_last_year ": {" filter" : {" range ": {" sold_date ": {" gte" : "now ‐ 12 m"}}}, "aggs" : {" sum_of_price_last_year ": {" sum" : {" field ":" the price "}}}}}}Copy the code