Elasticsearch Introduces the basic concepts of Elasticsearch and some common apis for Elasticsearch. This article is a continuation of Elasticsearch’s advanced search features:
Relevance of search and scoring mechanism
What is the correlation score?
- Relevance score describes the matching degree of a document and query statement. ES will score each document queried. The essence of scoring is sorting
- The default correlation score of ES5 used TF-IDF algorithm. Tf-idf is the most important invention in the field of information retrieval, and modern search engines have made a lot of subtle optimization on TF-IDF
- After ES6, BM25 algorithm was adopted (an improvement on TF-IDF). When TF increased infinitely, BM25 algorithm would make it tend to a stable value
- Query in ES with explain=true to see how the current query is rated
- Several factors affecting correlation score:
1. Term Frequency (TF) : the Frequency of the retrieval Term in a document. The higher the Frequency, the greater the weight. Document Frequency -DF (Document Frequency) - The proportion of the number of documents appearing in the retrieval term in the total number of documents. The larger DF is, the more documents appearing, the smaller the meaning to the application is, and the smaller the relevance of the term is"and"."is"3. Inverse document frequency -- IDF (Inverse Document frequency) -- Since the result range of DF value is very large, IN order to reduce the influence of DF on scoring, IDF is introduced. In fact, the logarithm of DF is taken to reduce the scoring effect. 4. Field length - The shorter the search Field, the higher the relevanceCopy the code
How does human intervention score correlation?
- Use the Boost attribute to control the Query weight value:
// The first match query has a weight of 2, and the second one has a default weight of 1. // The final score is not multiplied by 2."query": {
"bool": {
"should": [{"match": {
"title": {
"query": "quick brown fox"."boost": 2}}}, {"match": {
"content": "quick brown fox"}}]}}}Copy the code
- Use boost to increase index weights:
GET /docs_2014_*/_search {// control the weight of docs_2014_10 and docs_2014_09 respectively"indices_boost": {
"docs_2014_10": 3."docs_2014_09": 2}."query": {
"match": {
"text": "quick brown fox"}}}Copy the code
- Boosting Query: Decoupling the Query of a Query from the weighted Query of human intervention
POST testscore/_search
{
"query": {
"boosting": {// Specify the query to be used for the query. The result must be positive"positive" : {
"term" : {
"content" : "elasticsearch"}}, // Specify the query that affects the correlation score. If the result of the query satisfies both negative Queries, the final score = positive query score * negative_boost"negative" : {
"term" : {
"content" : "like"}},"negative_boost" : 0.2 (范围是 0 到 1.0)
}
}
}
Copy the code
- constant_score:
// Constant Score can be used to transform queries into a Filtering query, which avoids correlation scores and improves query performance // Filter can be cached, which returns a Constant correlation Score. // Constant Score is generally used for structured query POST /products/_search {"explain": true."query": {
"constant_score": {
"filter": {
"term": {
"productID.keyword": "XHDK-A-1293-#fJ3"
}
}
}
}
}
Copy the code
Function Score Query
If you are not satisfied with the scoring results we introduced earlier, ES also provides Function Score Query. After the Query, it performs a series of scoring for each matched document on the basis of the original scoring and then reorders it, which is the ultimate weapon used to control the scoring process. ES provides the following functions for calculating scores by default:
- Weight: Similar to the boost Weight set above, except that the Weight is not normalized. When a document’s Weight is 2, the result is 2 * _score
- Field Value Factor: Allows you to use certain fields in the document to contribute to the relevance score. For example, you can modify the _score by using the “heat” and “likes” fields as a reference Factor for scoring
POST /blogs/_search
{
"query": {
"function_score": {
"query": { // Multi Match Query
"multi_match": {
"query": "popularity"."fields": [ "title"."content"}}, // new score = _score *log(1 + votes * factor)"field_value_factor": {
"field": "votes", // field votes to affect counting // counting functions, which can be None,log.log1p,log2 p, please refer to the related (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html)"modifier": "log1p"."factor": 0.1 // Influence coefficient}}}}Copy the code
- Random Score: returns a 0 to 1 Random Score result for each user, which can be used for personalized recommendation scenarios for different users
POST /blogs/_search {// You want each user to see different random scores, but in the same relative order."query": {
"function_score": {
"random_score": {
"seed": 911119}}}}Copy the code
- Decay functions: Decay functions
1. The attenuation function takes the value of a field as a reference. The closer the distance to a value, the higher the score. Linear /exp/Gauss decay function Attenuation functions can operate on fields such as numeric values, time, and latitude and longitude geographic coordinates 4. For example, combine publish_date to get a recently published document, combine geo_location to get a document closer to a specific latitude and longitude (LAT/LON) location, and combine price to get a document closer to a specific priceCopy the code
- Script Score: custom scoring function
Script Score supports writing scripts to calculate correlation. The function provides great flexibility. We can access all the fields in the document, the current Score and even the word frequency, reverse document frequency and field length regular values through the Script. Field type limitations are addressed: Field Value Factor is generally only used for numeric types, while DecayfunctionsGenerally only used for digital, location, and time types POST /blogs/_search {"query": {
"function_score": {
"query": {
"match": { "message": "elasticsearch"}},"script_score" : {
"script" : {
"source": "Math.log(2 + doc['likes'].value)"
}
}
}
}
}
Copy the code
Function Score Query can use any combination of the above five functions at the same time. We can specify the following parameters to control the final combination result:
- Score_mode: combination type, which defines how to combine the query results of multiple functions:
Multiply fraction (default) Sum fraction add AVG fraction average first The first method matches the filter to filter the fraction with the largest Max and the smallest min appliedCopy the code
- Max_boost: The upper limit of the calculated result that cannot be exceeded by the final score
Classification of search
For Elasticsearch, there are two ways to Search a document: URI Search and Request Body Search. What if from a user search and analysis perspective, when a user enters a query string to search, does the user want to search the query string as a whole? Or do you want the query string tokenized first, and then for each item? From this perspective, we divide search into term query and full-text query:
Word item queries
- Term is the smallest unit of semantic expression, which is used in search and natural language processing using language model
- In ES, the word item query does not perform word segmentation, and directly inverts the query in the index by taking the input as a whole
- A term query is an exact match for terms in an inverted index. It does not deal with word diversity, such as case conversion
- With term queries, we generally know what we’re looking for because it’s an exact match, so we can use Constant Score to transform the query into a Filtering query, avoiding correlation scores, to take advantage of the cache and improve query performance
- The following types of queries are word item queries, which were covered in the previous article:
Term Query: exact matching of Term items 2. Range Query 3. Exists Query: whether there is a judgment Query 4Copy the code
- Examples of word item queries:
// run term productID ="XHDK-A-1293-#fJ3"POST /products/_search {POST /products/_search {"query": {
"term": {
"productID": {
"value": "XHDK-A-1293-#fJ3"}}}} // To match through term query"XHDK-A-1293-#fJ3"We can use productid.keyword to make sure that a match is made to POST /products/_search {//"explain": true."query": {
"term": {
"productID.keyword": {
"value": "XHDK-A-1293-#fJ3"}}}} // Use the {field.keyword} field to query information. // productID contains an attribute named fields as the keyword."properties" : {
"projectID" : {
"type" : "text"."fields" : {
"keyword": {// If the keyword attribute is added, the projectID as a whole is also used to create an inverted index on the basis of the segmentation of projectID"type" : "keyword"."ignore_above": 256}}}}}Copy the code
The full text query
- For full-text queries, indexing and searching are segmented, and the string to be queried is segmented by the word spliter and then a list of terms to be queried is generated
- The full-text query will conduct a low-level query for each participle in the list of participles, and finally merge the results in the upper layer, and generate a score for each queried document
- The following types of queries are full-text queries:
2. Match Phrase Query 3. Query String Query: By querying with the q parameter in the URL, Introduction to relevant specification reference on an article [Elasticsearch learning] (https://zhuanlan.zhihu.com/p/104215274) 4. Match Phrase Prefix Query: This is similar to the Match Phrase Query, except that the prefix of the last word is allowed to be matched. 5. Multi Match Query: Searches multiple fields at the same timeCopy the code
- What is Match Phrase Query
1. Match Phrase Query is a Query method to find search terms close to each other, mainly used in the Query scenarios sensitive to the position of words. Again, the Query string is segmented into a list of terms and then searched, but only documents containing all terms are kept in the same relative position as the search term. By default, the position is very strict, except for punctuation and Spaces. The word and order requirements are exactly the same, such as search"I like riding"There's no way to match it"I like swimming and riding!"Because of the addition of two words"swimming" 和 "and"4. Match Phrase Query provides the optional parameter SLOp to solve the problem of the third point above. Slop tells you how many times you need to move the term in order for the Query string to Match the document, such as the one above"I like riding"Query, just need to set"riding"Slop = 2 POST groups/_search {slop = 2 POST groups/_search {"query": {
"match_phrase": {
"names": {
"query": "I like riding"."slop": 2}}}}Copy the code
- Match Phrase Prefix Query Example:
POST groups/_search
{
"query": {
"match_phrase": {
"names": {
"query": "I like ri"// The last item was"ri"The last word can match"ri"Opening words"slop": 2."max_expansions": 10 //max_expansions controls the number of words that can be matched with the prefix. The default value is 50}}}}Copy the code
Structured query
- Structured search is another layer of query partitioning, referring to queries that have inherently Structured data
- For example, dates, times, and numbers are structured because they have a precise format and are often used for comparison queries of ranges and value sizes
- Some texts are also structured in some scenarios, a limited collection of discrete words, such as “male” and “female”.
- The results of structured queries exist either in or out of the collection
- Structured queries have no concept of “similarity”, so they are generally not scored to improve performance
- Structured text is usually matched exactly or with Prefix Prefix
POST products/_search {"query" : {
"constant_score": {// do not score"filter" : {
"range" : {
"date" : {
"gte" : "now-1y"/ / date greater than or equal to this year (y -, M - month, week of w -, d - day, H/H - hour, M - minute, s - seconds)}}}}}}Copy the code
Compound query
Compound query is to combine some simple queries together as query conditions for document retrieval. There are mainly two kinds of compound query:
Bool Query
- A Bool query is a combination of one or more query clauses
- You can use the following four options to control the type of composition:
Must, must, should The minimum_should_match parameter controls the minimum number of matches for should. The default value is 0. Must_not: must not match filter: Must match, similar to must, except that conditions in filter do not participate in scoringCopy the code
- The values of must, should, must_NOT, and filter are JSON arrays. You can add multiple search criteria, including term search and full-text search
- The structure of the query statement affects the correlation score. Competing fields at the same level have the same weight, so you can change the impact of the score by modifying the nested Bool query
POST /products/_search {"_source": "topics"."from": 0."size": 100,
"query": {
"bool": {
"should": [{"bool": {
"must": [{"term": { "topics": 1}}, {"term": { "topics": 2}}]}}, {"bool": {
"must": [{"term": { "topics": 3}}, {"term": { "topics": 4}}]}}],"minimum_should_match": 1}}}Copy the code
Disjunction Max Query
Disjunction Max Query supports multiple concurrent queries. A Disjunction Max Query returns the result matched by the highest-scoring field of all fields.
The following two methods have the same matching effect:
// For each document returned, the final score is returned as the highest matching score in title or body. POST blogs/_search {"query": {
"dis_max": {// Returns the final result of the score that best matches on the field"queries": [{"match": { "title": "Quick pets" }},
{ "match": { "body": "Quick pets"}}]."tie_breaker"If the highest score is the same, we can use the tie_breaker coefficient to multiply the score of the other fields and want to add to find the maximum value}}} POST blogs/_search {"query": {
"multi_match": {
"type": "best_fields"."query": "Quick pets"."fields": ["title"."body"]."tie_breaker": 0.2}}}Copy the code
Suggester API
When searching on Google or Baidu, auto-complete and auto-correct functions are provided to help users improve the matching degree and user experience of the search. ES also provides Suggerster API to provide these functions.
The Suggerster API essentially breaks the entered text into tokens, looks for similar terms in the index’s dictionary, and returns them. Each word will give multiple recommendation results, and each recommendation result will have a similarity score and occurrence frequency.
Suggerter classification
ES offers four categories of Suggerter:
- Term Suggerster: Entries in the text are broken down into words that provide word suggestions for fuzzy searching for each word
Suggest_mode has three kinds of suggestions: 1. Missing: No suggestion is provided if there is one in the index 2. */ POST /articles/_search {"suggest": {
"term-suggestion": {
"text": "lucen rock", // The text to search"term": {
"suggest_mode": "missing"// If the term already exists in the index, no suggestion will be provided"field": "body"// Search from the body field, if not, find reasonable suggestions from the body to the user}}}}Copy the code
- Phrase Suggerster: In addition to Term Suggerster, the Term Suggerster contains some additional logic and parameters that consider the relationships between terms, such as whether they appear together in the source text in the index, their adjacency, and word frequency
POST /articles/_search
{
"suggest": {
"my-suggestion": {
"text": "lucne and elasticsear rock hello world "."phrase": {
"field": "body"."max_errors":2, // The maximum Terms that can be misspelled"confidence":0, // To limit the number of results to be returned, only if the correlation score is greater than confidence. The default is 1.0"direct_generator": [{"field":"body"."suggest_mode":"always"Suggester has been complete with three types: missing, popular, and always}], and // highlighted"highlight": {
"pre_tag": "<em>"."post_tag": "</em>"
}
}
}
}
}
Copy the code
- Complete Suggerster: The main application scenario is automatic completion, real-time performance requirements are very high, so the query is not completed by inverted index, but the analyze data is encoded as FST and index stored together, for an open index, THE FST will be loaded into the memory by ES, prefix search speed is very fast
// It is important to note that Complete Suggerster's query field type must be defined as: Completion POST Articles /_search? pretty {"size": 0."suggest": {
"article-suggester": {
"prefix": "elk ", // Query all documents starting with elk in the title_completion field"completion": {
"field": "title_completion"}}}}Copy the code
- Context Suggerster: Extensions of Complete Suggerster, context-aware recommendations, can define two types of Context:
1. Category: any string. 2Copy the code
Context Suggerster is implemented in the following steps:
1. Customize a mapping PUT Comments /_mapping {"properties": {
"comment_autocomplete": {"type": "completion"// The field type must be completion"contexts": [{"type":"category"// The context type is Category"name":"comment_category"// Context name is comment_category}]},"comment": {"type": "text"POST comments/_doc {POST comments/_doc {"comment":"I love the star war movies"."comment_autocomplete": {"input": ["star wars"]."contexts": {"comment_category":"movies"// Match if the context is "movies""star wars"POST comments/_search {"suggest": {// in context yes"movies"The search prefix is"sta"The document"MY_SUGGESTION": {
"prefix": "sta"."completion": {"field":"comment_autocomplete"."contexts": {"comment_category":"movies"
}
}
}
}
}
Copy the code
Suggester uses summaries and suggestions
Suggester is compared with several different aspects of this Suggester.
- Completion > Phrase > Term
- Recall rate: Term > Phrase > Commpletion. Recall rate refers to the number of documents queried
- Performance: Completion > Phrase > Term
Completion Suggester is used for prefix matching during recommendation and error correction searches. If a Completion Suggester does not provide a reasonable recommendation, it is possible to guess whether the user has typed something incorrectly. You can try to use Phrase Suggester for matching, and then use Term Suggester when you can’t find a recommendation that makes sense.
Search for some optimization solutions in the development
Search Template
The Search Template can be defined to decouple the program to clarify the responsibilities of developers and Search engineers. Front-end engineers only need to define the query Template, and the real query is left to the Search engineer to define. The Search engineer uses the query Template defined by the front-end engineer to query, so as to realize the decoupling of the program:
// Use the defined query template as a script to query post_scripts/TMDB {"script": {
"lang": "mustache"."source": {
"_source": [
"title"."overview"]."size": 20."query": {
"multi_match": {
"query": "{{q}}"// Use template query with {{}}"fields": ["title"."overview"]}}}}} // define query template POST TMDB /_search/template {"id":"tmdb"."params": {
"q": "basketball with cartoon aliens"}}Copy the code
Index Alias
In some cases, the Index name may be dynamically updated over time. You can Alias the Index using the Index Alias, so that each time the Index name changes, only the Alias needs to be changed, without changing the Index name in the code
You can also redefine aliases for some query filtering criteria:
// set the alias movies-lastest-highrate POST _aliases for files with a rating greater than or equal to 4 {"actions": [{"add": {
"index": "movies-2019"."alias": "movies-lastest-highrate"."filter": {
"range": {
"rating": {
"gte": 4}}}}}]} POST movies-lastest-highrate/_search {"query": {
"match_all": {}}}Copy the code
Search across clusters
Pain points of horizontal expansion of single cluster: When there is too much meta information (nodes, indexes, and cluster status) on nodes, the update pressure will increase, and a single Active Master will become the performance bottleneck, leading to the failure of the whole cluster. Therefore, it is necessary to implement cross-cluster search access. After ES 5.3, a new cross-cluster search function was introduced. Here are the steps to implement a cross-cluster search:
- Starting multiple clusters
bin/elasticsearch -E node.name=cluster0node -E cluster.name=cluster0 -E path.data=cluster0_data -E discovery.type=single-node -E http.port=9200 -E transport.port=9300
bin/elasticsearch -E node.name=cluster1node -E cluster.name=cluster1 -E path.data=cluster1_data -E discovery.type=single-node -E http.port=9201 -E transport.port=9301
bin/elasticsearch -E node.name=cluster2node -E cluster.name=cluster2 -E path.data=cluster2_data -E discovery.type=single-node -E http.port=9202 -E transport.port=9302
Copy the code
- Dynamically set multiple cluster associations on each cluster:
PUT _cluster/settings
{
"persistent": {
"cluster": {
"remote": {
"cluster0": {
"seeds": [
"127.0.0.1:9300"]."transport.ping_schedule": "30s"
},
"cluster1": {
"seeds": [
"127.0.0.1:9301"]."transport.compress": true."skip_unavailable": true
},
"cluster2": {
"seeds": [
"127.0.0.1:9302"]}}}}}Copy the code
- Perform cross-cluster queries
/ / on the index of three clusters in cluster cluster0 = users search search indexes: the GET/users, cluster1: users, cluster2: users / _search {"query": {
"range": {
"age": {
"gte": 20."lte": 40}}}}Copy the code
reference
- Elasticsearch Suggester,
- Suggester API(auto-complete)
- Elasticsearch Query Match Phrase Query Match Phrase Prefix Query
- ElasticSearch correlation scoring mechanism
- Function Score Query