Guide language | Elasticsearch (hereinafter referred to as “ES) is the current popular open source full-text search engine, fast and convenient to use it we can build the search platform, but general configuration also need according to the actual circumstances of the content of the platform for further optimization, can satisfy the user’s search results. The following will introduce the optimization practice of ES search ranking, and I hope to communicate with you. Cao Yi is an application development engineer at Tencent.

One, the introduction

Although ES can be very convenient and fast to build a search platform, the results often do not meet expectations. Because ES is a general-purpose full-text search engine, it does not understand the content being searched, nor does a general-purpose configuration fit all content searches. Therefore, the application of ES in search needs to be optimized for specific platforms to achieve good results.

ES search results are sorted by calculating relevance scores between query keywords and document content. Correlation scores are not easy to master. First of all, ES is not very friendly to Chinese, which requires plug-in installation and some pre-processing. Secondly, there are many factors affecting the correlation score, which are highly controllable and flexible.

The following will introduce the practical experience of ES search ranking optimization. The sample index data of this article comes from a report document, as shown in the figure below:

Optimize the ES Query DSL

After building the search platform, we first need to optimize the ES Query DSL. The key to ES optimization is to optimize the Query DSL, as long as the DSL is used properly, the search results will be good.

1. The original multi_match

When the index is built and the data is synchronized, the fastest way to implement full-text indexing is to use multi_match, for example:

This is very convenient and fast to use, and to achieve the full text search requirements. But the results may not be as expected. But don’t worry, we can keep optimizing.

2. Add filtering for the filter queried by bool

In our applications, we should avoid directly asking the user to query all the content, as this will return a large number of hits, and if the results are slightly out of order, the user will not be able to get more accurate content.

In view of this situation, we can add some labels, classification and other filtering items to the content for users to make choices, so as to achieve a better result ranking. In this way, there will be fewer target results scored by ES engine during search, and the jitter effect of scoring will be less.

To do this, we use the bool query filter. The bool query provides four statements: must/filter/should/must_NOT, where filter/must_NOT belongs to the filter and must/should belongs to the query. Here are two things you need to know about filters:

  • Filters do not calculate relevance scores, because filtered content does not affect the order of returned content;

  • Filters can use caches within ES, so filters can speed up queries.

Note here: although a MUST query is like a forward filter, the results it queries will be returned and the correlation score will be computed with other queries, so caching cannot be used, unlike filters.

Typically, a document has multiple attributes that can be filtered, such as ID, time, tag, category, and so on. Documents should be carefully labeled and categorized for search quality, because once filtering is selected, documents will not be returned even if the user’s search terms match. An example Query DSL looks like this:

In the example above, there is a trick that uses the id of the tag to filter. Since tags fields are text, term queries are exact matches. Do not apply this to fields of text type. If text fields are to be used by filters, string should be used in mappings (which maps fields to two types, Text and keyword) or the keyword type.

3. Use match_phrase to increase the weight of search phrases

At this stage, search results and search terms often do not match continuously. For example, the search keyword is: “2020 wechat User research Report”, and most of the returned results are matching “wechat”, “user”, “research”, “report” and other scattered keywords, while users want to match the whole phrase results are behind.

This is not an ES bug. Before we can understand this behavior, we need to understand how ES handles match.

First, multi_match converts the matches of multiple fields into multiple match query combinations, and performs match query for each field one by one. When match executes a query, the query keywords are first analyzed by an analyzer set by search_Analyzer, and then the results of the analyzer are put one by one into the SHOULD statements in the bool query. These should statements have no weight or order. And any document that hits a should statement will be returned. The converted statement is shown below, preceded by the original statement and followed by the converted statement:

This results in documents that have only a few words in the query phrase but are rated higher than documents that match the entire phrase. So how do we think about the order of words? Now, let’s look at the inverted index in ES.

We all know that an inverted index records a word to the ID of the document containing the word, but an inverted index is certainly not that simple. The inversion list records the collection of documents corresponding to the word and consists of the inversion index. The inverted index contains the following information:

  • Document ID: Used to obtain documents;

  • Word frequency (TF) : used for correlation calculation (TF-IDF, BM25);

  • Position: record the word segmentation position in the document, there will be more than one, used for phrase query;

  • Offset: Start and end positions recorded in the document for highlighting.

ES specifically records the location information of words for query. In DSL, match_PHRASE is used for query. Match_phrase requires that all participles must be matched, and the returned document must match the words in the order of the query phrase. The word spacing can be set using SLOP.

Although match_phrase helps us solve the problem of order, it has more demanding requirements, requiring matching all participles. If you use it alone, you’ll find that your search results are significantly lower than match, because documents that match several words and documents that match in the wrong order are not returned.

In this case, the bool query statement should can be used, and match and match_PHRASE can be used at the same time. In this way, match_pharse increases the weight of the search phrase order, making the documents that can be matched sequentially have a higher correlation score. Here is an example DSL:

One thing to note here is that in an array of type TEXT in the inverted index, each element is recorded consecutively. For example, the tags in the data of a document: [” Tencent CDC”,” JINGdong Research Institute “], the position of “CDC” and” jingdong “is continuous. If you search for” CDC JINGdong “, the score of this document will be relatively high.

This should not happen. When setting index mappings, we should set position_increment_gap on the tags field to increase the position between array elements that exceeds the SLOP used by the query, for example:

4. Use Boost to adjust the weight of query statements

One obvious problem with the search implementation mentioned above is that none of the fields can be classified. According to common sense, we know that the weight of title should be higher than other fields, obviously not the same score as other fields.

The boost configuration can be used to increase the weight of the query, but instead of setting a field, the object is a query statement. When set, the query score is equal to the default score multiplied by Boost.

There are a few caveats to setting up boost:

  • The weight of fields with high data quality can be increased accordingly.

  • The weight of the match_PHRASE statement should be higher than that of the match query of the corresponding field, because the number of phrases matched in sequence may not be too many in the document, but there will be many words after the query keywords are segmented, and the match score will be relatively high. Match score will dilute the effect of match_phrase;

  • In the mappings setting, you can set weights for fields and no longer use boost Settings for fields when querying.

An example DSL is as follows:

5. Use function_score to add more scoring factors

There are also some factors that affect the rating of a document. For example, I may often consider the following questions:

  • The more recent the document information is more timely, more useful to the user, should be ranked first;

  • Popular documents in the platform, which may be preferred by users, should be better than other documents;

  • High quality documents are more likely to be visible to users, while missing tags and summaries are not always expected to be visible to users.

  • Operators sometimes want users to search for documents being promoted;

This scenario can be achieved by adding more factors that affect report ratings, such as time, popularity, quality ratings, operational weights, etc.

These factors have a feature that we can determine their weight in the data construction stage, and there is no relationship with the query keyword. The weight attributes of these documents themselves can be regarded as static scores, and the correlation score calculated by searching keywords is called dynamic scores, so the final score of a document should be the combination of dynamic scores and static scores.

Static scoring related properties should not be set arbitrarily. In order to give users a better experience, the effects of static scoring should have:

  • Stability: do not often have large changes, if large changes will lead to users searching the same keywords over a period of time out of the results will be different;

  • Continuity: For our convenience, other optimizations can also affect the total score. For example, for the documents with heat of 0.1 and 1000, even if the matching degree of the documents with heat of 0.1 is 100 times that of the documents with heat of 1000, the result ranking is still less than that of the documents with heat of 1000.

  • Degree of differentiation: In the case of continuous stability, there should be a certain degree of differentiation, that is, the points should be reasonably spaced. If there are 1000 documents and the score is between 1.0 and 1.001, this is actually moot because the impact on the ranking of documents is too small.

There is no generic query statement for these new factors, but ES provides function_score to allow us to customize the scoring formula, as well as multiple types for quick application. Function_score provides five types

  • Script_score, which is the most flexible way to customize the algorithm;

  • Weight, times a weight number;

  • Random_score;

  • Field_value_factor, which uses a field to influence the total score;

  • Decay Fucntion includes Gauss, EXP and Linear decay functions.

Due to the large number of types, the following are only examples of actual use of filed_value_factor and decay function.

(1) filed_value_factor

The influence of popularity, recommendation weight, etc. on the rating can be multiplied by the weight, which is suitable for filed_value_factor, as follows:

SQRT (1.2 * doc[‘likes’].value), 1 if missing There is a small trick in the setting of missing. For example, it is assumed that the document missing summary is of low quality and the weight can be appropriately reduced. Then, we can set missing to a value less than 1, and factor can be filled with 1.

(2) Decay function

The decay function is a very useful function to smooth out the transition between higher scores for documents closer to a point and lower scores for documents further away. Using the decay function, it is easy to implement scenarios where more recent documents score higher.

ES provides three attenuation functions. Let’s first take a look at the differences of these three attenuation functions. The official document is captured and shown as follows:

  • Linear is two linear functions that have a score of 0 except where the line intersects the horizontal axis;

  • E to the e, which is exponential, decays sharply and then slowly;

  • Guass, Gaussian attenuation is the most commonly used, first slow and then sharp and then slow, the attenuation near the intersection of scale is more intense.

When we want to select the results within a certain range, or the results within a certain range are more important, such as a certain time, region (circle), or price range, we can use gaussian attenuation function. The Gaussian attenuation function has four parameters that can be set

  • Origin: the center point, or the best possible value of the field, the document score that falls on the origin point is full score 1.0;

  • Scale: decay rate, that is, the rate at which the _score changes when a document falls from origin;

  • Decay: _score decayed from origin to scale. Default is 0.5.

  • Offset: Sets a non-zero offset to cover a range, not just a single origin, with the origin as the center point. All scores _score within -offset <= origin <= +offset are 1.0.

Assuming that documents of the last three years in the search engine will be more important and information of the last three years will be less valuable, origin can be selected as today, scale as three years, decay as 0.5, offset as three months, and the DSL is as follows:

6. End result

At this point our Query DSL is pretty much optimized, and the search results are now satisfactory. Let’s look at an example of the resulting DSL statement :(the example is not actually running code)

Third, optimize correlation algorithm

We discussed above that the correlation score should be obtained by combining dynamic score with static score. We have optimized the static score, and we will discuss the calculation of dynamic score again below.

Dynamic scoring means that the user calculates the relevance of the user’s query keywords to the document on each query and, more specifically, calculates the relevance of the full-text search fields in real time.

ES provides an excellent way to implement pluggable configuration, allowing us to control the correlation algorithm for each field. In the Mappings setting stage, we can adjust the parameter of similarity and set different similarity to different fields to adjust the correlation algorithm. ES offers several versions of similarity available, and we’ll focus on the BM 25 next.

The default correlation algorithm of ES is BM25, which is a correlation algorithm between words and documents based on probability model. It can be regarded as an upgrade of TF-IDF algorithm based on vector space model. ES has abandoned tF-IDF algorithm in version 7.0.0 and completely replaced it with BM25. The comparison and details between BM25 and TF-IDF algorithm are not described in this paper.

Take a look at the formula for BM25 on Wikipedia:

Do you find it hard to understand at first glance that there are so many variables? But don’t worry, there are really only two parameters we need to adjust. Except for K1 and B, the rest of these variables can be calculated directly from the document, so in ES, BM25 formula actually influences the score by adjusting these two parameters.

The parameter K1 controls the rate at which the result of word frequency increases in the saturation of word frequency. The default value is 1.2. The smaller the value is, the faster the saturation changes; the larger the value is, the slower the saturation changes. Word frequency saturation can be seen in the screenshot of the official document below, which reflects the score curve corresponding to word frequency. K1 controls tf of BM25.

The b parameter controls the effect of the field length normalization. 0.0 disables normalization and 1.0 enables full normalization. The default value is 0.75.

When optimizing K1 and B of BM25, we should start with the characteristics of the search content and carefully analyze the demand for retrieval.

For example, the quality of the Content field in the index data of the example is uneven, and some documents may even miss this field. However, the real data corresponding to this document (maybe a file or a video, etc.) is of high quality, so the length of the content field in ES cannot reflect the real situation of the document. We also don’t want short content documents to be highlighted, so we want to minimize the impact of document length on ratings.

According to the description of K1 and B, we reduced the b value in BM25 model from the default 0.75. Further attempts are needed to determine the appropriate value. Here I use setting to 0.2 as an example and write the corresponding Settings and mappings:

The default values for k1 and b work for most sets of documents, but the optimal values vary from set to set, and the parameters must be verified repeatedly to find the optimal value for the set of documents.

Fourth, optimization suggestions

In the optimization of ES search, most of the energy should be spent on document data quality improvement and query DSL combination tuning. It is necessary to repeatedly try various query combinations and adjust weights, and try not to adjust Similarity before THE OPTIMIZATION of DSL has reached a good degree.

Do not introduce too many plug-ins in the early stage, such as synonyms, pinyin, etc., which will affect your optimization, they are only tools to improve the search recall rate, not necessarily improve accuracy. More professional platforms should do more professional search guidance and suggestions, rather than let users blindly try to search.

Search tuning cannot always focus on technology, but also on users. Search quality is a subjective evaluation. To know whether users are satisfied with search results, we can only monitor search results and users’ behaviors, such as the frequency of repeated search and the frequency of page turning.

If a search returns documents that are relevant, users should get what they want on the first search, and if it returns less relevant results, users may click back and forth and try new search terms.

5. Use _explain to do bad case analysis

That’s the end of our search ranking optimization, but from time to time some users may still report inaccurate results. While it’s possible that users themselves don’t use search engines (which should guide users from the product to write better queries), it’s more likely that ranking optimization isn’t done well.

Therefore, if bad cases are found, we should record these bad cases, analyze the problems carefully from these problems, and then slowly optimize them.

To analyze bad case, we can use the _explain query API provided by ES, which will return the score details of a document we found using a DSL. Through these details, we can know the cause of bad case and then make targeted optimization.

Six, the concluding

ES is a universal full-text search engine. If you want to build a professional search platform with ES, you must go through search optimization to reach the available state. This paper summarizes the search optimization based on ES, mainly optimizing DSL and correlation calculation, hoping that readers can learn useful knowledge from it.

Tencent technology, learn knowledge of cloud computing, cloud + community: cloud.tencent.com/developer