This is the sixth day of my participation in the First Challenge 2022. For details: First Challenge 2022.

preface

When using ElasticSearch, most of the time it was able to retrieve the articles we wanted, but there were a few cases where the articles came out in the wrong order and the articles were sorted in the first place, which was not what we wanted. So how do you target that?

ELasticSearch default is TF-IDF before ELasticSearch 5, BM 25 after that.

How does ElasticSearch calculate relevancy based on tF-IDF’s complex math formula? What about its parameters?

Let’s first understand what TF-IDF and BM25 are.

TF-IDF

TF

TF is short for Term Frequency. You can think of it simply as the number of occurrences of a keyword divided by the total number of words in the document, also known as word frequency.

TF = number of occurrences of Term/total Terms of Term documents.

So how do you measure the correlation between a query keyword and the resulting document? Add the frequency of occurrences of each keyword directly in the document.

By this point, readers should also have noticed that words like “, “, “and” should not be considered in their TF.

IDF

Therefore, in addition to the stopword removal method, there is also an inverse document frequency, which can be used to measure the importance of the term after word segmentation. That is, IDF (Inverse Document Frequency).

The simple idea is that.

IDF= log (total number of documents/number of occurrences of term in all documents).

It is used to reduce the weight of common participle Term and increase the weight of rare participle Term.

BM 25

Best Match, 25 is after 25 iterations of the algorithm.

After ElasticSearch 5, BM25 was adopted. The complexity of the formula is dizzying to the reader. But the derivation of this formula is not our main purpose, we are more important to understand the algorithm mechanism, to pave the way for us to eliminate problems related to retrieval scoring. Here are the main meanings of the parameters in the formula.

  • D: indicates a document
  • Q: Query
  • K1: free parameter, default 1.2
  • B: Adjust the effect of document length on correlation, default 0.75

I don’t understand these parameters, but let’s look at them in action

In actual combat

indexing

Create index test and insert 7 data pieces, fragment 1, default word splitter standard

PUT test { "mappings": { "properties": { "title": { "type": "text" }, "content": { "type": "text" }, "remark": { "type": "text" } } }, "settings": { "number_of_shards": 1 } } PUT test/_bulk {"index":{"_id":1}} {"title":"To school, everywhere is the white one, school","content":" the snow is still one child to jump from the sky"} {"index":{"_id":2}} {"title":"First of the big brothers and sisters are braving the cold","content":"braving heavy snow snow yet"} {"index":{"_id":3}} {"title":"Behind  them there was a curved path","content":" junior high school English composition"} {"index":{"_id":4}} {"title":" we walked convenient","content":"small writing on the National Day is not smooth"} {"index":{"_id":5}} {"title":"but they must be tired","content":"very hard."} {"index":{"_id":6}} {"title":"Home school","content":"Iove made several small partner"} {"index":{"_id":7}} {"remark":"remark school"}Copy the code

Use the query statement to retrieve the document where the title field matches school

POST test/_search
{
  "query": {
    "match": {
      "title": "school"
    }
  }
}
Copy the code

The results of

id score title
6 1.4157268 Home school
1 1.2943789 To school, everywhere is the white one, school

Take the second result document with ID 1, where only the title field is taken, to see how it calculates 1.2943789. Use Explain: True to analyze the scoring rules.

Score analysis

POST test/_search
{
  "explain": true."query": {
    "match": {
      "title": "school"}}}Copy the code

This takes only the _EXPLANATION of the analysis statement for the second result.

        "_explanation" : {
          "value" : 1.2943789."details": [{"value" : 1.2943789."description" : "Score (FREq =2.0), computed as Boost * IDF * TF from:"."details": [{"value" : 2.2."description" : "boost"}, {"value" : 1.0296195."description" : "Idf, computed as log(1 + (n-n + 0.5)/(N + 0.5)) from:". }, {"value" : 0.5714286."description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:". }]}]}Copy the code

After opening Explain: true, the reader can see that there is more _explanation result in the result, which also states that score = boost * IDF * tf.

Part one: weight

{" value ": 2.2," description ":" boost ", "details" : []}Copy the code

The first part of _explanation is about weights, and the default value is 2.2, because we didn’t specify boost in the query. If you specify X for boost, this will be 2.2 times X.

Part TWO: IDF

            {
              "value" : 1.0296195."description" : "Idf, computed as log(1 + (n-n + 0.5)/(N + 0.5)) from:"."details": [{"value" : 2."description" : "n, number of documents containing term"."details": []}, {"value" : 6."description" : "N, total number of documents with field"."details": []}]}Copy the code
  • N: Number of documents containing the keyword school in the title field
  • N: Total number of documents containing the title field

This should be obvious. There are only papers 6 and 1 that contain the keyword school, so n = 2. The total document book with the title field is 6.

The idf = log (1 + (N - N + 0.5)/(N + 0.5) = log (1 + (6-2 + 0.5)/(2 +) 0.5) = 1.0296195Copy the code

The log base here is e

Part THREE: TF

	{
              "value" : 0.5714286."description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:"."details": [{"value" : 2.0."description" : "freq, occurrences of term within document"."details": []}, {"value" : 1.2."description" : "k1, term saturation parameter"."details": []}, {"value" : 0.75."description" : "b, length normalization parameter"."details": []}, {"value" : 8.0."description" : "dl, length of field"."details": []}, {"value" : 6.0."description" : "avgdl, average length of field"."details": []}]}Copy the code
  • Freq: The number of times the keyword appears in the text.
  • K1: saturation parameter, default 1.2
  • B: Normalized parameter, default 0.75
  • Dl: The length of the title field after the word segmentation
  • Avgdl: The average length after the word segmentation of the title field.

This can be verified using the Analyze API. View the result of the title field after word segmentation

POST test/_analyze
{
  "field": "{title}",
  "text": ["To school, everywhere is the white one, school"]
}
Copy the code

Result (other parameter information is omitted below, only token after word segmentation is shown)

{
  "tokens" : [
    {
      "token" : "to",
    },
    {
      "token" : "school",
    },
    {
      "token" : "everywhere",
    },
    {
      "token" : "is",
    },
    {
      "token" : "the",
    },
    {
      "token" : "white",
    },
    {
      "token" : "one",
    },
    {
      "token" : "school",
    }
  ]
}
Copy the code

The obvious

  • Freq appears twice, freq = 2
  • Dl word segmentation results in 8 words, index DL = 8

So how to calculate the average length of the AVGDL field?

Avgdl = avGDL = avGDL = avGDL = avGDL = avGDL = avGDL

To prove this, I use the simple Analyze API to put all the contents of the title field together and see the number of its segmentation results.

In the result on the right, tokens is 36 divided by the number of documents that have the title field 6. So AVGdl is equal to 6.

Tf = freq/(freq + k1 * (1-b + b * dl/AVGDL)) = 2 / (2 + 1.2 * (1-0.75 + 0.75 * 8/6)) = 0.57142857142857Copy the code

The total number of

Boost, because it is not specified, uses the default value of 2.2.

Score = Boost * IDF * TF = 2.2 * 1.0296195 * 0.57142857142857 = 1.2943788Copy the code

(There may be precision issues, but it’s generally the same)

conclusion

Through the above analysis, the complex mathematical formula is also composed of one parameter after another. When we understand the scoring rules, we can “have a clear idea” of the daily problems related to the sorting of results.

extension

All the data statistics described above are based on the case that there is only one index shard. If there are too many shards, the score will be affected to some extent.

This time the index shard is specified to be 10, and the seven pieces of data are inserted into the seven different shards using routing.

PUT test2 { "mappings": { "properties": { "title":{ "type": "text" }, "content":{ "type": "text" }, "remark":{ "type": "text" } } }, "settings": { "number_of_shards": 10 } } POST test2/_doc/1? routing=1 {"title":"To school, everywhere is the white one, school","content":" the snow is still one child to jump from the sky"} POST test2/_doc/2? routing=2 {"title":"First of the big brothers and sisters are braving the cold","content":"braving heavy snow snow yet"}  POST test2/_doc/3? routing=3 {"title":"Behind them there was a curved path","content":" junior high school English composition"} POST test2/_doc/4? routing=4 {"title":"but they must be tired","content":"very hard."} POST test2/_doc/5? routing=5 {"title":"but they must be tired","content":"very hard."} POST test2/_doc/6? routing=6 {"title":"Home school","content":"Iove made several small partner"} POST test2/_doc/7? routing=7 {"remark":"remark school"}Copy the code

Run the query again

POST test2/_search
{
  "explain": true, 
  "query": {
    "match": {
      "title": "school"
    }
  }
}
Copy the code

Results We still take explain of the second result as an example

. {"value" : 0.39556286, "description" :" Score (freq=2.0), computed as Boost * IDF * TF from:", "details" : [{"description" : "boost",}, {"value" : 0.2876821, "description" : "IDF ", "details" : [{"value" : 1, "description" : "n, number of documents containing term", }, { "value" : 1, "description" : "N, total number of documents with field",}]}, {"value" : 0.625, "description" : "tf", "details" : [{"value" : Oracleoccurrences of term within document", {"value" : 2oracleOccurrences of term within document", {"value" : 2oracleOccurrences of term within document",}, {"value" : 2oracleOccurrences of term within document", {"description" : "K1, term arc parameter",}, {"value" : 0.75, "description" : "B, length normalization parameter",}, {"value" : 8.0, "description" :" DL, length of field",}, {"value" : 8.0, "description" : "Average length of field",}]}]}Copy the code

The n and n above are both 1 because it is the only document in the shard where this record is located.

Similarly, avgl and DL are equal to 8. They’re all recorded in the same shard.

Therefore, if the amount of data is not large, it is not recommended to have too many fragments. In fact, the number of fragments is 1, which can handle most data. You can use search_type= dfs_query_then_FETCH to make ElasticSearch count all shard data, but this will cause performance degradation.

POST test2/_search? search_type=dfs_query_then_fetch { ...... }Copy the code