Elasticsearch master (25)

TF&IDF algorithm and vector space model algorithm are revealed

boolean model

  • A logical operator like and first filters out the doc containing the specified term

    • Query “hello world” –> filter –> hello/world/hello & world
    • Bool –> must/must not/should –> Filter –> Included/not included/may be included
    • Doc –> No score –> positive or negative true or false –> To reduce the number of doc calculations to improve performance

TF/IDF

  • Single term scores in DOC

    • query: hello world –> doc.content
    • doc1: java is my favourite programming language, hello world !!!
    • doc2: hello java, you are very good, oh hello world!!!
  • Hello to doc1 score

TF: term frequency

  • Find how many times hello appears in DOC1, once, and give a score based on the number of occurrences
  • The more times a term appears in a doc, the higher the final relevance score will be

IDF: Inversed Document Frequency

  • Find the number of occurrences of hello in all doc’s, 3
  • The more frequently a term appears in all doc’s, the lower its relevance score will be

The length norm

  • The length of the field that hello searches for, the longer the field length, the lower the relevance score; The shorter the field length, the higher the relevance score

Finally, the term hello, doc1 score, TF, IDF and length norm will be combined to calculate a comprehensive score

Hello world –> doc1 –> Hello to doc1 score, world to doc1 score –> But finally hello world query to doc1 score –> vector space model

vector space model

  • The total score of multiple terms for a DOC

    • Hello world –> es calculates a Query vector based on the score of Hello World among all doc’s
    • Hello term, given a score of 2 based on all doc’s
    • The term “world” gives a score of 5 based on all doc’s
    • [2, 5]
  • query vector

    • Doc vector, three doc’s, one containing one term, one containing another term, and one containing two terms
    • Three doc
      • Doc1: contains hello –> [2, 0]
      • Doc2: contains world –> [0, 5]
      • Doc3: contains hello, world –> [2, 5]
  • For each doc, a score is calculated for each term, hello has a score, world has a score, and then take the scores of all terms to form a doc vector

Draw in a graph, take the radians of each doc vector to query vector, and give the total fraction of each doc to multiple terms

  • Each doc Vector evaluates the value of the query VectorradianAnd finally based on thisradianThe greater the radian of the total fraction of a DOC relative to multiple terms in query, the lower the fraction is. The smaller the radian, the higher the score

  • If it is multiple terms, it is calculated by linear algebra and cannot be represented by graphs

Elasticsearch master (26)

Deep search technology _ deep reveal lucene correlation score algorithm

We Boolean model, TF/IDF, vector space model

Explain TF/IDF algorithm in depth, in Lucene, bottom layer, what is a complete formula for TF/IDF algorithm calculation?

boolean model

 query: hello world
 ​
 "match": {
     "title": "hello world"
 }
 ​
 "bool": {
     "should": [{"match": {
                 "title": "hello"}}, {"match": {
                 "title": "world"}}}]Copy the code

Ordinary multivalue search, convert to bool search, Boolean model

lucene practical scoring function

Practical scoring function is used to calculate a query formula for a DOC score, which will use a formula to calculate

  • score(q,d) score(q,d) is the relevance score of document d for query q.

    • The final result of this formula is the final total score of a Query (called Q) for a doc (called D)
  • queryNorm(q) is the query normalization factor (new).

    • The queryNorm is to get a DOC score within a reasonable range, not too far off, for example, a DOC score of 10,000 and a DOC score of 0.1
  • coord(q,d) is the coordination factor (new).

    • To put it simply, give some score multiplier rewards to the doc that is a better match
  • The sum of the weights for each term t in the query q for document d.

    • ∑ : symbol for summation
      • ∑ (t in q) : Each term in query, query = Hello world, the term in query contains hello and world
      • In query, the doc score of each term is summed, and multiple terms of a doc score form a vector space
  • tf(t in d) is the term frequency for term t in document d.

    • When calculating the doc score for each term, it is TF score
  • idf(t) is the inverse document frequency for term t.

    • When calculating the doc score for each term, it is the IDF score
  • t.getBoost() is the boost that has been applied to the query (new).

    • Is the weight part we specified earlier, affecting the score
  • norm(t,d) is the field-length norm, combined with the index-time field-level boost, if any. (new).

    • Look at the length we match, the longer the length, the lower the score; The shorter the length, the higher the score

query normalization factor

  • SumOfSquaredWeights = sum of IDF scores of all terms, take a square root, then take a square root 1/1

    • Mainly for the sake ofNormalize the score
      • If YOU take the square root, the first thing you get is a smaller number
      • And then you divide it by 1 and then the square root of that, the fraction is going to be small
      • 1. A few/a few points, scores will not appear tens of thousands, hundreds of thousands, such a ridiculous score

query coodination

Awards more points to doc’s that match more characters

  • Original method of calculation

    • Document 1 with hello → score: 1.5
    • Document 2 with hello world → score: 3.0
    • Document 3 with hello world java → score: 4.5
  • The way rewards are calculated

    • Document 1 with Hello → score: 1.5 * 1/3 = 0.5
    • Document 2 with Hello World → Score: 3.0 * 2/3 = 2.0
    • Document 3 with Hello World Java → Score: 4.5 * 3/3 = 4.5

Calculate the total score * the number of term matches/the total number of term matches, so that doc matches with different term/query numbers can be separated by the score

  • The scores of data that match more terms are improved and the scores of doc that match fewer terms are ranked down
  • This is why the number of terms is higher than the score that matches the data accurately

field level boost

Some filed matched fields are assigned different scores to calculate weights