Another key concept for Elasticsearch is correlation scoring. In the results of the query API, we often see the _score field, which is used to indicate the score of the correlation, and the correlation describes how well a document matches the query statement.

The nature of scoring is sorting, and Elasticsearch will rank the documents that best meet the user’s needs first.

Before Elasticsearch 5.0, correlation score algorithm used TF-IDF algorithm, and after 5.0 used BM 25 algorithm. Now, you might be wondering what these two algorithms look like. Don’t worry, let’s take a look.

TF-IDF

First of all, let’s look at the literal meaning. TF is the abbreviation of Term Frequency, that is, word Frequency. IDF is the abbreviation of Inverse Document Frequency, also known as Inverse Document Frequency.

Word frequency

Word frequency, which is easier to understand, is the frequency with which the word being searched appears in the document. The calculation is the number of occurrences of the search term divided by the total word count of the document. The simplest correlation algorithm is to segment the retrieval words and add their word frequencies. For example, if I want to search for “my algorithm”, the correlation can be expressed as:

TF(ME) + TF(of) + TF(algorithm)

But there’s a problem here. Words like “of”, which appear a lot, have little to do with contribution relevance. So you shouldn’t think about them when you think about relevancy, and we call them Stop words.

Inverse document frequency

After talking about TF, let’s take a look at IDF. Before we understand the inverse document frequency, we need to know what is the document frequency, namely DF.

DF is actually the frequency of search terms in all documents. For example, “I” appears in more documents, “of” appears in more documents, and “algorithm” appears in fewer documents. That’s the document frequency, and the inverse document frequency, which is simply:

Log (total number of documents/total number of documents in which the search term appears)

For the example above, let’s present it more concretely. If we have 100 million documents and 50 million documents with the word “I” in them, then its IDF is log(2) = 1. “Of” appears in 100 million documents, IDF is log(1) = 0, and the algorithm only appears in 200,000 documents, so its IDF is log(500), which is about 8.96.

Thus, the larger the IDF, the more important the word.

Ok, now you TF and IDF should have some understanding, so TF-IDF is essentially a weighted sum over TF.

TF * IDF (I) (I) + TF * IDF (of) (of) + TF * IDF algorithm () ()

BM 25

BM25 can be regarded as an optimization of TF-IDF. The optimization effect is that when TF increases infinitely, the result of TF-IDF will increase, while the result of BM25 will approach a value. This limits the impact of a term on the overall relevance of search terms.

The formula of BM25 algorithm is as follows:

BM25 The Next Generation of Lucene Relevance BM25 The Next Generation of Lucene Relevance

Explain API

If you want to see how a query is scored, you can use Elasticsearch’s Explain API

"explain": true
Copy the code

You can also add _explain to the path, for example:

curl -X GET "localhost:9200/my-index-000001/_explain/0? pretty" -H 'Content-Type: application/json' -d' { "query" : { "match" : { "message" : "elasticsearch" } } } 'Copy the code

In this case, there will be an Explanation field in the returned result to describe the specific scoring process.

summary

For Elasticsearch, you have a preliminary understanding of how to calculate the score. If you are interested, you can do more research on Elasticsearch.