BM25 is a classical algorithm used to calculate query and document similarity scores in the field of information indexing. Different from TF-IDF, the formula of BM25 mainly consists of three parts:
- The correlation between each word T in query and document D
- Similarity between the word t and query
- The weight of each word
General formula of BM25: Score(Q,d)=∑inWiR(qi,d)Score(Q,d) = \sum_i^n{W_i R(q_i, d)}Score(Q,d)=∑inWiR(qi,d) where QQQ represents a query and qiq_iqi represents a word in query. D represents a search document.
Represents word weight
Here is the IDF: IDF (qi) = log N – dfi dfi) + 0.5 + 0.5 IDF (q_i) = \ log {\ frac {N – df_i + 0.5} {df_i + 0.5}} IDF (qi) = logdfi – dfi + 0.5 + 0.5 N where N index all document number, Dfidf_idfi is the number of documents that contain QIq_iQI. According to the role of IDF, for a certain QIq_iQI, the more documents containing qiq_iQI, the lower the importance of qiq_iQI, or the lower the differentiation, the smaller the IDF, so IDF can be used to describe the similarity between qiq_iQI and documents.
The relevance of words to documents
BM25 design on the basis of an important discovery: the relationship between word frequency and the correlation is nonlinear, that is, each word relevance score for the document does not exceed a certain threshold, when the number of occurrences of words after reaching a threshold, the effect would not be in the linear increases, and the threshold will be associated with the document itself. Thus, BM25 is designed to depict the similarities between words and documents as follows: S(qi,d)=(k1+1)tftdK+tftdS(q_i, D) = \ frac {(k_1 + 1) tf_ {td}} {K + tf_ {td}} S (qi, d) = K + TFTD (k1 + 1) TFTD K = k1 (1 – b + b ∗ LdLave) K = K_1 (1 – b + b * \ frac {L_d} {L_ {ave}}) K = k1 (1 – b + b ∗ LaveLd)
Where, tFTDTF_ {td} TFTD is the word frequency of the word T in document D, LdL_dLd is the length of document D, LaveL_{ave}Lave is the average length of all documents, variable k1k_1K1 is a positive parameter, which is used to standardize the word frequency range of the article. When k1=0k_1=0k1=0, it is a binary model (no word frequency), and a larger value corresponds to the more primitive word frequency information. B is another tunable parameter (0
The correlation between words and query
When the query is long, we also need to characterize the weight between the word and the query. This item is not required for short queries. S(qi,Q)=(k3+1)tftqk3+tftqS(q_i, Q)=\frac{(k_3+1) TF_ {tq}}{k_3+ TF_ {tq}}S(Qi,Q)=k3+ TFTQ (k3+1) TFTQ To correct the word frequency range in query.
Therefore, the final formula of BM25 is:
After the test, the above three adjustable parameters are 1.2~2 for K1K_1K1 and K3K_3K3, and 0.75 for B
Here is an article on BM25 and IF/IDF word frequency saturation comparison, field length normalization, BM25 tuning for reference. About BM25 callbacks – link –