Coursera course Text Retrieval and Search Engine

What is the TF

Term frequency TF(term frequency), according to the frequency of occurrence as the weight, occurrence once, the weight value of 1. But using this method alone does not distinguish important words from non-important words, such as the

What is a DF

DF(document frequency) indicates the number of documents containing keywords

What is the IDF

IDF(Inverse Document Frequency) refers to a word that appears in more documents, the lower the IDF value is, the relationship can be shown in the figure below

TF – IDF model

That is, using TF*IDF. If TF does not calculate the frequency, take the simplest one and calculate 1[y={0,1}]. For example, if you want to search for “news about presidential campaign”, there are three documents in the document library









D4 is more relevant and can be accepted, but D3 and D2 are the same, which is a little unacceptable, because presidential is obviously more important than about, that is to say, different words have different weights, and the more words appear in all documents, the less important they should be. IDF can be included. Assume that the IDF for each word corresponds to the following

For this rate of increase, it would be best to add artificial control, according to which the best was found to be BM25

Why do long documents need to be normalized?

In general, long documents are more likely to contain more words, so it will match the query key in a relatively straightforward manner, but the real topic is not the query key. So there needs to be a better way to punish long text. Another thing to take into account is that long documents may have two situations: one is simply using too many words, and the other is having a lot of content to describe the topic, which is not expected to be punished. The general idea is to have a degree of punishment, and one strategy is to use “swing length normal.”

Double ln is used to achieve sublinear transformation (the weight grows more slowly as TF increases). The sorting function at this time is

General architecture for text retrieval (TR)

  • Tokenization: word extraction, determining word boundaries, mapping words with similar meanings to the same
  • Index: Converts documents into easily searchable data structures, usually using inverted indexes (use a dictionary to store partial statistics of a document, such as how many documents the word appears in, how many times, which documents it is, where it is located, etc.)

Zipf’s theorem

The theorem states that the frequency of a word and its order are constant