As we have said above a few of the inverted index structure, and also know how to implement an inverted index retrieval functions, how after retrieving sort this, this article simply the text relevance sort of inverted index, because the sort is too complex, we here about text relevance sort, And it’s the simplest TD-IDF sort, and we can talk about what the whole search sort algorithm looks like later on.
Text correlation sort
First, understand a few concepts:
Term
The smallest unit after participle, such asWrite a search engine with Golang
After the participlewith
.golang
.write
.a
.Search engine
, then each word is a Term.
- Term Frequency TF(Term Frequency), the Frequency of Term appearing in the article is the Frequency of current Term appearing in the article, which is Term number/total Term number, as mentioned above
Search engine
The TF of this term is 1/5. The higher the TF is, the more important the word in this article will be. - DF(Document Frequency) is the Frequency of a Term in the total Document. For example, there are 100 documents in total
Search engine
This term appears in 10 documents, so its IDF is 5/100=0.5. - IDF(Inverse Document Frequency) is opposite to DF above, divide the total number of documents by the number of documents including term, and then calculate the logarithm, as shown above
Search engine
The IDF is log (100/5)
How to find articles containing keywords in a pile of articles, inverted index technology has helped us to solve the problem, as long as the word segmentation is accurate, so there is no problem to find articles. The problem is how to sort a bunch of articles so that the most important articles are placed first. Here’s how to sort by relevance.
Tf-idf correlation sorting
Above, we see the concepts of TF and IDF. The obvious function of TF is to indicate the importance of a term in an article. The higher TF is, the more important the term is in the article. Therefore, the higher the overall importance of this term, that is, the greater the degree of differentiation, the more the importance of this term can be reflected.
Why log? Personally I think, use the log is actually less difference between TF – IDF just a computed text relevance of thought, is not a strict proof of formula, use the log so difference is not big, but from the perspective of information theory, demon people information formula is put forward by the shannon logX, the greater the value of information, The bigger the IDF, the more information.
Amount of information is what we can to baidu, a brief description of up to the probability of one thing happening, if the probability of something happening is P, so he is the amount of information – the logP, pay attention to a minus sign, such as men’s football team and China and Brazil football team play, assuming that the Chinese team win probability is 0.01 (may have overestimated), But if Brazil win, according to the formula to calculate out almost no information, because everyone knows that Brazil will win, but if if (I mean) the Chinese team won, then work out the amount of information is large, must be on the front page, this and our intuition is consistent, in the IDF, is to use the formula, but it put in the minus sign, It becomes log(1/P), and P is how often DF and term appear in the total document.
TF and IDF together represent the correlation of this term by multiplying the two values.
Why do we combine these two concepts? The first TF can already describe the importance of term, why do we use IDF? It can mainly solve two problems.
- Remove the noise of high-frequency words. Since IDF can be simply understood as the amount of information of term, it is mainly to remove the noise, that is, the influence of term with very small amount of information. Such as
the
This word, its TF is very high, but actually has no meaning, but when you calculate its IDF, it is basically 0, so if you use TF*IDF, the result is still 0, which can effectively remove the interference of such general words. - At the same time, IDF can better distinguish important words. If the IDF of a term is higher, it proves that articles with this term can be represented more by this term, which is easy to understand. If a term only appears in a certain article, this word can better represent the content of this article.
Finally, when multiple terms are searched jointly, their correlation is the sum of tF-IDF of each term,
OK, tF-IDF is all of these. When implementing tF-IDF, since the total number of documents is known at the beginning, the TF-IDF of each term is generally calculated when building the index and sorted according to the total number of documents. When IMPLEMENTING TF-IDF, there is no concept of full index. Therefore, every time a document is added, the TF of the document is calculated and stored. During retrieval, the IDF value is determined by term inversion of the number of documents recalled, and tF-IDF is calculated in real time. If the number of documents is very large, the real-time calculation is still at a disadvantage, so the full index is very necessary. It’s just that I didn’t fully implement full index building, but I’ll talk about how to build full indexes later.
Word from
In addition to tF-IDF for relevancy sorting, there are also some other text factors that can be used in sorting. First, the distance of term, namely word distance, if the key word is Xiaomi phone, then obviously, if the two terms (Xiaomi, mobile phone) are next to each other in an article, For example, Xiaomi mobile phone is a very popular mobile phone and there are many articles about health in the mobile phone application, for example, what are the benefits of eating Millet? Obviously, the first article is more relevant than the second one.
Therefore, in order to keep the information of word spacing, we also need to save the location information of each term when storing inversion, and use these location information to calculate the direct word spacing of each word in retrieval, so as to express the text relevance together with TF-IDF.
The location information
In addition to word spacing, there is another factor that also affects the ranking of relatedness, that is, the position of term. This is also easy to understand. If the term is hit in the title or abstract, the weight should obviously be higher than that in the text. Which affects the final correlation calculation.
Other model
In addition to the direct use of TF – IDF, now there are many other text sorting model of relevance, such as BM25 this based on the probability of sorting model, here is not opened, if you are interested, write the article later can write how to sort articles, including text, as well as the importance of text after sorting, How to use machine learning to compute document importance to rank documents offline, and when we talk about ranking, we’ll talk about how to score all of these things together
The next article will talk more about some of the things I didn’t implement with inverted index storage, such as index compression, and then how to set up inversion, add documents incrementally, and merge indexes.
If you feel good, welcome to forward to more people to see, also welcome to pay attention to my public number, mainly talk about search, recommendation, advertising technology, and nonsense. The article will be posted here first 🙂 scan or search wechat XJJ267 or search Spanish language