Information theory basis of TF_IDF

The weight of each key word w in a query should reflect how much information that word provides to the query. A simple way to do this is to use the amount of information per word as the weight, i.e. :

I(w) = -P(w)\log{P(w)} \\ =-\frac{TF(w)}{N}\log{\frac{TF(w)}{N}} == \frac{TF(w)}{N}\log{\frac{N}{TF(w)}}

Where, N is the size of the entire corpus and a constant that can be omitted. Therefore, the above formula can be simplified as:

I(w)=TF(w)\log{\frac{N}{TF(w)}}

However, this formula still has a defect: the TF frequency of the two words is the same, but one is a common word in a specific article, and the other word is scattered in multiple articles. Obviously, the first word should have a higher degree of differentiation and a greater weight. This formula doesn’t capture that distinction. If you make some assumptions:

Each document size is basically the same, are the M word, namely, M = ND = ∑ wTF (w) DM = \ frac {N} {D} = \ frac {\ sum_w {TF (w)}} {D} M = ∑ DN = D wTF (w), and total D said literature.
Once a keyword appears in the literature, its contribution is the same no matter how many times it appears. Such a word either appears c(w)=TF(w)D(w)c(w)=\frac{TF(w)}{D(w)}c(w)=D(w)TF(w) times in a literature or occurs 0 times. Note that c(w)

Based on the above assumptions, then: I (w) = TF (w) log ⁡ NTF (w) = TF (w) log ⁡ MDc (w) (w) D = TF (w) log (DD Mc (w) (w)) (w) = I TF(w)\log{\frac{N}{TF(w)}}=TF(w)\log{\frac{MD}{c(w)D(w)}} \\ = TF(w)log{(\frac{D}{D(w)}\frac{M}{c(w)})}I(w)=TF(w)logTF(w)N=TF(w)logc(w)D(w)MD=TF(w)log(D(w)Dc(w)M)

There are: TF – IDF (w) = TF (w) log ⁡ DD (w) = I (w) – TF (w) log ⁡ Mc TF – IDF (w) (w) = TF (w) \ log {\ frac {D} {D (w)}} = I (w) TF (w) \ log {\ frac {M} {c (w)}} TF – IDF (w) = TF logD (w) (w) D = I – TF (w) (w) logc M (w)

Therefore, it can be concluded that the amount of information in a word $I(w)$ The more the tF-IDF value is, the greater the tF-IDF value is. At the same time, the more times w appears on average in the literatures hit by W, the smaller the second term, and the larger tF-IDF. The second proof of the conclusion:

Related Posts

Production environments use Keras, Redis, Flask, and Apache for deep learning

【 Aircraft 】 Based on MATLAB quadrotor aircraft PID control simulation

Detailed guide for AI Pipeline construction