“This is the 14th day of my participation in the Gwen Challenge in November. Check out the details: The Last Gwen Challenge in 2021.”

What is TF-IDF?

Tf-idf (Term Frequency — Inverse Document Frequency) algorithm is a commonly used weighting technique for information retrieval and text mining. Tf-idf is a statistical method for assessing the importance of a word to one of the documents in a document set or a corpus. The importance of a word increases proportionally with the number of times it appears in the document. But at the same time, it decreases inversely with the frequency of its occurrence in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of correlation between files and user queries.

Algorithm description

Suppose we now have an article called zen and Motorcycle Maintenance Techniques, and we want to know what the key words of the article are.

First, we intend to extract the word frequency (TF) of each word in the article and make keyword judgment based on the word frequency. But when we counted them up, we found that the words with the highest frequency were ‘of’, ‘yes’,’ ba ‘and so on. This kind of Words is obviously not helpful to the statistics of the article, we need to filter these Words, screening, these Words are Stop Words.

After sifting through the stop words, we came up with a few words that appeared most frequently in the article: “Zen”, “motorcycle”, “repair” and so on. But if these words appear equally often, does that mean they are equally important as keywords? Of course not, we can find that zen appears less frequently in other articles than “motorcycle”, “maintenance”, etc., so we can assume that Zen is an important keyword in this article.

As mentioned above, the importance of a word relative to an article should be judged not only by the frequency of its occurrence, but also by the frequency of its occurrence in other documents, and its importance is inversely proportional to it. This level of importance, which is related to the frequency of occurrence in other articles, is the inverse document frequency (IDF)

After the word frequency TF and reverse document frequency (IDF) are calculated, the two results are multiplied TF*IDF, and the result is tF-IDF. The larger the TF-IDF value of a word is, the more important the word is relative to the article and the higher its importance as a key word.

To sum up, TF-IDF algorithm is an algorithm to judge the importance of a word in a corresponding article according to its frequency of occurrence in the corresponding article and its frequency in other articles. Its importance is directly proportional to the frequency of occurrence in the article and inversely proportional to the frequency of occurrence in other articles.

Algorithm steps

Computing word frequency

Calculate the inverse document frequency

At this point we need a corpus and calculate the inverse document frequency of the word relative to other articles in the corpus. The frequency of inverse documents is inversely proportional to the frequency with which words appear in other documents.

Calculate the TF – IDF

Code implementation

def TF_IDF(sentence, docs) :
    tf = dict()
    idf = dict()
    tf_idf = dict()
    stop_words = ["Yes"."The"."呢"."吧"."Ah"."呢"]

    # calculation TF
    words_cnt = dict(a)for w in sentence:
        if w in stop_words:  # filter stop words
            continue
        words_cnt[w] = words_cnt.get(w, 0) + 1
    for w in words_cnt:
        tf[w] = words_cnt[w] / sum(words_cnt.values())

    The IDF # calculation
    words_cnt_in_docs = dict(a)for w in words_cnt:
        words_cnt_in_docs[w] = 0
        for doc in docs:
            if w in doc:
                words_cnt_in_docs[w] += 1
    for w in words_cnt_in_docs:
        idf[w] = math.log(len(docs)/(words_cnt_in_docs[w]+1))

    # calculation TF - IDF
    for w in words_cnt:
        tf_idf[w] = tf[w] * idf[w]

    return tf_idf
Copy the code

Refer to the article

  • Application of TF-IDF and cosine similarity (I) : automatic extraction of keywords
  • WiKi TF-IDF