1. CountVectorizer

The CountVectorizer class converts words in the text to a word frequency matrix. For example, the matrix contains an element A [I][j]a[I][j]a[I][j]a[I][j], which represents the word frequency of JJJ word in class III text. It uses the FIT_transform function to count the number of occurrences of each word, get_feature_names() to get the keywords of all the text in the word bag, and toarray() to see the result of the word frequency matrix.

from sklearn.feature_extraction.text import CountVectorizer
# corpora
corpus = [
    'This is the first document.'.'This is the this second second document.'.'And the third one.'.'Is this the first document? '
]
# Convert the words in the text to a word frequency matrix
vectorizer = CountVectorizer()
print(vectorizer)
# Count the number of occurrences of a word
X = vectorizer.fit_transform(corpus)
print(type(X),X)
Get all the text keywords in the word bag
word = vectorizer.get_feature_names()
print(word)
# Check word frequency results
print(X.toarray())
Copy the code

Results:

  (0.2)	1
  (0.6)	1
  (0.3)	1
  (0.8)	1
  (1.5)	2
  (1.1)	1
  (1.6)	1
  (1.3)	1
  (1.8)	2
  (2.4)	1
  (2.7)	1
  (2.0)	1
  (2.6)	1
  (3.1)	1
  (3.2)	1
  (3.6)	1
  (3.3)	1
  (3.8)	1
['and'.'document'.'first'.'is'.'one'.'second'.'the'.'third'.'this']
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 2]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]
Copy the code

2. TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a text weighting method, which adopts a statistical idea, namely, the frequency of text occurrence and document frequency in the whole corpus to calculate the importance of words.

Advantages: Filter common but irrelevant words.


t f i d f i . j = t f i . j x i d f i . j tfidf_{i,j} = tf_{i,j}\times idf_{i,j}

Term Frequency (TF) refers to the Frequency of a keyword appearing in the whole article. (Total number of times a word appears in an article/total number of words in an article);


t f i . j = n i . j k n k . j t f_{i, j}=\frac{n_{i, j}}{\sum_{k} n_{k, j}}

Inverse Document Frequency (IDF) stands for computing Inverse text Frequency. Text frequency refers to the number of times a certain keyword appears in all articles in the whole corpus. The reverse document frequency is also known as the inverse document frequency, which is the reciprocal of the document frequency, mainly used to reduce the effect of some common words in all documents but have little impact on the document.

Official Documents for Transformer

The default:


I D F ( x ) = l o g N + 1 N ( x ) + 1 + 1 IDF(x) = log\frac{N+1}{N(x)+1} + 1

Here, NNN is the total number of documents, and N(x)N(x)N(x) is the number of documents containing the word XXX.

Textbook standard IDF definition:


I D F ( x ) = l o g N N ( x ) + 1 IDF(x) = log\frac {N}{N(x)+1}

Where, NNN represents the total number of documents in the corpus. N(x)N(x)N(x) represents the number of documents containing the word XXX.

Tfidf implementation, usually through countVectorizer, and then through tfidfTransformer, into Tfidf vector; There is also a ready-made TfidfVectorizer API.

Statement:

TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
Copy the code

Example:

from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
import numpy as np
# corpora
cc = [
      'aa bb.'.'aa cc.'
]
# method 1
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(cc)
print('feature',vectorizer.get_feature_names())
print(X.toarray())
Copy the code

Results:

feature ['aa'.'bb'.'cc']
[[0.57973867 0.81480247 0.        ]
 [0.57973867 0.         0.81480247]]
Copy the code

It is worth noting that the default corpus filters individual characters as stop_words. If single-character words need to be retained, you can change the word segmentation: token_pattern='(? u)\\b\\w+\\b’

In addition, the above can be implemented as follows:

# method 2
vectorizer=CountVectorizer()#token_pattern='(? u)\\b\\w+\\b'
transformer = TfidfTransformer()
cntTf = vectorizer.fit_transform(cc)
print('feature', vectorizer.get_feature_names())
print(cntTf)
cnt_array = cntTf.toarray()
X = transformer.fit_transform(cntTf)
print(X.toarray())
Copy the code

Results:

feature ['aa'.'bb'.'cc'The first number indicates the number of documents; The second number represents the number of features, and the result represents the corresponding word frequency. (0.1)        1
  (0.0)        1
  (1.2)        1
  (1.0)        1
[[0.57973867 0.81480247 0.        ]
 [0.57973867 0.         0.81480247]]
Copy the code

In order to better understand the operation of TfidfTransformer, a simple decomposition is carried out to achieve this function:

# method 3
vectorizer=CountVectorizer()
cntTf = vectorizer.fit_transform(cc)
tf = cnt_array/np.sum(cnt_array, axis = 1, keepdims = True)
print('tf',tf)
idf = np.log((1+len(cnt_array))/(1+np.sum(cnt_array,axis = 0))) + 1
print('idf', idf)
t = tf*idf
print('tfidf',t)
print('norm tfidf', t/np.sqrt(np.sum(t**2, axis = 1, keepdims=True)))
Copy the code

Results:

tf [[0.5 0.5 0. ]
 [0.5 0.  0.5]]
idf [1.         1.40546511 1.40546511]
tfidf [[0.5        0.70273255 0.        ]
 [0.5        0.         0.70273255]]
norm tfidf [[0.57973867 0.81480247 0.        ]
 [0.57973867 0.         0.81480247]]
Copy the code

That is, TfidfTransformer normalizes the obtained vector by dividing it by the 2 norm by default.


v norm  = v v 2 = v v 1 2 + v 2 2 + + v n 2 v_{\text {norm }}=\frac{v}{\|v\|_{2}}=\frac{v}{\sqrt{v_{1}^{2}+v_{2}^{2}+\cdots+v_{n}^{2}}}

Information theory basis of TF-IDF

The weight of each Key Word w in a Query should reflect how much information that Word provides for the Query. A simple way is to use the amount of information per word as its weight.

However, if two words appear at the same frequency TF, one is a common word in a specific article, while the other word is scattered in multiple articles, obviously the first word has a higher resolution and should have a greater weight.

3. HashingVectorizer

grammar

HashingVectorizer(alternate_sign=True, analyzer='word'.binary=False,
         decode_error='strict', dtype=<class 'numpy.float64'>,
         encoding='utf-8', input='content', lowercase=True,
         n_features=1048576, ngram_range=(1.1), non_negative=False,
         norm='l2', preprocessor=None, stop_words=None, strip_accents=None,
         token_pattern='(? u)\\b\\w\\w+\\b', tokenizer=None)
Copy the code

The characteristics of

The generic CountVectorizer exists but takes up a lot of memory when the thesaurus is large, so using hash tricks and storing compiled matrices with sparse matrices is a good way to solve this problem.

The pseudo-code is as follows:

 function hashing_vectorizer(features : array of string, N : integer):
     x := new vector[N]
     for f in features:
         h := hash(f)
         x[h mod N] += 1
     return x
Copy the code

The pseudocode does not take into account hash collisions, and the actual implementation is more complex.

from sklearn.feature_extraction.text import HashingVectorizer
corpus = [
     'This is the first document.'.'This document is the second document.'.'And this is the third one.'.'Is this the first document? ',
 ]
vectorizer = HashingVectorizer(n_features=2天安门事件4)
X = vectorizer.fit_transform(corpus)
print(X.toarray())
print(X.shape)
Copy the code

Results:

[[0.57735027  0.          0.          0.          0.          0.
   0.          0.         0.57735027  0.          0.          0.
   0.          0.57735027  0.          0.        ]
 [0.81649658  0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.40824829
   0.          0.40824829  0.          0.        ]
 [ 0.          0.          0.          0.         0.70710678  0.70710678
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.        ]
 [0.57735027  0.          0.          0.          0.          0.
   0.          0.         0.57735027  0.          0.          0.
   0.          0.57735027  0.          0.        ]]
(4.16)
Copy the code

4. To summarize

In general, these three methods are all word bag model methods. Among them, the tFIDfVectorizer method can reduce the interference of high-frequency words with less information and is applied more frequently.


reference:

  1. (recommended)sklearn tfidf;
  2. TF – IDF blog;
  3. Liu Jianping blog;
  4. Sklearn官网 Feature extraction;
  5. Learn text feature extraction of SKLearn;
  6. wiki, feature hashing;
  7. The Beauty of Mathematics Wu Jun