This is the 20th day of my participation in the August More Text Challenge
What is feature extraction?
Feature extraction is to reduce the dimension of some original input data or recombine the original features for subsequent use.
For example, we know that sometimes the original data has many features, some of which are highly relevant, and some of which are not relevant to the ultimate purpose. We need to get rid of unrelated features. (Reduce data dimension) For images, each image has a lot of data, at this time, it will be very slow to directly calculate the original data, which is not good for real-time operation, we need to extract new features. (Reduce the dimensionality of data) We get a new feature from the multidimensional features of the original data, and finally use this new feature to guide decision making. (Sort out existing data features)
What is the role of feature extraction?
- Reducing data dimensions
- Extract or sort out effective features for subsequent use
Feature extraction method in SkLearn
Sklearn currently includes methods for extracting features from text and images.
methods | instructions |
---|---|
Feature_extraction. DictVectorizer (* [...]. ) |
Feature extraction of dictionary |
Feature_extraction. FeatureHasher ([...]. ) |
Characteristics of the hash |
Text feature extraction
methods | instructions |
---|---|
Feature_extraction. Text. CountVectorizer (* [...]. ) |
Converts a text document collection to a word frequency matrix |
feature_extraction.text.HashingVectorizer(*) |
Use hashing techniques to quantize collections of text documents |
feature_extraction.text.TfidfTransformer(*) |
Transform the sparse matrix generated by CountVectorizer into tF-IDF eigenmatrix |
Feature_extraction. Text. TfidfVectorizer (* [...]. ) |
Transform the original document collection into a TF-IDF eigenmatrix |
Dicttorizer – Load features from the dictionary
The DictVectorizer class can be used to transform an array of elements represented as a list of standard Python dict objects into the NumPy/SciPy form used by the SciKit-Learn estimator.
Although not particularly fast, Python Dict has the advantages of being easy to use, sparse (missing features don’t need to be stored), and storing feature names in addition to feature values.
DictVectorizer implements “dictheat” coding to classify characteristics. Classification characteristics are pairs of “property-value” forms, where values are limited to an unordered list of conforming requirements (for example, subject identifiers, object types, labels, names, and so on).
In the following example, “city” is a classification attribute and “temperature” is a traditional numeric characteristic:
>>> measurements = [
... {'city': 'Dubai'.'temperature': 33.},... {'city': 'London'.'temperature': 12.},... {'city': 'San Francisco'.'temperature': 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
>>> vec.get_feature_names()
['city=Dubai'.'city=London'.'city=San Francisco'.'temperature']
Copy the code
FeatureHasher- Feature hash
FeatureHasher is a high-speed, low-storage vectorization class. It uses a technique called feature hashing (or “hashing technique”).
This class is a low memory alternative to DictVectorizer and CountVectorizer for large scale (online) learning and memory tight situations.
The output of FeatureHasher is always a Scipy. Sparse matrix in CSR format.
>>> from sklearn.feature_extraction import FeatureHasher
>>> hasher = FeatureHasher(n_features=10)
>>> D = [{'dog': 1, 'cat': 2.'elephant': 4}, {'dog': 2.'run': 5}] >>> f = hasher.transform(D) >>> f.toarray() array([[ 0., 0., -4., -1., 0., 0., 0., 0., 0., 2.], [ 0., 0., 0., -2., -5., 0., 0., 0., 0.Copy the code
Parameters: N_features are the number of features (columns) in the output matrix. In linear learning, small feature number is easy to cause hash conflict, while large feature number is large coefficient dimension.
Text feature extraction
CountVectorizer- Word frequency matrix
CountVectorizer is a text feature extraction method. For each training text, it only considers the frequency of occurrence of each word in the training text.
The CountVectorizer class has many parameters and is divided into three processing steps: preprocessing, Tokenizing, n-grams generation.
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> texts=["dog cat fish"."dog cat cat"."fish bird".'bird']
>>> cv = CountVectorizer() Create word bag data structure
>>> cv_fit=cv.fit_transform(texts)
>>> print(cv.get_feature_names()) # Tabular representation of the dictionary generated by the article
['bird'.'cat'.'dog'.'fish'] > > >print(cv.vocabulary_) # dictionary representation, key: word, value: word frequency
{'dog': 2.'cat': 1, 'fish': 3.'bird': 0} > > >print(cv_fit.toarray()) # Word frequency matrix
array([[0, 1, 1, 1],
[0, 2, 1, 0],
[1, 0, 0, 1],
[1, 0, 0, 0]])
Copy the code
CountVectorizer converts words in the text into word frequency matrix through fit_transform function. Matrix element A [I][j] represents the word frequency of j word in the ith text, that is, the number of occurrences of each word. Get_feature_names () shows the keywords for all the text, and toarray() shows the result of the word frequency matrix.
HashingVectorizer
HashingVectorizer uses the hash technique to vectorize text.
The characteristics of
A generic CountVectorizer exists but takes up a lot of memory when the thesaurus is large, so a HashingVectorizer solves this problem by using the hash technique and storing the compiled matrix in a sparse matrix.
>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> corpus = [
... 'This is the first document.'.'This document is the second document.'.'And this is the third one.'.'Is this the first document? '. ] >>> vectorizer = HashingVectorizer(n_features=2**4) >>> X = vectorizer.fit_transform(corpus) >>> X <4x16 sparse matrix oftype '<class 'numpy.float64'>'
with 16 stored elements inSparse Row Format > >>> x.toarray () array([[-0.57735027, 0., 0., 0., 0., 0., 0., -0.57735027, 0., .. 0, 0, 0., 0.57735027, 0. 0.], [0.81649658, 0. 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.40824829, 0., 0.40824829, 0. 0.], [... 0, 0, 0, 0., 0.70710678, 0.70710678, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0.57735027, 0, 0), and 0., 0., 0., 0., 0., 0.57735027, 0., 0., 0., 0., 0.57735027, 0., 0.]])Copy the code
TfidfTransformer
TfidfTransformer counts the TF-IDF weights of each word in the word frequency matrix produced by CountVectorizer.
Tf-idf is a statistical method used to assess the importance of a word to one of the documents in a document set or a corpus. The importance of a word increases with the number of times it appears in the document, but decreases inversely with the frequency of its occurrence in the corpus.
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer(smooth_idf=False)
>>> transformer
TfidfTransformer(smooth_idf=False)
>>>
>>> counts = [[3, 0, 1],
... [2, 0, 0],
... [3, 0, 0],
... [4, 0, 0],
... [3, 2, 0],
... [3, 0, 2]]
>>>
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf
<6x3 sparse matrix of type '<class 'numpy.float64'>'
with 9 stored elements inSparse Row Format > >>> >>> tFIDF. toarray() Array ([[0.81940995, 0., 0.57320793], [1., 0., 0.], 0.], [1, 0, 0.], [0.47330339, 0.88089948, 0.], [0.58149261, 0., 0.81355169]])Copy the code
TfidfVectorizer
TfidfVectorizer accepts text data and performs word bag feature extraction and TF-IDF transformation.
TfidfVectorizer is equivalent to a combination of CountVectorizer and TfidfTransformer.
>>> corpus = [
... 'This is the first document.'.'This is the second second document.'.'And the third one.'.'Is this the first document? '. ] >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> vectorizer = TfidfVectorizer() >>> result = Vectorizer.fit_transform (corpus) >>> result.toarray() array([0., 0.43877674, 0.54197657, 0.43877674, 0., 0., 0.35872874, 0., 0.43877674], [0., 0.27230147, 0., 0.27230147, 0., 0.85322574, 0.22262429, 0. [0.55280532, 0., 0., 0.28847675, 0.55280532, 0.], [0., 0.43877674, 0.54197657, 0.43877674, 0., 0., 0.35872874, 0., 0.43877674]]Copy the code
conclusion
Sklearn supports converting text to word frequency matrix (CountVectorizer), TF-IDF matrix (TF-IDVectorizer) and Hash matrix (HashingVectorizer), which are all methods of word bag model. TfidfVectorizer this method can reduce the interference of high-frequency words with less information, and apply more.
Refer to the article
- sklearn feature extraction